|
|
Why not just have AI generate
n-grams data, such as the "top 100 phrases" that meet
certain criteria? It's because AI-generated data for phrase
frequency does a very poor job of modeling what happens in
actual human-generated language, such as the COCA
corpus.
More information. |
These n-grams are based on the largest publicly-available, genre-balanced corpus
of English -- the one billion word Corpus of
Contemporary American English (COCA). With this n-grams data (2,
3, 4, 5-word sequences, with their frequency), you can carry
out powerful queries offline -- without needing to access the
corpus via the web interface.
| |
|
frequency |
word1 |
word2 |
word3 |
| 31891 | much |
of | the |
| 13261 | much | of |
a |
| 8000 | much | more |
than |
| 7396 | much | as |
i |
| 5650 | much | the |
same |
| 5633 | much | of |
it |
| 4229 | much | better |
than |
| 4191 | much | as |
the |
|
A few more examples (from among an
unlimited number of queries) might be:
The data is available in three different formats,
and when you purchase the data you have access to all three formats. (The numbers
refer to how many millions of entries there are for that format / n-grams set).
|
Type |
Data |
2-grams |
3-grams |
4-grams |
5-grams |
|
1 |
Words |
8.2 m |
16.3 m |
13.1 m |
6.2 m |
|
2 |
Words+ + part of speech |
11.6 m |
28.5 m |
28.2 m |
17.4 m |
|
db |
Database: integer values + lexicon |
13.5 m |
18.9 m |
27.1 m |
16.2 m |
You might also be interested in the
frequency
of single words (including frequency by genre and sub-genre), or
collocates
(all words "near by" a given word).
|