Most of the
information at this website deals with data from the COCA
corpus. You might also be interested in the
n-grams data from the 14
billion word
iWeb
corpus. |
These n-grams are based on the largest publicly-available, genre-balanced corpus
of English -- the one billion word Corpus of
Contemporary American English (COCA). With this n-grams data (2,
3, 4, 5-word sequences, with their frequency), you can carry
out powerful queries offline -- without needing to access the
corpus via the web interface.
|
frequency |
word1 |
word2 |
word3 |
31891 | much |
of | the |
13261 | much | of |
a |
8000 | much | more |
than |
7396 | much | as |
i |
5650 | much | the |
same |
5633 | much | of |
it |
4229 | much | better |
than |
4191 | much | as |
the |
|
A few more examples (from among an
unlimited number of queries) might be:
The data is available in three different formats,
and when you purchase the data you have access to all three formats. (The numbers
refer to how many millions of entries there are for that format / n-grams set).
Type |
Data |
2-grams |
3-grams |
4-grams |
5-grams |
1 |
Words |
8.2 m |
16.3 m |
13.1 m |
6.2 m |
2 |
Words+ + part of speech |
11.6 m |
28.5 m |
28.2 m |
17.4 m |
db |
Database: integer values + lexicon |
13.5 m |
18.9 m |
27.1 m |
16.2 m |
You might also be interested in the
frequency
of single words (including frequency by genre and sub-genre), or
collocates
(all words "near by" a given word).
|