N-grams data


Why not just have AI generate n-grams data, such as the "top 100 phrases" that meet certain criteria? It's because AI-generated data for phrase frequency does a very poor job of modeling what happens in actual human-generated language, such as the COCA corpus. More information.

These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the one billion word Corpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.

 
frequency word1 word2 word3
31891much ofthe
13261muchof a
8000muchmore than
7396muchas i
5650muchthe same
5633muchof it
4229muchbetter than
4191muchas the

A few more examples (from among an unlimited number of queries) might be:

 NOUN + NOUN sequences  three word strings with a preposition in the middle position
 VERB + the + NOUN sequences  two word strings, where the words begin or end with certain letters
 like + word + word  (potential) phrasal verb: VERB + ADV particle

The data is available in three different formats, and when you purchase the data you have access to all three formats. (The numbers refer to how many millions of entries there are for that format / n-grams set).

Type

Data

2-grams 3-grams 4-grams 5-grams
1 Words 8.2 m 16.3 m 13.1 m 6.2 m
2 Words+  + part of speech 11.6 m 28.5 m 28.2 m 17.4 m
db Database: integer values + lexicon 13.5 m 18.9 m 27.1 m 16.2 m

You might also be interested in the frequency of single words (including frequency by genre and sub-genre), or collocates (all words "near by" a given word).