N-grams data

Why not just have AI generate n-grams data, such as the "top 100 phrases" that meet certain criteria? It's because AI-generated data for phrase frequency does a very poor job of modeling what happens in actual human-generated language, such as the COCA corpus. More information.

These n-grams are based on the largest publicly-available, genre-balanced corpus of English -- the one billion word Corpus of Contemporary American English (COCA). With this n-grams data (2, 3, 4, 5-word sequences, with their frequency), you can carry out powerful queries offline -- without needing to access the corpus via the web interface.

frequency	word1	word2	word3
31891	much	of	the
13261	much	of	a
8000	much	more	than
7396	much	as	i
5650	much	the	same
5633	much	of	it
4229	much	better	than
4191	much	as	the

A few more examples (from among an unlimited number of queries) might be:

NOUN + NOUN sequences	three word strings with a preposition in the middle position
VERB + the + NOUN sequences	two word strings, where the words begin or end with certain letters
like + word + word	(potential) phrasal verb: VERB + ADV particle

The data is available in three different formats, and when you purchase the data you have access to all three formats. (The numbers refer to how many millions of entries there are for that format / n-grams set).

Type	Data	2-grams	3-grams	4-grams	5-grams
1	Words	8.2 m	16.3 m	13.1 m	6.2 m
2	Words+ + part of speech	11.6 m	28.5 m	28.2 m	17.4 m
db	Database: integer values + lexicon	13.5 m	18.9 m	27.1 m	16.2 m

You might also be interested in the frequency of single words (including frequency by genre and sub-genre), or collocates (all words "near by" a given word).