Statistical Natural Language Processing
Statistical techniques can help remove much of the ambiguity in natural language.
A type is a word form, while a token is each occurrence of a word type. N-grams are sequences of N words: unigrams, bigrams, trigrams, etc. Statistics on the occurrences of n-grams can be gathered from text corpora.[ corpus (Latin for body) is singular, corpora is plural. A corpus is a collection of natural language text, sometimes analyzed and annotated by humans.]
Unigrams give the frequencies of occurrence of words. Bigrams begin to take context into account. Trigrams are better, but it is harder to get statistics on larger groups.
N-gram approximations to Shakespeare:[D. Jurafsky and J. Martin, Speech and Language Processing, Prentice-Hall, 2000.]