CS 378

Natural Language Processing

Fall, 2003

Homework 3

Due Tuesday, Sept. 30 at 3:30

You are welcome (in fact encouraged) to work in teams to solve these problems. But each person should write up his/her solutions separately, except for problems 4 and 5, where a group solution is fine.

1) Linguistic Aside: English words are composed of morphemes of two types: roots and affixes. Another way to subdivide morphemes is into bound morphemes, which cannot occur by themselves but must be combined with other morphemes to form words, and free morphemes, which can stand alone. In English, most roots are free morphemes, since they can stand alone (or they can be combined). All affixes are bound morphemes. For example, interest is a root. The morphemes dis, ed, and ly are affixes. These morphemes can be combined to form words like disinterestedly. For more information, see:

http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/WhatIsAMorphemeType.htm

But not all roots are free. For example, consider plode (as in implode and explode) and flect (as in inflect, deflect, reflect, genuflect). Give two more examples of roots that cannot occur on their own.

2) J & M 5.1.

3) Minimum edit distance is one way to compute the likelihood that an observed word O is actually a mistyping of some intended word I. But, as presented in the book, it fails to take into account two things: the prior probabilities of each of the candidate words and the differences in likelihood of each of the individual errors that can be made. (For example, it is much more likely that someone will type m for n than that they will type t for n.) So we need to combine the Bayesian techniques described in Section 5.5 with the minimum edit distance notion of Section 5.6. Come up with an example text in which:

See http://www.cs.utexas.edu/users/ear/cs378NLP/EnglishWordFrequencies.txt for a list of English word frequencies for one example corpus (news wires) (Thanks to Kevin Knight).

4) The word frequencies in the table that is referenced in problem 3 are dependent on the dataset from which they were derived. It's easy to calculate frequencies yourself for other online texts: Find some different sort of text, compute the frequencies, and compare them to the ones shown in the table referenced above.

(see reverse)

A basic command to do this is

% cat yourtext | tr -cd ' a-zA-Z' | tr ' ' '\012' | sort | uniq -c | sort -nr

But it doesn't separate words by "return". Also, it's case sensitive. The same word with different cases will be counted as different words. So Ruifang has suggested some changes you could make:

## text

titian.cs.utexas.edu$ cat a.txt

hi

HI

hello HELLO

see

## The original command

## tr -cd ' a-zA-Z' will delete any other characters not in the list,

## including "return"

titian.cs.utexas.edu$ cat a.txt | tr -cd ' a-zA-Z' | tr ' ' '\012' | sort

| uniq -c | sort -nr

1 hiHIhello

1 HELLOsee

## Modified: keep "return"

## change to tr -cd ' a-zA-Z\012', \012 is return

titian.cs.utexas.edu$ cat a.txt | tr -cd ' a-zA-Z\012' | tr ' ' '\012' |

sort | uniq -c | sort -nr

1 see

1 hi

1 hello

1 HI

1 HELLO

## Modified: convert lowercase to uppercase

## Adding tr "a-z" "A-Z"

titian.cs.utexas.edu$ cat a.txt | tr "a-z" "A-Z" | tr -cd ' a-zA-Z\012' |

tr ' ' '\012' | sort | uniq -c | sort -nr

2 HI

2 HELLO

1 SEE

Of course, the next step is to strip off affixes so that you get a table of frequencies of stems rather than words. If you want to experiment with this, you might want to try the Porter Stemmer, described in Appendix B of J & M. I suggest it because, although it's not perfect, it does not require a lexicon. You might also want to try a modified stemmer in which you only strip off grammatical suffixes like -s and -ed but leave semantic affixes like dis- and -ation.

5) Plot the data you got in problem 4. Do the frequencies follow Zipf's Law?