CS 378 Assignment 7: Natural Language Interfaces

Due: December 9, 2024.

Files:

cs378.clj Utility Functions
atn.clj Augmented Transition Net
db.clj Database Functions
gramcom.clj Grammar Compiler
gram.clj Simple Example Grammar and Small Database for Orders
restgram.clj Starter Restaurant Grammar
physgram.clj Starter Physics Grammar
restdbsmall.clj Smaller Restaurant Database
restdb.clj Restaurant Database
smalllex.clj Small Restaurant Lexicon
restlex.clj Restaurant Lexicon
restqueries.clj Example Restaurant Queries
restaurant.clj Loads restaurant files
restexample.txt Example queries and results for restaurants
physics.clj Loads physics files
physexample.txt Example queries and results for physics

Computer interfaces based on natural languages such as English can be used by non-programmers with little or no training. Such an interface can easily be constructed provided that the English coverage is limited to a narrow domain; this requirement is satisfied by database systems and by many application areas.

For this exercise, you are to construct a natural language interface using the semantic grammar technique. A semantic grammar differs from a general English grammar in that phrases in the grammar have meaning in the domain rather than being purely syntactic descriptions of English phrases. For example, a general English grammar would have categories such as "noun phrase", while a semantic grammar would have categories such as "date" or "part-name".

One thing that needs to be handled in the parser is the scanner pointer that points to the current position in the sentence. The position must be moved as the sentence is read, but it must be saved at strategic points to allow for backup in case of a parsing error. A set of simple functions to do this has been provided in the file atn.clj .

Lexicon

Lexicon entries have the form:

(category  (word ...  (word semantics) ...) )
Each of the listed words is a member of the category. It is okay to have redundancies, slang, or abbreviations (these are good because they are usually short and unambiguous.) If a word is specified as a list of two items, the first item is the word and the second item is the semantics or internal form, e.g. (february 2) or (shoes shoe) to reduce a plural form to the singular form.

Grammar Compiler

Parsing programs using the ATN functions can be written by hand, but there is a grammar compiler gramcom.clj that makes things easier. Enter (gramcom grammar) to compile each time you change your grammar.

A grammar rule has the form:

(nonterm -> (right-hand side items) semantics)
nonterm is a nonterminal symbol that is the left-hand side of the production; the rule says that the left-hand side nonterminal can be composed of the sequence of items on the right-hand side.

The allowable items on the right-hand side are:
wordexactly the specified word
(nonterminal)a grammar nonterminal, like a subroutine call to a sub-grammar.
(category)a member of the lexical category, e.g. (month)
(number)any number
(symbol)any symbol
?preceding item is optional
(separate ? from a word by a space)

The semantics is clojure code to be executed when the grammar rule is satisfied. The right-hand side items are available as variables $1, $2, $3, etc., similar to what is done in Yacc. For example, consider the rule:

 (loc      ->  (in (city))      (restrict 'city $2))
In this case, $2 refers to whatever matches the (city) part of the grammar rule, so that an input in berkeley would restrict the database records to those where the city field is berkeley.

If you use (def tracegram true), the grammar compiler will compile code to trace each grammar rule; this can help debugging.

Application Area Options

There are two options for the application used for this assignment:

Choose one of these applications for your assignment. A starter file is provided for each: restgram.clj or physgram.clj . Expand the language coverage of the grammar and the kinds of questions that can be answered.

Relational Database

The database to be used is provided in the file restdb.clj; this is a version of the same database used in restaurant query systems by Prof. Mooney. A good first step in developing your program is to write a grammar that will describe the legal sentences that may be used in querying this database. The grammar might be in the form of productions or an ATN. Note that the grammar only needs to represent sensible questions about this one particular database. Your grammar should meet two goals: avoidance of ambiguity by accepting only the "correct" parsing of any phrase, and generality in the sense that a variety of database queries can be handled by the grammar.

An example of a restaurant query is:

(ask '(where can i get american food in berkeley))
The file restqueries.clj gives some sample queries; you do not need to implement all of them, but you can use these queries for ideas for extending the grammar. The first 30 sample queries is enough for this assignment. Some natural language systems produce a parse tree as output; in this case, the desired output is a database query. The output from parsing each phrase will in general be a part of a query; by collecting the parts of the query, the parsing algorithm can output a completed query that can be executed by the database system. Use the supplied database system and database in the file restdb.clj to answer the questions; the file restlex.clj provides lexicon definitions. Note that phrases in the English question will in general do one of several things to the database query:
  1. Restrict the set of records applicable to the query (e.g., "in los-altos"). This is done using a semantics form such as (restrict 'city $2) or (restrictb '>= 'rating $3).
  2. Request information to be reported from the chosen records (e.g., "where ..."). This is done using a semantics form such as (retrieve 'street).
  3. Post-process the set of database records. A post-processing request is of the form:
    (postpr '(fn (quote $$)))
    
    where fn is the function to be called and $$ represents the answer to the query. For example, to answer the question How many ... the fn could be length. To answer ... the best ... , you could write a function to find the best-rated restaurant from the list of restaurants that were retrieved.
  4. Multiple actions can be combined using (do ... ) . For example, a postpr form would have to be preceded by some retrieve forms within a (do ... ).

A grammar compiler, gramcom.clj, is provided to compile a Yacc-like grammar into ATN form. A small grammar for the database application is restgram.clj You can turn on a trace as the database queries are added by entering (def tracedb true) .

To use the code, start Clojure and enter:

(load-file "cs378/restaurant.clj")
(load-files)
(gramcom grammar)
(ask '(where can i get american food in palo-alto))

Algebraic Solver

Numerical questions about problems that are governed by algebraic equations can be answered by a system based on sets of equations and a symbolic algebra solver. Examples of questions include:

(phys '(what is the area of a circle with radius = 2))

(phys '(what is the circumference of a circle with area = 12))

(phys '(what is the power of a lift with mass = 5 and height = 10
        and time = 4))

An example of a program of this type is at: http://www.cs.utexas.edu/users/novak/cgi/physdemod.cgi . (You are not expected to handle this many question types.)

For this option, you will need your files for the first three assignments.

To use the code, start Clojure and enter:

(load-file "cs378/physics.clj")
(load-files)
(gramcom grammar)
(phys '(what is the area of a circle with radius = 2))

Some additional physics equations can be found here:

http://www.cs.utexas.edu/users/novak/cgi/physlaws.lsp .

Note that there is an easy way to add units to the physics option for this assignment.

  1. Define the units as numbers: (meter 1.0) (inch 0.0254) (inches 0.0254) where the number is the conversion factor to convert the unit to the equivalent metric (SI) unit. See http://www.cs.utexas.edu/users/novak/cgi/unitsdemoty.cgi to get conversion factors.
  2. For a value with a unit, just multiply the two numbers. "5 inches" = (* 5 0.0254) = 0.127
  3. The answerphys function will accept a product of a variable and a number, e.g. "radius in inches" becomes (calculate (* radius 0.0254) ...)

Another interesting category of questions is:

(phys '(how does the force of gravitation vary with radius))

(phys '(how does the area of a circle vary if radius is doubled))
There is an easy trick for answering such questions:

  1. Find an equation that relates the variables in the question.
  2. Solve the equation for the variable in the question.
  3. Evaluate the rhs of that equation, with all variables set to 1.
  4. Evaluate the rhs of that equation, with all variables except the variable to be changed set to 1, and the changed variable (radius in the above examples) set to 2 or another value.
  5. Take the ratio of the two values: that is the answer in the second question, or implies the answer in the first question, e.g. 4 means square, 1/4 means inverse square, etc.

Grading Criteria:

Expand the coverage of your program to allow more complex queries than the examples given.

You might consider adding some logical or comparison expressions to the database language. Make sure your responses to questions give the information that a user would want, e.g. street address of a restaurant.

Add new physical principles, units, and different kinds of questions for the algebraic solver.

References:

Woods, W. A., "Transition Network Grammars for Natural Language Analysis", Communications of the ACM, Oct. 1970, pp. 591-606.

Hendrix, G. G., Sacerdoti, E. D., Sagalowicz, D., and Slocum, J., "Developing a Natural Language Interface to Complex Data", ACM Trans. on Database Systems, Vol. 3, pp. 105-147, 1978.