For the WASP Semantic Parser, Version 1.0
Yuk Wah Wong (ywwong@cs.utexas.edu), The University of Texas at Austin, August 2006.
WASP is a statistical learning algorithm for semantic parsing. Semantic parsing is the construction of complete, formal meaning representations given input sentences. A semantic parser is learned given a set of sentences annotated with their correct meaning representations.
Here is a sample meaning representation (MR) written in a meaning-representation language (MRL) called CLang, which is used for encoding coach advice given to simulated soccer-playing agents in the RoboCup domain:
((bowner our {4}) (do our {6} (pos (left (half our)))))
If our player 4 has the ball, then our player 6 should stay in the left side of our half.
The CLang predicate bowner means ball owner, our means our team, 4 means the uniform number 4, and so on. Details about CLang can be found in the RoboCup Server Manual. (Details about the additional predicates that we added to the CLang language, such as left and half, can be found in this page.) Given the above sample sentence, a semantic parser would construct the above MR in CLang.
WASP is a statistical approach to semantic parsing based on machine translation techniques. More specifically, a statistical word alignment model is used to acquire a bilingual lexicon consisting of NL substrings coupled with their translations in the traget MRL. Complete MRs are then formed by combining these NL substrings and their translations under the synchronous parsing framework, which has been widely used in syntax-based machine translation (Wu, 1997; Yamada & Knight, 2001; Melamed, 2004; Chiang, 2005). The WASP algorithm is described in the following paper:
Yuk Wah Wong and Raymond J. Mooney. Learning for Semantic Parsing with Statistical Machine Translation. In Proceedings of the 2006 Human Language Technology Conference - North American Chapter of the Association for Computational Linguistics Annual Meeting (HLT/NAACL-2006), pp. 439-446, New York City, NY, June 2006.
This distribution contains:
All code is written in Java. To compile the code, you need JDK 1.4 or higher. In addition, training of word alignment models requires the GIZA++ package, which is available in Franz Josef Och's website. To evaluate formal queries in the Geoquery domain, you also need a copy of SICStus Prolog. The evaluation scripts have been tested on SICStus Prolog version 3.11.2.
First, you have to compile the Java code. All Java code is in the src directory. To compile the code, do the following:
cd src
find . -name '*.java' | xargs javac -O
All programs are in the wasp.main and wasp.main.eval packages:
The following sections will quickly walk you through the training, testing and evaluation stages.
The program that trains the semantic parser requires the following command-line arguments to run:
java wasp.main.Trainer config-file model-dir mask-file
For your convenience, we included two sample configuration files in this distribution, namely data/robocup-clang/config (for the RoboCup domain) and data/geo-funql/config (for the Geoquery domain). Let us take a look at the configuration file for RoboCup:
wasp.nl=en
wasp.mrl=robocup-clang
wasp.mrl.grammar=/usr/share/wasp/data/robocup-clang/mrl-grammar
wasp.corpus.file=/usr/share/wasp/data/robocup-clang/corpus.xml
wasp.align.model=giza++
wasp.align.giza++.exec=/usr/local/bin/GIZA++
wasp.translation.model=scfg
wasp.scfg.init=/usr/share/wasp/data/robocup-clang/scfg-init-rules
The configuration file is a Java property list, where each line is a key-value pair. The most important settings are at the top:
The rest of the settings specify the types of word alignment model and translation model to use. Some of the settings are model-specific. For example, if the current word alignment model is GIZA++ (giza++), then you have to specify the absolute pathname of the GIZA++ executable file (wasp.align.giza++.exec).
Apart from the configuration file, you also need to specify the training set (using the command-line argument mask-file2). For this, you need an example mask. An example mask is a text file that contains a list of example IDs (one ID per line). All examples in a corpus have an example ID. For your convenience, we included a set of example masks for each application domain, namely data/robocup-clang/split-* and data/geo-funql/split-*. These masks can be used to generate learning curves using 10x10-fold cross validation. (The 10-fold cross validation experiments in Kate et al. (2005) and Wong & Mooney (2006) are based on the data splits in data/robocup-clang/split-*/run-0 and data/geo-funql/split-*/run-0.)
The Trainer program writes the learned translation model to the output directory (the command-line argument model-dir). It also writes a message log to the standard error stream, which can be captured for detailed error analysis.
The program that does semantic parsing based on previously-learned translation models accepts similar command-line arguments as the trainer does:
java wasp.main.Parser config-file model-dir mask-file output-file
The configuration file is the same as before. However, there is a parser-specific setting that you need to specify:
wasp.kbest=1
This specifies the maximum number of top-scoring parses to return. Set this to 1 for Viterbi decoding. This setting is ignored by the trainer and the parser evaluator.
The output XML file follows roughly the same format as the corpus file. Instead of gold-standard MRs, it contains the top-scoring translations given by the parser. Some examples may not have any translations at all. Like the trainer, the Parser program writes a message log to the standard error stream.
The ParserEvaluator program reads the output of the parser, compares the top-scoring parse of each example against the gold-standard MR, and generates statistics using various evaluation metrics:
java wasp.main.eval.ParserEvaluator config-file output-file input-file-1 input-file-2 ...
The configuration file is the same as before. The input XML files are the output of the Parser program. If there are more than one input files, then each one should be produced by a separately-trained semantic parser. For example, there would be one input file for each fold in the 10-fold cross validation setting. Macro-averaging is used to obtain average statistics across all folds.
The parser evaluator generates the following statistics:
The definition of correctness is domain-specific. For example, in the RoboCup domain, an MR translation is correct if it exactly matches the gold-standard MR, up to permutation of arguments of certain predicates (e.g. and and or).
The output of the parser evaluator is a plain text file with the following format: It starts with the main section where the gold-standard MRs and the automatically-generated parses are listed. Incorrect parses are preceded by asterisks (*). After the main section are the statistics. There is a section for each evaluation metric. Here is a sample section for the precision metric:
begin precision
mean 0.8636014401946894
95%-confidence-interval 0.8173097234842457 0.9098931569051332
end precision
Obviously, the first line begins the section for precision, and the last line ends it. The second line is the mean precision across all input files. The third line is the range of values within which the mean precision may lie with 95% confidence. The sections for recall and F-measure follow the same format.
There is also a separate section for the precision-recall (PR) curve. A PR curve is obtained by varying the score threshold. In a PR curve, the x axis is the recall level, and the y axis is the mean precision. A PR curve is represented as a list of (x, y) pairs.
Why stick with the sample data sets that are given to you? Create your own data set in a new application domain! This section explains everything about extending WASP to your exciting new application.
Here are the things you need to do to create a new application domain:
We will cover these tasks in the following subsections.
Recall that WASP requires the target MRL to have an unambiguous context-free grammar, and the MRL grammar file is a text file that contains all production rules of this grammar. Each line of this file represents a production rule. Here is a sample production rule:
*n:Action -> ({ ( pos *n:Region ) })
The token before the arrow (->) is the LHS nonterminal. Everything between the ({ and }) tokens are the RHS string of terminals and nonterminals. Tokens with the prefix *n: are nonterminals. So both *n:Action and *n:Region are nonterminals. On the other hand, (, pos and ) are terminal symbols. All symbols must be separated by whitespace. Also there must be space between the LHS nonterminal and the arrow, and so on.
Following the production rule, there can be a list of modifiers. One such modifier is zero-fertility. Production rules modified by zero-fertility will be forced to have zero fertility in the output word alignments. You can create your own modifiers. See Section 4.6.
There are also a special class of terminal symbols called wildcard symbols. A wildcard symbol is a disjunction of terminal symbols. For example, the wildcard symbol *t:Num is the disjunction of terminal symbols that represent real numbers. The following production:
*n:Num -> ({ *t:Num })
is equivalent to the following set of productions, which obviously can be a very large set:
*n:Num -> ({ 0 })
*n:Num -> ({ 0.1 })
*n:Num -> ({ 0.01 })
...
For simplicity, whenever a wildcard symbol is used, it must be the only symbol in the RHS.
Currently, three wildcard symbols are defined: *t:Num (for real numbers), *t:Unum (for uniform numbers) and *t:Ident (for CLang identifiers). The last two are specific to the RoboCup domain. To define your own wildcard symbols, modify the following classes: wasp.data.Dictionary and wasp.data.Terminal. See the Javadoc for more information.
The corpus is stored in an XML file. Note that the file must be valid XML! A minimal corpus file should look like this:
<?xml version="1.0"?>
<examples>
<example id="0">
<nl lang="en">
...
</nl>
<mrl lang="robocup-clang">
...
</mrl>
</example>
...
</examples>
All examples have a unique ID which is a non-negative integer. Every sentence has a language tag (e.g. en). There is a language tag for every MR as well (e.g. robocup-clang). These tags correspond to those specified in the configuration file (wasp.nl and wasp.mrl, respectively).
The corpora for RoboCup and Geoquery are a lot more verbose. But the tags shown above are the only essential ones. (See Section 5.1 for information about other tags.)
Every MRL has its own string tokenizer. You provide one by extending the class wasp.mrl.MRLGrammar:
public class MyMRLGrammar extends MRLGrammar {
public Symbol[] tokenize(String str) {
...
}
public String combine(Symbol[] syms) {
...
}
...
}
The tokenize method tokenizes an MRL string. This method returns an array of symbols. Symbols are created using the read(String token) method of the class wasp.data.Symbol. The combine method is simply the inverse of tokenize.
The classes wasp.domain.RoboCupCLangGrammar and wasp.domain.GeoFunqlGrammar implement the RoboCup and Geoquery tokenizers.
The definition of MR correctness is domain-specific. For example, in the RoboCup domain, an MR is correct if it exactly matches the gold-standard MR. In the Geoquery domain, a query is correct if it retrieves the same answer from the U.S. geography database as the gold-standard query does.
Again, you define the notion of MR correctness in your domain by extending the class wasp.mrl.MRLGrammar:
public class MyMRLGrammar extends MRLGrammar {
...
public boolean[][] evaluate(Examples examples, Examples gold) {
...
}
...
}
The evaluate method takes two sets of examples, one having the MR translations to evaluate (examples), one having the gold-standard MRs (gold), and returns a list of Boolean arrays such that the i-th element of the j-th array is true if and only if the i-th top-scoring parse of the j-th example in examples is correct. This method allows for batch processing of MRs. It is useful because processing each MR individually can be very inefficient (e.g. in the Geoquery domain).
Currently, the only translation model supported is based on synchronous context-free grammars (SCFG). Here is a sample SCFG rule:
*n:Directive -> ({ position *n:Player#1 *n:Region#2 })({ ( do *n:Player#1 ( pos *n:Region#2 ) ) })
The token before the arrow (->) is the LHS nonterminal. Between the ({ and })({ tokens is the NL portion of the RHS string. Between the })({ and }) tokens is the MRL portion of the RHS string. All nonterminals on the RHS are indexed, as indicated by the non-negative integers following the pound signs (#). For every nonterminal in the NL portion of the RHS string, there is exactly one identical nonterminal in the MRL portion that has the same index. The same is true for wildcard symbols.
Given an MRL grammar, a set of initial rules are always created and added to the learned SCFG:
It is sometimes useful to provide additional initial rules that are always included in the learned SCFG. For example, initial rules can be used to represent a dictionary of foreign terms that refer to domain-specific entities:
*n:CityName -> ({ ausuchin })({ ' austin ' })
*n:CityName -> ({ nyuu yooku })({ ' new york ' })
...
Additional initial rules are stored in a text file. Each line of this file represents an SCFG rule. This file is used by the trainer, and it only accepts rules whose RHS symbols are all terminals, with no word gaps on the NL side. The absolute pathname of the text file is specified in the configuration file via the key wasp.scfg.init.
Following an SCFG rule, there can be a list of modifiers. One such modifier is tied-to rule, where rule is the representation of another rule with the same LHS nonterminal. It is said that the two rules have their parameters tied. This can be used for more robust parameter estimation.
There are a few more methods of the class MRLGrammar that you have to implement:
public class MyMRLGrammar extends MRLGrammar {
...
public int getStart() {
...
}
public int countNonterms() {
...
}
protected void readModifiers(Production prod, String[] line, Int index) {
...
}
}
The getStart method returns the integer ID of the start symbol of the MRL grammar. You can get the integer ID using the method Dictionary.nonterm(String str). The countNonterms method returns the number of possible nonterminals in this grammar. The readModifiers method allows you to define your own MRL production modifiers (see the Javadoc for more information). If you do not want to define any modifiers, you can leave this method empty.
After defining your own subclass of MRLGrammar, you have to link it to a new language tag. Language tags are defined in the MRLGrammar.createNew() method. You should use the same tag in your corpus and your configuration file. Your configuration file should also point to the MRL grammar file, the corpus, and the initial SCFG rules that you have created. Once you have done that, you are all set!
Instead of relying on GIZA++, the trainer can extract SCFG rules from gold-standard word alignments. Gold-standard word alignments are generally less noisy, and the resulting semantic parser is often much more precise, although word alignments are more time-consuming to annotate than plain MRs.
The RoboCup and Geoquery corpora included in this distribution come with gold-standard word alignments. These alignments are derived from previous annotations by Ge & Mooney (2005).
The gold-standard word alignments can be found inside the augsyn tags in the corpus XML files. The tags are called augsyn because the annotations are originally semantically-augmented parse trees (Ge & Mooney, 2005). For clarity, the annotations that we provide are degenerate parse trees that are only one level deep. (Full syntactic parses can be found inside the syn tags.)
Consistent with the use of MRL productions in WASP, semantic labels in our corpora are pointers to nodes in the gold-standard MR parse trees:
If-[Rule:1] the ball-[Condition:2] is-[Condition:2] in-[Condition:2] our-[Team:4] half-[Region:3] ...
Inside the square brackets ([ and ]) are the semantic labels. The number after each colon (:) is a node ID. It refers to the node with the same ID in the gold-standard MR parse tree.3 The parse trees are made available inside the mrl-parse tags, where nodes are listed in the top-down, leftmost order:
<mrl-parse>
<node id="0"> *n:Statement -> ({ *n:Rule }) </node>
<node id="1"> *n:Rule -> ({ ( *n:Condition *n:Directive ) }) </node>
<node id="2"> *n:Condition -> ({ ( bpos *n:Region ) }) </node>
<node id="3"> *n:Region -> ({ ( half *n:Team ) }) </node>
<node id="4"> *n:Team -> ({ our }) </node>
...
</mrl-parse>
Before the colon is the LHS nonterminal of the MRL production used in that node. This redundant information is useful in identifying annotation errors.
To use gold-standard word alignments during training, set the wasp.align.model property to gold-standard in the configuration file:
wasp.align.model=gold-standard
1 In fact, the programs will not break down even if the MRL grammar is ambiguous. Given an MR, the MRL parser will pick the first parse that it can find. Although there is no guarantee which parse it will pick, it is guaranteed that given the same MR, the same parse will always be picked.
2 Some settings are specified in the configuration file while others are specified in the command line. This is to make sure that the configuration file can be re-used during testing and evaluation.
3 The MRL strings inside the mrl tags would be redundant when gold-standard word alignments are used during training. Training would be based on the gold-standard MR parse trees, not the MRL strings.
Yuk Wah Wong (ywwong@cs.utexas.edu)