The best performing NLP models to date are learned from large volumes
of manually-annotated data. For tasks like part-of-speech tagging and grammatical
parsing, high performance can be achieved with plentiful supervised data. However,
such resources are extremely costly to produce, making them an unlikely option for
building NLP tools in under-resourced languages or domains.
This dissertation is concerned with reducing the annotation required to learn
NLP models, with the goal of opening up the range of domains and languages to
which NLP technologies may be applied. In this work, we explore the possibility
of learning from a degree of supervision that is at or close to the amount that could
reasonably be collected from annotators for a particular domain or language that
currently has none. We show that just a small amount of annotation input — even
that which can be collected in just a few hours — can provide enormous advantages
if we have learning algorithms that can appropriately exploit it.
This work presents new algorithms, models, and approaches designed to
learn grammatical information from weak supervision. In particular, we look at
ways of intersecting a variety of different forms of supervision in complementary
ways, thus lowering the overall annotation burden. Sources of information include
tag dictionaries, morphological analyzers, constituent bracketings, and partial tree
annotations, as well as unannotated corpora. For example, we present algorithms
that are able to combine faster-to-obtain type-level annotation with unannotated text
to remove the need for slower-to-obtain token-level annotation.
Much of this dissertation describes work on Combinatory Categorial Grammar
(CCG), a grammatical formalism notable for its use of structured, logic-backed
categories that describe how each word and constituent fits into the overall syntax
of the sentence. This work shows how linguistic universals intrinsic to the CCG
formalism itself can be encoded as Bayesian priors to improve learning.
PhD Thesis, Department of Computer Science, The University of Texas at Austin.