A project of the Knowledge Systems Group, UT-Austin
Learning by reading is one of the most challenging tasks in Artificial Intelligence. It subsumes at least two subtasks that are open problem in AI. The first subtask - natural language understanding - is difficult because of the ambiguity of language, which in part results from the speaker's intentional omission of information that the recipient is assumed to know. Researchers have made progress on isolated tasks in the syntactic and semantic analysis of utterances but there have been few attempts to integrate them into a comprehensive system that builds machine-sensible representations of the content of text. The second subtask - knowledge integration - involves combining new information gleaned from individual sentences of texts, along with a priori knowledge, to form a comprehensive and computationally useful knowledge base. Researchers in knowledge engineering have studied some of the core issues, such as flexible matching of information structures and integrating new concepts into ontologies, but the source of new information has been knowledge engineers, or subject-matter experts trained as knowledge engineers, who perform knowledge integration manually.
Arguably, although these subtasks are undeniably difficult, combining them might simplify both. The knowledge integration task produces a knowledge base that might help natural language understanding, and natural language understanding might automatically produce new content for knowledge integration. In fact, if the two tasks are tightly coupled in a cycle, a learning by reading system might start with only general knowledge and a corpus of relevant texts and bootstrap itself to a state of domain expertise.
In collaboration with Ed Hovy, Jerry Hobbs and their NLP group at ISI, we have taken a first step toward building a learning by reading system: cobbling together a prototype, analyzing its performance and identifying the major obstacles to success. We built the prototype system by assembling three off-the-shelf systems for the tasks of parsing, semantic elaboration and knowledge integration. The system's starting knowledge is our Component Library, which contains formal representations of about 700 general concepts - such as the events Penetrate and Enter, and the entities Barrier and Container. We applied the system to the domain of heart biology, giving it numerous paragraphs on the structure and function of the human heart. The texts were unrestricted in their use of English, and were roughly at the level of Wikipedia articles. To help the system get started, we extended its general knowledge with ten concepts - such as Pump and Muscle - that are domain-general, but important to understanding heart texts.
By reading texts, the system attempts to learn a knowledge base of concept-relation-concept triples for the information conveyed by the text. In addition, it attempts to formulate hypotheses (also triples) for inferences it derives, but cannot confirm, from the text. To evaluate its performance, we compare the system's recall and precision with that of human readers, thereby establishing a performance baseline for evaluating future systems.
With funding from DARPA and an expanded team of researchers - including Peter Clark and Ralph Weischedel - and project management by Noah Friedland and David Israel, we're now in Phase II of the project. Our group will continue to focus on the key research challenge: Knowledge Integration.