Department of Computer Science

Machine Learning Research Group

University of Texas at Austin Artificial Intelligence Lab

Publications: 2023

  1. SAGEViz: SchemA GEneration and Visualization
    [Details] [PDF]
    Sugam Devare, Mahnaz Koupaee, Gautham Gunapati, Sayontan Ghosh, Sai Vallurupalli, Yash Kumar Lal, Francis Ferraro, Nathanael Chambers, Greg Durrett, Raymond Mooney, Katrin Erk, Niranjan Balasubramanian
    In Empirical Methods in Natural Language Processing (EMNLP) Demo Track, December 2023.
    Schema induction involves creating a graph representation depicting how events unfold in a scenario. We present SAGEViz, an intuitive and modular tool that utilizes human-AI collaboration to create and update complex schema graphs efficiently, where multiple annotators (humans and models) can work simultaneously on a schema graph from any domain. The tool consists of two components: (1) a curation component powered by plug-and-play event language models to create and expand event sequences while human annotators validate and enrich the sequences to build complex hierarchical schemas, and (2) an easy-to-use visualization component to visualize schemas at varying levels of hierarchy. Using supervised and few-shot approaches, our event language models can continually predict relevant events starting from a seed event. We conduct a user study and show that users need less effort in terms of interaction steps with SAGEViz to generate schemas of better quality. We also include a video demonstrating the system.
    ML ID: 424
  2. What is the Best Automated Metric for Text to Motion Generation?
    [Details] [PDF]
    Jordan Voas
    Masters Thesis, Department of Computer Science, UT Austin, Austin, TX, May 2023.
    There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing meaningful training losses for supervising generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and rarely used coordinate errors show strong correlations. Several recently developed metrics are not recommended due to their low correlation compared to alternatives. Additionally, multiple novel metrics which exhibiting improved correlation and potential for future use.
    ML ID: 423
  3. What is the Best Automated Metric for Text to Motion Generation?
    [Details] [PDF] [Slides (PPT)] [Video]
    Jordan Voas, Yili Wang, Qixing Huang, Raymond Mooney
    In ACM SIGGRAPH Asia, December 2023.
    There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives.
    ML ID: 422
  4. Explaining Competitive-Level Programming Solutions using LLMs
    [Details] [PDF] [Poster]
    Jierui Li, Szymon Tworkowski, Yingying Wu, Raymond Mooney
    In Association of Computational Linguistics (ACL), Natural Language Reasoning and Structured Explanations Workshop, July 2023.
    In this paper, we approach competitive-level programming problem-solving as a composite task of reasoning and code generation. We propose a novel method to automatically annotate natural language explanations to pairs. We show that despite poor performance in solving competitive-level programming problems, state-of-the-art LLMs exhibit a strong capacity in describing and explaining solutions. Our explanation generation methodology can generate a structured solution explanation for the problem containing descriptions and analysis. To evaluate the quality of the annotated explanations, we examine their effectiveness in two aspects: 1) satisfying the human programming expert who authored the oracle solution, and 2) aiding LLMs in solving problems more effectively. The experimental results on the CodeContests dataset demonstrate that while LLM GPT3.5's and GPT-4's abilities in describing the solution are comparable, GPT-4 shows a better understanding of the key idea behind the solution.
    ML ID: 421
  5. Using Planning to Improve Semantic Parsing of Instructional Texts
    [Details] [PDF] [Slides (PDF)]
    Vanya Cohen, Raymond Mooney
    Association of Computational Linguistics (ACL), Natural Language Reasoning and Structured Explanations Workshop, July 2023.
    We develop a symbolic planning-based decoder to improve the few-shot semantic parsing of instructional texts. The system takes long-form instructional texts as input and produces sequences of actions in a formal language that enable execution of the instructions. This task poses unique challenges since input texts may contain long context dependencies and ambiguous and domain-specific language. Valid semantic parses also require sequences of steps that constitute an executable plan. We build on recent progress in semantic parsing by leveraging large language models to learn parsers from small amounts of training data. During decoding, our method employs planning methods and domain information to rank and correct candidate parses. To validate our method, we evaluate on four domains: two household instruction-following domains and two cooking recipe interpretation domains. We present results for few-shot semantic parsing using leave-one-out cross-validation. We show that utilizing planning domain information improves the quality of generated plans. Through ablations we also explore the effects of our decoder design choices.
    ML ID: 420
  6. Text-to-SQL Error Correction with Language Models of Code
    [Details] [PDF] [Poster]
    Ziru Chen, Shijie Chen, Michael White, Raymond Mooney, Ali Payani, Jayanth Srinivasa, Yu Su, Huan Sun
    In Proceedings of the Association for Computational Linguistics (ACL), January 2023.
    Despite recent progress in text-to-SQL parsing, current semantic parsers are still not accurate enough for practical use. In this paper, we investigate how to build automatic text-to-SQL error correction models. Noticing that token-level edits are out of context and sometimes ambiguous, we propose building clause-level edit models instead. Besides, while most language models of code are not specifically pre-trained for SQL, they know common data structures and their operations in programming languages such as Python. Thus, we propose a novel representation for SQL queries and their edits that adheres more closely to the pre-training corpora of language models of code. Our error correction model improves the exact set match accuracy of different parsers by 2.4–6.5 and obtains up to 4.3 point absolute improvement over two strong baselines.
    ML ID: 419
  7. Directly Optimizing Evaluation Metrics to Improve Text to Motion
    [Details] [PDF]
    Yili Wang
    Masters Thesis, Department of Computer Science, UT Austin, May 2023.
    There is a long-existing discrepancy between training and testing process of most generative models including both text-to-text models like machine translation (MT), and multi-modal models like image captioning and text-to-motion generation. These models are usually trained to optimize a specific objective like log-likelihood (MLE) in the Seq2Seq models or the KL-divergence in the variational autoencoder (VAE) models. However, they are tested using different evaluation metrics such as the BLEU score and Fréchet Inception Distance (FID). Our paper aims to address such discrepancy in text-to-motion generation models by developing algorithms to directly optimize the target metric during training time. We explore three major techniques: reinforcement learning, contrastive learning methods, and differentiable metrics that are originally applied to natural language processing fields and adapt them to the language-and-motion domain.
    ML ID: 418
  8. Learning Deep Semantics for Test Completion
    [Details] [PDF] [Slides (PDF)]
    Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond Mooney and Milos Gligoric
    International Conference on Software Engineering, May 2023.
    Writing tests is a time-consuming yet essential task during software development. We propose to leverage recent advances in deep learning for text and code generation to assist developers in writing tests. We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test. We develop TeCo---a deep learning model using code semantics for test completion. The key insight underlying TeCo is that predicting the next statement in a test method requires reasoning about code execution, which is hard to do with only syntax-level data that existing code completion models use. TeCo extracts and uses six kinds of code semantics data, including the execution result of prior statements and the execution context of the test method. To provide a testbed for this new task, as well as to evaluate TeCo, we collect a corpus of 130,934 test methods from 1,270 open-source Java projects. Our results show that TeCo achieves an exact-match accuracy of 18, which is 29 percent higher than the best baseline using syntax-level data only. When measuring functional correctness of generated next statement, TeCo can generate runnable code in 29 percent of the cases compared to 18 percent obtained by the best baseline. Moreover, TeCo is significantly better than prior work on test oracle generation.
    ML ID: 417
  9. Using Both Demonstrations and Language Instructions to Efficiently Learn Robotic Tasks
    [Details] [PDF] [Video]
    Albert Yu, Raymond J. Mooney
    In International Conference on Learning Representations, May 2023.
    Demonstrations and natural language instructions are two common ways to specify and teach robots novel tasks. However, for many complex tasks, a demonstration or language instruction alone contains ambiguities, preventing tasks from being specified clearly. In such cases, a combination of both a demonstration and an instruction more concisely and effectively conveys the task to the robot than either modality alone. To instantiate this problem setting, we train a single multi-task policy on a few hundred challenging robotic pick-and-place tasks and propose DeL-TaCo (Joint Demo-Language Task Conditioning), a method for conditioning a robotic policy on task embeddings comprised of two components: a visual demonstration and a language instruction. By allowing these two modalities to mutually disambiguate and clarify each other during novel task specification, DeL-TaCo (1) substantially decreases the teacher effort needed to specify a new task and (2) achieves better generalization performance on novel objects and instructions over previous task-conditioning methods. To our knowledge, this is the first work to show that simultaneously conditioning a multi-task robotic manipulation policy on both demonstration and language embeddings improves sample efficiency and generalization over conditioning on either modality alone. See additional materials at https://sites.google.com/view/del-taco-learning
    ML ID: 408
  10. “Female Astronaut: Because sandwiches won’t make themselves up there!": Towards multi-modal misogyny detection in memes
    [Details] [PDF]
    Smriti Singh, Amritha Haridasan, Raymond Mooney
    Association of Computational Linguistics (ACL), Workshop on Online Abuse and Harms (WOAH), July 2023.
    A rise in the circulation of memes has led to the spread of a new form of multimodal hateful content. Unfortunately, the degree of hate women receive on the internet is disproportionately skewed against them. This, combined with the fact that multimodal misogyny is more challenging to detect as opposed to traditional text-based misogyny, signifies that the task of identifying misogynistic memes online is one of utmost importance. To this end, the MAMI dataset was released, consisting of 12000 memes annotated for misogyny and four sub-classes of misogyny - shame, objectification, violence and stereotype. While this balanced dataset is widely cited, we find that the task itself remains largely unsolved. Thus, in our work, we investigate the performance of multiple models in an effort to analyse whether domain specific pretraining helps model performance. We also investigate why even state of the art models find this task so challenging, and whether domain-specific pretraining can help. Our results show that pretraining BERT on hateful memes and leveraging an attention-based approach with ViT outperforms state of the art models by more than 10 percent. Further, we provide insight into why these models may be struggling with this task with an extensive qualitative analysis of random samples from the test set.