Machine Learning Research Group | University of Texas

Publications: Natural Language for Software Engineering

The ability to translate instructions expressed in natural language directly to executable software is of considerable use in many applications such as personal assistants, as well as in making computers and automated systems more accessible to users unfamiliar with computer programming. Our work has focused on using semantic parsing and dialog to interpret English "if this then that" (IFTTT) instructions and using the evolution of comments and code in open source software repositories and a combination of NLP and program analysis methods to automate various software engineering tasks.

Hide abstracts

Enhancing Competitive-level Code Generation by Utilizing Natural Language Reasoning
[Details] [PDF] [Slides (PDF)]
Jierui Li
September 2025. Ph.D. Proposal.
Recent progress in large language models (LLMs) has shown strong performance in code generation. Models trained with long reasoning chains achieve promising results on complex competitive programming (CP) tasks. However, it remains unclear where the main bottlenecks in solving such problems lie. This dissertation studies these obstacles and explores how leveraging LLMs’ natural language reasoning abilities can improve code generation for CP. This proposal highlights three completed contributions: Explanation and Distilling: LLMs are e↵ective at explaining solution code(Li et al., 2023), and their ability to implement a verbal solution is stronger than solving a problem directly. Based on this, we developed a supervised finetuning method that distills LLM-generated explanations into chain-of-thought style problem-solving steps(Li and Mooney, 2024). Agent-Guided CodeTree Search: We introduced CodeTree(Li et al., 2025), an agent system for code generation that iteratively thinks, solves, reflects, refines, and verifies through an auto-expanded tree search until reaching the final solution. AlgoSimBench Benchmark: We built AlgoSimBench(Li and Mooney, 2025), a benchmark for evaluating LLMs’ ability to identify algorithmically similar problems. We found that using attempted solutions to match problems improves both end-to-end LLM selection and cosine similarity-based retrieval. Finally, we outline two directions for future work. Task-Aware Code Representation: Develop a zero-shot code embedding method that weighs tokens based on the task-specific prompt, focusing the representation on distinct aspects such as algorithm, functionality, and semantics. Retriever–LLM Training: Investigate why Retrieval-Augmented Generation (RAG) shows limited improvement in coding tasks, with two hypotheses: (a) retrievers fail to find useful context, and (b) LLMs struggle to use retrieved information effectively. To address this, we plan to jointly train retrievers and LLMs on context-dependent coding tasks.
ML ID: 443
AlgoSimBench: Identifying Algorithmically Similar Problems for Competitive Programming
[Details] [PDF]
Jierui Li and Raymond Mooney
In Preprint, July 2025.
Recent progress in LLMs, such as reasoning models, has demonstrated strong abilities to solve complex competitive programming problems, often rivaling top human competitors. However, it remains underexplored whether these abilities generalize to relevant domains that are less seen during training. To address this, we introduce AlgoSimBench, a new benchmark designed to assess LLMs' ability to identify algorithmically similar problems (ASPs)-problems that can be solved using similar algorithmic approaches. AlgoSimBench consists of 1317 problems, annotated with 231 distinct fine-grained algorithm tags, from which we curate 402 multiple-choice questions (MCQs), where each question presents one algorithmically similar problem alongside three textually similar but algorithmically dissimilar distractors. Our evaluation reveals that LLMs struggle to identify ASPs, with the best-performing model (o3-mini) achieving only 65.9 percent accuracy on the MCQ task. To address this challenge, we propose attempted solution matching (ASM), a novel method for improving problem similarity detection. On our MCQ task, ASM yields an absolute accuracy improvement of 6.7 percent to 11.7 percent across different models. We also evaluated code embedding models and retrieval methods on similar problem identification. While the adversarial selection of problems degrades the performance to be less than random, we found that simply summarizing the problem to remove narrative elements eliminates the effect, and combining ASM with a keyword-prioritized method, BM25, can yield up to 52.2 percent accuracy. Code and data are available at https://github.com/lijierui/AlgoSimBench.
ML ID: 442
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models
[Details] [PDF]
Jierui Li, Hung Le, Yingbo Zhou, Caiming Xiong, Silvio Savarese, Doyen Sahoo
In Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), April 2025.
Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can act as agents with capabilities to self-refine and improve generated code autonomously. However, on challenging coding tasks with extremely large search space, current agentic approaches still struggle with multi-stage planning, generating, and debugging. To address this problem, we propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions. In each stage, critical decision-making (ranking, termination, expanding) of the exploration process is guided by both the environmental execution-based feedback and LLM-agent-generated feedback. We comprehensively evaluated CodeTree on 7 code generation benchmarks and demonstrated the significant performance gains of CodeTree against strong baselines. Using GPT-4o as the base model, we consistently achieved top results of 95.1 on HumanEval, 98.7 on MBPP, and 43.0 on CodeContests. On the challenging SWEBench benchmark, our approach led to significant performance gains.
ML ID: 438
Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs
[Details] [PDF]
Jierui Li and Raymond Mooney
preprint, April 2024.
Distilling explicit chain-of-thought reasoning paths has emerged as an effective method for improving the reasoning abilities of large language models (LLMs) across various tasks. However, when tackling complex tasks that pose significant challenges for state-of-the-art models, this technique often struggles to produce effective chains of thought that lead to correct answers. In this work, we propose a novel approach to distill reasoning abilities from LLMs by leveraging their capacity to explain solutions. We apply our method to solving competitive-level programming challenges. More specifically, we employ an LLM to generate explanations for a set of pairs, then use pairs to fine-tune a smaller language model, which we refer to as the Reasoner, to learn algorithmic reasoning that can generate "how-to-solve" hints for unseen problems. Our experiments demonstrate that learning from explanations enables the Reasoner to more effectively guide program implementation by a Coder, resulting in higher solve rates than strong chain-of-thought baselines on competitive-level programming problems. It also outperforms models that learn directly from pairs. We curated an additional test set in the CodeContests format, which includes 246 more recent problems posted after the models' knowledge cutoff.
ML ID: 427
Explaining Competitive-Level Programming Solutions using LLMs
[Details] [PDF] [Poster]
Jierui Li, Szymon Tworkowski, Yingying Wu, Raymond Mooney
In Association of Computational Linguistics (ACL), Natural Language Reasoning and Structured Explanations Workshop, July 2023.
In this paper, we approach competitive-level programming problem-solving as a composite task of reasoning and code generation. We propose a novel method to automatically annotate natural language explanations to pairs. We show that despite poor performance in solving competitive-level programming problems, state-of-the-art LLMs exhibit a strong capacity in describing and explaining solutions. Our explanation generation methodology can generate a structured solution explanation for the problem containing descriptions and analysis. To evaluate the quality of the annotated explanations, we examine their effectiveness in two aspects: 1) satisfying the human programming expert who authored the oracle solution, and 2) aiding LLMs in solving problems more effectively. The experimental results on the CodeContests dataset demonstrate that while LLM GPT3.5's and GPT-4's abilities in describing the solution are comparable, GPT-4 shows a better understanding of the key idea behind the solution.
ML ID: 421
Learning Deep Semantics for Test Completion
[Details] [PDF] [Slides (PDF)]
Pengyu Nie, Rahul Banerjee, Junyi Jessy Li, Raymond Mooney and Milos Gligoric
International Conference on Software Engineering, May 2023.
Writing tests is a time-consuming yet essential task during software development. We propose to leverage recent advances in deep learning for text and code generation to assist developers in writing tests. We formalize the novel task of test completion to automatically complete the next statement in a test method based on the context of prior statements and the code under test. We develop TeCo---a deep learning model using code semantics for test completion. The key insight underlying TeCo is that predicting the next statement in a test method requires reasoning about code execution, which is hard to do with only syntax-level data that existing code completion models use. TeCo extracts and uses six kinds of code semantics data, including the execution result of prior statements and the execution context of the test method. To provide a testbed for this new task, as well as to evaluate TeCo, we collect a corpus of 130,934 test methods from 1,270 open-source Java projects. Our results show that TeCo achieves an exact-match accuracy of 18, which is 29 percent higher than the best baseline using syntax-level data only. When measuring functional correctness of generated next statement, TeCo can generate runnable code in 29 percent of the cases compared to 18 percent obtained by the best baseline. Moreover, TeCo is significantly better than prior work on test oracle generation.
ML ID: 417
Using Developer Discussions to Guide Fixing Bugs in Software
[Details] [PDF] [Slides (PDF)] [Video]
Sheena Panthaplackel, Milos Gligoric, Junyi Jessy Li, Raymond J. Mooney
In Findings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), December 2022.
Automatically fixing software bugs is a challenging task. While recent work showed that natural language context is useful in guiding bug-fixing models, the approach required prompting developers to provide this context, which was simulated through commit messages written after the bug-fixing code changes were made. We instead propose using bug report discussions, which are available before the task is performed and are also naturally occurring, avoiding the need for any additional information from developers. For this, we augment standard bug-fixing datasets with bug report discussions. Using these newly compiled datasets, we demonstrate that various forms of natural language context derived from such discussions can aid bug-fixing, even leading to improved performance over using commit messages corresponding to the oracle bug-fixing commits.
ML ID: 414
Facilitating Software Evolution through Natural Language Comments and Dialogue
[Details] [PDF] [Slides (PDF)]
Sheena Panthaplackel
PhD Thesis, Department of Computer Science, UT Austin, August 2022.
Software projects are continually evolving, as developers incorporate changes to refactor code, support new functionality, and fix bugs. To uphold software quality amidst constant changes and also facilitate prompt implementation of critical changes, it is desirable to have automated tools for supporting and driving software evolution. In this thesis, we explore tasks and data and design machine learning approaches which leverage natural language to serve this purpose.

When developers make code changes, they sometimes fail to update the accompanying natural language comments documenting various aspects of the code, which can lead to confusion and vulnerability to bugs. We present our work on alerting developers of inconsistent comments upon code changes and suggesting updates by learning to correlate comments and code.

When a bug is reported, developers engage in a dialogue to collaboratively understand it and ultimately resolve it. While the solution is likely formulated within the discussion, it is often buried in a large amount of text, making it difficult to comprehend, which delays its implementation through the necessary repository changes. To guide developers in more easily absorbing information relevant towards making these changes and consequently expedite bug resolution, we investigate generating a concise natural language description of the solution by synthesizing relevant content as it emerges in the discussion. We benchmark models for generating solution descriptions and design a classifier for determining when sufficient context for generating an informative description becomes available. We investigate approaches for real-time generation, entailing separately trained and jointly trained classification and generation models. Furthermore, we also study techniques for deriving natural language context from bug report discussions and generated solution descriptions to guide models in generating suggested bug-resolving code changes.
ML ID: 409
Learning to Describe Solutions for Bug Reports Based on Developer Discussions
[Details] [PDF] [Slides (PDF)] [Poster] [Video]
Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, Raymond J. Mooney
In Findings of the Annual Meeting of the Association for Computational Linguistics (ACL), May 2022.
When a software bug is reported, developers engage in a discussion to collaboratively resolve it. While the solution is likely formulated within the discussion, it is often buried in a large amount of text, making it difficult to comprehend and delaying its implementation. To expedite bug resolution, we propose generating a concise natural language description of the solution by synthesizing relevant content within the discussion, which encompasses both natural language and source code. We build a corpus for this task using a novel technique for obtaining noisy supervision from repository changes linked to bug reports, with which we establish benchmarks. We also design two systems for generating a description during an ongoing discussion by classifying when sufficient context for performing the task emerges in real-time. With automated and human evaluation, we find this task to form an ideal testbed for complex reasoning in long, bimodal dialogue context.
ML ID: 404
Impact of Evaluation Methodologies on Code Summarization
[Details] [PDF] [Slides (PDF)] [Poster] [Video]
Pengyu Nie and Jiyang Zhang and Junyi Jessy Li and Raymond J. Mooney and Milos Gligoric
In Annual Meeting of the Association for Computational Linguistics, May 2022.
There has been a growing interest in developing machine learning (ML) models for code summarization tasks, e.g., comment generation and method naming. Despite substantial increase in the effectiveness of ML models, the evaluation methodologies, i.e., the way people split datasets into training, validation, and test sets, were not well studied. Specifically, no prior work on code summarization considered the timestamps of code and comments during eval- uation. This may lead to evaluations that are inconsistent with the intended use cases. In this paper, we introduce the time-segmented evaluation methodology, which is novel to the code summarization research community, and compare it with the mixed-project and cross-project methodologies that have been commonly used. Each methodology can be mapped to some use cases, and the time-segmented methodology should be adopted in the evaluation of ML models for code summarization. To assess the impact of methodologies, we collect a dataset of (code, comment) pairs with timestamps to train and evaluate several recent ML models for code summarization. Our experiments show that different methodologies lead to conflicting evaluation results. We invite the community to expand the set of methodologies used in evaluations.
ML ID: 403
Facilitating Software Evolution through Natural Language Comments and Dialogue
[Details] [PDF] [Slides (PDF)] [Video]
Sheena Panthaplackel
October 2021. Ph.D. Proposal.
Software projects are continually evolving, as developers incorporate changes to refactor code, support new functionality, and fix bugs. To uphold software quality amidst constant changes and also facilitate the prompt implementation of critical changes, it is desirable to have automated tools for guiding developers in making methodical software changes. We explore tasks and data and design machine learning approaches which leverage natural language to serve this purpose. When developers make code changes, they sometimes fail to update the accompanying natural language comments documenting various aspects of the code, which can lead to confusion and vulnerability to bugs. We present our completed work on alerting developers of inconsistent comments upon code changes and suggesting updates by learning to correlate comments and code. When a bug is reported, developers engage in a dialogue to collaboratively understand it and ultimately resolve it. While the solution is likely formulated within the discussion, it is often buried in a large amount of text, making it difficult to comprehend, which delays its implementation through the necessary repository changes. To guide developers in more easily absorbing information relevant towards making these changes and consequently expedite bug resolution, we investigate generating a concise natural language description of the solution by synthesizing relevant content as it emerges in the discussion. In completed work, we benchmark models for generating solution descriptions and design a classifier for determining when sufficient context for generating an informative description becomes available. We also investigate a pipelined approach for real-time generation, entailing separate classification and generation models. For future work, we propose an improved classifier and also a more intricate system that is jointly trained on generation and classification. Next, we intend to study a system that can interactively generate natural language descriptions that can drive code changes. Finally, we plan to investigate how we can leverage the discussion context to also suggest concrete code changes for bug resolution
ML ID: 399
Copy That! Editing Sequences by Copying Spans
[Details] [PDF] [Slides (PPT)] [Slides (PDF)] [Poster]
Sheena Panthaplackel, Miltiadis Allamanis, Marc Brockschmidt
In The AAAI Conference on Artificial Intelligence (AAAI), February 2021.
Neural sequence-to-sequence models are finding increasing use in editing of documents, for example in correcting a text document or repairing source code. In this paper, we argue that common seq2seq models (with a facility to copy single tokens) are not a natural fit for such tasks, as they have to explicitly copy each unchanged token. We present an extension of seq2seq models capable of copying entire spans of theinput to the output in one step, greatly reducing the number of decisions required during inference. This extension means that there are now many ways of generating the same output, which we handle by deriving a new objective for training and a variation of beam search for inference that explicitly handles this problem.In our experiments on a range of editing tasks of natural language and source code, we show that our new model consistently outperforms simpler baselines.
ML ID: 393
Deep Just-In-Time Inconsistency Detection Between Comments and Source Code
[Details] [PDF] [Slides (PDF)] [Poster] [Video]
Sheena Panthaplackel, Junyi Jessy Li, Milos Gligoric, Raymond J. Mooney
In The AAAI Conference on Artificial Intelligence (AAAI), February 2021.
Natural language comments convey key aspects of source code such as implementation, usage, and pre- and post-conditions. Failure to update comments accordingly when the corresponding code is modified introduces inconsistencies, which is known to lead to confusion and software bugs. In this paper, we aim to detect whether a comment becomes in-consistent as a result of changes to the corresponding body of code, in order to catch potential inconsistencies just-in-time, i.e., before they are committed to a version control system.To achieve this, we develop a deep-learning approach that learns to correlate a comment with code changes. By evaluating on a large corpus of comment/code pairs spanning various comment types, we show that our model outperforms multiple baselines by significant margins. For extrinsic evaluation, we show the usefulness of our approach by combining it with a comment update model to build a more comprehensive automatic comment maintenance system that can both detect and resolve inconsistent comments based on code changes.
ML ID: 391
Learning to Update Natural Language Comments Based on Code Changes
[Details] [PDF] [Video]
Sheena Panthaplackel, Pengyu Nie, Milos Gligoric, Junyi Jessy Li, and Raymond J. Mooney
In Proceedings of the 58th Annual Conference of the Association for Computational Linguistics (ACL), July 2020.
We formulate the novel task of automatically updating an existing natural language comment based on changes in the body of code it accompanies. We propose an approach that learns to correlate changes across two distinct language representations, to generate a sequence of edits that are applied to the existing comment to reflect the source code modifications. We train and evaluate our model using a dataset that we collected from commit histories of open-source software projects, with each example consisting of a concurrent update to a method and its corresponding comment. We compare our approach against multiple baselines using both automatic metrics and human evaluation. Results reflect the challenge of this task and that our model outperforms baselines with respect to making edits.
ML ID: 383
Associating Natural Language Comment and Source Code Entities
[Details] [PDF] [Slides (PDF)] [Poster]
Sheena Panthaplackel, Milos Gligoric, Raymond J. Mooney and Junyi Jessy Li
In The AAAI Conference on Artificial Intelligence (AAAI), February 2020.
Comments are an integral part of software development; they are natural language descriptions associated with source code elements. Understanding explicit associations can be useful in improving code comprehensibility and maintaining the consistency between code and comments. As an initial step towards this larger goal, we address the task of associating entities in Javadoc comments with elements in Java source code. We propose an approach for automatically extracting supervised data using revision histories of open source projects and present a manually annotated evaluation dataset for this task. We develop a binary classifier and a sequence labeling model by crafting a rich feature set which encompasses various aspects of code, comments, and the relationships between them. Experiments show that our systems outperform several baselines learning from the proposed supervision.
ML ID: 382
A Framework for Writing Trigger - Action Todo Comments in Executable Format
[Details] [PDF] [Slides (PPT)]
Pengyu Nie, Rishabh Rai, Junyi Jessy Li, Sarfraz Khurshid, Raymond J. Mooney, Milos Gligoric
In Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Tallinn, Estonia, August 2019. Distinguished Paper Award.
Natural language elements, e.g., todo comments, are frequently used to communicate among developers and to describe tasks that need to be performed (actions) when specific conditions hold on artifacts related to the code repository (triggers), e.g., from the Apache Struts project: “remove expectedJDK15 and if() after switching to Java 1.6”. As projects evolve, development processes change, and development teams reorganize, these comments, because of their informal nature, frequently become irrelevant or forgotten. We present the first framework, dubbed TrigIt, to specify trigger-action todo comments in executable format. Thus, actions are executed automatically when triggers evaluate to true. TrigIt specifications are written in the host language (e.g., Java) and are evaluated as part of the build process. The triggers are specified as query statements over abstract syntax trees, abstract representation of build configuration scripts, issue tracking systems, and system clock time. The actions are either notifications to developers or code transformation steps. We implemented TrigIt for the Java programming language and migrated 44 existing trigger-action comments from several popular open-source projects. Evaluation of TrigIt, via a user study, showed that users find TrigIt easy to learn and use. TrigIt has the potential to enforce more discipline in writing and maintaining comments in large code repositories.
ML ID: 377
AInix: An open platform for natural language interfaces to shell commands
[Details] [PDF]
David Gros
May 2019. Undergraduate Honors Thesis, Computer Science Department, University of Texas at Austin.
This report discusses initial work on the AInix Platform. This platform is designed to allow developers to add natural language interfaces to Unix-like shell commands. This can be used with the aish shell, which allows users to intermix natural language with shell commands. We create a high-level way of specifying semantic parsing grammars and collect a dataset of basic shell commands. We experiment with seq2seq models, abstract syntax networks (ASN), and embedded nearest neighbor-based models. We find highest accuracy is achieved with seq2seq models and ASN’s. While not as accurate, we find that when embedders are pretrained on large-scale code-related text, nearest neighbor models can achieve decent performance.
ML ID: 372
Natural Language Processing and Program Analysis for Supporting Todo Comments as Software Evolves
[Details] [PDF]
Pengyu Nie, Junyi Jessy Li, Sarfraz Khurshid, Raymond Mooney, Milos Gligoric
In In Proceedings of the AAAI Workshop on NLP for Software Engineering, February 2018.
Natural language elements (e.g., API comments, todo comments) form a substantial part of software repositories. While developers routinely use many natural language elements (e.g., todo comments) for communication, the semantic content of these elements is often neglected by software engineering techniques and tools. Additionally, as software evolves and development teams re-organize, these natural language elements are frequently forgotten, or just become outdated, imprecise and irrelevant. We envision several techniques, which combine natural language processing and program analysis, to help developers maintain their todo comments. Specifically, we propose techniques to synthesize code from comments, make comments executable, answer questions in comments, improve comment quality, and detect dangling comments.
ML ID: 358
Dialog for Language to Code
[Details] [PDF] [Poster]
Shobhit Chaurasia and Raymond J. Mooney
In Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP-17), 175-180, Taipei, Taiwan, November 2017.
Generating computer code from natural language descriptions has been a long- standing problem. Prior work in this domain has restricted itself to generating code in one shot from a single description. To overcome this limitation, we propose a system that can engage users in a dialog to clarify their intent until it has all the information to produce correct code. To evaluate the efficacy of dialog in code generation, we focus on synthesizing conditional statements in the form of IFTTT recipes.
ML ID: 353
Dialog for Natural Language to Code
[Details] [PDF]
Shobhit Chaurasia
2017. Masters Thesis, Computer Science Department, University of Texas at Austin.
Generating computer code from natural language descriptions has been a longstanding problem in computational linguistics. Prior work in this domain has restricted itself to generating code in one shot from a single description. To overcome this limitation, we propose a system that can engage users in a dialog to clarify their intent until it is confident that it has all the information to produce correct and complete code. Further, we demonstrate how the dialog conversations can be leveraged for continuous improvement of the dialog system. To evaluate the efficacy of dialog in code generation, we focus on synthesizing conditional statements in the form of IFTTT recipes. IFTTT (if-this-then-that) is a web-service that provides event-driven automation, enabling control of smart devices and web-applications based on user-defined events.
ML ID: 347
Improved Semantic Parsers For If-Then Statements
[Details] [PDF]
I. Beltagy and Chris Quirk
To Appear In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL-16), Berlin, Germany, 2016.
Digital personal assistants are becoming both more common and more useful. The major NLP challenge for personal assistants is machine understanding: translating natural language user commands into an executable representation. This paper focuses on understanding rules written as If-Then statements, though the techniques should be portable to other semantic parsing tasks. We view understanding as structure prediction and show improved models using both conventional techniques and neural network models. We also discuss various ways to improve generalization and reduce overfitting: synthetic training data from paraphrase, grammar combinations, feature selection and ensembles of multiple systems. An ensemble of these techniques achieves a new state of the art result with 8% accuracy improvement.
ML ID: 332
Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes
[Details] [PDF] [Poster]
Chris Quirk and Raymond Mooney and Michel Galley
In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL-15), 878--888, Beijing, China, July 2015.
Using natural language to write programs is a touchstone problem for computational linguistics. We present an approach that learns to map natural-language descriptions of simple "if-then" rules to executable code. By training and testing on a large corpus of naturally-occurring programs (called "recipes") and their natural language descriptions, we demonstrate the ability to effectively map language to code. We compare a number of semantic parsing approaches on the highly noisy training data collected from ordinary users, and find that loosely synchronous systems perform best.
ML ID: 317