Instructor: Greg Durrett, gdurrett@cs.utexas.edu
Lecture: Tuesday and Thursday 9:30am - 11:00am, Zoom
TA: Xi Ye
Office Hours: see main course webpage
Natural language processing (NLP) is a subfield of AI focused on solving problems that involve dealing with human language in a sophisticated way: these include information extraction, machine translation, automatic summarization, conversational dialogue, syntactic analysis, and many others. Much of the progress on these problems over the last 25 years has been driven by statistical machine learning and, more recently, deep learning. One distinctive feature of language compared to other types of data is its structured nature: modeling language involves understanding the linguistic phenomena it exhibits and grappling with it as a sequentially-structured, tree-structured, or graph-structured entity.
This class is intended to be a survey of modern NLP in two respects. First, it covers the main applications of NLP techniques today, both in academia and in industry, as well as enough linguistics to put these problems in context and understand their challenges. Second, it covers a range of models in structured prediction and deep learning, including classifiers, sequence models, statistical parsers, neural network encoders, encoder-decoder models, ane more. We study the models themselves, examples of problems they are applied to, inference methods, parameter estimation (primarily supervised and semi-supervised / pre-trained), and optimization. Programming assignments involve building scalable machine learning systems for various NLP tasks and seeing how these models can be put into practice.
Prerequisites
Lectures are 9:30-11:00am Tuesday and Thursday held remotely on Zoom. A complete schedule of lectures and assignments, complete with readings, is on the main website page. The Zoom lectures will be recorded and made available later for students in the class to watch. We will not distribute these recorded lectures to anyone outside of the class, and you should not either, for privacy and copyright reasons.
Class Recordings: Class recordings are reserved only for students in this class for educational purposes and are protected under FERPA. The recordings should not be shared outside the class in any form. Violation of this restriction by a student could lead to Student Misconduct proceedings.
There is no required textbook for this course. Readings from book chapters and papers will be posted on the course website.
There are five assignments in the course: two "mini" assignments, two projects, and a final project. The timeline of these assignments is on the course calendar. Assignment specifications, code, and data will be made available on the course website and Canvas.
We will use two platforms for assignment submission: Gradescope and Canvas. Gradescope will be how you submit written works, and Canvas will be the place to submit your code.
The mini assignments are designed to be relatively straightforward programming assignments. In each one, you will implement a simple system and run it on some data. The main goal is to gain familiarity with the techniques you'll be using in the following project and get accustomed to coding up ML systems that perform well.
Grading: Minis will be graded on a 0-100 scale. 80% of the grade is determined on the basis of the code/results and 20% is on the basis of a minimal writeup that describes what you did and reports results. Code performance requirements for getting full credit will be described in each assignment. The writeup should include a table/graph of your results and any accompanying description necessary to understand that table. For example, if you're comparing two different classification techniques, at least briefly specify what features the models consider, what optimization techniques you used, whether you used regularization, etc.
The projects are more substantial programming assignments. Each project centers around an NLP task on a standard dataset, with part of the project being an open-ended extension where you'll have options for exactly what you want to implement or explore further.
Grading: Projects are graded 0-100 as well. These more heavily weight the writeup as well as open-ended "extension" that you do to the project. Code performance requirements for getting full credit will be described in each assignment. Getting full credit on the extension requires going above and beyond: your extension should really demonstrate some improvement, or you should have some particularly insightful analysis. See the website for examples of projects with successful extensions.
Writeup: Your project writeup should be 2-3 pages (excluding references, if you have any). Your report should briefly restate the core problem and what you are doing, describe relevant details of your implementation, present results, describe your extension, and optionally discuss error cases addressed by your extension or describe how the system could be further improved. Your report should be written in the tone and style of an ACL/NIPS conference paper. Any format with reasonably small (1" margins) is fine, including the ACL style files or any one- or two-column format with similar density.
The final project is an opportunity for open-ended exploration of concepts in the course. This project should constitute novel work beyond directly implementing concepts from lecture and should result in a report that roughly reads like an NLP/ML conference submission in terms of presentation and scope. You may work on the final project either individually or in groups of two; however, groups of two are preferred from the standpoint of enabling more substantial projects.
Proposal: You will write a brief proposal (around 1 page) explaining your idea, which the course staff will provide feedback on.
Writeup: Your final project report should be 4-8 pages---use your discretion about the length. Groups of two should have reports closer to 8 pages. The scope should be similar to that of an ACL paper: you should present a novel idea, discuss related work, describe your implementation or what you did, give results, and provide discussion or error analysis.
Presentation: You will give lightning talks (around 5 minutes) about your project on the last two class days.
Your final grade is computed based on the total points earned across all assignments. The final grade is mapped to a letter as follows, with grades on the boundary receiving the higher grade:
A | 100 - 93.3 |
A- | 93.3 - 90.0 |
B+ | 90.0 - 86.6 |
B | 86.6 - 83.3 |
B- | 83.3 - 80.0 |
C+ | 80.0 - 76.6 |
C | 76.6 - 73.3 |
C- | 73.3 - 70.0 |
D | 70 - 65 |
F | below 65 |
You are free to discuss the homework assignments with other students and work towards solutions together. However, all of the code you write must be your own! We will be using Moss and any copied code will be treated as a violation of academic honesty and may result in a failing grade. In addition, your writeup must be entirely your own and your extension cannot duplicate those of your collaborators'.
Projects will be submitted on Canvas. Submissions should include your writeup, a gzipped tar or zip file of code, and any requested system output (e.g., model predictions on the blind test set).
Each student is given 5 slip days to use throughout the term. Any number of these days can be applied to any mini or project excluding the final project to extend the deadline for that assignment. E.g., you can turn the first project in 2 days late and the second project 3 days late. After your slip days are exhausted, each day of lateness will incur a 20% penalty to that assignment's grade, so a project worth 20 points loses 4 points per day. Plan your slip day budget accordingly: you may want to save them up if you know you'll be traveling for a conference around a due date for a later project. Additional extensions may be granted in cases of medical or other emergencies, but must be agreed on with the course staff before the project's original due date.
Religious Holy Days: A student who cannot meet an assignment deadline due to the observance of a religious holy day may submit the assignment up to 24 hours late without penalty, if proper notice of the planned absence has been given. Notice must be given at least 14 days prior to the classes which will be missed. For religious holy days that fall within the first 2 weeks of the semester, notice should be given on the first day of the semester. Notice should be personally delivered to the instructor and signed and dated by the instructor, or emailed, in which case a student submitting email notification must receive email confirmation from the instructor.
The assignments are designed to be doable on personal computers (assuming you write your code efficiently!). However, for extensions and for your final project, you may wish to run longer experiments. We encourage you to do so using the department's Condor pool. An overview of Condor can be found here and some documentation can be found here.
Be aware that your jobs may be terminated by Condor if they are competing for resources and plan ahead for this if you choose to use Condor.
For the final project, the course will have an instructional allocation on TACC that you may use. More details about this forthcoming.
Finally, another useful resource is Google Cloud Platform. When you create an account for the first time, you will receive free credits that are often sufficient to run experiments for the final project or other projects in this class, if you haven't created an account or exhausted these already.
Disabilities: Students with disabilities may request appropriate academic accommodations from the Division of Diversity and Community Engagement, Services for Students with Disabilities at 471-6259.
Diversity: It is our intent that students from all diverse backgrounds and perspectives be well served by this course, that students' learning needs be addressed both in and out of class, and that the diversity that students bring to this class be viewed as a resource, strength and benefit. It is our intent to present materials and activities that are respectful of diversity: gender, sexuality, disability, age, socioeconomic status, ethnicity, race, and culture. Your suggestions are encouraged and appreciated. Please let the course staff know of ways to improve the effectiveness of the course for you personally or for other students.
Furthermore, at times throughout the semester, we will discuss the broader cultural impact of machine learning, NLP, and language technology. I ask that students approach these topics seriously and recognize the power technology has to both support and undermine efforts to create a more inclusive society.