Topics and Videos |
Readings |
Week 1: Intro and Linear Classification |
Course Preview
|
|
Introduction
|
Note: this introduction video is from an older run of the class and references an outdated schedule. Please refer
to the new course structure here.
|
Linear Binary
Classification |
Eisenstein
2.0-2.5, 4.2-4.4.1
Perceptron and logistic regression
|
Sentiment Analysis and Basic Feature Extraction |
Eisenstein
4.1 |
Basics of
Learning, Gradient Descent |
|
Perceptron
|
|
Perceptron as
Minimizing Loss
|
|
Logistic
Regression
|
Perceptron and LR connections |
Sentiment
Analysis
|
Thumbs up? Sentiment Classification using
Machine Learning Techniques Bo Pang et al., 2002
Baselines and Bigrams: Simple, Good
Sentiment and Topic Classification Sida Wang and Christopher Manning, 2012
Convolutional Neural Networks for Sentence
Classification Yoon Kim, 2014
[GitHub] NLP Progress on Sentiment Analysis
|
Optimization
Basics
|
|
Week 2: Multiclass and Neural Classification |
Multiclass
Classification |
Eisenstein 4.2
Multiclass lecture note
|
Multiclass
Perceptron and Logistic Regression |
|
Multiclass
Classification Examples |
A large annotated corpus for learning
natural language inference Sam Bowman et al., 2015
Authorship Attribution of
Micro-Messages Roy Schwartz et al., 2013
|
Fairness in
Classification |
50 Years of Test (Un)fairness: Lessons for
Machine Learning Ben Hutchinson and Margaret Mitchell, 2018
[Article] Amazon scraps secret AI recruiting tool that showed bias against women
|
Neural
Networks |
|
Neural Network
Visualization |
[Blog] Neural Networks,
Manifolds, and Topology Chris Olah |
Feedforward
Neural Networks, Backpropagation |
Eisenstein
Chapter 3.1-3.3 |
Neural Net
Implementation |
|
Neural Net
Training, Optimization |
Dropout: a simple way to prevent neural networks from
overfitting Nitish Srivastava et al., 2014
Batch Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift Sergey Ioffe and Christian Szegedy, 2015
Adam: A Method for Stochastic Optimization Durk Kingma and Jimmy Ba,
2015
The Marginal
Value of Adaptive Gradient Methods in Machine Learning Ashia Wilson et al., 2017
|
Week 3: Word Embeddings |
Word
Embeddings |
Eisenstein
14.5 |
Skip-gram
|
Distributed
Representations of Words and Phrases and their Compositionality Tomas Mikolov et al., 2013 |
Other Word
Embedding Methods |
A Scalable Hierarchical Distributed Language Model Andriy Mnih and Geoff Hinton, 2008
Neural Word Embedding as Implicit Matrix Factorization Omer Levy and Yoav Goldberg, 2014
GloVe: Global Vectors for Word
Representation Jeffrey Pennington et al., 2014
Enriching Word Vectors with
Subword Information Piotr Bojanowski et al., 2016
|
Bias in Word
Embeddings |
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word
Embeddings Tolga Bolukbasi et al., 2016
Black is to Criminal as Caucasian is to
Police: Detecting and Removing Multiclass Bias in Word Embeddings Thomas Manzini et al., 2019
Lipstick on a Pig: Debiasing Methods Cover
up Systematic Gender Biases in Word Embeddings But do not Remove Them Hila Gonen and Yoav Goldberg, 2019
|
Applying
Embeddings, Deep Averaging Networks |
Deep Unordered Composition Rivals
Syntactic Methods for Text Classification Mohit Iyyer et al., 2015 |
Week 4: Language Modeling and Self-Attention |
n-gram LMs |
Eisenstein
6.1 |
Smoothing in n-gram LMs |
Eisenstein
6.2 |
LM Evaluation |
Eisenstein
6.4 |
Neural Language Models |
|
RNNs and their Shortcomings |
Eisenstein
6.3
[Blog] Understanding LSTMs Chris Olah |
Attention |
Neural Machine Translation by Jointly Learning to Align and Translate
Dzmitry Bahdanau et al., 2015 |
Self-Attention |
Attention Is All You Need Ashish Vaswani et al.,
2017 |
Multi-Head Self-Attention |
Attention Is All You Need Ashish Vaswani et al.,
2017
[Blog] The Illustrated Transformer Jay Alammar |
Position Encodings |
Attention Is All You Need Ashish Vaswani et al.,
2017
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation Ofir Press et al., 2021
The Impact of Positional Encoding on Length Generalization in Transformers Amirhossein Kazemnejad et al., 2023
|
Week 5: Transformers and Decoding |
Transformer Architecture |
Attention Is All You Need Ashish Vaswani et al.,
2017
|
Using Transformers |
|
Transformer Language Modeling |
|
Transformer Extensions |
Scaling Laws for Neural Language Models Jared Kaplan et al., 2020
Efficient Transformers: A Survey Yi Tay et al., 2020
Rethinking Attention with Performers Krzysztof Choromanski et al., 2021
Longformer: The Long-Document Transformer Iz Beltagy et al., 2021
|
Beam Search |
|
Nucleus Sampling |
The Curious Case of Neural Text Degeneration Ari Holtzman et al., 2019
|
Week 6: Pre-training, seq2seq LMs |
BERT: Masked Language Modeling |
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding Jacob Devlin et al., 2019 |
BERT: Model and Applications |
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding Jacob Devlin et al., 2019
To Tune or Not to Tune? Adapting Pretrained
Representations to Diverse Tasks Matthew Peters et al., 2019
GLUE: A Multi-Task Benchmark and Analysis
Platform for Natural Language Understanding Alex Wang et al., 2019
What Does BERT Look At? An Analysis of BERT's Attention
Kevin Clark et al., 2019
RoBERTa: A Robustly Optimized BERT Pretraining
Approach Yinhan Liu et al., 2019
|
Seq2seq Models |
|
BART |
BART: Denoising Sequence-to-Sequence Pre-training
for Natural Language Generation, Translation, and Comprehension Mike Lewis et al., 2019
|
T5 |
Exploring the Limits of Transfer Learning with a
Unified Text-to-Text Transformer Colin Raffel et al., 2020
UnifiedQA: Crossing Format Boundaries With a Single QA System Daniel Khashabi et al., 2020
|
Word Piece and Byte Pair Encoding |
Neural Machine Translation of Rare Words with
Subword Units Rico Sennrich et al., 2016
Byte Pair Encoding is Suboptimal for Language
Model Pretraining Kaj Bostrom and Greg Durrett, 2020
|
Week 7-8: Structured Prediction: Part-of-speech, Syntactic Parsing
Note: this unit was previously presented as Week 4 right after classification. There are a few references
to it being our first brush with structured models. In this structure of the course, it's still true that it's our first exposure to
models dealing with linguistic structure as opposed to surface-level sequential structure (i.e., token sequences in generation).
|
Part-of-Speech
Tagging |
Eisenstein 8.1
|
Sequence
Labeling, Tagging with Classifiers |
Eisenstein 7.1
|
Hidden Markov
Models |
Eisenstein 7.4
|
HMMs: Parameter Estimation |
Eisenstein 7.4.1
|
HMMs: Viterbi Algorithm |
Eisenstein 7.3
|
HMMs for POS Tagging |
TnT - A Statistical Part-of-Speech Tagger
Thorsten Brants, 2000
Enriching the Knowledge Sources Used in a
Maximum Entropy Part-of-Speech Tagger Kristina Toutanvoa and Christopher Manning, 2000
Part-of-Speech Tagging
from 97% to 100%: Is It Time for Some Linguistics? Christopher Manning, 2011
Natural Language Processing with Small
Feed-Forward Networks Jan Botha et al., 2017
|
Constituency Parsing |
Eisenstein 10.1-10.2
|
Probabilistic Context-Free Grammars |
Eisenstein 10.3-10.4
|
CKY Algorithm |
Eisenstein 10.3.1
|
Refining Grammars |
Accurate Unlexicalized Parsing Dan Klein
and Chris Manning, 2003
Eisenstein 10.5
|
Dependencies |
Eisenstein 11.1
Finding Optimal 1-Endpoint-Crossing
Trees Emily Pitler et al., 2013 |
Transition-based Dependency Parsing |
Eisenstein 11.3
|
Week 9: Modern Large Language Models |
GPT-3 |
Language Models are Unsupervised Multitask Learners Alec Radford et al., 2019
Language Models are Few-Shot Learners Tom B. Brown et al., 2020
Llama 2: Open Foundation and Fine-Tuned Chat Models Hugo Touvron et al., 2023
Llama 2 is one of the latest models with publicly available weights (although it is not fully open-source, as many details of the training are not public).
|
Zero-shot Prompting |
Demystifying Prompts in Language Models via Perplexity Estimation Hila Gonen et al., 2022
|
Few-shot Prompting |
Calibrate Before Use: Improving Few-Shot Performance of Language Models Tony Z. Zhao et al., 2021
Holistic Evaluation of Language Models Percy Liang et al., 2022
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Sewon Min et al., 2022
|
Understanding ICL: Induction Heads |
In-context Learning and Induction Heads Catherine Olsson et al., 2022
|
Instruction Tuning |
Multitask Prompted Training Enables Zero-Shot Task Generalization Victor Sanh et al., 2021
Scaling Instruction-Finetuned Language Models Hyung Won Chung et al., 2022
|
Reinforcement Learning from Human Feedback (RLHF) |
Training language models to follow instructions with human feedback Long Ouyang et al., 2022
[Website] Stanford Alpaca: An Instruction-following LLaMA Model Rohan Taori et al., 2023
|
Factuality of LLMs |
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation Yixin Liu et al., 2023
WiCE: Real-World Entailment for Claims in Wikipedia Ryo Kamoi et al., 2023
SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization Philippe Laban et al., 2022
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation Sewon Min et al., 2023
RARR: Researching and Revising What Language Models Say, Using Language Models Luyu Gao et al., 2022
|
Week 10: Explanations |
Explainability in NLP |
The Mythos of Model Interpretability Zach Lipton,
2016
Deep Unordered Composition Rivals Syntactic
Methods for Text Classification Mohit Iyyer et al., 2015
Analysis Methods in Neural Language Processing: A
Survey Yonatan Belinkov and Jim Glass, 2019
|
Local Explanations: Highlights |
"Why Should I Trust You?" Explaining the
Predictions of Any Classifier Marco Tulio Ribeiro et al., 2016
Axiomatic Attribution for Deep Networks
Mukund Sundararajan et al., 2017
|
Model Probing |
BERT Rediscovers the Classical NLP Pipeline
Ian Tenney et al., 2019
What Do You Learn From Context? Probing For
Sentence Structure In Contextualized Word Represenations Ian Tenney et al., 2019
|
Annotation Artifacts |
Annotation Artifacts in Natural Language
Inference Data Suchin Gururangan et al., 2018
Hypothesis Only Baselines in Natural
Language Inference Adam Poliak et al., 2018
Did the Model Understand the Question?
Pramod Kaushik Mudrakarta et al., 2018
Swag: A Large-Scale Adversarial Dataset
for Grounded Commonsense Inference Rowan Zellers et al., 2018
|
Text Explanations |
Generating Visual Explanations Lisa-Anne Hendricks et
al., 2016
e-SNLI: Natural Language Inference with Natural Language Explanations Oana-Maria Camburu et
al., 2018
Explaining Question Answering Models through Text
Generation Veronica Latcinnik and Jonathan Berant, 2020
|
Chain-of-thought |
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems Wang Ling et al., 2017
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason Wei et al., 2022
The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning Xi Ye and Greg Durrett, 2022
Large Language Models are Zero-Shot Reasoners Takeshi Kojima et al., 2022
|
Chain-of-thought: Extensions and Analysis |
Complementary Explanations for Effective In-Context Learning Xi Ye et al., 2023
PAL: Program-aided Language Models Luyu Gao et al., 2022
Measuring and Narrowing the Compositionality Gap in Language Models Ofir Press et al., 2022
|
Week 11: Question Answering, Dialogue Systems |
Reading comprehension intro |
|
Reading comprehension: setup and baselines |
MCTest: A Challenge Dataset for the
Open-Domain Machine Comprehension of Text Matthew Richardson et al., 2013
SQuAD: 100,000+ Questions for Machine
Comprehension of Text Pranav Rajpurkar et al., 2016
|
BERT for QA |
|
Problems with Reading Comprehension |
Adversarial Examples for Evaluating
Reading Comprehension Systems Robin Jia and Percy Liang, 2017 |
Open-domain QA |
Reading Wikipedia to Answer Open-Domain Questions Danqi Chen et al., 2017
Latent Retrieval for Weakly Supervised
Open Domain Question Answering Kenton Lee et al., 2019
[Website] Natural Questions Tom Kwiatkowski et al., 2019
Most modern open-domain QA systems are either "closed-book" models like ChatGPT or "open-book" models that do retrieval, similar
to the Chen et al. and Lee et al. papers above. These are typically described under the general framework of retrieval-augmented generation and an example of
how these systems work is WebGPT (similar to the "new Bing" chatbot).
|
Multi-hop QA |
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang et al., 2018
Understanding Dataset Design Choices for
Multi-hop Reasoning Jifan Chen and Greg Durrett, 2019
Learning to Retrieve Reasoning Paths over
Wikipedia Graph for Question Answering Akari Asai et al., 2020
Modern QA systems operating over the web are largely multi-hop by default; multi-hop QA has been subsumed by open-domain QA to a large extent. For a more recent multi-hop QA dataset, see QAMPARI
|
Dialogue: Chatbots |
|
Task-Oriented Dialogue |
Wizards of Wikipedia: Knowledge-Powered
Conversational Agents Emily Dinan et al., 2019
Task-Oriented Dialogue as Dataflow Synthesis Semantic Machines, 2020 |
Neural Chatbots |
A Neural Network Approach to Context-Sensitive Generation of Conversational Responses Alessandro Sordoni et al., 2015
A Diversity-Promoting Objective Function for Neural Conversation Models Jiwei Li et al., 2016
Recipes for building an open-domain chatbot Stephen Roller et al., 2020
Note: an updated version of BlenderBot is described in Kurt Shuster et al..
Other chatbots discussed, like character.ai, can be found online and you can play with them, but less information
about their precise internals is available in published papers.
|
Week 12: Machine Translation, Summarization |
Machine Translation Intro |
Eisenstein 18.1
|
MT: Framework and Evaluation |
Eisenstein 18.1
|
MT: Word alignment |
|
MT: IBM Models |
HMM-Based Word Alignment in
Statistical Translation Stephan Vogel et al., 1996 |
Phrase-based Machine Translation |
Pharaoh: A
Beam Search Decoder for Phrase-Based Statistical Machine Translation Models Philipp Koehn, 2004
Minimum Error Rate Training in Statistical
Machine Translation Franz Och, 2003
Eisenstein 18.4
|
Neural and Pre-Trained Machine Translation |
Revisiting Low-Resource Neural Machine Translation: A Case Study Rico Sennrich and Biao Zhang, 2019
In Neural Machine Translation, What Does Transfer Learning Transfer? Alham Fikri Aji et al., 2020
Multilingual Denoising Pre-training for Neural Machine Translation Yinhan Liu et al., 2020
Large Language Models Are State-of-the-Art Evaluators of Translation Quality Tom Kocmi and Christian Federmann, 2023
|
Summarization Intro |
|
Extractive Summarization |
The use of MMR, diversity-based
reranking for reordering documents and producing summaries Jaime Carbonell and Jade Goldstein, 1998
LexRank: Graph-based Lexical
Centrality as Salience in Text Summarization Gunes Erkan and Dragomir Radev, 2004
A Scalable Global Model for
Summarization Dan Gillick and Benoit Favre, 2009
Revisiting the Centroid-based Method: A
Strong Baseline for Multi-Document Summarization Demian Gholipour Ghalandari, 2017
|
Pre-trained Summarization and Factuality |
BART: Denoising
Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Mike Lewis et
al., 2019
PEGASUS:
Pre-training with Extracted Gap-sentences for Abstractive Summarization Jingqing Zhang et al., 2020
Evaluating Factuality in Generation with
Dependency-level Entailment Tanya Goyal and Greg Durrett, 2020
Asking and Answering Questions to Evaluate the Factual Consistency of Summaries Alex Wang et al., 2020
Note: while the specific fine-tuned modeling approaches and factuality detection systems are no longer state-of-the-art as stated in the video,
they are representative of ideas from pre-training
that are still used today. For discussion of how LLMs relate to summarization, see News Summarization and Evaluation in the Era of GPT-3
by Tanya Goyal, Junyi Jessy Li, and Greg Durrett
|
Week 13-14: Multilinguality, Language Grounding, Ethical Issues |
Morphology |
|
Cross-lingual Tagging and Parsing |
Unsupervised Part-of-Speech Tagging with
Bilingual Graph-Based Projections Dipanjan Das and Slav Petrov, 2011
Multi-Source Transfer of Delexicalized
Dependency Parsers Ryan McDonald et al., 2011
|
Cross-lingual Pre-training |
Massively Multilingual Word Embeddings
Waleed Ammar et al., 2016
Massively Multilingual Sentence
Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Mikel Artetxe and Holger Schwenk, 2019
How multilingual is Multilingual
BERT? Telmo Pires et al., 2019
|
Language Grounding |
Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data Emily Bender and Alexander Koller, 2020
Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand? Will Merrill et al., 2021
Entailment Semantics Can Be Extracted from an Ideal Language Model Will Merrill et al., 2022
Experience Grounds Language Yonatan Bisk et al., 2020
|
Language and Vision |
VQA: Visual Question Answering Aishwarya Agrawal et al., 2015
Learning Transferable Visual Models From Natural Language Supervision Alex Radford et al., 2021
|
Ethics: Bias |
The Social Impact of Natural Language Processing Dirk Hovy and Shannon Spruit, 2016
Men Also Like Shopping:
Reducing Gender Bias Amplification using Corpus-level Constraints Jieyu Zhao et al., 2017
|
Ethics: Exclusion |
GeoMLAMA: Geo-Diverse Commonsense Probing on Multilingual Pre-Trained Language Models Da Yin et al., 2022
Visually Grounded Reasoning across Languages and Cultures Fangyu Liu et al., 2021
|
Ethics: Dangers of Automation |
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Emily Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell, 2021
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models Samuel Gehman et al., 2020
|
Ethics: Unethical Use and Paths Forward |
Datasheets for Datasets Timnit Gebru et al., 2018
Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing Deb Raji et al., 2020
|