Syllabus for CS 329E Elements of Data Integration - Spring 2024

Class Meetings: Friday 2:00pm - 5:00pm
Class Location: Zoom and RLP 0.128*

Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office Hours: Monday from 6:30pm - 7:30pm on Zoom

TA: Ritvik Renikunta
Email: ritvik dot renikunta at utexas dot edu
Office Hours: Wednesday from 3:30 - 4:30pm and Friday from 12:00 - 1:00pm on Zoom

TA: Grace Kim
Email: yeeunk at utexas dot edu
Office Hours: Tuesday from 3:00 - 4:00pm and Thursday from 4:00 - 5:00pm

*This class will have meetings both on Zoom and in-person. For all Zoom links (class meetings and office hours), please see Canvas.


Course Description
This new course on Data Integration will deep-dive into the practices of modeling, transforming, and unifying data at scale, with a focus on creating reference and master data from multiple sources and managing this data over time.

We will cover some of the most common data engineering techniques, that include data generation, ingestion, cleansing, modeling, validating, and orchestrating. We will store raw, staging, and golden copies in a blob store and data warehouse, perform incremental loads, publish changes to subscribers, and implement long-running pipelines. We will use Google Cloud Storage, BigQuery, Pub/Sub, Airflow, and Colab for this work.

In the last phase of the course, we will examine data engineering concerns through the lens of AI research and emerging technologies. We will prototype LLM-powered data enrichment functions that assist with attribute detection, repairing, and entity matching. Our goals will be to evaluate if the language models can accomplish these data wrangling tasks with sufficient recall and precision. We will use Vertex AI, Gemini, BigQuery, and Colab for this work.


Prerequisites
This course assumes deep experience with SQL, databases, and Python. CS 327E or equivalent upper division course in databases is required. CS 303E or CS 307 or the equivalent programming course is also required. Familiarity with machine learning is helpful, but not required.


Textbooks
- Joe Reis and Matt Housley, Fundamentals of Data Engineering, O'Reilly, 2022.
- Sinan Ozdemir, Quick Start Guide to Large Language Models, Addison-Wesley, 2023.


Tutorials
- Apache Airflow Fundamentals.
- Apache Airflow Core Concepts.


Research Papers
The following papers are optional. You won't be assigned to read them, but you are encouraged to do so as they are extremely relevant to the last phase of the course.

- Ashish Vaswani et al. Attention is All You Need, NIPS 2017.
- Shunyu Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models, ICLR 2023.
- Avanika Narayan et al. Can Foundation Models Wrangle Your Data?, VLDP 2022.
- Jinyang Li et al. BIRD: Can LLM Already Serve as a Database Interface?, NeurIPS 2023.


Projects
The most important aspect of this course are the projects. With the exception of the first project, the remaining ones will involve major programming assignments. The projects will build on each other throughout the term, up to and including the final project.

There will be 10 weekly projects and 1 final project in total. The weekly projects are worth 70% of your grade or 7% per assignment. The final project is worth 10% of your grade.

The projects will be carried out in groups of twos. You will pair up on the first day of class and will work with the same person throughout the term. If you have issues collaborating with your partner, you are encouraged to email me or come to office hours to discuss.


Presentations
There will be two presentation days in which you and your partner will present the highlights of your project work to the class. You will also have a chance to listen to other presentations and give your peers feedback on their work. You will be assigned three presentations to peer review.

The first presentation day will occur around week 8 and the second one will occur on the last day of class. The presentations are worth 10% of your grade, 5% per presentation day.


Quizzes
There will be a quiz on most weeks when there is an assigned reading. The quizzes are designed to keep you on schedule and check your understanding of the basics from the readings. They will mostly consist of multiple choice questions covering material that is not covered in the projects.

You are expected to take the quizzes on your own without collaboration. If you are struggling with any of the questions on the quizzes, you are encouraged to come to office hours for help. There will be around 9 quizzes overall and we will drop your lowest score. The quizzes are worth 10% of your grade.


Grading Rubric
Your final grade will be made up of the following components:

Due to the experimental nature of this class, there will be no midterm or final exams this semester.

The final grades will use the plus/minus grading system.


Academic Integrity
This course will abide by UTCS' code of academic integrity.


Class Structure
Typically, in-class time will be spent on project work. There will be a short lecture at the start of class when the content for the project will be introduced through the use of live examples. For all working sessions, we will meet on Zoom and use breakout rooms and screensharing to facilitate collaboration.

We will meet in our physical classroom three times during the term: on the first day of class and on the two presentation days.


Communication
Please make sure you have bookmarked this page and are enrolled on Ed. Ed will be our primary method of communication outside of class, and you are responsible for checking it frequently. This page is where all projects will be released and Canvas will be where all assignments are submitted.


Late Policy and Extensions
You'll be docked 10% of your score for each late day on your submission. This applies to all project submissions throughout the term.

For deadline extension requests, alternate quiz requests, SSD accommodations, or special accommodations (for emergencies or personal issues), please make a private post on Ed. Please include your reason for requesting an extension and any relevant documentation if applicable.


Students with Disabilities
Students with disabilities may request appropriate academic accommodations.


Tools
- Canvas for project submissions and grade reporting.
- Ed Discussion for announcements, questions, discussions.
- Zoom for online instruction.
- Google Cloud Platform for project work.
- GitHub for code repository.
- Lucidchart for diagramming.


Week-by-Week Schedule (subject to change)


Acknowledgments
This course is generously supported by Google by giving us access to their Cloud Platform.