Syllabus for 378 Foundations of Data Warehousing - Fall 2024

Class Meetings: Friday 2:00pm - 5:00pm
Class Location: Zoom and JGB 2.218*

Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office Hours: Monday from 6:30pm - 7:30pm on Zoom

TA: Amogh Hasotkar
Email: amoghhasotkar at utexas dot edu
Office Hours: Tuesday from 3pm - 4pm on Zoom; Wednesday and Thursday from 3pm - 4pm on Zoom

*This class will have meetings both on Zoom and in-person. For all Zoom links (class meetings and office hours), please see Canvas.

Course Description
This new course will introduce students to designing and implementing a data warehouse through three pillars. The first pillar is data modeling. We will study techniques to model independently sourced and related datasets into a cohesive data model that is suitable for a variety of business intelligence applications. The datasets will contain structured, semi-structured, and unstructured data.

The second pillar is ETL. We will focus on the "T" of ETL and look at how to design data pipelines that populate the warehouse with initial loads and incremental changes. We will get hands-on and build some long-running pipelines. We will also examine some operational aspects of pipeline development that include debugging, quality testing, data lineage, and orchestration.

The third pillar of the course is AI engineering, which will be interleaved throughout the term. We will look at new approaches to data sourcing and enrichment whose solutions are not yet well understood: how to incorporate large language models for entity extraction, entity matching, error detection and repair. We will review the research literature and experiment with prompting, embeddings, and retrieval.

This will be a project-based course. The projects will be coded in SQL and Python. We will use various technologies, including BigQuery, Vertex AI, Cloud Storage, Colab, dbt, and Apache Airflow.


Prerequisites
This course assumes knowledge of SQL, databases, and Python. In addition, CS 429 is required. Familiarity with machine learning is helpful, but not required.


Required Readings
There will be assigned readings on most weeks. They will come from two texts and two research papers:
- Lawrence Corr and Jim Stagnitto. Agile Data Warehouse Design. DecisionOne Press, 2011.
- Rui Pedro Machado and Helder Russa. Analytics Engineering with SQL and dbt. O'Reilly, 2024.
- Avanika Narayan et al. Can Foundation Models Wrangle Your Data? Proceedings of the VLDB Endowment 16(4), 2022.
- Xue Li and Till Dohmen. Towards Efficient Data Wrangling with LLMs using Code Generation. DEEM@SIGMOD, 2024.


Optional Readings
- Sinan Ozdemir. Quick Start Guide to Large Language Models. Second Edition. Addison-Wesley, 2024. Note: The Second Edition won't be available until October.


Projects
The most important component of this course are the projects. With the exception of the first project, the remaining ones will involve substantial design and coding work. The projects will build on each other throughout the term, up to and including the final project.

There will be 8 weekly projects and 1 final project in total. The weekly projects are worth 70% of the grade or 11% per assignment. The final project is worth 10% of the grade.

The projects will be carried out in groups of twos, and you will stay in the same groups throughout the term. I expect both partners to be fully engaged and contribute evenly to every project. If that is not the case and you run into problems with your partner, please reach out to me and the TA for help. It is always better to reach out early and often than wait until the problems get worst.


Presentations
There will be two presentations where your group will present the highlights of your project work. You will also have the opportunity to listen to other presentations, ask questions, and give feedback to your peers.

The first presentation day will occur around week 7 and the second one will occur on the last day of class. The presentations are worth 10% of the grade or 5% per presentation day.


Quizzes
There will be in-class quizzes on most weeks. The quizzes are based on the assigned reading and designed to keep you on track and check your understanding of the readings. They will mostly consist of multiple choice questions covering material that is not part of the projects.

Students are expected to take the quizzes on their own without collaboration. If you are struggling with any of the content on the quizzes, I encourage you to come to office hours and ask for help. There will be around 10 quizzes overall, together they are worth 10% of your grade.


Class Structure
We will meet in our physical classroom three times during the term, on the first day of class and on the two presentation days (weeks 6 and 14). The rest of the time we will meet on Zoom so that we can take advantage of class time to make significant progress on our projects. Typically, there will be a lecture at the start of class, followed by a working period in which we will split up into breakout rooms. The TA and I will be visiting each group during the working period to help with any issues and questions that come up.

Attendance is mandatory. If you have to miss a class, you should notify both your partner and the instructors ahead of time. This will allow us to make alternative arrangements and give your partner extra support with the project. If you miss more than three classes, we will ask you to work solo on the remaining projects.


Academic Integrity
This course will abide by UTCS' code of academic integrity.


Late Policy and Extensions
You will lose 10% of your score for each late day of your project submission. This applies to all projects throughout the term.

For deadline extension requests, alternate quiz requests, SSD accommodations, or special accommodations (for emergencies or personal issues), please make a private post on Ed. Please include your reason for requesting an extension and any relevant documentation if applicable.

Students with Disabilities
Students with disabilities may request appropriate academic accommodations.


Grading Rubric
The final grade will be made up of the following components:


Tools
- Canvas for project submissions and grade reporting.
- Ed Discussion for announcements, questions, discussions.
- Zoom for online instruction.
- Google Cloud Platform for project work.
- GitHub for code repository.
- Lucidchart for diagramming.


Tentative Schedule (subject to change)
Acknowledgments
This course is generously supported by Google by giving us access to their Cloud Platform.