Class Times: Mondays 6:00pm - 9:00pm
Class Location: RLP 1.106
Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office hours: After class or by appointment in RLP 1.106
TA: Prithvi Chowhan
Email: chowhan at utexas dot edu
Office hours: Wednesdays from 12:30 to 1:30pm and Fridays from 2pm to 3pm at TA Stations, GDC basement
TA: William Chia
Email: william dot chia at utexas dot edu
Office hours: Tuesdays and Thursdays from 1pm - 2pm at TA Stations, GDC basement
Course Description:
This course is designed to give students a practical understanding of databases and data systems. The goal is to learn modern data management and data processing techniques through lots of problem solving and hands-on projects.
The content of the course is organized into three broad areas: 1) SQL with an emphasis on the standard features of the language; 2) data models from relational to hierarchical; and 3) dataflow programming for processing data at-scale.
We will first construct a simple analytics database from structured data. We will then expand the scope of the database with additional datasets from a variety of sources. We will explore the raw data through SQL and gradually enrich and transform the data to create a single view across datasets. This work will be implemented on Google Cloud Platform using BigQuery, Apache Beam/Dataflow, Apache Airflow/Composer, and Data Studio.
Below are some of the topics we will be covering during the term:
SQL:
- select-from-where
- order-bys
- joins
- inserts, updates, deletes
- aggregates
- group-bys
- subqueries
- functions
Data Modeling:
- referential integrity
- normalization
- hierarchies
- repeated fields
- views
- partitioning
Dataflow Programming:
- ingestion
- cleansing
- transformation
- data integration
- workflow orchestration
Prerequisites:
The course assumes a programming background and in particular, a solid working knowledge of Python scripting. As such, the prerequisites for this course
are CS 303E, CS 307 or the equivalent. Familiarity with SQL is also helpful, but not required.
Textbook:
There is one required text for this class:
- Clare Churcher, Beginning SQL Queries: From Notice to Professional, Second Edition, 2016. Available as an ebook from the UT Library.
Supplemental Readings:
The course requires pouring over product documentation for BigQuery, Apache Beam/Dataflow, Apache Airflow/Composer, and Data Studio. The product documentation is all available online.
Quizzes:
There will be weekly quizzes based on assigned readings. The quizzes will have 5 multiple-choice questions and they will be taken at the start of class with your iClicker. iClickers are available as a remote and mobile app. Choose one of these options (not both) and register your device through Canvas.
Project:
The most important component of this course is the project. The project will be to construct a data warehouse that brings together previously siloed datasets with the goal of deriving some interesting insights from the combined data. You will pair up with another student in the class and work collaboratively on the project. Each team will choose their datasets and formulate their queries for analysis. The project will be divided into several milestones and will span the entire term.
Exam:
There is only one exam in this course which is a midterm. The exam is comprehensive and will cover all the material on SQL, data modeling, and dataflow programming seen up to that point in the semester. It is a closed-book exam and it will be done in class. No make-up exam will be offered.
Participation:
This class will be taught using active learning techniques. The goal is to spend the majority of class time clarifying difficult concepts and actively wo
rking through problems rather than listening to a traditional lecture. You will need to bring a laptop to class and sit with your partner to able to work
together on project milestones. Participation questions will be answered with your iClicker and they will be based on practice problems and project mile
stones.
Grading Rubric:
The basic grading rubric is comprised of the four components listed below: