Class Meetings: Fridays 2:00pm - 5:00pm
Class Location: JGB 2.218 and Zoom*
Instructor: Shirley Cohen
Email: scohen at cs dot utexas dot edu
Office Hours: Mondays 6:30pm - 7:30pm on Zoom
Teaching Assistant: Sai Surya Duvvuri
Email: saisurya at cs dot utexas dot edu
Office Hours: Tuesdays 4:00pm - 5:00pm and Fridays 12:00pm - 1:00pm on Zoom
Teaching Assistant: Ritvik Renikunta
Email: ritvik dot renikunta at utexas dot edu
Office Hours: Wednesdays 4:00pm - 5:00pm and Thursdays 2:00pm - 3:00pm on Zoom
*This class will have meetings mostly in person and occasionally on Zoom. For all Zoom links (class meetings and office hours), please see Canvas.
Course Description:
This course is designed to give students a practical introduction to databases and data systems. The goals are to learn modern data management and data processing techniques through a mix of best practices, experimentation, and problem solving.
The contents of the course are organized into three broad themes: 1) query languages with an emphasis on SQL; 2) data models from relational to document to graph; and 3) data engineering with a focus on data cleansing, data exploration, and data analysis.
We will construct multiple databases for operational and analytical purposes throughout the term. The work will be done on Google Cloud Platform using a variety of database technologies and data science tools: MySQL, Postgres, BigQuery, Firestore, MongoDB, Neo4j, Jupyter Notebooks, Looker Studio, and Trino.
Below are some of the topics we will cover:
SQL:
- select-from-where
- order-bys
- joins
- inserts, updates, deletes
- aggregates
- group-bys
- subqueries
Data Models:
- relational
- document
- graph
- hybrid
Data Engineering:
- data ingestion
- data cleansing
- data integration
- data analysis
- data visualization
Prerequisites:
The course assumes a programming background and in particular, a solid working knowledge of Python scripting. As such, the prerequisites for this course are CS 303E, CS 307 or the equivalent. Familiarity with SQL is also helpful, but not required.
Textbooks:
There are two required texts for this course:
- Alan Beaulieu, Learning SQL, Third Edition, 2020.
- Dan Sullivan, NoSQL for Mere Mortals, First Edition, 2015.
Supplemental Readings:
In addition to the required readings, the assignments will involve consulting the product documentation on Cloud SQL, BigQuery, Firestore, MongoDB, Neo4j, Looker Studio and others. All product documentation will be available online.
Projects:
The most important component of this course are the projects. The projects are intended to give you hands-on experience with the database systems and tools. They will start with the basic CRUD operations and move on to more advanced capabilities.
There are two types of projects, weekly projects and a final project. The weekly projects are aimed at giving you some practice with the series of database systems. They will be assigned as homework and will require outside class time to complete. The final project will introduce an area that many enterprises struggle with today, integrating data from disparate sources. We will look at query federation as a way of querying across relational and NoSQL databases with the SQL query engine, Trino.
All projects will be carried out in groups of two students. You will form groups at the start of the term and work with the same partner throughout the term. More details on the projects will be provided in the week-by-week section below.
Exams:
There will be two midterms and no final exam. The tests are comprehensive and will cover all the material to-date, including readings, projects, and lectures. They will be closed book and taken in class via Canvas. Unfortunately, no make-up tests will be offered due to limited availability.
Participation:
We will be holding synchronous class meetings so that you have the opportunity to discuss questions and work together with other students. My goal is to spend the majority of class time actively working through problems and clarifying difficult concepts. You will need to have a stable internet connection and a laptop so that you can be fully present for each class.
Participation questions will mostly be multiple choice answered through UT Instapoll. The questions will be based on the assigned setup guide of the day or practice problems that we work on in class.
Absences:
Excused absences may be given only for verifiable medical or family emergencies. Written documentation must be provided to qualify for an excused absence. The medical documentation must specifically state that you could not attend class due to your illness and must be signed by a physician. A job or internship interview or any other appointment does not constitute an excused absence.
Grading Rubric:
The basic grading rubric is comprised of the five components listed below:
Acknowledgments:
This course is generously supported by Google by giving us access to their Cloud Platform.