CS 378: Big Data Programming

Overview, Infrastructure

[Datacenter] The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , L.A. Barroso, U. Holzle, Synthesis Lectures on Computer Architecture, 2009. Chapter 1 and 2.
- [VL2: Extra Reading] VL2: A Scalable and Flexible Data Center Network, Greenberg et al., SIGCOMM 2009.
[HDFS] The Hadoop Distributed File System, Schvachko et al, MSST, 2010.
- [GFS: Extra Reading] The Google File System, Ghemawat et al, SOSP, 2003.
[Map-Reduce] MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghemawat, OSDI, 2004.
[Spark] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Zaharia et al, NSDI, 2012.
- [Spark Architecture: Extra Reading] Spark Architecture: Shuffle, Alexey Grishchenko

[Mesos] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, Hindman et al, NSDI, 2011.
- [Packing: Extra Reading] Multi-Resource Packing for Cluster Schedulers, Grandl et al, SIGCOMM, 2014.
- [YARN: Extra Reading] Apache Hadoop YARN: Yet Another Resource Negotiator, Vavilapalli et al, SOCC, 2013.
[Resource Allocation] Dominant Resource Fairness: Fair Allocation of Multiple Resource Types, Ghodsi et al, NSDI, 2011.

[SparkSQL] Spark SQL: Relational Data Processing in Spark, Armburst et al, SIGMOD, 2015.
- [QOOP: Extra Reading] Dynamic Query Re-Planning using QOOP, Kshiteej Mahajan; Mosharaf Chowdhury; Aditya Akella and Shuchi Chawla
- [Hive: Extra Reading] Major technical advancements in Apache Hive, Huai et al, SIGMOD, 2014.
[Snowflake] The Snowflake Elastic Data Warehouse, Dageville et al, SIGMOD, 2016.
- [WHIZ: Extra Reading] WHIZ: Data-Driven Execution Analytics, Grandl et al, NSDI 2021.

[SparkStreaming] Discretized Streams: Fault-Tolerant Streaming Computation at Scale, Zaharia et al, SOSP, 2013. Also read this introduction to Structured Streaming.
[Heron] Twitter Heron: Stream Processing at Scale, Kulkarni et al, SIGMOD, 2015.
- [Dataflow: Extra Reading] The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Akidau et al, VLDB, 2015.
[Flink] Apache Flink: Stream and Batch Processing in a Single Engine, Carbone et al, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015.

[GraphX] GraphX: Graph Processing in a Distributed Dataflow Framework, Gonzalez et al, OSDI, 2014.
[Pregel] Pregel: A System for Large-Scale Graph Processing, Malewicz et al, SIGMOD, 2010.
- [PowerGraph: Extra Reading] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, Gonzalez et al, OSDI, 2012.

[Pytorch] PyTorch Distributed: Experiences on Accelerating Data Parallel Training, S. Li et al, VLDBI, 2015.
- [Parameter Server: Extra Reading] Scaling Distributed Machine Learning with the Parameter Server, Mu Li et al, OSDI 2014.
[Clipper] Clipper: A Low-Latency Online Prediction Serving System, Crankshaw et al, NSDI, 2017.
[Ray] Ray: A Distributed Framework for Emerging AI Applicationss, Moritz et al, OSDI, 2018.
[Gandiva] Gandiva: Introspective Cluster Scheduling for Deep Learning, Wencong Xiao et al, OSDI, 2018.

[PyWren] Occupy the Cloud: Distributed Computing for the 99%, Jonas et al, SoCC, 2017.
[TPU] In-Datacenter Performance Analysis of a Tensor Processing Unit, Jouppi et al, CIDR, 2017.