Overview, Infrastructure
- [Datacenter] The Datacenter as a Computer: An
Introduction to the Design of Warehouse-Scale Machines
, L.A. Barroso, U. Holzle, Synthesis Lectures on Computer Architecture, 2009. Chapter 1 and 2.
- [VL2: Extra Reading] VL2: A Scalable and Flexible Data Center
Network, Greenberg et al., SIGCOMM 2009.
- [HDFS] The Hadoop Distributed File System, Schvachko et al, MSST, 2010.
- [GFS: Extra Reading] The Google File System, Ghemawat et al, SOSP, 2003.
- [Map-Reduce] MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghemawat, OSDI, 2004.
- [Spark] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, Zaharia et al, NSDI, 2012.
Execution Engines, Schedulers
- [Mesos] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, Hindman et al, NSDI, 2011.
- [Packing: Extra Reading] Multi-Resource Packing for Cluster Schedulers, Grandl et al, SIGCOMM, 2014.
- [YARN: Extra Reading] Apache Hadoop YARN: Yet Another Resource Negotiator, Vavilapalli et al, SOCC, 2013.
- [Resource Allocation] Dominant Resource Fairness: Fair Allocation of Multiple Resource Types, Ghodsi et al, NSDI, 2011.
Batch Analytics
- [SparkSQL] Spark SQL: Relational Data Processing in Spark, Armburst
et al, SIGMOD, 2015.
- [QOOP: Extra Reading] Dynamic Query Re-Planning using QOOP, Kshiteej Mahajan; Mosharaf Chowdhury; Aditya Akella and Shuchi Chawla
- [Hive: Extra Reading] Major technical advancements in Apache Hive, Huai et al, SIGMOD, 2014.
- [Snowflake] The Snowflake Elastic Data Warehouse, Dageville et al, SIGMOD, 2016.
- [WHIZ: Extra Reading] WHIZ: Data-Driven Execution Analytics, Grandl et al, NSDI 2021.
Stream Analytics
- [SparkStreaming] Discretized Streams: Fault-Tolerant Streaming Computation at Scale, Zaharia et al, SOSP, 2013. Also read this introduction to Structured Streaming.
- [Heron] Twitter Heron: Stream Processing at Scale, Kulkarni et al, SIGMOD, 2015.
- [Dataflow: Extra Reading] The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing, Akidau et al, VLDB, 2015.
- [Flink] Apache Flink: Stream and Batch Processing in a Single Engine, Carbone et al, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015.
Graph Processing
- [GraphX] GraphX: Graph Processing in a Distributed Dataflow Framework, Gonzalez et al, OSDI, 2014.
- [Pregel] Pregel: A System for Large-Scale Graph Processing, Malewicz et al, SIGMOD, 2010.
- [PowerGraph: Extra Reading] PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs, Gonzalez et al, OSDI, 2012.
Machine Learning
- [Pytorch] PyTorch Distributed: Experiences on Accelerating Data Parallel Training, S. Li et al, VLDBI, 2015.
- [Parameter Server: Extra Reading] Scaling Distributed Machine Learning
with the Parameter Server, Mu Li et al, OSDI 2014.
- [Clipper] Clipper: A Low-Latency Online Prediction Serving System, Crankshaw et al, NSDI, 2017.
- [Ray] Ray: A Distributed Framework for Emerging AI Applicationss, Moritz et al, OSDI, 2018.
- [Gandiva] Gandiva: Introspective Cluster Scheduling
for Deep Learning, Wencong Xiao et al, OSDI, 2018.
Platforms: Serverless and Hardware
- [PyWren] Occupy the Cloud: Distributed Computing for the 99%, Jonas et al, SoCC, 2017.
- [TPU] In-Datacenter Performance Analysis of a Tensor Processing Unit, Jouppi et al, CIDR, 2017.