Large-Scale Data Mining

CS 395T/CAM 395T

CS Unique No. 51085 / CAM Unique No. 61210

Fall 2001
TTh 11:00am-12:30pm
TAY 2.106

Professor: Inderjit Dhillon (send email)
Office: ACES 2.332
Office Hours: Tue 2-3pm

TA: Yuqiang Guan (send email)
Phone: 232-7420
Office Hours: Th 1:30-3pm at PAI 5.40B

Handouts

  • Course Information (contains grading information) handed out on Aug 30.
  • Class Survey, Aug 30.
  • Lectures

  • Lecture 1 - Finding good "hubs" and "authorities" for broad-topic queries. Material from:
  • Authoritative sources in a hyperlinked environment by Jon Kleinberg.
  • Improved Algorithms for Topic Distillation in a Hyperlinked Environment by Krishna Bharat and Monika Henzinger.
  • Lecture 2 - Review of basic linear algebra (vectors, norms, eigenvalues/eigenvectors, SVD), Proof that hub and authority vectors converge to the dominant singular vectors.
  • Lectures 3 & 4 - HITS, Clever Project, Google's PageRank.
  • Lectures 5 & 6 - Vector Space Model, Latent Semantic Indexing, SVD.
  • Lectures 7, 8 & 9 - PCA, Clustering, Hierarchical Agglomerative Clustering(HAC), k-means.
  • Lecture 10 - Graph partitioning algorithms (Kernighan-Lin, Spectral Partitioning, Multilevel methods such as Metis.
  • Lecture 11:Co-clustering paper.
  • Homeworks

  • Homework 1, due date: Sept 25.
  • Homework 2, due date: Oct 2.
  • Homework 3, due date: Oct 11.
  • Homework 4, due date: Nov 2.
  • Related Courses

  • Univ of Minnesota's CSci 8363, Numerical Linear Algebra in Data Exploration, Spring 2001.
  • Stanford's CS 349, Data Mining, Search, and the World Wide Web, Fall 1998.
  • Stanford's Data Mining Course, 2000.
  • UT Austin ECE course EE 380L, A Practicum in Data Mining, Fall 1999.