Large-Scale Data Mining

CS 395T/CAM 395T

CS Unique No. 52795 / CAM Unique No. 63815

Course Announcement

Fall 2003
MW 4-5:30pm
GEO 2.102

Professor: Inderjit Dhillon (send email)
Office: ACES 2.332
Office Hours: Wed 2-3pm

Handouts

  • Course Information (contains grading information) handed out on Aug 27.
  • Class Survey, Aug 27.
  • Reference Textbooks

  • "Elements of Statistical Learning: Data Mining, Inference, and Prediction" by T. Hastie, R. Tibshirani, J. Friedman, Springer-Verlag, 2001.
  • "Pattern Classification" by R. Duda, P. Hart and D. Stork, John Wiley and Sons, November 2000.
  • Reading List for Student Presentations

    Lecture Notes

  • Lecture 1 - Finding good "hubs" and "authorities" for broad-topic queries. Material from:
  • Authoritative sources in a hyperlinked environment by Jon Kleinberg.
  • Improved Algorithms for Topic Distillation in a Hyperlinked Environment by Krishna Bharat and Monika Henzinger.
  • Lecture 2 - Review of basic linear algebra (vectors, norms, eigenvalues/eigenvectors, SVD), Proof that hub and authority vectors converge to the dominant singular vectors.
  • Lectures 3 & 4 - HITS, Clever Project, Google's PageRank.
  • Lectures 5 & 6 - Vector Space Model, Latent Semantic Indexing, SVD.
  • Lectures 7, 8 & 9 - PCA, Clustering, Hierarchical Agglomerative Clustering(HAC), k-means.
  • Lecture 10 - Information Theory, Clustering and Bregman Divergences.
  • Lecture 11 - Graph partitioning algorithms (Kernighan-Lin, Spectral Partitioning, Multilevel methods such as Metis.
  • Homeworks

  • Homework 1, due date: Oct 1.
  • Homework 2, due date: Nov 5.
  • Homework 3, due date: Nov 17.
  • Related Courses

  • Previous offering of CS 395T in Fall 2001.
  • Univ of Minnesota's CSci 8363, Linear Algebra in Data Exploration, Spring 2003.
  • Stanford's CS 349, Data Mining, Search, and the World Wide Web, Fall 1998.
  • Stanford's Data Mining Course, 2000.
  • UT Austin ECE course EE 380L, A Practicum in Data Mining, Fall 1999.