Schedule -- June 26, 2014

09:00 - 09:40 Peder Olsen
09:40 - 10:20 Pradeep Ravikumar
10:20 - 10:40 Coffee break
10:40 - 11:20 Jason Lee
11:20 - 12:00 Weibin Zhang

Lunch

14:00 - 14:40 Po-Ling Loh
14:40 - 15:20 Cho-Jui Hsieh
15:20 - 15:40 Coffee break
15:40 - 16:20 Arindam Banerjee
16:20 - 17:00 Steven Rennie



Peder Olsen (IBM Research, Watson)
[homepage] [slides]

Title: Covariance Selection by L1 regularization
Abstract: This talk will be a selective overview of the covariance selection problem. Covariance selection by optimizing an L1 penalized log likelihood leads to a convex objective function known as the graphical LASSO. A number of methods have been invented to solve increasingly larger covariance selection problems. There are methods that solve for a single element at a time, methods that solve a row/column at a time, coordinate descent methods, quasi-newton type methods, FISTA type solvers and many more. Some methods are designed for parallel implementations, some for very sparse solutions, some for denser matrices and some work better when the condition number is better. First order solutions tend to be slower, but easier to implement and prove convergence for. Second order methods seem to now be the fastest and some (legitimate) convergence results have now been proven for some variants. We will attempt to review some of these results as well as a beautiful result that show how we can infer the block sparsity structure of the minimum extremely efficiently without explicitly solving the problem.

Pradeep Ravikumar (UT Austin)
[homepage][slides]

Title: Elementary Estimators for Graphical Models
Abstract: We consider the problem of learning high-dimensional exponential family distributions, a class of problems that includes the learning of Gaussian as well as discrete graphical models. This class of problems has attracted considerable attention over the last decade, with state of the art statistical estimators based on solving regularized convex programs. Scaling these typically non-smooth convex programs to very large-scale problems is thus an ongoing and rich area of research. Here, we attempt to address this scaling issue at the source, by asking whether one can build simpler possibly closed-form estimators, that yet come with statistical guarantees that are nonetheless comparable to regularized likelihood estimators. Surprisingly, we answer this question in the affirmative. We analyze our estimators in the high-dimensional setting, and moreover provide empirical corroboration of its statistical and computational performance on simulated data.

Jason Lee (Stanford)
[homepage] [slides]

Title: Learning and Optimization for Mixed Graphical Models
Abstract: We consider the problem of learning the structure of a pairwise graphical model over continuous and discrete variables. We present a new pairwise model for graphical models with both continuous and discrete variables that is amenable to structure learning. In previous work, authors have considered structure learning of Gaussian graphical models and structure learning of discrete models. Our approach is a natural generalization of these two lines of work to the mixed case. We also discuss proximal Newton-type optimization algorithms for parameter learning in graphical models, and model selection consistency results.
(This is joint work with Trevor Hastie, Michael Saunders, Yuekai Sun, and Jonathan Taylor. )

Weibin Zhang (Hong Kong University of Science and Technology)
[slides]

Title: Sparse Models for Speech Recognition
Abstract: In this talk, we propose sparse models that subsume traditional GMM-HMM acoustic modeling methods with diagonal or full covariance matrices as special cases. Previously, GMM-HMM acoustic models with diagonal or full covariance matrices were heuristically chosen according to training data size. Full covariance models are seldom used when training data is limited since they tend to over fit. On the other hand, diagonal covariance models simply assume feature independence which is an over simplification. Acoustic models with sparse inverse covariance matrices, are proposed to deal with the problems that conventional diagonal and full covariance models face: incorrect model assumption and over-fitting when training data is insufficient. Both maximum likelihood training and discriminative training of acoustic models with sparse inverse covariance matrices are discussed. Experimental results show that sparse models significantly outperform diagonal and full covariance models.

Po-Ling Loh (UC Berkeley)
[homepage] [slides]

Title: Beyond the graphical Lasso: Structure learning via inverse covariance estimation
Abstract: In this talk, we highlight a variety of recent results in structure estimation for graphical models. Our results are motivated by well-known theory for Gaussian graphical models; however, a key insight is that methods such as the graphical Lasso and nodewise linear regression may be statistically consistent for edge recovery even when data are non-Gaussian. We outline some theoretical advances relating the support of generalized inverse covariance matrices to the edge structure of undirected graphs for discrete-valued variables, and also discuss the significance of the inverse covariance matrix for directed graphs in the case of linear structural equation models. Finally, we show how our proposed structure estimation methods may be adjusted to yield statistically consistent estimators even when data are systematically corrupted by noise or missing data.

Cho-Jui Hsieh (UT Austin)
[homepage] [slides]

Title: Sparse Inverse Covariance Estimation for a Million Variables
Abstract: The L1-regularized Gaussian maximum likelihood estimator has been shown to have strong statistical guarantees in recovering a parse inverse covariance matrix even under high dimensional settings. However, it requires solving a difficult non-smooth log-determinant program with number of parameters that scale quadratically with the number of Gaussian variables. State-of-the-art methods thus do not scale to problems with more than 20,000 variables. In this talk, we will describe a quadratic approximation method that can solve 1 million dimensional L1-regularized log determinant problems (which would thus have a trillion parameters) on a single computer. In spite of the latter approximations, the proposed BIGQUIC algorithm can achieve super-linear or even quadratic convergence rates. Finally, we will show that this approach can be extended to estimate GMRF with a mixed of sparse and low rank structures.

Arindam Banerjee (University of Minnesota, Twin Cities)
[homepage] [slides]

Title: Large Scale Gaussian Coupula Precision Estimation with Missing Values
Abstract: This talk will focus on both statistical and computational aspects of structure estimation in high-dimensional Gaussian copula graphical models using the CLIME estimator. Similar to the case of multi-variate Gaussians, the structure estimation problem for Gaussian copulas reduces to the estimation of the sparse inverse of a suitable high-dimensional matrix. On the statistical side, we show that the sparse inverse can be estimated correctly even from multivariate samples with entries missing at random. Experimental results will be discussed illustrating different facets of the estimation. On the computational side, we will discuss CLIME-ADMM, a parallel algorithm for estimating the sparse matrix inverse. Experiments on high-dimensional distributions, e.g., one million dimensions and corresponding one trillion entries in the matrix, show that the method scales linearly with the number of cores especially in a distributed memory setting.

Steven Rennie (IBM Research, Watson)
[homepage]

Title: Structured Deep Neural Networks for Automatic Speech Recognition
Abstract: Deep neural networks (DNNs) have re-defined the state-of-the-art in fields like automatic speech recognition and computer vision in recent years. This success has been facilitated by progress in computing, and fueled by new models and algorithms for deep learning. A cornerstone of deep network representations is their apparent ability to extract abstract, high-level representations for the purposes of data modeling or classification automatically, without the need to inject prior knowledge. However, in practice, DNNs can be tedious to optimize, and can benefit tremendously from information about the preferred structure of the network. In addition, there is also accumulating evidence that the choice of non-linearity can significantly affect performance, perhaps because certain choices (e.g. conditionally linear) make the network easier to optimize. In this talk, I'll describe our recent investigations into structured DNNs for ASR. I'll also discuss some new deep network architectures that we've been exploring, their relation to the recently introduced maxout and sum-product network architectures, and experimental results demonstrating that such networks can be utilized to achieve state-of-the-art ASR performance.