Computational strategies for large-scale statistical data analysis

ICMS, 15 South College Street, Edinburgh, EH8 9AA

2 – 6 July 2018

Organisers

  • Guang Cheng, Purdue University
  • Chenlei Leng, University of Warwick


Bringing together active researchers working at the interface of statistics, machine learning and optimization, the aims of this workshop are:

  • to exchange the methodological, algorithmic and theoretical developments recently made in distributed data analysis and aggregated inference with consideration on computational complexity and statistical properties of relevant estimators;
  • to discuss open challenges, exchange research ideas and forge collaborations in three research areas: statistics, machine learning and optimization;
  • to promote the development of software with justified statistical properties and efficient computational properties;
  • to engage more UK young researchers to work at the interface of computing and
    statistics.

Arrangements

PARTICIPATION

Invitations will be sent out by ICMS in March 2018.

VENUE

The workshop will be held at ICMS, 15 South College Street, Edinburgh, EH8 9AA.

TALKS AND AUDIO/VISUAL EQUIPMENT

All talks will be held in the Newhaven Lecture Theatre. The lecture theatre has a built in computer, data projector, and visualiser/document camera. In addition, there are two blackboards. The projector and one board may be used simultaneously. We advise Speakers that, where possible, you bring your talk on a memory stick/USB to put onto our computer in advance of your session - either at the start of the day or during coffee/lunch breaks. ICMS staff can help with this. It is possible for you to use your own laptops but it is then your own responsibility to ensure that resolutions are changed on the laptops if necessary (ICMS will not change the resolution on our equipment). If you use a Mac we expect you to bring your own adaptor. 

UK VISAS

If you are travelling from outside of the UK you may require an entry visa. A European visa does not guarantee entry to the UK. Please use this link to the UK Visas site to find out if you need a visa and if so how to apply for one.

TRAVEL

Please note that it is your responsibility to have adequate travel insurance to cover medical and other emergencies that may occur on your trip.

A taxi directly from the airport will cost approximately 20.00 to 25.00 GBP to the city centre for a one-way journey. There is also a bus service direct from the airport to the city centre which will cost 4.50 GBP single or 7.50 GBP return - the Airlink 100.  This is a frequent service (every 10 minutes during peak times) and will bring you close to Waverley Railway Station, only a short walk to the workshop venue.

Lothian buses charge £1.70 for a single, £4.00 for a day ticket. Please note that the exact fare is required and no change is given.

If travelling by train, please note that Edinburgh has several railway stations - Waverley Railway Station being the main station and closest to the workshop venue at 15 South College Street. If you alight at Edinburgh Waverley, the workshop venue is an easy 10 minute walk over North and South Bridge.

Provisional programme

May be subject to change

Monday 2 July

08:30-09:20

Registration in the Chapterhouse, Level 1

09:20-09:30

Welcome

09:30-10:15

David Dunson (Duke University)
Scaling up Bayesian inference

10:15-10:45

Tea/Coffee in the Chapterhouse, Level 1

10:45-11:30

Ata Kaban (University of Birmingham)
Structure aware generalisation error bounds using random projections

11:30-12:15

Ping Ma (University of Georgia)
To follow

12:15-14:00

Lunch provided in the Chapterhouse, Level 1

14:00-14:45

Yoonkyung Lee (The Ohio State University)
To follow

14:45-15:30

Jinchi Lv (University of Southern California)
Asymptotics of eigenvectors and eigenvalues for large structured random matrices

15:30-16:00

Tea/Coffee in the Chapterhouse, Level 1

16:00-16:45

Faming Liang (Purdue University)
To follow

16:45-17:30

Eric Xin (Carnegie Mellon University)
To follow

17:30-18:30

Informal wine reception in the Chapterhouse

 

Tuesday 3 July

09:00-09:45

Mladen Kolar (University of Chicago)
Recovery of simultaneous low rank and two-way sparse coefficient matrices, a nonconvex approach

09:45-10:30

Guang  Cheng (Purdue University)
Large-scale nearest neighbor classification with statistical guarantee

10:30-11:00

Tea/Coffee in the Chapterhouse, Level 1

11:00-11:45

Yining  Chen (LSE)        
Narrowest-Over-Threshold detection of multiple change-points and change-point-like features

11:45-12:30

Haeran Cho (University of Bristol)        
To follow

12:30-14:00

Lunch provided in the Chapterhouse, Level 1

14:00-14:45

Jason Lee (University of Southern California)    
Geometry of optimization Landscapes and implicit regularization of optimization algorithms

14:45-15:30

Junchi Li (Princeton University)        
Stochastic approximation for feature extraction: nonconvex optimization, online learning, and diffusion approximation

15:30-16:00

Tea/Coffee in the Chapterhouse, Level 1

16:00-16:45

Jeremias Knoblauch (University of Warwick)
Bayesian on-line changepoint detection and model selection in high-dimensional data

16:45-17:30

Stanislav Volgushev (University of Toronto)        
To follow

17:30

Doors open to the public

18:00-19.00

Possible public lecture (to be decided)

19:00-20:00

Informal wine reception in the Chapterhouse, Level 1

 

Wednesday 4 July

09:00-09:45

Hua Zhou (University of California, Los Angeles)        
To follow

09:45-10:30

Matteo Fasiolo (University of Bristol)        
Calibrated additive quantile regression

10:30-11:00

Tea/Coffee in the Chapterhouse, Level 1

11:00-11:45

Didong Li (Duke University)        
Efficient manifold and subspace approximations with spherelets

11:45-12:30

Eric Xing (Carnegie Mellon University)
To follow

12:30

Free afternoon

 

Thursday 5 July

09:00-09:45

Mouli Banerjee  (University of Michigan)        
To follow

9:45-10:30

Qifan Song  (Purdue University)        
To follow

10:30-11:00

Tea/Coffee in the Chapterhouse, Level 1

11:00-11:45

Binyan  Jiang (The Hong Kong Polytechnic University)        
Penalized interaction estimation for ultrahigh dimensional quadratic regression

11:45-12:30

Hao Zhang (University of Arizona)        
To follow

12:30-14:00

Lunch provided in the Chapterhouse, Level 1

14:00-14:45

Mike Bing (Cornell University)        
To follow

14:45-15:30

Cheng Qian (London School of Economics)        
Covariance and graphical modelling for high-dimensional longitudinal and functional data

15:30-16:00

Tea/Coffee in the Chapterhouse, Level 1

16:00-16:45

Xiangyu Wang (Google)        
To follow

16:45-17:30

Chao Zheng (Lancaster University)        
To follow

 

Friday 6 July

09:00-09:45

Wenxuan Zhong (University of Georgia)        
Leverage sampling to overcome the computational challenges for big spatial data

09:45-10:15

Xiaoming Huo  (Georgia Institute of Technology)        
To follow

10:15-10:45

Tea/Coffee in the Chapterhouse

10:45-11:30

Sandipan Roy (University College London)        
Network homogeneity via spectral unimodality: concept, theory and application

11:30

Closing and discussion

 

Abstracts

Yining Chen
Narrowest-over-threshold detection of multiple change-points and change-point-like features
We propose a new, generic and flexible methodology for nonparametric function estimation, in which we first estimate the number and locations of any features that may be present in the function, and then estimate the function parametrically between each pair of neighbouring detected features. Examples of features handled by our methodology include change-points in the piecewise-constant signal model, kinks in the piecewise-linear signal model, and other similar irregularities, which we also refer to as generalised change-points.

Guang Cheng
Large-scale nearest neighbor classification with statistical guarantee
In this talk, we develop distributed classification methods based on nearest neighbor principle. Through
majority voting, the distributed classification can achieve the oracle rate of regret and instability if neighbors are carefully chosen. The only loss is a multiplicative constant that depends only on data dimensionality. This can be remedied by replacing majority voting with continuous aggregation.

Haeran Cho
Title to follow
Our methodology works with only minor modifications across a range of generalised change-point scenarios, and we achieve such a high degree of generality by proposing and using a new multiple generalised change-point detection device, termed Narrowest-Over-Threshold (NOT). The key ingredient of NOT is its focus on the smallest local sections of the data on which the existence of a feature is suspected. Crucially, this adaptive localisation technique prevents NOT from considering subsamples containing two or more features, a key factor that ensures the general applicability of NOT.

David Dunson
Scaling up Bayesian inference
This talk discusses some recent work on scaling up Bayesian inference to large sample size (n) and/or high-dimensional (large p) problems. We focus in particular on algorithms having theoretical guarantees on performance, including divide-and-conquer Markov chain Monte Carlo (MCMC) and approximate MCMC. We are also heavily motivated by real world applications in the sciences.

Matteo Fasiolo
Calibrated additive quantile regression
For selected scenarios, we show the consistency and near-optimality of NOT in detecting the number and locations of generalised change-points. Furthermore, we discuss its computational cost and demonstrate that the total cost is close-to-linear.

Xiaoming Huo
To follow

Binyan Jiang
Penalized interaction estimation for ultrahigh dimensional quadratic regression
Quadratic regression goes beyond linear model by simultaneously including main effects and interactions between the covariates. The problem of interaction estimation in high dimensional quadratic regression has received extensive attention in the past decade. In this article we introduce a novel method which allows us to estimate the main effects and interactions separately. Unlike existing methods for ultrahigh dimensional quadratic regressions, our proposal does not require the widely used heredity assumption. In addition, our proposed estimates have explicit formulas and obey the invariance principle at the population level. We estimate the interactions of matrix form under penalized convex loss function. The resulting estimates are shown to be consistent even when the covariate dimension is an exponential order of the sample size. We develop an efficient ADMM algorithm to implement the penalized estimation. This ADMM algorithm fully explores the cheap computational cost of matrix multiplication and hence is much more efficient than existing penalized methods under heredity constraints. We demonstrate the promising performance of our proposal through extensive numerical studies.

Ata Kaban
Structure aware generalisation error bounds using random projections
We present new risk bounds for binary classification in high-dimensional settings when the dimensionality can be larger than the sample size. In particular, we prove upper bounds for both 'compressive learning' by empirical risk minimization (ERM) -- that is when the ERM classifier is learned from data that have been projected from high-dimensions onto a randomly selected low-dimensional subspace - as well as uniform upper bounds in the full high-dimensional space. For the latter, random projection serves as an analytic tool to discover and exploit benign geometry inherent in data by means of the distortion of the loss function of a learning problem. Our bounds are data dependent, highlight benign geometric structures that make a classification problem easier or harder, and remain informative in small sample conditions. We then use our approach to draw connections between existing classification approaches, and finally we empirically demonstrate that our basic bound is informative enough in practice to serve as the objective function for learning a classifier.

Jeremias Knoblauch
Bayesian on-line changepoint detection and model selection in high-dimensional data
Bayesian on-line changepoint detection is extended to on-line model selection and non- stationary spatio-temporal processes. We pro- pose spatially structured Vector Autoregressions (VARs) for modelling the process between changepoints (CPs) and give an upper bound on the approximation error of such models. The resulting algorithm performs prediction, model selection and CP detection on-line. Its time complexity is linear and its space complexity constant, and thus it is two orders of magnitudes faster than its closest competitor. In addition, it outperforms the state of the art for multivariate data.

Mladen Kolar
Recovery of simultaneous low rank and two-way sparse coefficient matrices, a nonconvex approach
We study the problem of recovery of matrices that are simultaneously low rank and row and/or column sparse. Such matrices appear in recent applications in cognitive neuroscience, imaging, computer vision, macroeconomics, and genetics. We propose a GDT (Gradient Descent with hard Thresholding) algorithm to efficiently recover matrices with such structure, by minimizing a bi-convex function over a nonconvex set of constraints. We show linear convergence of the iterates obtained by GDT to a region within statistical error of an optimal
solution. As an application of our method, we consider multi-task learning problems and show that the statistical error rate obtained by GDT is near optimal compared to minimax rate. Experiments demonstrate competitive performance and much faster running speed compared to existing methods, on both simulations and real data sets.
Joint work with Ming Yu, Zhaoran Wang, and Varun Gupta.

Jason Lee
Geometry of optimization landscapes and implicit regularization of optimization algorithms
We first study the problem of learning a Gaussian input two-layer ReLU network with positive output layer and and the symmetric matrix completion problem. Despite the non-convexity of both problems, we prove that every local minimizer is a global minimizer. Since gradient descent converges to local minimizers, this shows that simple gradient-based methods can find the global optimum of these non-convex problems. In the second part, we analyze the implicit regularization effects of various optimization algorithms. In particular we prove that for least squares with mirror descent, the algorithm converges to the closest solution in terms of the bregman divergence. For linearly separable classification problems, we prove that the steepest descent with respect to a norm solves SVM with respect to the same norm. For over-parametrized non-convex problems such as matrix sensing or neural net with quadratic activation, we prove that gradient descent converges to the minimum nuclear norm solution, which allows for both meaningful optimization and generalization guarantees.
This is joint work with Rong Ge, Suriya Gunasekar, Tengyu Ma, Mor Shpigel, Daniel Soudry, and Nati Srebro.

Didong Li,
Efficient Manifold and Subspace Approximations with Spherelets
Data lying in a high-dimensional ambient space are commonly thought to have a much lower intrinsic dimension. In particular, the data may be concentrated near a lower-dimensional subspace or manifold. There is an immense literature focused on approximating the unknown subspace, and in exploiting such approximations in clustering, data compression, and building of predictive models. Most of the literature relies on approximating subspaces using a locally linear, and potentially multiscale, dictionary. In this presentation, we propose a simple and general alternative, which instead uses pieces of spheres, or spherelets, to locally approximate the unknown subspace. Theory is developed showing that spherelets can produce dramatically lower covering numbers and MSEs for many manifolds. We develop spherical principal components analysis (SPCA). Results relative to state-of-the-art competitors show dramatic gains in ability to accurately approximate the subspace with orders of magnitude fewer components. This leads to substantial gains in data compressibility, few clusters and hence better interpretability, and much lower MSE based on small to moderate sample sizes. Furthermore, a Bayesian nonparametric model based on spherelets is built to estimate the manifold and the density simultaneously.

Junchi Li
Stochastic approximation for feature extraction: nonconvex optimization, online learning, and diffusion approximation
Statistical dimension-reduction and feature-extraction procedures such as Principal Component Analysis (PCA), Independent Component Analysis (ICA), Partial Least Squares (PLS) and Canonical Correlation Analysis (CCA) present significant computational challenges with large-scale data. Online algorithms that updates the estimator by processing streaming data on-the-fly are of tremendous practical and theoretical interests. In this talk, we formulate these statistical procedures as nonconvex statistical optimization problem with nonconvex statistical structures, and view these online algorithm as corresponding stochastic approximation methods. Using standard tools of martingale concentration inequalities, we for the first time obtain the finite-sample error bound of O( \sqrt(d/N) ). We prove that (i) for PCA, such bound is known to closely match the minimax information lower bound for PCA (L. et al., 2015 Mathematical Programming; L. et al., 2017 NIPS); (ii) for (tensor formulation of) ICA, such bound is consistent with the computational lower bound for spiked tensor model (L., Wang & Liu, 2016 NIPS; L. et al., 2017; Wang, Yang, L. & Liu, 2016); (iii) for PLS and CCA, novel online location-dependent stochastic gradient method is proposed to achieve this bound. Time permitting, we further brief the recent progresses on the diffusion approximation methods for generic first-order algorithms. We show that in the continuous-time framework, first-order stochastic gradient method (Hu & L., 2017+) as well as stochastic heavy-ball method (Hu, L. & Su, 2017+) fastly avoid all saddle points within a time complexity that is inverse proportional to the stepsize.

Jinchi Lv
Asymptotics of eigenvectors and eigenvalues for large structured random matrices
Characterizing the exact asymptotic distributions of high-dimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. This paper establishes the asymptotic properties of the spiked eigenvectors and eigenvalues for the generalized Wigner random matrix, where the mean matrix is assumed to have a low-rank structure. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we provide novel asymptotic expansions for the general linear combination and further show that the linear combination is asymptotically normal after some normalization, where the weight vector can be an arbitrary unit vector. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications. This is a joint work with Jianqing Fan, Yingying Fan and Xiao Han.

Ping Ma
Aympirical method: a new paradigm for statistical analysis for large samples
Traditional statistical theory and methods are developed for small and mild size samples. In particular, statistical model fitting and inference are conducted in the small and mild size samples to get empirical results. Asymptotic theory is established to extrapolate the performance of the empirical results to large samples. However, this traditional coherent statistical analysis paradigm falls apart in large samples. The key challenge is that many traditional statistical methods are computational too expensive to get meaningful empirical results. New statistical paradigm is in urgent need for the statistical analysis in large samples.

Cheng Qian
Covariance and graphical modelling for high-dimensional longitudinal and functional data
The problem of estimating functional covariance and graphical models from a data set consisting of multivariate sparse longitudinal data is considered. The underlying trajectories are represented through the functional principal components expansions, where the covariance matrix of principal component scores characterizes the global covariance feature and principal component functions present the functional representation of covariance relationships. Our proposed estimation procedure first implements a nonparametric method to perform functional principal components for sparse longitudinal data, and then computes functional regularized covariance or precision matrices. We derive the relevant concentration inequalities for high dimensional sparsely sampled functional data and use them to investigate the uniform consistency results for our proposed estimators. The finite sample performance of our proposed methods are illustrated through an extensive set of simulation studies and two real data examples.

Sandipan Roy
Network homogeneity via spectral unimodality: concept, theory and application
In this talk, I will present an asympirical (asymptotic + empirical) method, which is designed by the principle that theory informs practice. I will present it in the context of smoothing spline ANOVA models. Simulation and real data analysis will be used to demonstrate the performance of the new paradigm.

Stefan Stein
To follow

Lizhu Tao
To follow

Eric Xing
To follow

Chen Zhang
Variational Gaussian approximation for Poisson data
The Poisson model is frequently employed to describe count data, but in a Bayesian context it leads to an analytically intractable posterior probability distribution. In this work, we analyze a variational Gaussian approximation to the posterior distribution arising from the Poisson model with a Gaussian prior. This is achieved by seeking an optimal Gaussian distribution minimizing the Kullback–Leibler divergence from the posterior distribution to the approximation, or equivalently maximizing the lower bound for the model evidence. We derive an explicit expression for the lower bound, and show the existence and uniqueness of the optimal Gaussian approximation. The lower bound functional can be viewed as a variant of classical Tikhonov regularization that penalizes also the covariance. Then we develop an efficient alternating direction maximization algorithm for solving the optimization problem, and analyze its convergence. We discuss strategies for reducing the computational complexity via low rank structure of the forward operator and the sparsity of the covariance. Further, as an application of the lower bound, we discuss hierarchical Bayesian modeling for selecting the hyperparameter in the prior distribution, and propose a monotonically convergent algorithm for determining the hyperparameter. We present extensive numerical experiments to illustrate the Gaussian approximation and the algorithms.

Hua Zhou
To follow

Wenxuan Zhong
Leverage sampling to overcome the computational challenges for big spatial data
Within the realms of big data, spatial data is one of fastest growing data types and poses a significant challenge to researchers on statistical analyses. With advances in remote sensors, sensor networks, and the proliferation of location sensing devices in daily life activities and common business practices, the generation of disparate, dynamic, and geographically distributed spatial data has exploded in recent years. In addition, large-scale spatiotemporal data generated by social media outlets is proving to be highly useful in disease surveillance, disaster mapping, and national security applications.

In general, spatial models are limited in their efficacy for large data sets due to expensive computational cost. For example, in the spatial modeling, computation typically is of the order (n), with n being the number of spatial locations. Methods for reducing the computational burden in spatial models are becoming more common in the spatial statistics literature.

Here we introduce a data dependent random sampling scheme called spatial leveraging sampling method. Leveraging sampling method has been successful in improving computational efficiency of independent data cases. Considering the spatial dependency of the data, we develop the spatial leveraging score under the Gaussian spatial linear model. An empirical evaluation of the leverage-based methods is carried out on both synthetic and real datasets. The empirical results indicate that our method has a good practical performance. For example, with the same computation reduction as in the random uniform sampling approach, our proposed spatial leveraging sampling method typically leads to improved mean squared error. To establish the potential of our method, we model the daily spatiotemporal Twitter trend in the United States. Our results show that the proposed spatial leveraging sampling approach can provide computational efficient and informative estimates of spatiotemporal social media trends.