Computational strategies for large-scale statistical data analysis

ICMS, 15 South College Street, Edinburgh, EH8 9AA

2 – 6 July 2018

Organisers

  • Guang Cheng, Purdue University
  • Chenlei Leng, University of Warwick

Large-scale data are increasingly encountered in biology, medicine, engineering, social sciences and economics with the advance of the measurement technology. A distinctive feature of such data is that they usually come with a large sample size and/or a large number of features, creating challenges even for data storage and processing, not to mention for data analysis. On the other hand, classical statistical methodology, theory and computation have been developed based on the assumption that the entire data reside on a central location. As a result, most classical statistical methods face computational challenges for analysing large-scale data in the big data era. Specifically, big data is known to possess the so-called 4D features: Distributed, Dirty, Dimensionality and Dynamic. These features, which are often mixed together in reality, make it very challenging to apply traditional statistical thinking to massive data.

Bringing together active researchers working at the interface of statistics, machine learning and optimization, the aims of this workshop are:

  • to exchange the methodological, algorithmic and theoretical developments recently made in distributed data analysis and aggregated inference with consideration on computational complexity and statistical properties of relevant estimators;
  • to discuss open challenges, exchange research ideas and forge collaborations in three research areas: statistics, machine learning and optimization;
  • to promote the development of software with justified statistical properties and efficient computational properties;
  • to engage more UK young researchers to work at the interface of computing and statistics.

Photographs of the workshop delegates are now available in our Flickr account.

Arrangements

You are expected to arrive at ICMS on Monday 2 July between 08:30 and 09:20 for registration. The workshop will begin with a welcome talk at 09:20 followed by the first talk at 09:30. The workshop will close at lunchtime on Friday 6 July 2018.

PARTICIPATION

The registration period for this workshop is now closed.

REGISTRATION FEE

There will be a registration fee of 100.00 GBP for this workshop. This will be collected on arrival at ICMS at registration.  The registration fee is payable by all participants with the exception of organisers and locals.

VENUE

The workshop will be held at ICMS, 15 South College Street, Edinburgh, EH8 9AA.

TALKS AND AUDIO/VISUAL EQUIPMENT

All talks will be held in the Newhaven Lecture Theatre. The lecture theatre has a built in computer, data projector, and visualiser/document camera. In addition, there are two blackboards. The projector and one board may be used simultaneously. We advise Speakers that, where possible, you bring your talk on a memory stick/USB to put onto our computer in advance of your session - either at the start of the day or during coffee/lunch breaks. ICMS staff can help with this. It is possible for you to use your own laptops but it is then your own responsibility to ensure that resolutions are changed on the laptops if necessary (ICMS will not change the resolution on our equipment). If you use a Mac we expect you to bring your own adaptor. 

UK VISAS

If you are travelling from outside of the UK you may require an entry visa. A European visa does not guarantee entry to the UK. Please use this link to the UK Visas site to find out if you need a visa and if so how to apply for one.

TRAVEL

Please note that it is your responsibility to have adequate travel insurance to cover medical and other emergencies that may occur on your trip.

A taxi directly from the airport will cost approximately 20.00 to 25.00 GBP to the city centre for a one-way journey. There is also a bus service direct from the airport to the city centre which will cost 4.50 GBP single or 7.50 GBP return - the Airlink 100.  This is a frequent service (every 10 minutes during peak times) and will bring you close to Waverley Railway Station, only a short walk to the workshop venue.

Lothian buses charge £1.70 for a single, £4.00 for a day ticket. Please note that the exact fare is required and no change is given.

If travelling by train, please note that Edinburgh has several railway stations - Waverley Railway Station being the main station and closest to the workshop venue at 15 South College Street. If you alight at Edinburgh Waverley, the workshop venue is an easy 10 minute walk over North and South Bridge.

CATERING

Refreshments will be served during the coffee breaks in the Programme.  Lunch will be provided Monday, Tuesday and Thursday.  As Wednesday is a free afternoon, lunch is not provided and participants are free to visit the many cafés and restaurants nearby. The workshop will close at lunchtime on Friday to enable travel that day - again no lunch is provided.  There will be an informal wine reception on Monday evening and a workshop dinner which will take place in Blonde Restaurant, 75 St Leonard's Street, Edinburgh, 19.00, on Thursday 5 July 2018.  There is no charge to workshop participants for attendance at the dinner.

PUBLIC LECTURE

On Monday 2 July , there will be a Public Lecture, 18:00-19:00, given by Eric Xing, Carnegie Mellon University, entitled A builder’s take on AI: magical, useful, or usable? Doors will open to the public at 17:30. This will be followed by an informal wine reception afterwards from 19:00-19:45. Places have been reserved for all delegates.

Provisional programme

May be subject to change

Monday 2 July

08:30-09:20

Registration in the Chapterhouse, Level 1

09:20-09:30

Welcome

09:30-10:15

David Dunson (Duke University)
Scaling up Bayesian inference
Download presentation pdf

10:15-10:45

Tea/Coffee in the Chapterhouse, Level 1

10:45-11:30

Ata Kaban (University of Birmingham)
Structure aware generalisation error bounds using random projections
Download presentation pdf

11:30-12:15

Eric Xing (Carnegie Mellon University)
On system and algorithm co-design and automatic machine learning

12:15-14:00

Lunch provided in the Chapterhouse, Level 1

14:00-14:45

Yoonkyung Lee (The Ohio State University)
Dimensionality reduction for exponential family data
Download presentation pdf

14:45-15:30

Jinchi Lv(University of Southern California)
Asymptotics of eigenvectors and eigenvalues for large structured random matrices

15:30-15:35

Group photograph to be taken

15:35-16:00

Tea/Coffee in the Chapterhouse, Level1

16:00-16:45

Faming Liang (Purdue University)
Markov neighbourhood regression for high-dimensional inference

16:45-17:30

Ping Ma (University of Georgia)
Asympirical analysis: a new paradigm for data science

17:40

Doors open to the public

18:00-19:00

Public lecture by Eric Xing (Carnegie Mellon University)
A builder’s take on AI: magical,  useful,  or usable? 

19:00-19:45

Informal wine reception in the Chapterhouse, Level 1

 

Tuesday 3 July

09:00-09:45

Mladen Kolar (University of Chicago)
Recovery of simultaneous low rank and two-way sparse coefficient matrices, a nonconvex approach
Download presentation pdf

09:45-10:30

Guang  Cheng (Purdue University)
Large-scale nearest neighbour classification with statistical guarantee

10:30-11:00

Tea/Coffee in the Chapterhouse, Level 1

11:00-11:45

Yining  Chen (LSE)        
Narrowest-Over-Threshold Detection of Multiple Change-points and Change-point-like Features
Download presentation pdf

11:45-12:30

Haeran Cho (University of Bristol)        
Multiscale MOSUM procedure with localised pruning
Download presentation pdf  

12:30-14:00

Lunch provided in the Chapterhouse, Level 1

14:00-14:45

Jason Lee (University of Southern California)    
Geometry of optimization Landscapes and implicit regularization of optimization algorithms

14:45-15:30

Chen Zhang (University College London)
Variational Gaussian approximation for Poisson data
Download presentation pdf

15:30-16:00

Tea/Coffee in the Chapterhouse, Level 1

16:00-16:45

Jeremias Knoblauch (University of Warwick)
Bayesian on-line changepoint detection and model selection in high-dimensional data
Download presentation pdf

16:45-17:30

Stanislav Volgushev (University of Toronto)        
Distributed inference for quantile regression processes
Download presentation pdf

 

Wednesday 4 July

09:00-09:45

Hua Zhou (University of California, Los Angeles)        
Global solutions of generalized canonical correlation analysis problems

09:45-10:30

Matteo Fasiolo (University of Bristol)        
Calibrated additive quantile regression

10:30-11:15

Tea/Coffee in the Chapterhouse, Level 1

11:15-12:00

Wenxuan Zhong (University of Georgia)        
Leverage sampling to overcome the computational challenges for big spatial data

12:00

Free afternoon

 

Thursday 5 July

09:00-09:45

Moulinath Banerjee  (University of Michigan)        
Divide and conquer in nonstandard problems: the super-efficiency phenomenon
Download presentation pdf

9:45-10:30

Qifan Song  (Purdue University)        
Bayesian shrinkage towards sharp minimaxity

10:30-11:00

Tea/Coffee in the Chapterhouse, Level 1

11:00-11:45

Binyan  Jiang (The Hong Kong Polytechnic University)        
Penalized interaction estimation for ultrahigh dimensional quadratic regression

11:45-12:30

Chao Zheng (Lancaster University)        
Revisiting Huber’s m-estimation: a tuning-free approach
Download presentation pdf

12:30-14:00

Lunch provided in the Chapterhouse, Level 1

14:00-14:45

Xin Bing (Cornell University)        
A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

14:45-15:30

Cheng Qian (London School of Economics)        
Covariance and graphical modelling for high-dimensional longitudinal and functional data

15:30-16:00

Tea/Coffee in the Chapterhouse, Level 1

 

 

19:00

Workshop dinner at Blonde Restaurant, 75 St Leonard's Street, Edinburgh

 

Friday 6 July 

09:30-10:15

Didong Li (Duke University)        
Efficient manifold and subspace approximations with spherelets
Download presentation pdf

10:15-11:00

Xiaoming Huo (Georgia Institute of Technology)        
Non-convex optimization and statistical properties

11:00-11:15

Discussion

11:15-12noon

Tea/Coffee in the Chapterhouse, Level 1

12noon

Close of workshop

 

 

Abstracts

Moulinath Banerjee
Divide and conquer in nonstandard problems: the super-efficiency phenomenon
We study how the divide and conquer principle - partition the available data into subsamples, compute an estimate from each subsample and combine these appropriately to form the final estimator - works in non-standard problems where rates of convergence are typically slower than the square root of n and limit distributions are non-Gaussian, with a special emphasis on the least squares estimator (and its inverse) of a monotone regression function.

We find that the pooled estimator, obtained by averaging non-standard estimates across the mutually exclusive subsamples (typically stored on different servers), outperforms the non-standard estimator based on the entire sample in the sense of pointwise inference. We also show that, under appropriate conditions, if the number of subsamples is allowed to increase at appropriate rates, the pooled estimator is asymptotically normally distributed with a variance that is empirically estimable from the subsample-level estimates. Further, in the context of monotone function estimation we show that this gain in pointwise efficiency comes at a price - the pooled estimator's performance, in a uniform sense (maximal risk) over a class of models worsens as the number of subsamples increases, leading to a version of the super-efficiency phenomenon. 

Finally, we propose an alternative pooled estimator that merges data from the different servers in an appropriate way - namely, by smoothing the data over bins at each local server, conveying these summary statistics to a central machine, and performing isotonic regression from the combined summary statistics at the central machine - and manages to overcome the super-efficiency phenomenon.

This is joint work with Cecile Durot and Bodhisattva Sen. 

Xin Bing
A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics 
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations, with parameters Ni ∈ N and Πi ∈ [0, 1]p for i = 1, . . . , n. The model links all cell probabilities, collected in a p×n matrix Π, via the assumption that Π can be factorized as the product of two nonnegative matrices A ∈ [0, 1]pxK and W ∈ [0, 1]K×n. Topic models have been originally developed in text mining, when one browses through n documents, based on a dictionary of p words, and covering K topics. In this terminology, the matrix A is called the word-topic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in detail in this work. Notably, the unique A is required to satisfy what is commonly known as the anchor word assumption, under which A has an unknown number of rows respectively proportional to the canonical basis vectors in RK. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the K vertices of a simplex. This crucial step in the estimation of A requires K to be known, and cannot be easily extended to the more realistic set-up when K is unknown.

This work takes a different view on anchor word estimation, and on the estimation of A. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any n, Ni, p and K, and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.

Yining Chen
Narrowest-Over-Threshold detection of multiple change-points and change-point-like features
We propose a new, generic and flexible methodology for nonparametric function estimation, in which we first estimate the number and locations of any features that may be present in the function, and then estimate the function parametrically between each pair of neighbouring detected features. Examples of features handled by our methodology include change-points in the piecewise-constant signal model, kinks in the piecewise-linear signal model, and other similar irregularities, which we also refer to as generalised change-points. Our methodology works with only minor modifications across a range of generalised change-point scenarios, and we achieve such a high degree of generality by proposing and using a new multiple generalised change-point detection device, termed Narrowest-Over-Threshold (NOT). The key ingredient of NOT is its focus on the smallest local sections of the data on which the existence of a feature is suspected. Crucially, this adaptive localisation technique prevents NOT from considering subsamples containing two or more features, a key factor that ensures the general applicability of NOT. For selected scenarios, we show the consistency and near-optimality of NOT in detecting the number and locations of generalised change-points. Furthermore, we discuss its computational cost and demonstrate that the total cost is close-to-linear.

Guang Cheng
Large-scale nearest neighbor classification with statistical guarantee
In this talk, we develop distributed classification methods based on nearest neighbor principle. Through majority voting, the distributed classification can achieve the oracle rate of regret and instability if neighbors are carefully chosen. The only loss is a multiplicative constant that depends only on data dimensionality. This can be remedied by replacing majority voting with continuous aggregation.

Haeran Cho
Multiscale MOSUM procedure with localised pruning
In this work, we investigate the problem of multiple change-point detection where the changes occur in the mean of univariate data. The proposed localised pruning methodology is applicable when conflicting change-point estimates are returned with the information about the local interval in which they are detected. We study the theoretical consistency of the localised pruning in combination with the multiscale extension of the Moving SUM (MOSUM) procedure Eichinger and Kirch (2018). Extensive simulation studies show the computational efficiency and good finite sample performance of the combined methodology, which is available as an R package 'mosum'. This is joint work with Claudia Kirch (OvGU Magdeburg).

David Dunson
Scaling up Bayesian inference
This talk discusses some recent work on scaling up Bayesian inference to large sample size (n) and/or high-dimensional (large p) problems. We focus in particular on algorithms having theoretical guarantees on performance, including divide-and-conquer Markov chain Monte Carlo (MCMC) and approximate MCMC. We are also heavily motivated by real world applications in the sciences.

Matteo Fasiolo
Calibrated additive quantile regression
Generalized Additive Models (GAMs) are an extension of Generalized Linear Models (GLMs), which allow the inclusion of random effects, complex non-linear effects (built using spline bases) and, more recently, response distributions outside the exponential family. In this talk I will describe a computationally efficient Bayesian framework for fitting quantile GAMs, which are based on the pinball loss, rather than on a probabilistic response distribution. The new hierarchical framework selects both the smoothing parameter and the so-called "learning-rate" automatically and efficiently, and provides posterior credible intervals that are approximately calibrated in a frequentist sense. I will illustrate the new methods in the context of electricity demand forecasting, where they provide a predictive performance that is competitive with that of Gradient Boosting (GB), but at a fraction of GB's computational cost.

Xiaoming Huo
Non-convex optimization and statistical properties
Non-convex optimization has been introduced into statistics with a range of applications. One application is in the model selection under the sparse regression framework, with the celebrated methods such as the smoothly clipped absolute deviation (SCAD), the minimax concave penalty (MCP), and many more. The newly emerged deep-learning-related techniques often involve non-convex objective functions as well. A non-convex optimization problem is generally NP-hard; therefore there is no guaranteed polynomial-time numerical solution. One can only hope to identify a local optimum. A difference-of-convex (DC) function can be expressed as a difference of two convex functions, though the original function itself may be non-convex. There is a large existing literature on the optimization problems when their objectives and/or constraints involve the DC functions; they are commonly referred to as difference-of-convex algorithms (DCA). Efficient numerical solutions have been proposed. Under the DC framework, directional-stationary (d-stationary) solutions are considered, and they are in general not unique. We show that under some mild conditions, a certain subset of d-stationary solutions in an optimization problem (with a DC objective) has some ideal statistical properties: namely, asymptotic estimation consistency, asymptotic model selection consistency, asymptotic efficiency. The aforementioned properties are the ones that have been proven by many researchers for a range of proposed non-convex penalties in the sparse regression. Our analysis indicates that even with non-convex optimization, some statistical theoretical guarantee can still be established, in some general senses. Our work potentially bridges the communities of optimization and statistics. A joint work with Shanshan Cao.

Binyan Jiang
Penalized interaction estimation for ultrahigh dimensional quadratic regression
Quadratic regression goes beyond linear model by simultaneously including main effects and interactions between the covariates. The problem of interaction estimation in high dimensional quadratic regression has received extensive attention in the past decade. In this article we introduce a novel method which allows us to estimate the main effects and interactions separately. Unlike existing methods for ultrahigh dimensional quadratic regressions, our proposal does not require the widely used heredity assumption. In addition, our proposed estimates have explicit formulas and obey the invariance principle at the population level. We estimate the interactions of matrix form under penalized convex loss function. The resulting estimates are shown to be consistent even when the covariate dimension is an exponential order of the sample size. We develop an efficient ADMM algorithm to implement the penalized estimation. This ADMM algorithm fully explores the cheap computational cost of matrix multiplication and hence is much more efficient than existing penalized methods under heredity constraints. We demonstrate the promising performance of our proposal through extensive numerical studies.

Ata Kaban
Structure aware generalisation error bounds using random projections
We present new risk bounds for binary classification in high-dimensional settings when the dimensionality can be larger than the sample size. In particular, we prove upper bounds for both 'compressive learning' by empirical risk minimization (ERM) -- that is when the ERM classifier is learned from data that have been projected from high-dimensions onto a randomly selected low-dimensional subspace - as well as uniform upper bounds in the full high-dimensional space. For the latter, random projection serves as an analytic tool to discover and exploit benign geometry inherent in data by means of the distortion of the loss function of a learning problem. Our bounds are data dependent, highlight benign geometric structures that make a classification problem easier or harder, and remain informative in small sample conditions. We then use our approach to draw connections between existing classification approaches, and finally we empirically demonstrate that our basic bound is informative enough in practice to serve as the objective function for learning a classifier.

Jeremias Knoblauch
Bayesian on-line changepoint detection and model selection in high-dimensional data
Bayesian on-line changepoint detection is extended to on-line model selection and non- stationary spatio-temporal processes. We propose spatially structured Vector Autoregressions (VARs) for modelling the process between changepoints (CPs) and give an upper bound on the approximation error of such models. The resulting algorithm performs prediction, model selection and CP detection on-line. Its time complexity is linear and its space complexity constant, and thus it is two orders of magnitudes faster than its closest competitor. In addition, it outperforms the state of the art for multivariate data.

Mladen Kolar
Recovery of simultaneous low rank and two-way sparse coefficient matrices, a nonconvex approach
We study the problem of recovery of matrices that are simultaneously low rank and row and/or column sparse. Such matrices appear in recent applications in cognitive neuroscience, imaging, computer vision, macroeconomics, and genetics. We propose a GDT (Gradient Descent with hard Thresholding) algorithm to efficiently recover matrices with such structure, by minimizing a bi-convex function over a nonconvex set of constraints. We show linear convergence of the iterates obtained by GDT to a region within statistical error of an optimal solution. As an application of our method, we consider multi-task learning problems and show that the statistical error rate obtained by GDT is near optimal compared to minimax rate. Experiments demonstrate competitive performance and much faster running speed compared to existing methods, on both simulations and real data sets.
Joint work with Ming Yu, Zhaoran Wang, and Varun Gupta.

Jason Lee
Geometry of optimization landscapes and implicit regularization of optimization algorithms
We first study the problem of learning a Gaussian input two-layer ReLU network with positive output layer and and the symmetric matrix completion problem. Despite the non-convexity of both problems, we prove that every local minimizer is a global minimizer. Since gradient descent converges to local minimizers, this shows that simple gradient-based methods can find the global optimum of these non-convex problems. In the second part, we analyze the implicit regularization effects of various optimization algorithms. In particular we prove that for least squares with mirror descent, the algorithm converges to the closest solution in terms of the bregman divergence. For linearly separable classification problems, we prove that the steepest descent with respect to a norm solves SVM with respect to the same norm. For over-parametrized non-convex problems such as matrix sensing or neural net with quadratic activation, we prove that gradient descent converges to the minimum nuclear norm solution, which allows for both meaningful optimization and generalization guarantees.
This is joint work with Rong Ge, Suriya Gunasekar, Tengyu Ma, Mor Shpigel, Daniel Soudry, and Nati Srebro.

Yoonkyung Lee
Dimensionality reduction for exponential family data
Principal component analysis (PCA) is useful for a wide range of data analysis tasks. However, its implicit link to the Gaussian distribution can be undesirable for discrete data such as binary and multi-category responses or counts. We generalize PCA to handle various types of data using the generalized linear model framework. In contrast to the existing approach of matrix factorizations for exponential family data, our generalized PCA provides low-rank estimates of the natural parameters by projecting the saturated model parameters. Due to this difference, the number of parameters does not grow with the number of observations and the principal component scores on new data can be computed with simple matrix multiplication. We provide a computationally efficient algorithm for finding the principal component loadings and demonstrate the benefits of the proposed approach numerically. Further we discuss its extension to the supervised setting where principal components from predictors are associated with a response.
This is joint work with Andrew Landgraf.

Didong Li,
Efficient manifold and subspace approximations with spherelets
Data lying in a high-dimensional ambient space are commonly thought to have a much lower intrinsic dimension. In particular, the data may be concentrated near a lower-dimensional subspace or manifold. There is an immense literature focused on approximating the unknown subspace, and in exploiting such approximations in clustering, data compression, and building of predictive models. Most of the literature relies on approximating subspaces using a locally linear, and potentially multiscale, dictionary. In this presentation, we propose a simple and general alternative, which instead uses pieces of spheres, or spherelets, to locally approximate the unknown subspace. Theory is developed showing that spherelets can produce dramatically lower covering numbers and MSEs for many manifolds. We develop spherical principal components analysis (SPCA). Results relative to state-of-the-art competitors show dramatic gains in ability to accurately approximate the subspace with orders of magnitude fewer components. This leads to substantial gains in data compressibility, few clusters and hence better interpretability, and much lower MSE based on small to moderate sample sizes. Furthermore, a Bayesian nonparametric model based on spherelets is built to estimate the manifold and the density simultaneously.

Faming Liang
Markov neighbourhood regression for high-dimensional inference
We propose an innovative method for constructing p-values and confidence intervals in high-dimensional regression. Unlike the existing methods, such as desparsified-Lasso, ridge projection and multi sample-splitting, which strive to work on the original scale high-dimensional problem, the proposed method successfully reduces the original high-dimensional inference problem to a series of low-dimensional inference problems by making use of conditional independence relations between different predictors. The proposed method has been tested on high-dimensional linear, logistic and Cox regression. The numerical results indicate that the proposed method significantly outperforms the existing ones. The idea of using conditional independence relations for dimension reduction is general and potentially can be extended to other high-dimensional or big data problems as well.

Jinchi Lv
Asymptotics of eigenvectors and eigenvalues for large structured random matrices
Characterizing the exact asymptotic distributions of high-dimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. This paper establishes the asymptotic properties of the spiked eigenvectors and eigenvalues for the generalized Wigner random matrix, where the mean matrix is assumed to have a low-rank structure. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we provide novel asymptotic expansions for the general linear combination and further show that the linear combination is asymptotically normal after some normalization, where the weight vector can be an arbitrary unit vector. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these large-scale applications. This is a joint work with Jianqing Fan, Yingying Fan and Xiao Han.

Ping Ma
Aympirical method: a new paradigm for statistical analysis for large samples
Traditional statistical theory and methods are developed for small and mild size samples. In particular, statistical model fitting and inference are conducted in the small and mild size samples to get empirical results. Asymptotic theory is established to extrapolate the performance of the empirical results to large samples. However, this traditional coherent statistical analysis paradigm falls apart in large samples. The key challenge is that many traditional statistical methods are computational too expensive to get meaningful empirical results. New statistical paradigm is in urgent need for the statistical analysis in large samples.

Cheng Qian
Covariance and graphical modelling for high-dimensional longitudinal and functional data
The problem of estimating functional covariance and graphical models from a data set consisting of multivariate sparse longitudinal data is considered. The underlying trajectories are represented through the functional principal components expansions, where the covariance matrix of principal component scores characterizes the global covariance feature and principal component functions present the functional representation of covariance relationships. Our proposed estimation procedure first implements a nonparametric method to perform functional principal components for sparse longitudinal data, and then computes functional regularized covariance or precision matrices. We derive the relevant concentration inequalities for high dimensional sparsely sampled functional data and use them to investigate the uniform consistency results for our proposed estimators. The finite sample performance of our proposed methods are illustrated through an extensive set of simulation studies and two real data examples.

Qifan Song
Bayesian shrinkage towards sharp minimaxity
Shrinkage prior becomes more and more popular in Bayesian modeling for high dimensional sparse problems due to its computational efficiency. Recent works show that a polynomially decaying prior leads to satisfactory posterior asymptotics under regression models. In literature, statisticians have investigated how the global shrinkage parameter, i.e., the scale parameter, in a heavy tail prior affects the posterior contraction. In this work, we explore how the shape of the prior, or more specifically, the polynomial order of the prior tail affects the posterior. We discover that, under sparse normal means models, the polynomial order does affect the multiplicative constant of the posterior contraction rate. More importantly, if the polynomial order is sufficiently close to 1, it will induce the optimal Bayesian posterior convergence, in the sense that the Bayesian contraction rate is sharply minimax, i.e., not only the order, but also the multiplicative constant of the posterior contraction rate are optimal. The above Bayesian sharp minimaxity holds when the global shrinkage parameter follows a deterministic choice which depends on the unknown sparsity’s. Therefore, a Beta modeling is further proposed, such that our sharply minimax Bayesian procedure is adaptive to unknown s. Our theoretical discoveries are justified by simulation studies.

Stanislav Volgushev
Distributed inference for quantile regression processes
The increased availability of massive data sets provides a unique opportunity to discover subtle patterns in their distributions, but also imposes overwhelming computational challenges. To fully utilize
the information contained in big data, we propose a two-step procedure:

  1. estimate conditional quantile functions at different levels in a parallel computing environment;
  2. construct a conditional quantile regression process through projection based on these estimated
    quantile curves.

Our framework covers both linear models with fixed or growing dimension and series approximation models. We prove that the proposed procedure does not sacrifce any statistical inferential accuracy provided that the number of distributed computing units and quantile levels are chosen properly. In particular, a sharp upper bound for the former and a sharp lower bound for the latter are derived to capture the minimal computational cost from a statistical perspective. Moreover, we propose computationally efficient approaches to conducting inference in the distributed estimation setting described above. Those approaches directly utilize the availability of estimators from sub-samples and can be carried out at almost no additional computational cost. In particular, a novel multiplier bootstrap procedure on aggregated samples which should also be useful outside of the quantile regression setting is proposed.

Eric Xing
On system and algorithm co-design and automatic machine learning

The rise of Big Data and AI computing has led to new demands for Machine Learning systems to learn complex models with millions to billions of parameters that promise adequate capacity to digest massive datasets and offer powerful and real-time predictive analytics thereupon. In this talk, I discuss a recent trend toward building new distributed frameworks for AI at massive scale known as “system and algorithm co-design” -- system designs are tailored to the unique properties of ML algorithms, and algorithms are re-designed to better fit into the system architecture. I show how one can explore the underlying statistical and algorithmic characteristics unique to ML programs but not typical in traditional computer programs in designing the system architecture to achieve significant, universal, and theoretically sound power-up of ML program across the board. I also present a briefly introduction of the Petuum system based on such interdisciplinary innovations, which intends to dramatically improve adoption of AI solutions by lowering the barrier of entry to AI technologies via Automatic Machine Learning through Petuum. I show how, through automatable, product-grade, hardware-agnostic, standardized building blocks that can be assembled and customized, AI users can liberate themselves from the demanding experience of algorithm programming and system tuning, and easily experiment with different AI methods, parameters, and speed/resource trade-offs by themselves or automatically.

Chen Zhang
Variational Gaussian approximation for Poisson data
The Poisson model is frequently employed to describe count data, but in a Bayesian context it leads to an analytically intractable posterior probability distribution. In this work, we analyze a variational Gaussian approximation to the posterior distribution arising from the Poisson model with a Gaussian prior. This is achieved by seeking an optimal Gaussian distribution minimizing the Kullback–Leibler divergence from the posterior distribution to the approximation, or equivalently maximizing the lower bound for the model evidence. We derive an explicit expression for the lower bound, and show the existence and uniqueness of the optimal Gaussian approximation. The lower bound functional can be viewed as a variant of classical Tikhonov regularization that penalizes also the covariance. Then we develop an efficient alternating direction maximization algorithm for solving the optimization problem, and analyze its convergence. We discuss strategies for reducing the computational complexity via low rank structure of the forward operator and the sparsity of the covariance. Further, as an application of the lower bound, we discuss hierarchical Bayesian modeling for selecting the hyperparameter in the prior distribution, and propose a monotonically convergent algorithm for determining the hyperparameter. We present extensive numerical experiments to illustrate the Gaussian approximation and the algorithms.

Chao Zheng
Revisiting Huber’s m-estimation: a tuning-free approach
We introduce a novel scheme to choose the scale or robustification parameter in Huber’s method for mean estimation and linear regression in both low and high dimensional settings, which is tuning-free. For robustly estimating the mean of a univariate distribution, we first consider the adaptive Huber estimator with the robustification parameter calibrated via the censored equation approach. Our theoretical results provide finite sample guarantees for both the estimation and calibration parts. To further reduce the computational complexity, we next develop an alternating M-estimation procedure, which simultaneously estimates the mean and variance in a sequential manner. This idea can be naturally extended to regression problems in both low and high dimensions. We provide simple and fast algorithms to implement this procedure under various scenarios and study the numerical performance through simulations.

Hua Zhou
Global solutions of generalized canonical correlation analysis problems
Generalization of canonical correlation analysis (CCA) can be written as a problem of maximizing the sum of traces of quadratic forms of orthogonal matrices, which is termed the orthogonal trace-sum maximization (OTSM) problem. OTSM also generalizes many interesting problems other than CCA, from Procrustes analysis to cryo-electron microscopy of the Nobel prize fame. Due to the orthogonality constraints, OTSM is a nonconvex maximization problem on a product of Stiefel manifolds. We first introduce an efficient block ascent algorithm that locally solves the problem with provable convergence. We then provide a simple method to certify global optimality of the local solution. This method only requires to test the sign of the smallest eigenvalue of a symmetric matrix constructed from the local solution and the data, and does not rely on a particular solution method. Our result relies on semidefinite programming (SDP) relaxation of OTSM, but needs not solve the SDP of lifted dimension. This is joint work with Joong-Ho Won and Kenneth Lange.

Wenxuan Zhong
Leverage sampling to overcome the computational challenges for big spatial data
Within the realms of big data, spatial data is one of fastest growing data types and poses a significant challenge to researchers on statistical analyses. With advances in remote sensors, sensor networks, and the proliferation of location sensing devices in daily life activities and common business practices, the generation of disparate, dynamic, and geographically distributed spatial data has exploded in recent years. In addition, large-scale spatiotemporal data generated by social media outlets is proving to be highly useful in disease surveillance, disaster mapping, and national security applications.

In general, spatial models are limited in their efficacy for large data sets due to expensive computational cost. For example, in the spatial modeling, computation typically is of the order (n), with n being the number of spatial locations. Methods for reducing the computational burden in spatial models are becoming more common in the spatial statistics literature.

Here we introduce a data dependent random sampling scheme called spatial leveraging sampling method. Leveraging sampling method has been successful in improving computational efficiency of independent data cases. Considering the spatial dependency of the data, we develop the spatial leveraging score under the Gaussian spatial linear model. An empirical evaluation of the leverage-based methods is carried out on both synthetic and real datasets. The empirical results indicate that our method has a good practical performance. For example, with the same computation reduction as in the random uniform sampling approach, our proposed spatial leveraging sampling method typically leads to improved mean squared error. To establish the potential of our method, we model the daily spatiotemporal Twitter trend in the United States. Our results show that the proposed spatial leveraging sampling approach can provide computational efficient and informative estimates of spatiotemporal social media trends.