Computational strategies for largescale statistical data analysis
ICMS, 15 South College Street, Edinburgh, EH8 9AA
2 – 6 July 2018
Organisers
 Guang Cheng, Purdue University
 Chenlei Leng, University of Warwick
Largescale data are increasingly encountered in biology, medicine, engineering, social sciences and economics with the advance of the measurement technology. A distinctive feature of such data is that they usually come with a large sample size and/or a large number of features, creating challenges even for data storage and processing, not to mention for data analysis. On the other hand, classical statistical methodology, theory and computation have been developed based on the assumption that the entire data reside on a central location. As a result, most classical statistical methods face computational challenges for analysing largescale data in the big data era. Specifically, big data is known to possess the socalled 4D features: Distributed, Dirty, Dimensionality and Dynamic. These features, which are often mixed together in reality, make it very challenging to apply traditional statistical thinking to massive data.
Bringing together active researchers working at the interface of statistics, machine learning and optimization, the aims of this workshop are:
 to exchange the methodological, algorithmic and theoretical developments recently made in distributed data analysis and aggregated inference with consideration on computational complexity and statistical properties of relevant estimators;
 to discuss open challenges, exchange research ideas and forge collaborations in three research areas: statistics, machine learning and optimization;
 to promote the development of software with justified statistical properties and efficient computational properties;
 to engage more UK young researchers to work at the interface of computing and statistics.
Photographs of the workshop delegates are now available in our Flickr account.
Arrangements
You are expected to arrive at ICMS on Monday 2 July between 08:30 and 09:20 for registration. The workshop will begin with a welcome talk at 09:20 followed by the first talk at 09:30. The workshop will close at lunchtime on Friday 6 July 2018.
PARTICIPATION
The registration period for this workshop is now closed.
REGISTRATION FEE
There will be a registration fee of 100.00 GBP for this workshop. This will be collected on arrival at ICMS at registration. The registration fee is payable by all participants with the exception of organisers and locals.
VENUE
The workshop will be held at ICMS, 15 South College Street, Edinburgh, EH8 9AA.
TALKS AND AUDIO/VISUAL EQUIPMENT
All talks will be held in the Newhaven Lecture Theatre. The lecture theatre has a built in computer, data projector, and visualiser/document camera. In addition, there are two blackboards. The projector and one board may be used simultaneously. We advise Speakers that, where possible, you bring your talk on a memory stick/USB to put onto our computer in advance of your session  either at the start of the day or during coffee/lunch breaks. ICMS staff can help with this. It is possible for you to use your own laptops but it is then your own responsibility to ensure that resolutions are changed on the laptops if necessary (ICMS will not change the resolution on our equipment). If you use a Mac we expect you to bring your own adaptor.
UK VISAS
If you are travelling from outside of the UK you may require an entry visa. A European visa does not guarantee entry to the UK. Please use this link to the UK Visas site to find out if you need a visa and if so how to apply for one.
TRAVEL
Please note that it is your responsibility to have adequate travel insurance to cover medical and other emergencies that may occur on your trip.
A taxi directly from the airport will cost approximately 20.00 to 25.00 GBP to the city centre for a oneway journey. There is also a bus service direct from the airport to the city centre which will cost 4.50 GBP single or 7.50 GBP return  the Airlink 100. This is a frequent service (every 10 minutes during peak times) and will bring you close to Waverley Railway Station, only a short walk to the workshop venue.
Lothian buses charge £1.70 for a single, £4.00 for a day ticket. Please note that the exact fare is required and no change is given.
If travelling by train, please note that Edinburgh has several railway stations  Waverley Railway Station being the main station and closest to the workshop venue at 15 South College Street. If you alight at Edinburgh Waverley, the workshop venue is an easy 10 minute walk over North and South Bridge.
CATERING
Refreshments will be served during the coffee breaks in the Programme. Lunch will be provided Monday, Tuesday and Thursday. As Wednesday is a free afternoon, lunch is not provided and participants are free to visit the many cafés and restaurants nearby. The workshop will close at lunchtime on Friday to enable travel that day  again no lunch is provided. There will be an informal wine reception on Monday evening and a workshop dinner which will take place in Blonde Restaurant, 75 St Leonard's Street, Edinburgh, 19.00, on Thursday 5 July 2018. There is no charge to workshop participants for attendance at the dinner.
PUBLIC LECTURE
On Monday 2 July , there will be a Public Lecture, 18:0019:00, given by Eric Xing, Carnegie Mellon University, entitled A builder’s take on AI: magical, useful, or usable? Doors will open to the public at 17:30. This will be followed by an informal wine reception afterwards from 19:0019:45. Places have been reserved for all delegates.
Provisional programme
May be subject to change
Monday 2 July
08:3009:20 
Registration in the Chapterhouse, Level 1 
09:2009:30 
Welcome 
09:3010:15 
David Dunson (Duke University) 
10:1510:45 
Tea/Coffee in the Chapterhouse, Level 1 
10:4511:30 
Ata Kaban (University of Birmingham) 
11:3012:15 
Eric Xing (Carnegie Mellon University) 
12:1514:00 
Lunch provided in the Chapterhouse, Level 1 
14:0014:45 
Yoonkyung Lee (The Ohio State University) 
14:4515:30 
Jinchi Lv(University of Southern California) 
15:3015:35 
Group photograph to be taken 
15:3516:00 
Tea/Coffee in the Chapterhouse, Level1 
16:0016:45 
Faming Liang (Purdue University) 
16:4517:30 
Ping Ma (University of Georgia) 
17:40 
Doors open to the public 
18:0019:00 
Public lecture by Eric Xing (Carnegie Mellon University) 
19:0019:45 
Informal wine reception in the Chapterhouse, Level 1 
Tuesday 3 July
09:0009:45 
Mladen Kolar (University of Chicago) 
09:4510:30 
Guang Cheng (Purdue University) 
10:3011:00 
Tea/Coffee in the Chapterhouse, Level 1 
11:0011:45 
Yining Chen (LSE) 
11:4512:30 
Haeran Cho (University of Bristol) 
12:3014:00 
Lunch provided in the Chapterhouse, Level 1 
14:0014:45 
Jason Lee (University of Southern California) 
14:4515:30 
Chen Zhang (University College London) 
15:3016:00 
Tea/Coffee in the Chapterhouse, Level 1 
16:0016:45 
Jeremias Knoblauch (University of Warwick) 
16:4517:30 
Stanislav Volgushev (University of Toronto) 
Wednesday 4 July
09:0009:45 
Hua Zhou (University of California, Los Angeles) 
09:4510:30 
Matteo Fasiolo (University of Bristol) 
10:3011:15 
Tea/Coffee in the Chapterhouse, Level 1 
11:1512:00 
Wenxuan Zhong (University of Georgia) 
12:00 
Free afternoon 
Thursday 5 July
09:0009:45 
Moulinath Banerjee (University of Michigan) 
9:4510:30 
Qifan Song (Purdue University) 
10:3011:00 
Tea/Coffee in the Chapterhouse, Level 1 
11:0011:45 
Binyan Jiang (The Hong Kong Polytechnic University) 
11:4512:30 
Chao Zheng (Lancaster University) 
12:3014:00 
Lunch provided in the Chapterhouse, Level 1 
14:0014:45 
Xin Bing (Cornell University) 
14:4515:30 
Cheng Qian (London School of Economics) 
15:3016:00 
Tea/Coffee in the Chapterhouse, Level 1 


19:00 
Workshop dinner at Blonde Restaurant, 75 St Leonard's Street, Edinburgh 
Friday 6 July
09:3010:15 
Didong Li (Duke University) 
10:1511:00 
Xiaoming Huo (Georgia Institute of Technology) 
11:0011:15 
Discussion 
11:1512noon 
Tea/Coffee in the Chapterhouse, Level 1 
12noon 
Close of workshop 
Abstracts
Moulinath Banerjee
Divide and conquer in nonstandard problems: the superefficiency phenomenon
We study how the divide and conquer principle  partition the available data into subsamples, compute an estimate from each subsample and combine these appropriately to form the final estimator  works in nonstandard problems where rates of convergence are typically slower than the square root of n and limit distributions are nonGaussian, with a special emphasis on the least squares estimator (and its inverse) of a monotone regression function.
We find that the pooled estimator, obtained by averaging nonstandard estimates across the mutually exclusive subsamples (typically stored on different servers), outperforms the nonstandard estimator based on the entire sample in the sense of pointwise inference. We also show that, under appropriate conditions, if the number of subsamples is allowed to increase at appropriate rates, the pooled estimator is asymptotically normally distributed with a variance that is empirically estimable from the subsamplelevel estimates. Further, in the context of monotone function estimation we show that this gain in pointwise efficiency comes at a price  the pooled estimator's performance, in a uniform sense (maximal risk) over a class of models worsens as the number of subsamples increases, leading to a version of the superefficiency phenomenon.
Finally, we propose an alternative pooled estimator that merges data from the different servers in an appropriate way  namely, by smoothing the data over bins at each local server, conveying these summary statistics to a central machine, and performing isotonic regression from the combined summary statistics at the central machine  and manages to overcome the superefficiency phenomenon.
This is joint work with Cecile Durot and Bodhisattva Sen.
Xin Bing
A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics
Topic models have become popular for the analysis of data that consists in a collection of n independent multinomial observations, with parameters Ni ∈ N and Πi ∈ [0, 1]^{p} for i = 1, . . . , n. The model links all cell probabilities, collected in a p×n matrix Π, via the assumption that Π can be factorized as the product of two nonnegative matrices A ∈ [0, 1]^{pxK} and W ∈ [0, 1]^{K×n}. Topic models have been originally developed in text mining, when one browses through n documents, based on a dictionary of p words, and covering K topics. In this terminology, the matrix A is called the wordtopic matrix, and is the main target of estimation. It can be viewed as a matrix of conditional probabilities, and it is uniquely defined, under appropriate separability assumptions, discussed in detail in this work. Notably, the unique A is required to satisfy what is commonly known as the anchor word assumption, under which A has an unknown number of rows respectively proportional to the canonical basis vectors in R^{K}. The indices of such rows are referred to as anchor words. Recent computationally feasible algorithms, with theoretical guarantees, utilize constructively this assumption by linking the estimation of the set of anchor words with that of estimating the K vertices of a simplex. This crucial step in the estimation of A requires K to be known, and cannot be easily extended to the more realistic setup when K is unknown.
This work takes a different view on anchor word estimation, and on the estimation of A. We propose a new method of estimation in topic models, that is not a variation on the existing simplex finding algorithms, and that estimates K from the observed data. We derive new finite sample minimax lower bounds for the estimation of A, as well as new upper bounds for our proposed estimator. We describe the scenarios where our estimator is minimax adaptive. Our finite sample analysis is valid for any n, Ni, p and K, and both p and K are allowed to increase with n, a situation not handled well by previous analyses. We complement our theoretical results with a detailed simulation study. We illustrate that the new algorithm is faster and more accurate than the current ones, although we start out with a computational and theoretical disadvantage of not knowing the correct number of topics K, while we provide the competing methods with the correct value in our simulations.
Yining Chen
NarrowestOverThreshold detection of multiple changepoints and changepointlike features
We propose a new, generic and flexible methodology for nonparametric function estimation, in which we first estimate the number and locations of any features that may be present in the function, and then estimate the function parametrically between each pair of neighbouring detected features. Examples of features handled by our methodology include changepoints in the piecewiseconstant signal model, kinks in the piecewiselinear signal model, and other similar irregularities, which we also refer to as generalised changepoints. Our methodology works with only minor modifications across a range of generalised changepoint scenarios, and we achieve such a high degree of generality by proposing and using a new multiple generalised changepoint detection device, termed NarrowestOverThreshold (NOT). The key ingredient of NOT is its focus on the smallest local sections of the data on which the existence of a feature is suspected. Crucially, this adaptive localisation technique prevents NOT from considering subsamples containing two or more features, a key factor that ensures the general applicability of NOT. For selected scenarios, we show the consistency and nearoptimality of NOT in detecting the number and locations of generalised changepoints. Furthermore, we discuss its computational cost and demonstrate that the total cost is closetolinear.
Guang Cheng
Largescale nearest neighbor classification with statistical guarantee
In this talk, we develop distributed classification methods based on nearest neighbor principle. Through
majority voting, the distributed classification can achieve the oracle rate of regret and instability if neighbors are carefully chosen. The only loss is a multiplicative constant that depends only on data dimensionality. This can be remedied by replacing majority voting with continuous aggregation.
Haeran Cho
Multiscale MOSUM procedure with localised pruning
In this work, we investigate the problem of multiple changepoint detection where the changes occur in the mean of univariate data. The proposed localised pruning methodology is applicable when conflicting changepoint estimates are returned with the information about the local interval in which they are detected. We study the theoretical consistency of the localised pruning in combination with the multiscale extension of the Moving SUM (MOSUM) procedure Eichinger and Kirch (2018). Extensive simulation studies show the computational efficiency and good finite sample performance of the combined methodology, which is available as an R package 'mosum'. This is joint work with Claudia Kirch (OvGU Magdeburg).
David Dunson
Scaling up Bayesian inference
This talk discusses some recent work on scaling up Bayesian inference to large sample size (n) and/or highdimensional (large p) problems. We focus in particular on algorithms having theoretical guarantees on performance, including divideandconquer Markov chain Monte Carlo (MCMC) and approximate MCMC. We are also heavily motivated by real world applications in the sciences.
Matteo Fasiolo
Calibrated additive quantile regression
Generalized Additive Models (GAMs) are an extension of Generalized Linear Models (GLMs), which allow the inclusion of random effects, complex nonlinear effects (built using spline bases) and, more recently, response distributions outside the exponential family. In this talk I will describe a computationally efficient Bayesian framework for fitting quantile GAMs, which are based on the pinball loss, rather than on a probabilistic response distribution. The new hierarchical framework selects both the smoothing parameter and the socalled "learningrate" automatically and efficiently, and provides posterior credible intervals that are approximately calibrated in a frequentist sense. I will illustrate the new methods in the context of electricity demand forecasting, where they provide a predictive performance that is competitive with that of Gradient Boosting (GB), but at a fraction of GB's computational cost.
Xiaoming Huo
Nonconvex optimization and statistical properties
Nonconvex optimization has been introduced into statistics with a range of applications. One application is in the model selection under the sparse regression framework, with the celebrated methods such as the smoothly clipped absolute deviation (SCAD), the minimax concave penalty (MCP), and many more. The newly emerged deeplearningrelated techniques often involve nonconvex objective functions as well. A nonconvex optimization problem is generally NPhard; therefore there is no guaranteed polynomialtime numerical solution. One can only hope to identify a local optimum. A differenceofconvex (DC) function can be expressed as a difference of two convex functions, though the original function itself may be nonconvex. There is a large existing literature on the optimization problems when their objectives and/or constraints involve the DC functions; they are commonly referred to as differenceofconvex algorithms (DCA). Efficient numerical solutions have been proposed. Under the DC framework, directionalstationary (dstationary) solutions are considered, and they are in general not unique. We show that under some mild conditions, a certain subset of dstationary solutions in an optimization problem (with a DC objective) has some ideal statistical properties: namely, asymptotic estimation consistency, asymptotic model selection consistency, asymptotic efficiency. The aforementioned properties are the ones that have been proven by many researchers for a range of proposed nonconvex penalties in the sparse regression. Our analysis indicates that even with nonconvex optimization, some statistical theoretical guarantee can still be established, in some general senses. Our work potentially bridges the communities of optimization and statistics. A joint work with Shanshan Cao.
Binyan Jiang
Penalized interaction estimation for ultrahigh dimensional quadratic regression
Quadratic regression goes beyond linear model by simultaneously including main effects and interactions between the covariates. The problem of interaction estimation in high dimensional quadratic regression has received extensive attention in the past decade. In this article we introduce a novel method which allows us to estimate the main effects and interactions separately. Unlike existing methods for ultrahigh dimensional quadratic regressions, our proposal does not require the widely used heredity assumption. In addition, our proposed estimates have explicit formulas and obey the invariance principle at the population level. We estimate the interactions of matrix form under penalized convex loss function. The resulting estimates are shown to be consistent even when the covariate dimension is an exponential order of the sample size. We develop an efficient ADMM algorithm to implement the penalized estimation. This ADMM algorithm fully explores the cheap computational cost of matrix multiplication and hence is much more efficient than existing penalized methods under heredity constraints. We demonstrate the promising performance of our proposal through extensive numerical studies.
Ata Kaban
Structure aware generalisation error bounds using random projections
We present new risk bounds for binary classification in highdimensional settings when the dimensionality can be larger than the sample size. In particular, we prove upper bounds for both 'compressive learning' by empirical risk minimization (ERM)  that is when the ERM classifier is learned from data that have been projected from highdimensions onto a randomly selected lowdimensional subspace  as well as uniform upper bounds in the full highdimensional space. For the latter, random projection serves as an analytic tool to discover and exploit benign geometry inherent in data by means of the distortion of the loss function of a learning problem. Our bounds are data dependent, highlight benign geometric structures that make a classification problem easier or harder, and remain informative in small sample conditions. We then use our approach to draw connections between existing classification approaches, and finally we empirically demonstrate that our basic bound is informative enough in practice to serve as the objective function for learning a classifier.
Jeremias Knoblauch
Bayesian online changepoint detection and model selection in highdimensional data
Bayesian online changepoint detection is extended to online model selection and non stationary spatiotemporal processes. We propose spatially structured Vector Autoregressions (VARs) for modelling the process between changepoints (CPs) and give an upper bound on the approximation error of such models. The resulting algorithm performs prediction, model selection and CP detection online. Its time complexity is linear and its space complexity constant, and thus it is two orders of magnitudes faster than its closest competitor. In addition, it outperforms the state of the art for multivariate data.
Mladen Kolar
Recovery of simultaneous low rank and twoway sparse coefficient matrices, a nonconvex approach
We study the problem of recovery of matrices that are simultaneously low rank
and row and/or column sparse. Such matrices appear in recent applications in
cognitive neuroscience, imaging, computer vision, macroeconomics, and genetics.
We propose a GDT (Gradient Descent with hard Thresholding) algorithm to
efficiently recover matrices with such structure, by minimizing a biconvex
function over a nonconvex set of constraints. We show linear convergence of the
iterates obtained by GDT to a region within statistical error of an optimal
solution. As an application of our method, we consider multitask learning
problems and show that the statistical error rate obtained by GDT is near
optimal compared to minimax rate. Experiments demonstrate competitive
performance and much faster running speed compared to existing methods, on both
simulations and real data sets.
Joint work with Ming Yu, Zhaoran Wang, and Varun Gupta.
Jason Lee
Geometry of optimization landscapes and implicit regularization of optimization algorithms
We first study the problem of learning a Gaussian input twolayer ReLU network with positive output layer and and the symmetric matrix completion problem. Despite the nonconvexity of both problems, we prove that every local minimizer is a global minimizer. Since gradient descent converges to local minimizers, this shows that simple gradientbased methods can find the global optimum of these nonconvex problems. In the second part, we analyze the implicit regularization effects of various optimization algorithms. In particular we prove that for least squares with mirror descent, the algorithm converges to the closest solution in terms of the bregman divergence. For linearly separable classification problems, we prove that the steepest descent with respect to a norm solves SVM with respect to the same norm. For overparametrized nonconvex problems such as matrix sensing or neural net with quadratic activation, we prove that gradient descent converges to the minimum nuclear norm solution, which allows for both meaningful optimization and generalization guarantees.
This is joint work with Rong Ge, Suriya Gunasekar, Tengyu Ma, Mor Shpigel, Daniel Soudry, and Nati Srebro.
Yoonkyung Lee
Dimensionality reduction for exponential family data
Principal component analysis (PCA) is useful for a wide range of data analysis tasks. However, its implicit link to the Gaussian distribution can be undesirable for discrete data such as binary and multicategory responses or counts. We generalize PCA to handle various types of data using the generalized linear model framework. In contrast to the existing approach of matrix factorizations for exponential family data, our generalized PCA provides lowrank estimates of the natural parameters by projecting the saturated model parameters. Due to this difference, the number of parameters does not grow with the number of observations and the principal component scores on new data can be computed with simple matrix multiplication. We provide a computationally efficient algorithm for finding the principal component loadings and demonstrate the benefits of the proposed approach numerically. Further we discuss its extension to the supervised setting where principal components from predictors are associated with a response.
This is joint work with Andrew Landgraf.
Didong Li,
Efficient manifold and subspace approximations with spherelets
Data lying in a highdimensional ambient space are commonly thought to have a much lower intrinsic dimension. In particular, the data may be concentrated near a lowerdimensional subspace or manifold. There is an immense literature focused on approximating the unknown subspace, and in exploiting such approximations in clustering, data compression, and building of predictive models. Most of the literature relies on approximating subspaces using a locally linear, and potentially multiscale, dictionary. In this presentation, we propose a simple and general alternative, which instead uses pieces of spheres, or spherelets, to locally approximate the unknown subspace. Theory is developed showing that spherelets can produce dramatically lower covering numbers and MSEs for many manifolds. We develop spherical principal components analysis (SPCA). Results relative to stateoftheart competitors show dramatic gains in ability to accurately approximate the subspace with orders of magnitude fewer components. This leads to substantial gains in data compressibility, few clusters and hence better interpretability, and much lower MSE based on small to moderate sample sizes. Furthermore, a Bayesian nonparametric model based on spherelets is built to estimate the manifold and the density simultaneously.
Faming Liang
Markov neighbourhood regression for highdimensional inference
We propose an innovative method for constructing pvalues and confidence intervals in highdimensional regression. Unlike the existing methods, such as desparsifiedLasso, ridge projection and multi samplesplitting, which strive to work on the original scale highdimensional problem, the proposed method successfully reduces the original highdimensional inference problem to a series of lowdimensional inference problems by making use of conditional independence relations between different predictors. The proposed method has been tested on highdimensional linear, logistic and Cox regression. The numerical results indicate that the proposed method significantly outperforms the existing ones. The idea of using conditional independence relations for dimension reduction is general and potentially can be extended to other highdimensional or big data problems as well.
Jinchi Lv
Asymptotics of eigenvectors and eigenvalues for large structured random matrices
Characterizing the exact asymptotic distributions of highdimensional eigenvectors for large structured random matrices poses important challenges yet can provide useful insights into a range of applications. This paper establishes the asymptotic properties of the spiked eigenvectors and eigenvalues for the generalized Wigner random matrix, where the mean matrix is assumed to have a lowrank structure. Under some mild regularity conditions, we provide the asymptotic expansions for the spiked eigenvalues and show that they are asymptotically normal after some normalization. For the spiked eigenvectors, we provide novel asymptotic expansions for the general linear combination and further show that the linear combination is asymptotically normal after some normalization, where the weight vector can be an arbitrary unit vector. Simulation studies verify the validity of our new theoretical results. Our family of models encompasses many popularly used ones such as the stochastic block models with or without overlapping communities for network analysis and the topic models for text analysis, and our general theory can be exploited for statistical inference in these largescale applications. This is a joint work with Jianqing Fan, Yingying Fan and Xiao Han.
Ping Ma
Aympirical method: a new paradigm for statistical analysis for large samples
Traditional statistical theory and methods are developed for small and mild size samples. In particular, statistical model fitting and inference are conducted in the small and mild size samples to get empirical results. Asymptotic theory is established to extrapolate the performance of the empirical results to large samples. However, this traditional coherent statistical analysis paradigm falls apart in large samples. The key challenge is that many traditional statistical methods are computational too expensive to get meaningful empirical results. New statistical paradigm is in urgent need for the statistical analysis in large samples.
Cheng Qian
Covariance and graphical modelling for highdimensional longitudinal and functional data
The problem of estimating functional covariance and graphical models from a data set consisting of multivariate sparse longitudinal data is considered. The underlying trajectories are represented through the functional principal components expansions, where the covariance matrix of principal component scores characterizes the global covariance feature and principal component functions present the functional representation of covariance relationships. Our proposed estimation procedure first implements a nonparametric method to perform functional principal components for sparse longitudinal data, and then computes functional regularized covariance or precision matrices. We derive the relevant concentration inequalities for high dimensional sparsely sampled functional data and use them to investigate the uniform consistency results for our proposed estimators. The finite sample performance of our proposed methods are illustrated through an extensive set of simulation studies and two real data examples.
Qifan Song
Bayesian shrinkage towards sharp minimaxity
Shrinkage prior becomes more and more popular in Bayesian modeling for high dimensional sparse problems due to its computational efficiency. Recent works show that a polynomially decaying prior leads to satisfactory posterior asymptotics under regression models. In literature, statisticians have investigated how the global shrinkage parameter, i.e., the scale parameter,
in a heavy tail prior affects the posterior contraction. In this work, we explore how the shape of the prior, or more specifically, the polynomial order of the prior tail affects the posterior. We discover that, under sparse normal means models, the polynomial order does affect the multiplicative constant of the posterior contraction rate. More importantly, if the polynomial order is sufficiently close to 1, it will induce the optimal Bayesian posterior convergence, in the sense that the Bayesian contraction rate is sharply minimax, i.e., not only the order, but also the multiplicative constant of the posterior contraction rate are optimal. The above Bayesian sharp minimaxity holds when the global shrinkage parameter follows a deterministic choice which depends on the unknown sparsity’s. Therefore, a Beta modeling is further proposed, such that our sharply minimax Bayesian procedure is adaptive to unknown s. Our theoretical discoveries are justified by simulation studies.
Stanislav Volgushev
Distributed inference for quantile regression processes
The increased availability of massive data sets provides a unique
opportunity to discover subtle patterns in their distributions, but
also imposes overwhelming computational challenges. To fully utilize
the information contained in big data, we propose a twostep procedure:
 estimate conditional quantile functions at different levels in a parallel computing environment;
 construct a conditional quantile
regression process through projection based on these estimated
quantile curves.
Our framework covers both linear models with fixed or growing dimension and series approximation models. We prove that the proposed procedure does not sacrifce any statistical inferential accuracy provided that the number of distributed computing units and quantile levels are chosen properly. In particular, a sharp upper bound for the former and a sharp lower bound for the latter are derived to capture the minimal computational cost from a statistical perspective. Moreover, we propose computationally efficient approaches to conducting inference in the distributed estimation setting described above. Those approaches directly utilize the availability of estimators from subsamples and can be carried out at almost no additional computational cost. In particular, a novel multiplier bootstrap procedure on aggregated samples which should also be useful outside of the quantile regression setting is proposed.
Eric Xing
On system and algorithm codesign and automatic machine learning
The rise of Big Data and AI computing has led to new demands for Machine Learning systems to learn complex models with millions to billions of parameters that promise adequate capacity to digest massive datasets and offer powerful and realtime predictive analytics thereupon. In this talk, I discuss a recent trend toward building new distributed frameworks for AI at massive scale known as “system and algorithm codesign”  system designs are tailored to the unique properties of ML algorithms, and algorithms are redesigned to better fit into the system architecture. I show how one can explore the underlying statistical and algorithmic characteristics unique to ML programs but not typical in traditional computer programs in designing the system architecture to achieve significant, universal, and theoretically sound powerup of ML program across the board. I also present a briefly introduction of the Petuum system based on such interdisciplinary innovations, which intends to dramatically improve adoption of AI solutions by lowering the barrier of entry to AI technologies via Automatic Machine Learning through Petuum. I show how, through automatable, productgrade, hardwareagnostic, standardized building blocks that can be assembled and customized, AI users can liberate themselves from the demanding experience of algorithm programming and system tuning, and easily experiment with different AI methods, parameters, and speed/resource tradeoffs by themselves or automatically.
Chen Zhang
Variational Gaussian approximation for Poisson data
The Poisson model is frequently employed to describe count data, but in a Bayesian context it leads to an analytically intractable posterior probability distribution. In this work, we analyze a variational Gaussian approximation to the posterior distribution arising from the Poisson model with a Gaussian prior. This is achieved by seeking an optimal Gaussian distribution minimizing the Kullback–Leibler divergence from the posterior distribution to the approximation, or equivalently maximizing the lower bound for the model evidence. We derive an explicit expression for the lower bound, and show the existence and uniqueness of the optimal Gaussian approximation. The lower bound functional can be viewed as a variant of classical Tikhonov regularization that penalizes also the covariance. Then we develop an efficient alternating direction maximization algorithm for solving the optimization problem, and analyze its convergence. We discuss strategies for reducing the computational complexity via low rank structure of the forward operator and the sparsity of the covariance. Further, as an application of the lower bound, we discuss hierarchical Bayesian modeling for selecting the hyperparameter in the prior distribution, and propose a monotonically convergent algorithm for determining the hyperparameter. We present extensive numerical experiments to illustrate the Gaussian approximation and the algorithms.
Chao Zheng
Revisiting Huber’s mestimation: a tuningfree approach
We introduce a novel scheme to choose the scale or robustification parameter in Huber’s method for mean estimation and linear regression in both low and high dimensional settings, which is tuningfree. For robustly estimating the mean of a univariate distribution, we first consider the adaptive Huber estimator with the robustification parameter calibrated via the censored equation approach. Our theoretical results provide finite sample guarantees for both the estimation and calibration parts. To further reduce the computational complexity, we next develop an alternating Mestimation procedure, which simultaneously estimates the mean and variance in a sequential manner. This idea can be naturally extended to regression problems in both low and high dimensions. We provide simple and fast algorithms to implement this procedure under various scenarios and study the numerical performance through simulations.
Hua Zhou
Global solutions of generalized canonical correlation analysis problems
Generalization of canonical correlation analysis (CCA) can be written as a problem of maximizing the sum of traces of quadratic forms of orthogonal matrices, which is termed the orthogonal tracesum maximization (OTSM) problem. OTSM also generalizes many interesting problems other than CCA, from Procrustes analysis to cryoelectron microscopy of the Nobel prize fame. Due to the orthogonality constraints, OTSM is a nonconvex maximization problem on a product of Stiefel manifolds. We first introduce an efficient block ascent algorithm that locally solves the problem with provable convergence. We then provide a simple method to certify global optimality of the local solution. This method only requires to test the sign of the smallest eigenvalue of a symmetric matrix constructed from the local solution and the data, and does not rely on a particular solution method. Our result relies on semidefinite programming (SDP) relaxation of OTSM, but needs not solve the SDP of lifted dimension. This is joint work with JoongHo Won and Kenneth Lange.
Wenxuan Zhong
Leverage sampling to overcome the computational challenges for big spatial data
Within the realms of big data, spatial data is one of fastest growing data types and poses a significant challenge to researchers on statistical analyses. With advances in remote sensors, sensor networks, and the proliferation of location sensing devices in daily life activities and common business practices, the generation of disparate, dynamic, and geographically distributed spatial data has exploded in recent years. In addition, largescale spatiotemporal data generated by social media outlets is proving to be highly useful in disease surveillance, disaster mapping, and national security applications.
In general, spatial models are limited in their efficacy for large data sets due to expensive computational cost. For example, in the spatial modeling, computation typically is of the order (n), with n being the number of spatial locations. Methods for reducing the computational burden in spatial models are becoming more common in the spatial statistics literature.
Here we introduce a data dependent random sampling scheme called spatial leveraging sampling method. Leveraging sampling method has been successful in improving computational efficiency of independent data cases. Considering the spatial dependency of the data, we develop the spatial leveraging score under the Gaussian spatial linear model. An empirical evaluation of the leveragebased methods is carried out on both synthetic and real datasets. The empirical results indicate that our method has a good practical performance. For example, with the same computation reduction as in the random uniform sampling approach, our proposed spatial leveraging sampling method typically leads to improved mean squared error. To establish the potential of our method, we model the daily spatiotemporal Twitter trend in the United States. Our results show that the proposed spatial leveraging sampling approach can provide computational efficient and informative estimates of spatiotemporal social media trends.