Research and Teaching in Statistical and Data Sciences

Home > What's on > Research and Teaching in Statistical and Data Sciences

Research and Teaching in Statistical and Data Sciences

 Dec 10 2021

15:00 - 16:00

This diverse seminar series will highlight novel advances in methodology and application in statistics and data science, and will take the place of the University of Glasgow Statistics Group seminar during this period of remote working. We welcome all interested attendees at Glasgow and further afield. For more information please see the University of Glasgow webpage

Register here

Call details will be sent out 30mins before the start of the seminar

These seminars are recorded. All recordings can be found here.


Next seminar:

10 December 2021

Xiaoming Hao (Georgia Institute of Technology, USA) - Identification of Underlying Partial Differential Equations from Noisy Data with Splines
  • We propose a two-stage method called Spline Assisted Partial Differential Equation involved Model Identification (SAPDEMI) to efficiently identify the underlying partial differential equation (PDE) models from the noisy data. In the first stage, we employ the cubic spline to estimate the unobservable derivatives, where some of them govern the underlying PDE models. This stage is computationally efficient, i.e., it only requires the computational complexity of the linear polynomial of the sample size, which achieves the lowest possible order of complexity. In the second stage, we apply Least Absolute Shrinkage and Selection Operator (Lasso) to identify the underlying PDE models, where we focus on the model selections, instead of parameter estimations in the existing literature. Moreover, we develop statistical properties of our method for the correct identification. Finally, we validate our theory through various numerical examples and apply it to a case study analyzing the data downloaded from the National Aeronautics and Space Administration (NASA).


Future Seminars:  

Seminar series will break for Christmas and will return January 2022


Past Seminars:

23 April 2020

Neil Chada, (National University of Singapore) - Advancements of non-Gaussian random fields for statistical inversion

  • Developing informative priors for Bayesian inverse problems is an important direction, which can help quantify information on the posterior. In this talk we introduce a new of a class priors for inversion based on alpha-stable sheets, which incorporate multiple known processes such as a Gaussian and Cauchy process. We analyze various convergence properties which is achieved through different representations these sheets can take. Other aspects we wish to address are well-posedness of the inverse problem and finite-dimensional approximations. To complement the analysis we provide some connections with machine learning, which will allow us to use sampling based MCMC schemes. We will conclude the talk with some numerical experiments, highlighting the robustness of the established connection, on various inverse problems arising in regression and PDEs.


14 May 2020

Roberta Pappadà, (University of Trieste) - Consensus clustering based on pivotal methods

  • Despite its large use, one major limitation of K-means clustering algorithm is its sensitivity to the initial seeding used to produce the final partition. We propose a modified version of the classical approach, which exploits the information contained into a co-association matrix obtained from clustering ensembles. Our proposal is based on the identification of a set of data points–pivotal units–that are representative of the group they belong to. The presented approach can thus be viewed as a possible strategy to perform consensus clustering. The selection of pivotal units has been originally employed for solving the so-called label-switching problem in Bayesian estimation of finite mixture models. Different criteria for identifying the pivots are discussed and compared. We investigate the performance of the proposed algorithm via simulation experiments and the comparison with other consensus methods available in the literature.


21 May 2020

Ana Basiri, (UCL) - Who Are the "Crowd"? Learning from Large but Patchy Samples

  • This talk will look at the challenges of crowdsourced/self-reporting data, such as missingness and biases in ‘new forms of data’ and consider them as a useful source of data itself. A few applications and examples of these will be discussed, including extracting the 3D map of cities using the patterns of blockage, reflection, and attenuation of the GPS signals (or other similar signals), that are contributed by the volunteers/crowd. In the era of big data, open data, social media and crowdsourced data when “we are drowning in data”, gaps and unavailability, representativeness and bias issues associated with them may indicate some hidden problems or reasons allowing us to understand the data, society and cities better.


4 June 2020

Colin Gillespie, (University of Newcastle) - Getting the most out of other people's R sessions

  • Have you ever wondered how you could hack other people's R sessions? Well, I did, and discovered that it wasn't that hard! In this talk, I discuss a few ways I got people to run arbitrary, and hence very dangerous, R scripts. This is certainly worrying now thatwe have all moved to working from home.


18 June 2020

Jo Eidsvik, (NTNU) - Autonomous Oceanographic Sampling Designs Using Excursion Sets for Multivariate Gaussian random fields

  • Improving and optimizing oceanographic sampling is a crucial task for marine science and maritime management. Faced with limited resources to understand processes in the water-column, the combination of statistics and autonomous robotics provides new opportunities for experimental designs. In this work we develop methods for efficient spatial sampling applied to the mapping of coastal processes by providing informative descriptions of spatial characteristics of ocean phenomena. Specifically, we define a design criterion based on improved characterization of the uncertainty in the excursions of vector-valued Gaussian random fields, and derive tractable expressions for the expected Bernoulli variance reduction in such a framework. We demonstrate how this criterion can be used to prioritize sampling efforts at locations that are ambiguous, making exploration more effective. We use simulations to study the properties of methods and to compare them with state-of-the-art approaches, followed by results from field deployments with an autonomous underwater vehicle as part of a case study mapping the boundary of a river plume. The results demonstrate the potential of combining statistical methods and robotic platforms to effectively inform and execute data-driven environmental sampling.


9 July 2020

Vianey Leos-Barajas, (NCSU) - Spatially-coupled hidden Markov models for short-term wind speed forecasting

  • Hidden Markov models (HMMs) provide a flexible framework to model time series data where the observation process, Yt, is taken to be driven by an underlying latent state process, Zt. In this talk, we will focus on discrete-time, finite-state HMMs as they provide a flexible framework that facilitates extending the basic structure in many interesting ways. HMMs can accommodate multivariate processes by (i) assuming that a single state governs the M observations at time t, (ii) assuming that each observation process is governed by its own HMM, irrespective of what occurs elsewhere, or (iii) a balance between the two, as in the coupled HMM framework. Coupled HMMs assume that a collection of M observation processes is governed by its respective M state processes. However, the mth state process at time t, Zm,t not only depends on Zm,t−1 but also on the collection of state process Z−m,t−1. We introduce spatially-coupled hidden Markov models whereby the state processes interact according to an imposed spatial structure and the observations are collected at S spatial locations. We outline an application (in progress) to short-term forecasting of wind speed using data collected across multiple wind turbines at a wind farm.


6 August 2020

Helen Ogden, (University of Southampton) - Towards More Flexible Models for Count Data

  • Count data are widely encountered across a range of application areas, including medicine, engineering and ecology. Many of the models used for the statistical analysis of count data are quite simple and make strong assumptions about the data generating process, and it is common to encounter situations in which these models fail to fit data well. I will review various existing models for count data, and describe some simple scenarios where each of these models fail. I will describe current work on an extension to existing Poisson mixture models, and demonstrate the performance of this new class of models in some simple examples.


17 September 2020

Andrew Zammit Mangion
, (University of Wollongong) - Statistical Machine Learning for Spatio-Temporal Forecasting

  • Conventional spatio-temporal statistical models are well-suited for modelling and forecasting using data collected over short time horizons. However, they are generally time-consuming to fit, and often do not realistically encapsulate temporally-varying dynamics. Here, we tackle these two issues by using a deep convolution neural network (CNN) in a hierarchical statistical framework, where the CNN is designed to extract process dynamics from the process' most recent behaviour. Once the CNN is fitted, probabilistic forecasting can be done extremely quickly online using an ensemble Kalman filter with no requirement for repeated parameter estimation. We conduct an experiment where we train the model using 13 years of daily sea-surface temperature data in the North Atlantic Ocean. Forecasts are seen to be accurate and calibrated. A key advantage of the approach is that the CNN provides a global prior model for the dynamics that is realistic, interpretable, and computationally efficient to forecast with. We show the versatility of the approach by successfully producing 10-minute nowcasts of weather radar reflectivities in Sydney using the same model that was trained on daily sea-surface temperature data in the North Atlantic Ocean. This is joint work with Christopher Wikle, University of Missouri.


25 September 2020

Ed Hill, (University of Warwick) - Predictions of COVID-19 dynamics in the UK: short-term forecasting, analysis of potential exit strategies and impact of contact networks

  • Regarding the future course of the COVID-19 outbreak in the UK, mathematical models have provided, and continue to provide, short and long term forecasts to support evidence-based policymaking. We present a deterministic, age-structured transmission model for SARS-CoV-2 that uses real-time data on confirmed cases requiring hospital care and mortality to provide predictions on epidemic spread in ten regions of the UK. The model captures a range of age-dependent heterogeneities, reduced transmission from asymptomatic infections and is fit to the key epidemic features over time. We illustrate how the model has been used to generate short-term predictions and assess potential lockdown exit strategies. As steps are taken to relax social distancing measures, questions also surround the ramifications on community disease spread of workers returning to the workplace and students returning to university. To study these aspects, we present a network model to capture the transmission of SARS-CoV-2 over overlapping sets of networks in household, social and work/study settings.


2 October 2020

Eleni Matechou, (University of Kent) - Environmental DNA as a monitoring tool at a single and multi-species level

  • Environmental DNA (eDNA) is a survey tool with rapidly expanding applications for assessing presence of a wildlife species at surveyed sites. eDNA surveys consist of two stages: stage 1, when a sample is collected from a site, and stage 2, when the sample is analysed in the lab for presence of species' DNA. The methods were originally developed to target particular species (single-species), but can now be used to obtain a list of species at each surveyed site (multi-species/metabarcoding). In this talk, I will present a novel Bayesian model for analysing single-species eDNA data, while accounting for false positive and false negative errors, which are known to be non-negligible, in both stages of eDNA surveys. All model parameters can be modelled as functions of covariates and the proposed methodology allows us to perform efficient Bayesian variable selection that does not require the use of trans-dimensional algorithms. I will also discuss joint species distribution models as the starting point for modelling multi-species eDNA data and will outline the next steps required to obtain a unifying modelling framework for eDNA surveys.


9 October 2020

Daniela Castro Camilo
, (University of Glasgow) - Bayesian space-time gap filling for inference on extreme hot-spots: an application to Red Sea surface temperatures

  • We develop a method for probabilistic prediction of extreme value hot-spots in a spatio-temporal framework, tailored to big datasets containing important gaps. In this setting, direct calculation of summaries from data, such as the minimum over a space-time domain, is not possible. To obtain predictive distributions for such cluster summaries, we propose a two-step approach. We first model marginal distributions with a focus on accurate modeling of the right tail and then, after transforming the data to a standard Gaussian scale, we estimate a Gaussian space-time dependence model defined locally in the time domain for the space-time subregions where we want to predict. In the first step, we detrend the mean and standard deviation of the data and fit a spatially resolved generalized Pareto distribution to apply a correction of the upper tail. To ensure spatial smoothness of the estimated trends, we either pool data using nearest-neighbor techniques, or apply generalized additive regression modeling. To cope with high space-time resolution of data, the local Gaussian models use a Markov representation of the Matérn correlation function based on the stochastic partial differential equations (SPDE) approach. In the second step, they are fitted in a Bayesian framework through the integrated nested Laplace approximation implemented in R-INLA. Finally, posterior samples are generated to provide statistical inferences through Monte-Carlo estimation. Motivated by the 2019 Extreme Value Analysis data challenge, we illustrate our approach to predict the distribution of local space-time minima in anomalies of Red Sea surface temperatures, using a gridded dataset (11,315 days, 16,703 pixels) with artificially generated gaps. In particular, we show the improved performance of our two-step approach over a purely Gaussian model without tail transformations.


16 October 2020

Daniel Lawson, (University of Bristol) - CLARITY - Comparing heterogeneous data using dissimiLARITY
  • Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale, and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the similarities between entities are conserved. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise, and aids in their interpretation. We explore three diverse comparisons: Gene Methylation vs Gene Expression, evolution of language sounds vs word use, and country-level economic metrics vs cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: the `structural' component analogous to a clustering, and an underlying `relationship' between those structures. This allows a `structural comparison' between two similarity matrices using their predictability from `structure'. This presentation describes work presented in arXiv:2006.00077 with software, the work can be found here.


22 October 2020

Charlotte Jones-Todd
, (University of Aukland) - Modelling systematic effects and latent phenomena in point referenced data

  • The spatial location and time of events (or objects) is the currency that point process statistics invests to estimate the drivers of the intensity, or rate of occurrence, of those events. Yet, the assumed processes giving rise to the observed data typically fail to represent the full complexity of the driving mechanisms. Ignoring spatial or temporal dependence between events leads to incorrect inference, as does assuming the wrong dependency structure. Latent Gaussian models are a flexible class of model that accounts for dependency structures in a hide all ills fashion; the stochastic structures in these models absorb and amalgamate the underlying, unaccounted for, mechanisms leading to the observed data. In this talk I will introduce this class of model and discuss recent work using latent Gaussian fields to model the fluctuations in the data that cannot otherwise be accounted for.


30 October 2020

Theresa Smith
, (University of Bath) - A collaborative project to monitor and improve engagement in talking therapies

  • Over the past two years, I have been working on a joint research project between the University of Bath and as software company Mayden to develop and test models to predict whether a patient in NHS England’s Improving Access to Psychological Therapy (IAPT) programme will attend their therapy appointments, allowing for targeted intervention to be developed. Currently only a third of patients referred to IAPT complete their treatment and of these only half reach recovery. Given that nearly two thirds of people in the UK report having experienced mental health problems, these rates are concerning. In this talk, I’ll give an overview of the collaboration and present the co-development and findings of two recent papers investigating factors associated with engagement in IAPT services:
  •           Predicting patient engagement in IAPT services: a statistical analysis of electronic health records (doi: 10.1136/ebmental-2019-300133)

  •           Impact of COVID-19 on Primary Care Mental Health Services: A Descriptive, Cross-Sectional Timeseries of Electronic Healthcare Records (doi:     

6 November 2020

Manuele Leonelli, (IE University) - Diagonal distrirbutions

  • Diagonal distributions are an extension of marginal distributions, which can be used to summarize and efficiently visualize the main features of multivariate distributions in arbitrary dimensions. The main diagonal is studied in detail, which consists of a mean-constrained univariate distribution function on [0; 1], whose variance connects with Spearman's rho, and whose mass at the endpoints 0 and 1 offers insights on the strength of tail dependence. To learn about the main diagonal density from data, histogram and kernel-based methods that take advantage of auxiliary information in the form of a moment constraint to which diagonal distributions must obey are introduced. An application is given illustrating how diagonal densities can be used in order to contrast the diversification of a portfolio based on FAANG stocks against one based on crypto-assets.

20 November 2020

Nicole Augustin
, (University of Edinburgh) - Introduction of standardised tobacco packaging and minimum excise tax in the UK: a prospective study

  • Standardised packaging for factory made and roll your own tobacco was implemented inthe UK in May, 2017, alongside a minimum excise tax for factory made products. As other jurisdictions attempt to implement standardised packaging, the tobacco industrycontinues to suggest that it would be counterproductive, in part by leading to falls inprice due to commoditisation. Here, we assess the impact of the introduction of these policies on the UK tobacco market. We carried out a prospective study of UK commercial electronic point-of-sale data from 11 constituent geographic areas. Data were available for each tobacco product (or Stock Keeping Unit (SKU)): the tobacco brand, brand family, brand variant, specific features ofthe pack. For each SKU, three years (May 2015 to April 2018) of monthly data onvolume of sales, sales prices, and extent of distribution of sales within the 11 UK geographical areas were available. The main outcomes were changes in sales volumes, volume-weighted real prices, andtobacco industry revenue. To estimate temporal trends of monthly price per stick, revenue and volume sold we used additive mixed models. In the talk we will cover some of the statistically interesting details on data preparation, model choice, trend estimationand presentation of model results. We will also present the main results and talk about limitations. This is joint work with Rosemary Hiscock, Rob Branston and Anna Gilmore. The project was funded by Cancer Research UK.


27 November 2020

Mark Brewer
, (BIOSS), & Marcel van Oijen, (CEH) - Drought risk analysis for forested landscapes: project prafor
  • This NERC-funded project aims to extend theory for probabilistic risk analysis of continuous systems, test its use against forest data, use process models to predict future risks, and develop decision-support tools. Risk is commonly defined as the expectation value for loss. Most risk theory is developed for discrete hazards such as accidents, disasters and other forms of sudden system failure and not for systems where the hazard variable is always present and continuously varying, with matching continuous system response. Risks from such continuous hazards (levels of water, pollutants) are not associated with sudden discrete events, but with extended periods of time during which the hazard variable exceeds a threshold. To manage such risks, we need to know whether we should aim to reduce the probability of hazard threshold exceedance or the vulnerability of the system. In earlier work, we showed that there is only one possible definition of vulnerability that allows formal decomposition of risk as the product of hazard probability and system vulnerability. We have used this approach to analyse risks from summer droughts to the productivity of vegetation across Europe under current and future climatic conditions; this showed that climate change will likely lead to greatest drought risks in southern Europe, primarily because of increased hazard probability rather than significant changes in vulnerability. We plan to improve on this earlier work by: adding exposure to hazard; quantifying uncertainties in our risk estimates for risk; relaxing assumptions via Bayesian hierarchical modelling; testing our approach on both observational data from forests in the U.K., Spain and Finland and on simulated data from process-based modelling of forest response to climate change; embedding the approach in Bayesian decision theory; and developing an interactive web application as a tool for preliminary exploration of risk and its components to support decision-making.

4 December 2020

Ruth King, (University of Edinburgh): To integrated models ... and beyond …

  • Capture-recapture studies are often conducted on wildlife populations in order to improve our understanding of the given species and/or for conservation and management purposes. Such studies involve the repeated observation of individuals over a period of time, and in many cases, over years and even decades. At initial capture each individual is uniquely marked (e.g. using a tag/ring/using natural markings) and released. The data then correspond to the set of observed individual capture histories, detailing at which capture occasions each observed individual is recorded at. For standard capture-recapture models, it is well known that the survival probabilities correspond to apparent survival, with permanent migration and death confounded. We describe an integrated modelling approach, where in addition to the live resighting of individuals, we are also able to recover dead individuals, typically referred to as capture-recapture-recovery data. We show how such data permits the estimation of the dispersal (or migration) parameter and hence true survival probabilities. We describe the general modelling approach and, using sufficient statistics, calculate a corresponding efficient likelihood expression, which makes use of hidden Markov model-type ideas. We apply the approach to data from a colony of guillemots (Uria aalge) collected over the period 1992-2018. For these data there are physical challenges with the data collection process, with the aim of minimising the disturbance of the population in their natural habitat. Individuals are ringed as chicks on the beach below the breeding ledges, but future sightings are then based on visual sightings using long-range telescopes, which are only able to survey a small proportion of the colony, resulting in a partially monitored capture-recapture study. Thus, individuals who locate away from the monitored areas to breed become unobservable (guillemots are philopatric to their breeding location). We show that applying the standard capture-recapture (Cormack-Jolly-Seber) models leads to biased estimates of survival parameter, which can be corrected when using the additional ring-recovery data. Finally we discuss some further ongoing challenges for these data.

22 January 2021

Luca Del
(Groningen): Stochastic modelling of COVID-19 spread in Italy
  • Italy was particularly hard hit during COVID-19 pandemic and the aim of this work is to describe the dynamics of infections within each region and the geographical spread of the virus over time. To this end we extend the standard SIRD framework by introducing pairwise interaction terms between the region-specific infected compartments. That is, we check how the increments of infected people of one region depends on that of another region. That information could be used as a proxy to measure how the epidemics spread geographically. Furthermore, we make all the transition parameters dependent on time so as to capture potential changes in the epidemic spread due to the effect of external factors such as containment measures. We also include the information of the number of tests performed as predictor for the infection parameters. We then use the master equation as probabilistic model to describe the evolution of the temporal process of the states and we focus on its first two-order moments by considering a local linear approximation. Finally we infer the corresponding parameters using a step-wise optimisation procedure which leads to the maximum likelihood estimate.

29 January 2021

Agnieszka Borowska, (University of Glasgow)

  • Chemotaxis is a type of cell movement in response to a chemical stimulus which plays a key role in multiple biophysical processes, such as embryogenesis and wound healing, and which is crucial for understanding metastasis in cancer research. In the literature, chemotaxis has been modelled using biophysical models based on systems of nonlinear stochastic partial differential equations (NSPDEs), which are known to be challenging for statistical inference due to the intractability of the associated likelihood and the high computational costs of their numerical integration. Therefore, data analysis in this context has been limited to comparing predictions from NSPDE models to laboratory data using simple descriptive statistics. In my talk, I will present a statistically rigorous framework for parameter estimation in complex biophysical systems described by NSPDEs such as the one of chemotaxis. I will adopt a likelihood-free approach based on approximate Bayesian computations with sequential Monte Carlo (ABC-SMC) which allows for circumventing the intractability of the likelihood. To find informative summary statistics, crucial for the performance of ABC, I will discuss a Gaussian process (GP) regression model. The interpolation provided by the GP regression turns out useful on its own merits: it relatively accurately estimates the parameters of the NSPDE model and allows for uncertainty quantification, at a very low computational cost. The proposed methodology allows for a considerable part of computations to be completed before having observed any data, providing a practical toolbox to experimental scientists whose modes of operation frequently involve experiments and inference taking place at distinct points in time. In an application to externally provided synthetic data I will demonstrate that the correction provided by ABC-SMC is essential for accurate estimation of some of the NSPDE model parameters and for more flexible uncertainty quantification.

5 February 2021

Mihaela Paun
(Glasgow): Assessing model mismatch and model selection in a Bayesian uncertainty quantification analysis of a fluid-dynamics model of pulmonary blood circulation

  • There have recently been impressive advancements in the mathematical modelling of complex cardio-vascular systems. However, parameter estimation and uncertainty quantification are still challenging problems in this field. In my talk, I will describe a study that uses Bayesian inference to quantify the uncertainty of model parameters and haemodynamic predictions in a one-dimensional pulmonary circulation model based on an integration of mouse haemodynamic and micro-computed tomography imaging data. I will discuss an often neglected, though important source of uncertainty: in the mathematical model form due to the discrepancy between the model and the reality, and in the measurements due to the wrong noise model (jointly called ‘model mismatch’). I will demonstrate that minimising the mean squared error between the measured and the predicted data (the conventional method) in the presence of model mismatch leads to biased and overly confident parameter estimates and haemodynamic predictions. I will show that the proposed method allowing for model mismatch, which is represented with Gaussian Processes, corrects the bias. Additionally, I will compare a linear and a non-linear wall model, as well as models with different vessel stiffness relations. I'll use formal model selection analysis based on the Watanabe Akaike Information Criterion to select the model that best predicts the pulmonary haemodynamics. Results show that the non-linear pressure-area relationship with stiffness dependent on the unstressed radius predicts best the data measured in a control mouse.

12 February 2021

Bernal Arzola (Groningen): Improved network reconstruction with shrinkage-based Gaussian graphical models

  • Gaussian graphical models (GGMs) are undirected network models where the nodes represent the random variables and the edges their partial correlations. GGMs are straightforward to implement, easy to interpret, and have an acceptable computational cost. These advantages have made GGMs popular to reconstruct gene regulatory networks from high throughput data. In this talk, I will discuss the reconstruction of GGMs in the high dimensional case (n << p). In this scenario, the inference becomes challenging and requires regularization/shrinkage.  I will present a novel approach to infer GGM's structures based on the Ledoit-Wolf (LW) shrinkage, which accounts for the sample size, the number of variables, and the shrinkage value.

19 February 2021

Mu Niu (Glasgow): Intrinsic Gaussian processes on nonlinear manifolds and point clouds

  • We propose a class of intrinsic Gaussian processes (GPs) for interpolation, regression on manifolds with a primary focus on complex constrained domains or irregularly shaped spaces arising as subsets or submanifolds of R, R2, R3 and beyond. For example, intrinsic GPs can accommodate spatial domains arising as complex subsets of Euclidean space. Intrinsic GPs respect the potentially complex boundary or interior conditions as well as the intrinsic geometry of the spaces. The key novelty of the approach proposed is to utilize the relationship between heat kernels and the transition density of Brownian motion on manifolds for constructing and approximating valid and computationally feasible covariance kernels.

26 February 2021

Sara Wade & Karla Monterrubio-Gomez (Edinburgh): On MCMC for variationally sparse Gaussian processes: A pseudo-marginal approach

  • Gaussian processes (GPs) are frequently used in machine learning and statistics to construct powerful models. However, when employing GPs in practice, important considerations must be made, regarding the high computational burden, approximation of the posterior, form of the covariance function and inference of its hyperparmeters. To address these issues, Hensman et al. (2015) combine variationally sparse GPs with Markov chain Monte Carlo (MCMC) to derive a scalable, flexible, and general framework for GP models. Nevertheless, the resulting approach requires intractable likelihood evaluations for many observation models. To bypass this problem, we propose a pseudo-marginal (PM) scheme that offers asymptotically exact inference as well computational gains through doubly stochastic estimators for the intractable likelihood and large datasets. In complex models, the advantages of the PM scheme are particularly evident, and we demonstrate this on a two-level GP regression model with a nonparametric covariance function to capture non-stationarity.

5 March 2021

Robert Gramacy, (Virginia Tech Department of Statistics): Replication or Exploration? Sequential Design for Stochastic Simulation Experiments

  • We investigate the merits of replication, and provide methods that search for optimal designs (including replicates), in the context of noisy computer simulation experiments. We first show that replication offers the potential to be beneficial from both design and computational perspectives, in the context of Gaussian process surrogate modeling. We then develop a lookahead based sequential design scheme that can determine if a new run should be at an existing input location (i.e. replicate) or at a new one (explore). When paired with a newly developed heteroskedastic Gaussian process model, our dynamic design scheme facilitates learning of signal and noise relationships which can vary throughout the input space. We show that it does so efficiently, on both computational and statistical grounds. In addition to illustrative synthetic examples, we demonstrate performance on two challenging real-data simulation experiments, from inventory management and epidemiology.
12 March 2021

Chris Williams (Edinburgh): Multi-task Dynamical Systems

  • Time series datasets are often composed of a variety of sequences from the same domain, but from different entities, such as individuals, products, or organizations. We are interested in how time series models can be specialized to individual sequences (capturing the specific characteristics) while still retaining statistical power by sharing commonalities across the sequences. We describe the multi-task dynamical system (MTDS)---a general methodology for extending multi-task learning (MTL) to time series models. Our approach endows dynamical systems with a set of hierarchical latent variables which can modulate all model parameters. To our knowledge, this is a novel development of MTL, and applies to time series both with and without control inputs. We apply the MTDS to motion-capture data of people walking in various styles using a multi-task recurrent neural network (RNN), and to patient drug-response data using a multi-task pharmacodynamic model. Joint work with Alex Bird and Chris Hawthorne
19 March 2021

Ernst C. Wit (Università della Svizzera italiana): Causal regularization

  • Causality is the holy grail of science, but for millennia humankind has struggled to operationalize it efficiently. In recent decades, a number of more successful ways of dealing with causality in practice, such as propensity score matching, the PC algorithm and invariant causal prediction, have been introduced. However, approaches that use a graphical model formulation tend to struggle with the computational complexity, whenever the system gets large. Finding the causal structure typically becomes a combinatorial-hard problem. In our causal inference approach, we build forth on ideas present in invariant causal prediction and the causal Dantzig, by replacing the combinatorial optimization by a continuous optimization using a form of causal regularization. This makes our method applicable to large systems. Furthermore, our method allows a precise formulation of the trade-off between in-sample and out-of-sample prediction error. This is joint work with Lucas Kania.

26 March 2021

Guido Sanguinetti, SISSA: Robustness and interpretability of Bayesian neural networks

  • Deep neural networks have surprised the world in the last decade with their successes in a number of difficult machine learning tasks. However, while their successes are now part of everyday life, DNNs also exhibit some profound weaknesses: chief amongst them, in my opinion, their black box nature and brittleness under adversarial attacks. In this talk, I will discuss a geometric perspective which sheds light on the origins of their vulnerability under adversarial attack, and has also considerable implications for their interpretability. I will also show how a Bayesian treatment of DNNs provably avoids adversarial weaknesses, and improves interpretability (in a saliency context).

          Refs: Carbone et al, NeurIPS 2020; Carbone et al, under review,

23 April 2021

Theodore Papamarkou
- Challenges in Markov chain Monte Carlo for Bayesian neural networks

  • Markov chain Monte Carlo (MCMC) methods have not been broadly adopted in Bayesian neural networks (BNNs). This paper initially reviews the main challenges in sampling from the parameter posterior of a neural network via MCMC. Such challenges culminate to lack of convergence to the parameter posterior. Nevertheless, this paper shows that a non-converged Markov chain, generated via MCMC sampling from the parameter space of a neural network, can yield via Bayesian marginalization a valuable predictive posterior of the output of the neural network. Classification examples based on multilayer perceptrons showcase highly accurate predictive posteriors. The postulate of limited scope for MCMC developments in BNNs is partially valid; an asymptotically exact parameter posterior seems less plausible, yet an accurate predictive posterior is a tenable research avenue. This is joint work with Jacob Hinkle, M. Todd Young and David Womble.

30 April 2021

 Samuel Jackson - Understanding Scientific Processes via Sequential History Matching and Emulation of Computer Models

  • Computer models are essential for aiding the understanding of real-world processes of interest. History matching aims to find the set of all possible combinations of computer model input rate parameters which are not inconsistent with observed data, gathered from a collection of physical experiments, given all the sources of uncertainty involved with the model and the measurements. Analysis of this set permits understanding of the links between the model, the parameter space and experimental observations, thus allowing the model to be informative for the corresponding scientific process. Additional insight can be gained from sequential history matching - namely analysing how the sets of acceptable parameters changes in accordance with successive sets of physical observations. We discuss how sequential history matching, often making use of statistical emulators (approximations) of the model, can be informative for many scientific analyses. In addition, this methodology naturally extends to aiding choice of the most appropriate physical experiment for answering specific scientific questions of interest. We demonstrate our techniques on an important systems biology model of hormonal crosstalk in the roots of an Arabidopsis plant.

7 May 2021

Mitchel Colebank  (North Carolina State University) - On the effects of vascular network size for hemodynamic parameter inference

  • Computational fluid dynamics (CFD) modeling is an emerging tool for understanding the prognosis and development of cardiovascular disease. Advances in image analysis and data acquisition has led to patient-specific CFD, whereby imaging data is directly used as the vascular domain for computational hemodynamic simulations. These techniques are underutilized in understanding pulmonary vascular disease, e.g., pulmonary hypertension (PH), which affect hundreds to thousands of pulmonary blood vessels. One-dimensional (1D) CFD models can predict wave propagation throughout a network of blood vessels, yet it is unclear how model predictions and parameter inference are affected by the size of the vascular network used. This talk investigates the effect of pulmonary arterial network size on model sensitivity, parameter inference, and model predictions in both normotensive and PH models, utilizing data from mice.

14 May 2021

Alen Alexanderian
(North Carolina State University) - Optimal experimental design for inverse problems governed by PDEs with reducible model uncertainty

  • We consider the problem of optimal experimental design (OED) for infinite-dimensional Bayesian linear inverse problems governed by PDEs that contain model uncertainties, in addition to the uncertainty in the inversion parameters. The focus will be on the case where these additional (secondary) uncertainties are reducible; such parametric uncertainties can, in principle, be reduced through parameter estimation. We seek experimental designs that minimize the posterior uncertainty in the inversion parameters, while accounting for the uncertainty in the secondary parameters. To accomplish this, we derive a marginalized A-optimality criterion and present an efficient computational approach for its optimization. We illustrate our proposed methods in a problem involving inference of an uncertain time-dependent source in a contaminant transport model, with an uncertain initial state as secondary uncertainty. We will also discuss a sensitivity analysis framework that can help reduce the complexity of inversion and design of experiments under additional model uncertainties.

21 May 2021

Short Presentation Series -Statistical Inference and Uncertainty Quantification in Cardio-physiological Modelling
    This is a series of short presentations to showcase the work carried out at the new EPSRC-funded research hub "SoftMech-Set". 

    Dirk Husmeier - Overview of the Hub’s research remit

    Richard Clayton - Calibration and sensitivity analysis in cardiac electrophysiology

    Hao Gao - Parameter inference for myocardial constitutive laws based on cardiac magnetic resonance (CMR) images

    Yalei Yang - Bayesian hierarchical modelling for lesion detection from CMR scans

    Mihaela Paun - Haemodynamic modelling for detecting pulmonary hypertension

    Alan Lazarus - Parameter estimation and uncertainty quantification in cardiac mechanics

    David Dalton - Graph neural network emulation of cardiac mechanics

    Arash Rabbani - Quantification of cardiac endotypes in Covid-19

28 May 2021

Cian Scannell (King's College London) - Automating cardiac MRI
  • The first part of the talk will focus on addressing the main technical challenges of quantitative myocardial perfusion MRI. This will cover the problem of respiratory motion in the images and the use of dimension reduction techniques, such as robust principal component analysis, to mitigate this problem. I will then discuss our deep learning-based image processing pipeline that solves the necessary series of computer vision tasks required for the blood flow modelling and introduce the Bayesian inference framework in which the kinetic parameter values are inferred from the imaging data. The second part of this talk will discuss some of the challenges of integrating deep learning models in clinical routine. This will cover recent work on building generalisable models that perform well when tested on unseen datasets acquired from distinct MRI scanners or clinical centres and work on learning in a data-efficient manner, without large training datasets, using synthesised data and physics-informed models.
 4 June 2021
 Ryan McClarren (University of Notre Dame) - Intrusive Uncertainty Quantification for Hyperbolic Equations
  • In this talk I will cover numerical techniques for quantifying parametric uncertainties in hyperbolic problems (e.g., fluid flow, kinetic models, etc.) where we change the underlying equations to include uncertain dimensions. Upon discretizing the resulting equations we obtain an approximation of the solution including the uncertainty in the solution. One aspect of these approaches is that the behavior of the solutions can change, such as shocks no longer sharp. Additionally, when using a continuous basis for the uncertain dimensions such as polynomial chaos, the resulting equations can exhibit spurious oscillations. I will discuss approaches to tackle these oscillations and demonstrate that techniques developed in kinetic theory can be transferred to intrusive UQ. Numerical experiments will convey the effectiveness of these approaches.
11 June 2021 - Please note there is no seminar for this week. 
18 June 2021 - Please note there is no seminar for this week
25 June 2021

  • This talk describes recent joint work with Hau-Tieng Wu and Nan Wu. In nonparametric regression and spatial process modeling, it is common for the inputs to fall in a restricted subset of Euclidean space. For example, the locations at which spatial data are collected may be restricted to a narrow non-linear subset, such as near the edge of a lake. Typical kernel-based methods that do not take into account the intrinsic geometric of the domain across which observations are collected may produce sub-optimal results. In this talk, we focus on solving this problem in the context of Gaussian process (GP) models, proposing a new class of diffusion-based GPs (DB-GPs), which learn a covariance that respects the geometry of the input domain. We use the term ‘diffusion-based’ as the idea is to measure intrinsic distances between inputs in a restricted domain via a diffusion process. As the heat kernel is intractable computationally, we approximate the covariance using finitely-many eigenpairs of the Graph Laplacian (GL). Our proposed algorithm has the same order of computational complexity as current GP algorithms using simple covariance kernels. We provide substantial theoretical support for the DB-GP methodology, and illustrate performance gains through toy examples, simulation studies, and applications to ecology data.  
2 July 2021


  • Today, it is natural that great efforts are directed towards the development of tools to improve our knowledge about molecular interactions. The representation of biological systems as Genetic Regulatory Networks (GRN) that form a map of the interactions between the molecules in an organism is a way of representing such biological complexity. In the past few years, for simulation and inference purposes, many different mathematical and algorithmic models have been adopted to represent the GRN. Among these methods, Multiagent Systems (MAS) are somewhat neglected. Thus, in this paper was performed a Systematic Literature Review (SLR) to clarify the use of MAS in the representation of GRN. The results show that there are very few studies in which the MAS are applied in the task of modeling the GRN. Therefore, given the interesting properties of MAS, it is expected that it can be further investigated in the task of GRN modelling
17 September 2021
Wei Zhang (School of Mathematics and Statistics, University of Glasgow) - Latent multinomial models for capture-recapture data with latent identification
  • Latent multinomial models (LMMs) are a class of models in which observed count data arise as a summary of an unobservable multinomial random variable. Bayesian Markov chain Monte Carlo methods have been well developed for fitting these models. However, one obvious limitation is that model fitting using these methods can take a long time, even for moderate sized data sets. In the first part of this talk, I will introduce a fast maximum likelihood estimation approach to fit LMMs, using an approximate likelihood constructed via the saddlepoint approximation. In the second part, I will introduce some recent applications of the LMM for modelling capture-recapture data that are often collected for wildlife surveys. The LMM can be particularly useful when detected individuals are not identified with certainty.  
1 October 2021
Evan Baker (Exeter) - Emulating Stochastic Computer Models (and using Deterministic Models to do so).
  • Using fast statistical models (emulators) to predict the output of slower numerical models (simulators) is, at this point, a fairly well researched idea. It is usually assumed that these slow simulators are deterministic, with the simulation output always being the same if the same input settings are used. However, many simulators are in fact stochastic, with a random internal component leading to noisy outputs. In this talk we will review some general strategies for dealing with these stochastic simulators, and some challenges these models raise. We will also investigate how deterministic simulators can be used to improve stochastic emulators, using a case study example involving the engineering design of buildings.
 15 October 2021

Rebecca Shipley and her group (University College London) - Collaborative Healthcare Innovation through Mathematics, EngineeRing and AI
  • Hospitals collect a wealth of physiological data that provide information on patient health. Full use of this data is significantly limited by its complexity and by a limited mechanistic understanding of the relationship between internal physiology and external measurement. Addressing this challenge requires multidisciplinary collaboration between mathematicians developing new biomechanical models, clinicians who measure and interpret the data to treat patients, and statistical and computational scientists to bridge the two-way translation between model output and real-life data. The talk will discuss new methods for relating physiology to real time data, and, finally, to translate these into practice, improving outcomes for patients by supporting clinical decision making.
29 October 2021

Jaline Geraldine (Northwestern University, Illinois, USA) - Mathematical modeling to inform policy: COVID-19 in Illinois
  • In 2020, the US state of Illinois assembled a modeling task force to help inform COVID-19 policy. We show how local modelers built transmission models to capture local trends and make short-term forecasts. We take a step back and ask where the state could have done better in preventing disparities and whether we could have known that we were underprepared for reopening. We describe a sentinel surveillance scheme for early warning of increasing trends and show what worked and what didn’t work in implementing this surveillance. Finally, we present some thoughts on lessons learned in engaging with public health officials during the pandemic.
12 November 2021

Michael Evans (University of Toronto, Canada) - The Concept of Statistical Evidence
  • The concept of statistical evidence has proven to be somewhat elusive in the development of the discipline of Statistics. Still there is a conviction that appropriately collected data contains evidence concerning the answers to questions of scientific interest. We discuss some of the attempts at making the concept of evidence precise and, in particular, present an approach based upon measuring how beliefs change from a priori to a posteriori. Of necessity this is Bayesian in nature as a proper prior is required that reflects beliefs about where the truth lies before the data is observed. Bayesian inference is often criticized for its subjective nature. It is possible, however, to deal with this subjectivity in a scientifically sound manner. In part, this is done by assessing and controlling the bias the prior and model induce into inferences and this depends intrinsically on being clear about statistical evidence. In addition, the model and the prior are falsifiable through model checking and checking for prior-data conflict. Both the assessment of bias and the falsification steps are essentially frequentist in nature so this provides a degree of unity between sometimes conflicting philosophies. This approach to statistical reasoning can be seen as dealing with the inevitable subjectivity required in the choice of ingredients to an analysis so that a statistical analysis can approach the goal of objectivity that is central to scientific work.
26 November 2021

Jiahua Chen (University of British Columbia, Vancouver, Canada) - Distributed Learning of Finite Gaussian Mixtures
  • Advances in information technology have led to extremely large datasets that are often kept in different storage centers. Existing statistical methods must be adapted to overcome the resulting computational obstacles while retaining statistical validity and efficiency. Split-and-conquer approaches have been applied in many areas, including quantile processes, regression analysis, principal eigenspaces, and exponential families. We study split- and-conquer approaches for the distributed learning of finite Gaussian mixtures. We recommend a reduction strategy and develop an effective MM algorithm. The new estimator is shown to be consistent and retains root-n consistency under some general conditions. Experiments based on simulated and real-world data show that the proposed split-and-conquer approach has comparable statistical performance with the global estimator based on the full dataset, if the latter is feasible. It can even slightly outperform the global estimator if the model assumption does not match the real-world data. It also has better statistical and computational performance than some existing methods.