## Virtual Seminar Series - Research and teaching in statistical and data sciences

This is the webpage for the Research and Teaching in statistical and data science seminars.

This diverse seminar series will highlight novel advances in methodology and application in statistics and data science, and will take the place of the University of Glasgow Statistics Group seminar during this period of remote working. We welcome all interested attendees at Glasgow and further afield.

Call details will be sent out 30mins before the start of the seminar

### These seminars are recorded. All recordings can be found here.

The dates of the seminars and speakers are as follows:

### Next seminar

Thursday 22 October 16:00-17:00 (Please note Thursday and a later start time for this seminar)

Charlotte Jones-Todd (University of Aukland)

Title: Modelling systematic effects and latent phenomena in point referenced data.

Abstract: The spatial location and time of events (or objects) is the currency that point process statistics invests to estimate the drivers of the intensity, or rate of occurrence, of those events. Yet, the assumed processes giving rise to the observed data typically fail to represent the full complexity of the driving mechanisms. Ignoring spatial or temporal dependence between events leads to incorrect inference, as does assuming the wrong dependency structure. Latent Gaussian models are a flexible class of model that accounts for dependency structures in a hide all ills fashion; the stochastic structures in these models absorb and amalgamate the underlying, unaccounted for, mechanisms leading to the observed data. In this talk I will introduce this class of model and discuss recent work using latent Gaussian fields to model the fluctuations in the data that cannot otherwise be accounted for.

### Future Seminars

Friday 30 October 15:00-16:00

Theresa Smith (University of Bath)

Title: TBC

Abstract: TBC

Friday 6 November 15:00-16:00

Manuele Leonelli (IE University)

Title: TBC

Abstract: TBC

Friday 13 November 15:00-16:00

Glenna Nightingale (University of Edinburgh)

Title: TBC

Abstract: TBC

Friday 20 November 15:30-16:30 (Please note the later start time for this seminar)

Nicole Augustin (University of Edinburgh)

Title: TBC

Abstract: TBC

Friday 27 November 15:00-16:00

Mark Brewer (BIOSS)

Title: TBC

Abstract: TBC

Friday 4 December 15:00-16:00

Ruth King (University of Edinburgh)

Title: TBC

Abstract: TBC

### Previous Seminars

23 April 2020, 10am:

Neil Chada (National University of Singapore)

Title: Advancements of non-Gaussian random fields for statistical inversion

Abstract: Developing informative priors for Bayesian inverse problems is an important direction, which can help quantify information on the posterior. In this talk we introduce a new of a class priors for inversion based on $\alpha$-stable sheets, which incorporate multiple known processes such as a Gaussian and Cauchy process. We analyze various convergence properties which is achieved through different representations these sheets can take. Other aspects we wish to address are well-posedness of the inverse problem and finite-dimensional approximations. To complement the analysis we provide some connections with machine learning, which will allow us to use sampling based MCMC schemes. We will conclude the talk with some numerical experiments, highlighting the robustness of the established connection, on various inverse problems arising in regression and PDEs.

14 May 2020, 2pm

Title: Consensus clustering based on pivotal methods

Abstract: Despite its large use, one major limitation of K-means clustering algorithm is its sensitivity to the initial seeding used to produce the ﬁnal partition. We propose a modiﬁed version of the classical approach, which exploits the information contained into a co-association matrix obtained from clustering ensembles. Our proposal is based on the identiﬁcation of a set of data points–pivotal units–that are representative of the group they belong to. The presented approach can thus be viewed as a possible strategy to perform consensus clustering. The selection of pivotal units has been originally employed for solving the so-called label-switching problem in Bayesian estimation of ﬁnite mixture models. Diﬀerent criteria for identifying the pivots are discussed and compared. We investigate the performance of the proposed algorithm via simulation experiments and the comparison with other consensus methods available in the literature.

21 May 2020, 2pm

Ana Basiri (UCL)

Title: Who Are the "Crowd"? Learning from Large but Patchy Samples

Abstract:This talk will look at the challenges of crowdsourced/self-reporting data, such as missingness and biases in ‘new forms of data’ and consider them as a useful source of data itself. A few applications and examples of these will be discussed, including extracting the 3D map of cities using the patterns of blockage, reflection, and attenuation of the GPS signals (or other similar signals), that are contributed by the volunteers/crowd. In the era of big data, open data, social media and crowdsourced data when “we are drowning in data”, gaps and unavailability, representativeness and bias issues associated with them may indicate some hidden problems or reasons allowing us to understand the data, society and cities better.

4 June 2020, 2pm (BST)

Colin Gillespie (University of Newcastle)

Title:Getting the most out of other people's R sessions.

Abstract:Have you ever wondered how you could hack other people's R sessions? Well, I did, and discovered that it wasn't that hard! In this talk, I discuss a few ways I got people to run arbitrary, and hence very dangerous, R scripts. This is certainly worrying now thatwe have all moved to working from home.

18 June 2020, 2pm (BST)

Jo Eidsvik (NTNU)

Title: 'Autonomous Oceanographic Sampling Designs Using Excursion Sets for Multivariate Gaussian random fields'.

Abstract: Improving and optimizing oceanographic sampling is a crucial task for marine science and maritime management. Faced with limited resources to understand processes in the water-column, the combination of statistics and autonomous robotics provides new opportunities for experimental designs. In this work we develop methods for efficient spatial sampling applied to the mapping of coastal processes by providing informative descriptions of spatial characteristics of ocean phenomena. Specifically, we define a design criterion based on improved characterization of the uncertainty in the excursions of vector-valued Gaussian random fields, and derive tractable expressions for the expected Bernoulli variance reduction in such a framework. We demonstrate how this criterion can be used to prioritize sampling efforts at locations that are ambiguous, making exploration more effective. We use simulations to study the properties of methods and to compare them with state-of-the-art approaches, followed by results from field deployments with an autonomous underwater vehicle as part of a case study mapping the boundary of a river plume. The results demonstrate the potential of combining statistical methods and robotic platforms to effectively inform and execute data-driven environmental sampling.

9 July 2020, 3pm (BST)

Vianey Leos-Barajas (NCSU)

Title: 'Spatially-coupled hidden Markov models for short-term wind speed forecasting

Abstract: Hidden Markov models (HMMs) provide a flexible framework to model time series data where the observation process, Yt, is taken to be driven by an underlying latent state process, Zt. In this talk, we will focus on discrete-time, finite-state HMMs as they provide a flexible framework that facilitates extending the basic structure in many interesting ways. HMMs can accommodate multivariate processes by (i) assuming that a single state governs the M observations at time t, (ii) assuming that each observation process is governed by its own HMM, irrespective of what occurs elsewhere, or (iii) a balance between the two, as in the coupled HMM framework. Coupled HMMs assume that a collection of M observation processes is governed by its respective M state processes. However, the mth state process at time t, Zm,t not only depends on Zm,t−1 but also on the collection of state process Z−m,t−1. We introduce spatially-coupled hidden Markov models whereby the state processes interact according to an imposed spatial structure and the observations are collected at S spatial locations. We outline an application (in progress) to short-term forecasting of wind speed using data collected across multiple wind turbines at a wind farm.

6 August 2020, 2pm (BST)

Helen Ogden (University of Southampton)

Title: Towards More Flexible Models for Count Data

Abstract: Count data are widely encountered across a range of application areas, including medicine, engineering and ecology. Many of the models used for the statistical analysis of count data are quite simple and make strong assumptions about the data generating process, and it is common to encounter situations in which these models fail to fit data well. I will review various existing models for count data, and describe some simple scenarios where each of these models fail. I will describe current work on an extension to existing Poisson mixture models, and demonstrate the performance of this new class of models in some simple examples.

Please note this seminar will not be recorded

Thursday 17 September 10:00-11:00 (please note this is a Thursday seminar)

Andrew Zammit Mangion (University of Wollongong)

Title: Statistical Machine Learning for Spatio-Temporal Forecasting

Abstract: Conventional spatio-temporal statistical models are well-suited for modelling and forecasting using data collected over short time horizons. However, they are generally time-consuming to fit, and often do not realistically encapsulate temporally-varying dynamics. Here, we tackle these two issues by using a deep convolution neural network (CNN) in a hierarchical statistical framework, where the CNN is designed to extract process dynamics from the process' most recent behaviour. Once the CNN is fitted, probabilistic forecasting can be done extremely quickly online using an ensemble Kalman filter with no requirement for repeated parameter estimation. We conduct an experiment where we train the model using 13 years of daily sea-surface temperature data in the North Atlantic Ocean. Forecasts are seen to be accurate and calibrated. A key advantage of the approach is that the CNN provides a global prior model for the dynamics that is realistic, interpretable, and computationally efficient to forecast with. We show the versatility of the approach by successfully producing 10-minute nowcasts of weather radar reflectivities in Sydney using the same model that was trained on daily sea-surface temperature data in the North Atlantic Ocean. This is joint work with Christopher Wikle, University of Missouri.

Friday 25th September 15:00-16:00

Ed Hill (University of Warwick)

Title: Predictions of COVID-19 dynamics in the UK: short-term forecasting, analysis of potential exit strategies and impact of contact networks

Abstract: Regarding the future course of the COVID-19 outbreak in the UK, mathematical models have provided, and continue to provide, short and long term forecasts to support evidence-based policymaking. We present a deterministic, age-structured transmission model for SARS-CoV-2 that uses real-time data on confirmed cases requiring hospital care and mortality to provide predictions on epidemic spread in ten regions of the UK. The model captures a range of age-dependent heterogeneities, reduced transmission from asymptomatic infections and is fit to the key epidemic features over time. We illustrate how the model has been used to generate short-term predictions and assess potential lockdown exit strategies. As steps are taken to relax social distancing measures, questions also surround the ramifications on community disease spread of workers returning to the workplace and students returning to university. To study these aspects, we present a network model to capture the transmission of SARS-CoV-2 over overlapping sets of networks in household, social and work/study settings.

Friday 2 October 15:00-16:00

Eleni Matechou (University of Kent)

Title: Environmental DNA as a monitoring tool at a single and multi-species level

Abstract: Environmental DNA (eDNA) is a survey tool with rapidly expanding applications for assessing presence of a wildlife species at surveyed sites. eDNA surveys consist of two stages: stage 1, when a sample is collected from a site, and stage 2, when the sample is analysed in the lab for presence of species' DNA. The methods were originally developed to target particular species (single-species), but can now be used to obtain a list of species at each surveyed site (multi-species/metabarcoding). In this talk, I will present a novel Bayesian model for analysing single-species eDNA data, while accounting for false positive and false negative errors, which are known to be non-negligible, in both stages of eDNA surveys. All model parameters can be modelled as functions of covariates and the proposed methodology allows us to perform efficient Bayesian variable selection that does not require the use of trans-dimensional algorithms. I will also discuss joint species distribution models as the starting point for modelling multi-species eDNA data and will outline the next steps required to obtain a unifying modelling framework for eDNA surveys.

Friday 9 October 15:00-16:00

Daniela Castro Camilo (University of Glasgow)

Title: Bayesian space-time gap filling for inference on extreme hot-spots: an application to Red Sea surface temperatures

Abstract: We develop a method for probabilistic prediction of extreme value hot-spots in a spatio-temporal framework, tailored to big datasets containing important gaps. In this setting, direct calculation of summaries from data, such as the minimum over a space-time domain, is not possible. To obtain predictive distributions for such cluster summaries, we propose a two-step approach. We first model marginal distributions with a focus on accurate modeling of the right tail and then, after transforming the data to a standard Gaussian scale, we estimate a Gaussian space-time dependence model defined locally in the time domain for the space-time subregions where we want to predict. In the first step, we detrend the mean and standard deviation of the data and fit a spatially resolved generalized Pareto distribution to apply a correction of the upper tail. To ensure spatial smoothness of the estimated trends, we either pool data using nearest-neighbor techniques, or apply generalized additive regression modeling. To cope with high space-time resolution of data, the local Gaussian models use a Markov representation of the Matérn correlation function based on the stochastic partial differential equations (SPDE) approach. In the second step, they are fitted in a Bayesian framework through the integrated nested Laplace approximation implemented in R-INLA. Finally, posterior samples are generated to provide statistical inferences through Monte-Carlo estimation. Motivated by the 2019 Extreme Value Analysis data challenge, we illustrate our approach to predict the distribution of local space-time minima in anomalies of Red Sea surface temperatures, using a gridded dataset (11,315 days, 16,703 pixels) with artificially generated gaps. In particular, we show the improved performance of our two-step approach over a purely Gaussian model without tail transformations.

Friday 16 October 15:00-16:00

Daniel Lawson (University of Bristol)

Title: CLARITY - Comparing heterogeneous data using dissimiLARITY

Abstract: Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale, and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the similarities between entities are conserved. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise, and aids in their interpretation. We explore three diverse comparisons: Gene Methylation vs Gene Expression, evolution of language sounds vs word use, and country-level economic metrics vs cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: the structural' component analogous to a clustering, and an underlying relationship' between those structures. This allows a structural comparison' between two similarity matrices using their predictability from structure'. This presentation describes work presented in arXiv:2006.00077 with software, the work can be found here.

This seminar series is supported as part of the ICMS Online Mathematical Sciences Seminars.