Statistics and Data Science Seminar

Department of Mathematics and Statistics




The Statistics & Data Science Seminar is hosted by the Department of Mathematics and Statistics and provides a weekly platform for academics and researchers from different domains to present and discuss problems and solutions regarding data collection, management and analysis.


  

Spring 2021 Seminars

Welcome to the Spring 2021 Seminar series! The seminars will take place over Zoom on Thursdays at 2pm CDT . For any questions or requests, please contact Roberto Molinari. The list of speakers for this series can be found in the table below which is followed by information on the title and abstract of each seminar as the date approaches.

 

Speaker Institution Date
Xinyi Li
Clemson University Jan. 28 (2pm CDT)
Steven Nixon
Penn State University Feb. 4 (2pm CDT)
Mucyo Karemera
University of Geneva Feb. 11 (2pm CDT)
Shuoyang Wang   Auburn University Feb. 18 (2pm CDT)
Antony Pearson Auburn University Feb. 25 (2pm CDT)
Hans-Werner van Wyk Auburn University Mar. 4 (2pm CDT)
Michael A. Alcorn
Auburn University Mar. 11 (2pm CDT)
Andrea Apolloni CIRAD (France) Mar. 18 (2pm CDT)
Marco Avella-Medina Columbia University Mar. 25 (2pm CDT)
Dave Zhao U of I at Urbana Champaign Apr. 1 (2pm CDT)
Debashis Mondal Oregon State University Apr. 8 (2pm CDT)
Mikhail Zhelonkin University of Rotterdam Apr. 15 (2pm CDT)

 



 

 

Xinyi Li

20120702 单寸

Title: Sparse Learning and Structure Identification for Ultrahigh-Dimensional Image-on-Scalar Regression

Abstract We consider high-dimensional image-on-scalar regression, where the spatial heterogeneity of covariate effects on imaging responses is investigated via a flexible partially linear spatially varying coefficient model. To tackle the challenges of spatial smoothing over the imaging response’s complex domain consisting of regions of interest, we approximate the spatially varying coefficient functions via bivariate spline functions over triangulation. We first study estimation when the active constant coefficients and varying coefficient functions are known in advance. We then further develop a unified approach for simultaneous sparse learning and model structure identification in the presence of ultrahigh-dimensional covariates. Our method can identify zero, nonzero constant, and spatially varying components correctly and efficiently. The estimators of constant coefficients and varying coefficient functions are consistent and asymptotically normal for constant coefficient estimators. The method is evaluated by Monte Carlo simulation studies and applied to a dataset provided by the Alzheimer’s Disease Neuroimaging Initiative.

Recording

 

Steven Nixon

  headshot_clarion_brown.jpg

Title: Condition Based Maintenance in the Big Data Era

Abstract: Operating and maintaining the equipment necessary to make everything from electricity to toilet paper is a complex undertaking, and the last thing a plant manager needs is a machine breaking down unexpectedly. One strategy for avoiding these expensive shutdowns is condition based maintenance (CBM), the art and science of predicting machinery failures before they happen. By collecting the right data and doing the right analyses, any number of different breakdowns can be detected long before they actually force a machine to stop working. As the internet of things spreads throughout the worlds manufacturing facilities, project managers are turning to statistics and machine learning more and more often to draw conclusions on the condition of their equipment. Applying modern data science to these sources isn’t straightforward – although enormous amounts of data are being generated, rarely are fully dense, rectangular matrices created. Successful deployment requires high levels of clarity to win over skeptical maintenance personnel, and there is very little cross-over between the physics based maintenance and engineering world and the fast-moving data science community. Successfully bridging these differences, however, would represent a massive leap forwards for manufacturing worldwide.

Recording

 

Mucyo Karemera

  mucyok (Mucyo Karemera) · GitHub

Title: A General Approach for Simulation-based Bias Correction in High Dimensional Settings

Abstract: An important challenge in statistical analysis lies in controlling the bias of estimators due to the ever-increasing data size and model complexity. Approximate numerical methods and data features like censoring and misclassification often result in analytical and/or computational challenges when implementing standard estimators. As a consequence, consistent estimators may be difficult to obtain, especially in complex and/or high dimensional settings. In this talk, I will present a general simulation-based estimation framework that allows to construct bias corrected consistent estimators.This approach leads, under more general conditions, to stronger bias correction properties compared to alternative methods. Besides its bias correction advantages, the considered method can be used as a simple strategy to construct consistent estimators in settings where alternative methods may be challenging to apply. Moreover, it can be easily implemented and is computationally efficient. These theoretical results will be highlighted with some simulation studies of various commonly used models.

Recording

 

Shuoyang Wang

  Current Graduate Students - Geosciences - Auburn University College of  Sciences and Mathematics

Title: Estimation of the Mean Function of Functional Data via Deep Neural Networks

Abstract: In this work, we propose a deep neural networks based method to perform nonparametric regression for functional data. The proposed estimators are based on sparsely connected deep neural networks with ReLU activation function. We provide the convergence rate of the proposed deep neural networks estimator in terms of the empirical norm. We discuss how to properly select of the architecture parameters by cross-validation. Through Monte Carlo simulation studies we examine the finite-sample performance of the proposed method. Finally, the proposed method is applied to analyze positron emission tomography images of patients with Alzheimer disease obtained from the Alzheimer Disease Neuroimaging Initiative database.

Recording

 

Antony Pearson

  2020-04-13-161227.jpg

Title: Quantifying Structure within Unstructured Symbolic Data

Abstract: Modern biological research is epitomized by "omics" experiments, which produce millions to billions of symbolic outcomes in the form of reads (i.e., DNA sequences of a few dozen to a few hundred nucleotides). Unfortunately, these intrinsically non-numerical datasets are often highly contaminated, and the possible sources of contamination are usually poorly characterized. The latter contrasts with continuous datasets, where it is often well-justified to assume that the distribution of contaminating samples is Gaussian. To overcome hurdles associated with these data, I will introduce the notion of "latent weights", which measure the largest expected fraction of samples from a contaminated probabilistic source that conforms to a model in a well-structured class of desired models. As proof of concept, I use latent weights to reevaluate a long-standing assumption used in most modern DNA methylation analysis.

Recording

 

Hans-Werner van Wyk

  Hans-Werner van Wyk

Title: Stochastic Optimization in Data Analysis and Design under Uncertainty

Abstract: Questions of estimation, design, and control in systems subject to uncertainty can often be formulated as deterministic optimization problems, where the quantity to be minimized is some statistic related the model output. The idea of incorporating statistical sampling efficiently into the optimization loop has been fundamental to the development of feasible optimization methods in large-scale data applications. In this talk I will give a brief overview of stochastic optimization methods, the factors that determine their convergence, and their applications to stochastic tensor decomposition and to the optimal control of uncertain systems governed by partial differential equations.

Recording

 

Michael A. Alcorn

airalcorn2 (Michael A. Alcorn) · GitHub

Title baller2vec : A Multi-Entity Transformer For Multi-Agent Spatiotemporal Modeling

Abstract: Multi-agent spatiotemporal modeling, e.g., forecasting the trajectories of basketball players during a game, is a challenging task from both an algorithmic design and computational complexity perspective. Recent work has explored the efficacy of traditional deep sequential models in this domain, but these architectures are slow and cumbersome to train, particularly as model size increases. Further, prior attempts to model interactions between agents across time have limitations, such as imposing an order on the agents, or making assumptions about their relationships. In this presentation, I will introduce baller2vec, a multi-entity generalization of the standard Transformer that, with minimal assumptions, can simultaneously and efficiently integrate information across entities and time. To test the effectiveness of baller2vec for multi-agent spatiotemporal modeling, we trained it to perform two different basketball-related tasks: (1) simultaneously forecasting the trajectories of all players on the court and (2) forecasting the trajectory of the ball. Not only does baller2vec learn to perform these tasks well, it also appears to "understand" the game of basketball, encoding idiosyncratic qualities of players in its embeddings, and performing basketball-relevant functions with its attention heads. In addition to discussing some baller2vec results, I will review two fundamental deep learning concepts behind the Transformer architecture: "embeddings" and "attention".

Recording | Slides | Paper | GitHub Repo

 

Andrea Apolloni

  img_4463.jpg

TitleModelling and Predicting National and Regional Animal Mobility in North/West Africa

Abstract: The trade of live animals is one of the main economic activities in most of the West and North African countries. Due to the absence of infrastructure, animals are sold alive at local markets to traders, and then moved to capital or coastal cities where they are slaughtered and butchered. In general, the consumption and production areas are several hundred km apart. The possibility of providing a reliable picture of livestock mobility in the area is hindered by the fact that few quantitative data are collected. In this talk, I present the results of the analysis of data provided by Veterinarian services in West and North Africa countries on ruminants’ mobility. Using gravity models we found that possible mobility drivers include environmental factors (conditioning the availability of natural resources), commercial reasons (demand and market price), economical (gdp difference between producer and consumer areas ) and social factors like religious festivities such as Tabaski celebration. To conclude I will present the application of this approach to two case studies: the diffusion of genetic strains in the area and the risk of bluetongue occurrence in Senegal.

Recording

 

Marco Avella-Medina

  Department of Statistics - Marco Avella » Department Directory

TitleDifferentially Private Inference via Noisy Optimization

Abstract: We propose a general optimization-based framework for computing differentially private M-estimators and a new method for the construction of differentially private confidence regions. Firstly, we show that robust statistics can be used in conjunction with noisy gradient descent and noisy Newton methods in order to obtain optimal private estimators with global linear or quadratic convergence, respectively. We establish global convergence guarantees, under both local strong convexity and self-concordance, showing that our private estimators converge with high probability to a neighborhood of the non-private M-estimators. The radius of this neighborhood is nearly optimal in the sense it corresponds to the statistical minimax cost of differential privacy up to a logarithmic term. Secondly, we tackle the problem of parametric inference by constructing differentially private estimators of the asymptotic variance of our private M-estimators. This naturally leads to the use of approximate pivotal statistics for the construction of confidence regions and hypothesis testing. We demonstrate the effectiveness of a bias correction that leads to enhanced small-sample empirical performance in simulations. We illustrate the benefits of our methods with synthetic numerical examples and real data.

Recording | Paper

 

Dave Zhao

  dave.jpg

Title: Perfect is the Enemy of Good: New Shrinkage Estimators for Genomics

Abstract: Simultaneous estimation problems have a long history in statistics and have become especially common and important in genomics research: modern technologies can simultaneously assay tens of thousands to even millions of genomic features that can each introduce an unknown parameter of interest. These applications reveal some conceptual and methodological gaps in the standard empirical Bayes approach to simultaneous estimation. This talk summarizes standard approaches, illustrates some difficulties, and introduces an alternative approach based on regression modeling, and illustrates some new estimators that can be applied to gene expression denoising, coexpression network reconstruction, and large-scale gene expression imputation.

Recording

 

Debashis Mondal

  Mondal, Debashis | Statistics Department | Oregon State University

Title: H-likelihood Methods in Spatial Statistics

Abstract: Youngjo Lee and John Nelder introduced an important body of literature on h-likelihood methods and hierarchical generalized linear models, which expanded the scope of generalized linear regressions with correlated errors and revived an interest in Charles Roy Henderson's pioneering ideas on mixed linear equations and best linear unbiased predictions. In this talk, I shall present my work on how the h-likelihood methods pave the way for a deeper understanding of kriging and residual maximum likelihood estimation in spatial statistics, particularly for models based on conditional and intrinsic auto-regressions, de Wijs process and fractional Gaussian fields. In addition, I shall discuss how h-likelihood methods allow for scalable matrix-free computations. The importance of these developments will be emphasized with applications from environmental science. At the end, I will mention some of my ongoing works.

Recording

 

Mikhail Zhelonkin

  dr. (Mikhail) M Zhelonkin | Erasmus University Rotterdam

Title: Robust Estimation of Probit Models with Endogeneity

Abstract: Probit models with endogenous regressors are commonly used models in economics and other social sciences. Yet, the robustness properties of parametric estimators in these models have not been formally studied. In this paper, we derive the influence functions of the endogenous probit model’s classical estimators (the maximum likelihood and the two-step estimator) and prove their non-robustness to small but harmful deviations from distributional assumptions. We propose a procedure to obtain a robust alternative estimator, prove its asymptotic normality and provide its asymptotic variance. A simple robust test for endogeneity is also constructed. We compare the performance of the robust and classical estimators in Monte Carlo simulations with different types of contamination scenarios. The use of our estimator is illustrated in several empirical applications.