2nd Annual GSN Research Conference - List of Abstracts Presented

GSN Research Conference - Oral Presentation Abstracts

Arnab Auddy    
Columbia University    
Latent Variable Modeling via Orthogonally Decomposable Tensors    

Abstract:  With the advent of more and more complex data generating mechanisms, it becomes necessary to model higher order interactions among the observed variables, which often manifest through some latent variables. In this talk, we will see how orthogonally decomposable tensors provide a unified framework for many such problems. While this is a natural extension of matrix SVD to tensors, they automatically provide much better identifiability properties. More interestingly, a small perturbation affects each singular vector in isolation, and hence their recovery does not depend on the gap between consecutive singular values. In addition to the attractive statistical properties, these methods present us with intriguing computational considerations. To this end, we will discuss some statistical vs computational tradeoffs and describe methods of principal component estimation that have near optimal rates.

Manqi Cai    
University of Pittsburgh    
Robust and accurate estimation of cellular fraction from tissue omics data via ensemble deconvolution    

Abstract:  Motivation: Tissue-level gene expression and DNA methylation data represent an average across diverse cell types. Differences in cell-type fractions typically confound tissue-level analyses. To extract cell-type-specific (CTS) signals, dozens of computational methods (cell type deconvolution) have been proposed to infer cell-type fractions from tissue-level data. However, these computational methods produce vastly different results in practice under various settings. Simulation-based benchmarking studies showed no universally best computational method to estimate cell type fractions. 

Results: To achieve a robust estimation of cellular fractions, we proposed Ensemble-learning-based Deconvolution (EnsDeconv), which adopts CTS robust regression to synthesize the results from eleven deconvolution methods, ten reference datasets, five marker gene selection procedures, five data normalizations, and two transformations. Unlike most benchmarking studies based on simulations, we compiled four large real datasets of 4,937 tissue samples in total with measured cellular fractions and bulk gene expression from different tissues types. Comprehensive evaluations demonstrated that EnsDeconv yields more stable, robust, and accurate fractions than existing methods. We illustrated that EnsDeconv estimated cellular fractions enable various CTS downstream analyses such as differential fractions associated with clinical variables. We further extended EnsDeconv to analyze bulk DNA methylation data of thousands of samples.

Availability: EnsDeconv is freely available as an R-package from https://github.com/randel/EnsDeconv.

Qingyu Chen    
The Ohio State University    
An Adaptive and Robust Test for Microbial Community Analysis    

Abstract:  In microbiome studies, researchers measure the abundance of each operational taxon unit (OTU) and are often interested in testing the association between the microbiota and the clinical outcome while conditional on certain covariates. Two types of approaches exists for this testing purpose: the OTU-level tests that assess the association between each OTU and the outcome, and the community-level tests that examine the microbial community all together. It is of considerable interest to develop methods that enjoy both the flexibility of OTU-level tests and the biological relevance of community-level tests. We proposed MiAF, a method that adaptively combines p-values from the OTU-level tests to construct a community-level test. By borrowing the flexibility of OTU-level tests, the proposed method has great potential to generate a series of community-level tests that suit a range of different microbiome profiles, while achieving the desirable high statistical power of community-level testing methods. Using simulation study and real data applications in a smoker throat microbiome study and a HIV patient stool microbiome study, we demonstrated that MiAF has comparable or better power than methods that are specifically designed for community-level tests. The proposed method also provides a natural heuristic taxa selection.

Kunal Das    
Department Of Statistics, Iowa State University    
Nonparametric Two-Step Estimation For Inhomogeneous Spatial Point Patterns    

Abstract:  A spatial point process is a random pattern of points in $d$-dimensional space. It can be characterized by an intensity function that predicts the number of events occurring across space. This talk focuses on the nonparametric estimation of the intensity function of some inhomogeneous 2D spatial point processes, such as the Neyman Scott process and the Cox processes, and their tractable second-order properties (Ripley's $K$ function). The estimation is based on a two-step procedure. First, the intensity function is estimated using bivariate spline smoothing over triangulation. Second, cluster parameters of the point patterns are estimated using minimum contrast methods. To examine the finite-sample performance of the proposed method, we consider simulation studies for the inhomogeneous Thomas process and the log-Gaussian Cox process. Finally, we exemplify how the results can be applied to analyze tropical rainforests data and the consumer complaints data from New York City. This is a joint work with my advisors, Dr. Zhengyuan Zhu and Dr. Lily Wang.

Pritam Dey    
Duke University, Department of Statistical Science    
Outlier Detection for Brain Network Data    

Abstract:  It has become routine in neuroscience studies to measure brain networks for different individuals using neuroimaging. These networks are typically expressed as adjacency matrices, with each cell containing a summary of connectivity between a pair of brain regions. There is an emerging statistical literature describing methods for the analysis of such multi-network data. However, there has been essentially no consideration of the important problem of outlier detection. In particular, for certain subjects, the neuroimaging data are so poor quality that the network cannot be reliably reconstructed. For such subjects, the resulting adjacency matrix may be mostly zero or exhibit a bizarre pattern not consistent with a functioning brain. These outlying networks may serve as influential points, contaminating subsequent statistical analyses. We propose a simple method for network outlier detection (NOD) relying on an influence measure under a hierarchical generalized linear model for the adjacency matrices. An efficient computational algorithm is described, and our NOD method is illustrated through simulations and an application to data from the UK Biobank.

Yusi Fang    
Department of Biostatistics, University of Pittsburgh    
On p-value combination of independent and frequent signals: asymptotic efficiency and Fisher ensemble    

Abstract:  Combining p-values to integrate multiple effects is of long-standing interest in social science and biomedical research. In this paper, we focus on revisiting a classical scenario closely related to meta-analysis, which combines a relatively small (finite and fixed) number of p-values while the sample size for generating each p-value is large (asymptotically goes to infinity). We evaluate a list of traditional and recently developed modified Fisher's methods to investigate their asymptotic efficiencies and finite-sample numerical performance. The result concludes that Fisher and adaptively weighted Fisher method have top performance and complementary advantages across different proportions of true signals. Consequently, we propose an ensemble method, namely Fisher ensemble, to combine the two top-performing Fisher-related methods using a robust truncated Cauchy ensemble approach.
We show that Fisher ensemble achieves asymptotic Bahadur optimality and integrates the strengths of Fisher and adaptively weighted Fisher methods in simulations. We subsequently extend Fisher ensemble to a variant with emphasized power for concordant effect size directions. A transcriptomic meta-analysis application confirms the theoretical and simulation conclusions, generates intriguing biomarker and pathway findings, and demonstrates the strengths and strategy of using the proposed Fisher ensemble methods.

Ruochen Huang    
The Ohio State University    
Statistical Issues in Principal Component Score Estimation for Exponential Family PCA    

Abstract:  Most extensions of standard PCA to exponential family data are based on the assumption that the natural parameter matrix can be factorized into two low-rank matrices, namely, the principal component loadings matrix and scores matrix. The quality of component scores is of great importance for downstream tasks such as clustering and regression. When both loadings and scores are treated as fixed and unknown, they are often estimated jointly through the maximum likelihood. However, the joint estimation tends to inflate component scores in the magnitude and degrade the quality of scores when the data dimension is fixed. One possible source of this inflation is related to the bias of MLE in generalized linear model. We examine the extent of bias in component scores for logistic PCA with binary data. Through simulation studies we evaluate the effectiveness of some existing methods for bias reduction in MLE for logistic regression when the loadings are treated as known or estimated first from training data. In addition, we compare the quality of component scores from the joint estimation with an alternative formulation of logistic PCA through the projection of saturated logit parameters.

Yoonji Kim    
Department of Statistics, The Ohio State University    
Sequential Bayesian Registration for Functional Data    

In many modern applications, discretely-observed data may be naturally understood as a set of functions. Functional data often exhibit two confounded sources of variability: amplitude (y-axis) and phase (x-axis). The extraction of amplitude and phase, a process known as registration, is essential in exploring the underlying structure of functional data in a variety of areas, from environmental monitoring to medical imaging. Critically, such data are often gathered sequentially with new functional observations arriving over time. Despite this, most available registration procedures are only applicable to batch learning, leading to inefficient computation. To address these challenges, we introduce a Bayesian framework for sequential registration of functional data, which updates statistical inference as new sets of functions are assimilated. This Bayesian model-based sequential learning approach utilizes sequential Monte Carlo sampling to recursively update the alignment of observed functions while accounting for associated uncertainty. As a result, distributed computing, which is not generally an option in batch learning, significantly reduces computational cost. Simulation studies and comparisons to existing batch learning methods reveal that the proposed approach performs well even when the target posterior distribution has a challenging structure. We apply the proposed method to three real datasets: (1) functions of annual drought intensity near Kaweah River in California, (2) annual sea surface salinity functions near Null Island, and (3) PQRST complexes segmented from an electrocardiogram signal.

Rebecca Kurtz-Garcia    
University of California Riverside    
Alternative Time-Average Covariance Matrix Estimation Procedures    

Abstract:  The time-average covariance matrix (TACM) is the variance of the sample mean when there is serially correlated data. Estimation of the TACM is of interest in various fields such as time series, econometrics, and Markov chain Monte Carlo simulations. Spectral variance (SV) estimators are one of the most common estimation methods, but they suffer from a negative bias. An alternative lugsail estimator has been proposed which uses a linear combination of the common SV estimators that induces a positive bias to correct the issue. With the lugsail estimators new tuning parameters are introduced to control the induced positive bias. We typically use TACM estimators for hypothesis testing, and to create confidence regions for parameters. Current methods focus on optimizing these TACM estimators according to their mean squared error (MSE). Instead we can use alternative loss functions that optimize our parameters according to the type 1 and type 2 error rates of testing procedures, in contrast to controlling them indirectly with MSE. With these new tuning parameters and alternative loss functions we can eliminate the bias, control the variance, and obtain a testing optimal estimator.

Dillon Lloyd    
NIEHS    
toxpiR: an R package for the Toxicological Priority Index (ToxPi) framework    

Abstract:  toxpiR is an open-source R package for the Toxicological Priority Index (ToxPi) framework. ToxPi is a decision support tool that enables the transparent integration and visualization of data across disparate information domains. The framework has been used for decision support and hazard assessment by bodies such as the International Agency for Research on Cancer (IARC) and the National Academy of Sciences (NAS). While ToxPi was established to assess multi-factor risk profiles of chemicals, its utility as a visual profiling and communication framework has expanded into public health applications for streaming data. For any application, users collect data and indicate how it should be integrated, then ToxPi transforms the data into visual profiles that communicate contributions of factors toward an overall score. The toxpiR package offers new functionality for data handling, recombination, and customization while retaining compatibility with the stand-alone, Graphical User Interface (GUI) Java application. The package transforms simple input data into package-specific data structures that aid in the storage and organization of models. Once objects are created, built-in functions create highly-customizable plots, and the R ecosystem can allow for easy data output. The toxpiR package extends the application domain by supporting rapid analysis of massive datasets, facilitating parameter sweeps for model optimization and open-source code for methodological expansion. 

Brandon Lumsden    
Clemson University    
Engineering Informed Genomic Prediction of Complex Phenotypes    

Abstract:  In light of advances in sequencing technology, genomic prediction has become a focal point of many plant breeding programs. Existing techniques use secondary phenotypes for improved accuracy and phenotyping costs, however, none have explored the performance gains of incorporating known functional relationships between the multiple responses. To this end, we develop a multi-level genomic prediction model in which intermediate phenotypes, along with known structural relationships, are used to more accurately estimate the distribution of complex phenotypes.  Multiple simulations are conducted to compare the engineering informed technique with standard practice multivariate linear mixed effects models. In both settings, model fitting is accomplished via Gibbs sampling, and model performance is assessed via the Kullback-Leibler divergence between the posterior predictive distribution and the true underlying distribution of the primary phenotype. The simulations show that exploiting known non-linear relationships with intermediate phenotypes results in more accurate posterior predictive distributions of the primary phenotype. We further demonstrate the new technique by preforming guided prediction of flexural rigidity in maize crops.

Alexander Murph    
University of NC at Chapel Hill    
Bayesian Change Point Detection for Mixed Data with Missing Values    

Abstract:  An in-production prediction model must be monitored over time to ensure that its performance does not suffer from drift or abrupt changes to data.  Typically, this is done by evaluating the algorithm’s predictions to outcome data and ensuring that the algorithm maintains an acceptable level of accuracy over time.  However, it is far preferable to learn about major changes in the input data that could affect the model’s performance in real-time.  Thus, there is large need for robust, real-time monitoring of high dimensional input data over time.  Here we consider the problem of change point detection on high-dimensional longitudinal data with mixed variable types and missing values.  We do this by fitting an array of Mixture Gaussian Graphical Models to groupings of homogeneous data in time, called regimes.   The primary goal of this model is to identify when there is a regime change, as this indicates a significant change in the input data distribution.  To handle the messy nature of real-world data which has mixed continuous/discrete variable types, missing data, etc., we take a Bayesian latent variable approach. This affords us flexibility to handle missing values in a principled manner, while simultaneously providing a way to encode discrete and censored values into a continuous framework. We take this approach a step further by encoding the missingness structure, which allows our model to then detect major changes in the patterns of missingness, in addition to the structure of the data distributions themselves. 

Subrata Pal    
Iowa State University    
Model-based Personalized Synthetic Magnetic Resonance Imaging    

Abstract:  Synthetic Magnetic Resonance (MR) imaging predicts images at new design parameter settings from a few observed MR scans. Model-based methods, that use both the physical and statistical properties underlying the MR signal and its acquisition, can predict images at any setting from as few as three scans, allowing it to be used in individualized patient- and anatomy-specific contexts. However, the estimation problem in model-based synthetic MR imaging is ill-posed and so regularization, in the form of correlated Gaussian Markov Random Fields, is imposed on the voxel-wise spin-lattice relaxation time, spin-spin relaxation time and the proton density underlying the MR image. 
We develop theoretically sound but computationally practical matrix-free estimation methods for synthetic MR imaging with the help of a novel profile likelihood approach with a fast underlying Alternating Expectation Conditional Maximization algorithm. Our evaluations demonstrate excellent ability of our methods to synthetize MR images in a clinical framework and also estimation and prediction accuracy and consistency in both 2D and 3D MR images. We also uniformly outperform a recently-developed deep learning method for this problem. An added strength of our model-based approach, also developed and illustrated here, is the accurate estimation of standard errors of regional means or contrasts in the synthesized images with a matrix-free approach to handle huge sparse matrices. We also provide a publicly-available C++ package with R and python interfaces.

Isaac Quintanilla Salinas    
UC Riverside Statistics Department    
Multilevel Joint Models for Longitudinal and Survival Outcomes    

Abstract:  Motivated by data from dialysis patients in the U.S, we propose a joint modeling framework for longitudinal and survival outcomes. In this population of patients, two outcomes are of interest, hospitalization, a longitudinal binary outcome, which is a major source of death risk, and mortality, which is higher in this population than in other comparable populations, including Medicare patients with cancer. Therefore, it is of interest to identify the patient- and dialysis facility-level risk factors that jointly affect these outcomes. In addition to its higher-order hierarchical structure, that is, longitudinal measurements (hospitalizations) recorded over time and nested within subjects, and subjects further nested within dialysis facilities where they receive regular care, our data comes from a large national database containing information on a large number of facilities (>520) each with 60-160 patients. In order to accommodate the hierarchical data structure, we depart from the literature on joint modeling of longitudinal and survival data and include multilevel random effects and multilevel covariates, at both the patient and facility levels. An approximate Expectation-Maximization algorithm is developed for estimation and inference, where fully exponential Laplace approximations are employed to address computational challenges. We demonstrate the finite sample performance of our approach via simulation studies.

Ye Tian    
Department of Statistics, Columbia University    
Transfer Learning under High-dimensional Generalized Linear Models    

Abstract:  In this work, we study the transfer learning problem under high-dimensional generalized linear models (GLMs), which aim to improve the fit on target data by borrowing information from useful source data. Given which sources to transfer, we propose a transfer learning algorithm on GLM, and derive its $\ell_1/\ell_2$-estimation error bounds as well as a bound for a prediction error measure. The theoretical analysis shows that when the target and source are sufficiently close to each other, these bounds could be improved over those of the classical penalized estimator using only target data under mild conditions. When we don't know which sources to transfer, an algorithm-free transferable source detection approach is introduced to detect informative sources. The detection consistency is proved under the high-dimensional GLM transfer learning setting. We also propose an algorithm to construct confidence intervals of each coefficient component, and the corresponding theories are provided. Extensive simulations and a real-data experiment verify the effectiveness of our algorithms. We implement the proposed GLM transfer learning algorithms in a new R package glmtrans, which is available on CRAN. The paper is under minor revision to JASA.

Adam Tonks    
University of Illinois at Urbana-Champaign    
West Nile Virus Forecasting Using GNNs    

Abstract:  Within Illinois, the Illinois Department of Health overseas a program to monitor populations of mosquitoes infected with West Nile virus (WNV). Using the trap data collected from this program to forecast the location and abundance of infected mosquitoes could aid mosquito surveillance and abatement efforts within the state. Although a variety of machine learning methods have been previously utilized for similar problems, none have taken full consideration of the spatial dimension of such data sets. We show that graph neural networks (GNNs) can perform well with geospatial data that has been collected at individual points, without needing to augment the data set via interpolation. Furthermore, we describe a simple, generalizable method to determine the input graph vertices and edges using k-nearest neighbors, and to incorporate weather time series data. A baseline comparison of our model shows that its performance in generating trap-by-trap forecasts exceeds that of a variety of other models, including logistic regression, decision trees and other neural network architectures.

Hannah Waddel    
Emory University    
A Systematic Bayesian Integration of Epidemiological and Genetic Data    

Abstract:  The amount of genetic data available from pathogens in infectious disease outbreaks has rapidly increased in recent years, and insights from this data can help in the analysis of disease outbreaks. This genetic sequence data can improve inference for epidemiological quantities such as the transmission network (‘who-infected-whom’) and the time of infection. The integration of genetic and epidemiological data to infer dynamics of a disease outbreak is known as ‘phylodynamics’ (Grenfell 2004). This 2015 paper by Lau et al. describes a novel Bayesian phylodynamic model which allows imputation of the transmission tree and unobserved pathogen genetic sequences by a full joint likelihood. The model is fit using the Metropolis-Hastings algorithm, where the authors developed tools specifically for proposing the transmission graph and sequences transmitted at time of infection. The authors then use the new framework to quantify the value of using pathogen sequence data in phylodynamic inference. An outbreak of foot-and-mouth disease in the UK is analyzed as a real-world data example. In a 2018 comparison of various phylodynamic methods, this model was found to most accurately infer the transmission network of an outbreak (Firestone 2018).

Yuan Yang    
Clemson University    
Estimation of l0 Norm Penalized Models: A Statistical Treatment    

Abstract:  Fitting penalized models for the purpose of merging the estimation and model selection problem has become common place in statistical practice. Of the various regularization strategies that can be leveraged to this end, the use of the l0 norm to penalize parameter estimation poses the most daunting model fitting task. In fact, this particular strategy requires an end user to solve a non-convex NP-hard optimization problem irregardless of the underlying data model. For this reason, the use of the l0 norm as a regularization strategy has been woefully under utilized. To obviate this difficulty, herein we propose a strategy to solve such problems that is generally accessible by the statistical community. Our approach can be adopted to solve l0 norm penalized problems across a very broad class of models, can be implemented using existing software, and is computationally efficient. We demonstrate the performance of our method through in depth numerical experiments and through using it to analyze several prototypical data sets.

Yingchao Zhou    
Iowa State University    
Locally adaptive conformal prediction for regression    

Abstract:  Conformal inference method constructs distribution-free prediction intervals when only assuming i.i.d. data. For regression model Y = μ(X)+σ(X)ε with flexible mean and variance function, we study the statistical accuracy of locally adaptive conformal prediction intervals (C_loc), and investigate the conditions under which C_loc is asymptotically conditional valid. We make an extension for the current equal-tailed C_loc to handle the asymmetric error distribution. We compare the resulting prediction interval, C_asy, to the existing conformal prediction intervals, and show C_asy achieves shorter interval
length.