Statistics and Probability Seminar Series

 

See also our calendar for a complete list of the Department of Mathematics and Statistics seminars and events.

Please email Amir Ghassami (ghassami@bu.edu) if you would like to be added to the email list.
 

Spring 2025 Dates

Monday, January 13, CDS 365

Yuchen Wu (University of Pennsylvania)

Title:

Modern Sampling Paradigms: from Posterior Sampling to Generative AI

Abstract:

Sampling from a target distribution is a recurring theme in statistics and generative artificial intelligence (AI). In statistics, posterior sampling offers a flexible inferential framework, enabling uncertainty quantification, probabilistic prediction, as well as the estimation of intractable quantities. In generative AI, sampling aims to generate unseen instances that emulate a target population, such as the natural distributions of texts, images, and molecules.

In this talk, I will present my works on designing provably efficient sampling algorithms, addressing challenges in both statistics and generative AI. (1) In the first part, I will focus on posterior sampling for Bayes sparse regression. In general, such posteriors are high-dimensional and contain many modes, making them challenging to sample from. To address this, we develop a novel sampling algorithm based on decomposing the target posterior into a log-concave mixture of simple distributions, reducing sampling from a complex distribution to sampling from a tractable log-concave one. We establish provable guarantees for our method in a challenging regime that was previously intractable. (2) In the second part, I will describe a training-free acceleration method for diffusion models, which are deep generative models that underpin cutting-edge applications such as AlphaFold, DALL-E and Sora. Our approach is simple to implement, wraps around any pre-trained diffusion model, and comes with a provable convergence rate that strengthens prior theoretical results. We demonstrate the effectiveness of our method on several real-world image generation tasks.

Lastly, I will outline my vision for bridging the fields of statistics and generative AI, exploring how insights from one domain can drive progress in the other.

Tuesday, January 21, CDS 548

Ye Tian (Columbia University)

Title:

Transfer and Multi-task Learning: Statistical Insights for Modern Data Challenges

Abstract:

Knowledge transfer, a core human ability, has inspired numerous data integration methods in machine learning and statistics. However, data integration faces significant challenges: (1) unknown similarity between data sources; (2) data contamination; (3) high-dimensionality; and (4) privacy constraints.
This talk addresses these challenges in three parts across different contexts, presenting both innovative statistical methodologies and theoretical insights.
In Part I, I will introduce a transfer learning framework for high-dimensional generalized linear models that combines a pre-trained Lasso with a fine-tuning step. We provide theoretical guarantees for both estimation and inference, and apply the methods to predict county-level outcomes of the 2020 U.S. presidential election, uncovering valuable insights.
In Part II, I will explore an unsupervised learning setting where task-specific data is generated from a mixture model with heterogeneous mixture proportions. This complements the supervised learning setting discussed in Part I, addressing scenarios where labeled data is unavailable. We propose a federated gradient EM algorithm that is communication-efficient and privacy-preserving, providing estimation error bounds for the mixture model parameters.
In Part III, I will introduce a representation-based multi-task learning framework that generalizes the distance-based similarity notion discussed in Parts I and II. This framework is closely related to modern applications of fine-tuning in image classification and natural language processing. I will discuss how this study enhances our understanding of the effectiveness of fine-tuning and the influence of data contamination on representation multi-task learning.
Finally, I will summarize the talk and briefly introduce my broader research interests.

Thursday, January 23, CDS 548

Charles Margossian (Flatiron Institute)

Title:

Markov chain Monte Carlo and variational inference in the age of parallel computation

Abstract:

Probabilistic models describe complex data generating processes and have been applied to a broad range of fields, such as epidemiology, pharmacology, and astrophysics. Inference for probabilistic models poses significant computational challenges, particularly as models grow in complexity and datasets increase in size. Modern hardware, with its parallelization capabilities, offers new opportunities to accelerate statistical inference. However, many traditional methods are not inherently designed for parallel computation. Markov chain Monte Carlo (MCMC), for instance, typically relies on a few long-running chains. I propose an alternative approach: running hundreds or thousands of shorter chains in parallel. To support this paradigm, I introduce the nested “R-hat,” a novel convergence diagnostic tailored for the many-short-chains regime, paving the way for faster and more automated MCMC.

Next I examine variational inference (VI). VI already leverages the parallelization capacities of modern hardware, however it lacks the theoretical guarantees of MCMC and other statistical methods. I present two key theoretical results: (1) a positive result demonstrating that VI can effectively learn symmetries even under misspecified approximations, and (2) a negative result revealing that factorized (or mean-field) approximations lead to an impossibility theorem, preventing the simultaneous estimation of multiple measures of uncertainty . These findings provide practical guidance for selecting VI’s objective function and approximation family, offering a path toward robust and scalable inference.

Friday, January 24, CDS 548

Yuetian Luo (University of Chicago)

Title:

Challenges and Opportunities in Assumption-free and Robust Inference

Abstract:

With the growing application of data science to complex high-stakes tasks, ensuring the reliability of statistical inference methods has become increasingly critical. This talk considers two key challenges to achieving this goal: model misspecification and data corruption, highlighting their associated difficulties and potential solutions. In the first part, we investigate the problem of distribution-free algorithm risk evaluation, uncovering fundamental limitations for answering these questions with limited amounts of data. To navigate the challenge, we will also discuss how incorporating an assumption about algorithmic stability might help. The second part focuses on constructing robust confidence intervals in the presence of arbitrary data contamination. We show that when the proportion of contamination is unknown, uncertainty quantification incurs a substantial cost, resulting in optimal robust confidence intervals that must be significantly wider.

Monday, January 27, CDS 1646

Kai Tan (Rutgers University)

Title:

Estimating Generalization Error for Iterative Algorithms in High-Dimensional Regression

Abstract:

In the first part of the talk, I will investigate the generalization error of iterates from iterative algorithms in high-dimensional linear regression. The proposed estimators apply to Gradient Descent, Proximal Gradient Descent, and accelerated methods like FISTA. These estimators are consistent under Gaussian designs and enable the selection of the optimal iteration when the generalization error follows a U-shaped pattern. Simulations on synthetic data demonstrate the practical utility of these methods.

In the second part of the talk, I will focus on the generalization performance of iterates obtained by Stochastic Gradient Descent (SGD), and their proximal variants in high-dimensional robust regression problems. I will introduce estimators that can precisely track the generalization error of the iterates along the trajectory of the iterative algorithm. These estimators are shown to be consistent under mild conditions that allow the noise to have infinite variance. Extensive simulations confirm the effectiveness of the proposed generalization error estimators.

Tuesday, January 28, CDS 365

Anirban Chatterjee (University of Pennsylvania)

Title:

Kernel and graphical methods for conditional inference

Abstract:

In this talk, we will discuss methods for comparing two conditional distributions using kernels and geometric graphs. Specifically, we propose a new measure of discrepancy between two conditional distributions that can be estimated using the ‘kernel trick’ and nearest-neighbor graphs in nearly linear time. When the two conditional distributions are the same, the estimate has a Gaussian limit and its asymptotic variance has a simple form that can be easily estimated from the data. This leads to a test that attains precise asymptotic level and is universally consistent for detecting differences between two conditional distributions.

Next, we will introduce a resampling-based test using our proposed measure for the conditional goodness-of-fit problem that controls the level in finite samples while maintaining asymptotic consistency with a finite number of resamples. A method to de-randomize the resampling-based test will also be discussed.

These methods can be readily applied to a broad range of problems, ranging from classical nonparametric statistics to modern machine learning. Specifically, we will discuss applications in testing model calibration, regression curve evaluation, and validation of emulators in simulation-based inference.

(Joint work with Ziang Niu and Bhaswar B. Bhattacharya)

Thursday, January 30, Online seminar

Georgia Papadogeorgou (University of Florida)

Title:

Addressing selection bias in cluster randomized experiments via weighting

Abstract:

In cluster randomized experiments, individuals are often recruited after the cluster treatment assignment, and data are typically only available for the recruited sample. Post-randomization recruitment can lead to selection bias, inducing systematic differences between the overall and the recruited populations, and between the recruited intervention and control arms. In this setting, we define causal estimands for the overall and the recruited populations. We prove, under the assumption of ignorable recruitment, that the average treatment effect on the recruited population can be consistently estimated from the recruited sample using inverse probability weighting. Generally we cannot identify the average treatment effect on the overall population. Nonetheless, we show, via a principal stratification formulation, that one can use weighting of the recruited sample to identify treatment effects on two meaningful subpopulations of the overall population: individuals who would be recruited into the study regardless of the assignment, and individuals who would be recruited into the study under treatment but not under control. We develop an estimation strategy and a sensitivity analysis approach for checking the ignorable recruitment assumption. The proposed methods are applied to the ARTEMIS cluster randomized trial, where removing co-payment barriers increases the persistence of P2Y12 inhibitor among the always-recruited population.

Thursday, February 13, CDS 365

Murali Haran (Pennsylvania State University)

Title:

Statistical inference when likelihoods are intractable

Abstract:

Many modern scientific problems involve models for which traditional likelihood-based inference is difficult or impossible. Developing reliable and efficient algorithms for inference in such settings is one of the most critical challenges in statistical computing today. In this talk, I will discuss two broad classes of problems that exemplify these challenges. The first class includes models defined by computer simulations, which arise across diverse fields such as climate science, disease modeling, and automotive engineering. The second class involves statistical models with intractable normalizing functions, which appear in areas such as network analysis, spatial models on lattices, gene expression analysis, and permutation models.

I will provide an overview of various algorithms designed to tackle these problems, with a particular focus on a key challenge: how to tune and analyze algorithms for models with intractable normalizing functions. I will introduce a diagnostic tool applicable to both asymptotically exact and asymptotically inexact Monte Carlo methods. In the former, Monte Carlo approximations converge to the desired expectations, whereas in the latter, theoretical guarantees are limited or nonexistent. Finally, I will discuss the practical implications of this work, offering insights into widely used algorithms and their performance. 

Thursday, February 27, CDS 365

Yuguo Chen (University of Illinois at Urbana-Champaign)

Title:

Subsampling Based Community Detection for Large Networks

Abstract:

Large networks are increasingly prevalent in many scientific applications. Statistical analysis of such large networks become prohibitive due to exorbitant computation cost and high memory requirements. We develop a subsampling based divide-and-conquer algorithm for community detection in large networks. This method saves both memory and computation costs significantly as one needs to store and process only the smaller subnetworks. This method is also parallelizable which makes it even faster. We derive theoretical upper bounds for the error rate of the algorithm applied with community detection algorithms. We demonstrate the effectiveness of the algorithm on simulated and real networks.

Thursday, March 6, CDS 548

Organized by BUSCASA

Youssef Marzouk (MIT)

Title:

Scaling up the black box: simulation-based inference, transport, and dimension reduction

Abstract:

Many practical Bayesian inference problems fall into the simulation-based or “likelihood-free” setting, where evaluations of the likelihood function or prior density are intractable; instead one can only draw samples from the joint parameter-data prior. Conditional sampling becomes the key computational challenge in this setting. Transportation of measure offers a unifying framework for tackling this challenge, encompassing an enormous variety of contemporary algorithms. Scaling these algorithms to high dimensional parameters and data, however, requires exploiting the prospect of low-dimensional structure. Our recent work has shown how to identify maximally informative (and informed) low-dimensional projections of the data (and parameters), and obtain error bounds on the resulting posterior approximations, via gradient-based dimension reduction. We then propose a framework, derived from score-matching, to extend these dimension reduction methods to the simulation-based setting where gradients are unavailable. I will showcase examples of this approach in inverse problems, data assimilation, and energy market modeling.

Thursday, March 13

Spring Break

Thursday, March 27, CDS 365

Yao Xie (Georgia Institute of Technology)

Title:

TBA

Abstract:

TBA

Thursday, April 10, CDS 365

Andrea Rotnitzky (University of Washington)

Title:

Towards a Unified Theory for Semiparametric Data Fusion with Individual-Level Data

Abstract:

Consider the goal of conducting inference about a smooth finite-dimensional parameter by utilizing individual-level data from various independent sources. Recent advancements have led to the development of a comprehensive theory capable of handling scenarios where different data sources align with, possibly distinct subsets of, conditional distributions of a single factorization of the joint target distribution. While this theory proves effective in many significant contexts, it falls short in certain common data fusion problems, such as two-sample instrumental variable analysis, settings that integrate data from epidemiological studies with diverse designs (e.g., prospective cohorts and retrospective case-control studies), and studies with variables prone to measurement error that are supplemented by validation studies. In this talk, I will extend the aforementioned comprehensive theory to allow for the fusion of individual-level data from sources aligned with conditional distributions that do not correspond to a single factorization of the target distribution. Assuming conditional and marginal distribution alignments, I will discuss universal results that characterize the class of all influence functions of regular asymptotically linear estimators and the efficient influence function of any pathwise differentiable parameter, irrespective of the number of data sources, the specific parameter of interest, or the statistical model for the target distribution. This theory paves the way for machine-learning debiased, semiparametric efficient estimation.

This is joint work with Ellen Graham and Marco Carone

Previous Speakers

Fall 2024

Zhongyang Li (University of Connecticut)

Devavrat Shah (MIT)

Natesh Pillai (Harvard University)

Pamela Reinagel (UC San Diego)

Bodhisattva Sen (Columbia University)

Susan Murphy (Harvard University)

Luc Rey-Bellet (University of Massachusetts Amherst)

James Murphy (Tufts University)

Pragya Sur (Harvard University)

Spring 2024

Tracy Ke (Harvard University)

Feng Liu (Stevens Institute of Technology)

Rajarshi Mukherjee (Harvard University)

Guido Consonni (Università Cattolica del Sacro Cuore)

Fan Li (Duke University)

Kavita Ramanan (Brown University)

Fall 2023

Cynthia Rush (Columbia University)

James Maclaurin (New Jersey Institute of Technology)

Ruoyu Wu (Iowa State University)

Jonathan Pillow (Princeton University)

Subhabrata Sen (Harvard University)

Le Chen (Auburn University)

Raluca Balan (University of Ottawa)

Eric Tchetgen Tchetgen (University of Pennsylvania)

Tyler VanderWeele (Harvard University)

Jose Zubizarreta (Harvard University)