Statistics and Probability Seminar Series
See also our calendar for a complete list of the Department of Mathematics and Statistics seminars and events.
Please email Amir Ghassami (ghassami@bu.edu) if you would like to be added to the email list.
Spring 2025 Dates
Monday, January 13, CDS 365
Yuchen Wu (University of Pennsylvania)
Title:
Modern Sampling Paradigms: from Posterior Sampling to Generative AI
Abstract:
Sampling from a target distribution is a recurring theme in statistics and generative artificial intelligence (AI). In statistics, posterior sampling offers a flexible inferential framework, enabling uncertainty quantification, probabilistic prediction, as well as the estimation of intractable quantities. In generative AI, sampling aims to generate unseen instances that emulate a target population, such as the natural distributions of texts, images, and molecules.
In this talk, I will present my works on designing provably efficient sampling algorithms, addressing challenges in both statistics and generative AI. (1) In the first part, I will focus on posterior sampling for Bayes sparse regression. In general, such posteriors are high-dimensional and contain many modes, making them challenging to sample from. To address this, we develop a novel sampling algorithm based on decomposing the target posterior into a log-concave mixture of simple distributions, reducing sampling from a complex distribution to sampling from a tractable log-concave one. We establish provable guarantees for our method in a challenging regime that was previously intractable. (2) In the second part, I will describe a training-free acceleration method for diffusion models, which are deep generative models that underpin cutting-edge applications such as AlphaFold, DALL-E and Sora. Our approach is simple to implement, wraps around any pre-trained diffusion model, and comes with a provable convergence rate that strengthens prior theoretical results. We demonstrate the effectiveness of our method on several real-world image generation tasks.
Lastly, I will outline my vision for bridging the fields of statistics and generative AI, exploring how insights from one domain can drive progress in the other.
Tuesday, January 21, CDS 548
Ye Tian (Columbia University)
Title:
Transfer and Multi-task Learning: Statistical Insights for Modern Data Challenges
Abstract:
Knowledge transfer, a core human ability, has inspired numerous data integration methods in machine learning and statistics. However, data integration faces significant challenges: (1) unknown similarity between data sources; (2) data contamination; (3) high-dimensionality; and (4) privacy constraints.
This talk addresses these challenges in three parts across different contexts, presenting both innovative statistical methodologies and theoretical insights.
In Part I, I will introduce a transfer learning framework for high-dimensional generalized linear models that combines a pre-trained Lasso with a fine-tuning step. We provide theoretical guarantees for both estimation and inference, and apply the methods to predict county-level outcomes of the 2020 U.S. presidential election, uncovering valuable insights.
In Part II, I will explore an unsupervised learning setting where task-specific data is generated from a mixture model with heterogeneous mixture proportions. This complements the supervised learning setting discussed in Part I, addressing scenarios where labeled data is unavailable. We propose a federated gradient EM algorithm that is communication-efficient and privacy-preserving, providing estimation error bounds for the mixture model parameters.
In Part III, I will introduce a representation-based multi-task learning framework that generalizes the distance-based similarity notion discussed in Parts I and II. This framework is closely related to modern applications of fine-tuning in image classification and natural language processing. I will discuss how this study enhances our understanding of the effectiveness of fine-tuning and the influence of data contamination on representation multi-task learning.
Finally, I will summarize the talk and briefly introduce my broader research interests.
Thursday, January 23, CDS 548
Charles Margossian (Flatiron Institute)
Title:
Markov chain Monte Carlo and variational inference in the age of parallel computation
Abstract:
Probabilistic models describe complex data generating processes and have been applied to a broad range of fields, such as epidemiology, pharmacology, and astrophysics. Inference for probabilistic models poses significant computational challenges, particularly as models grow in complexity and datasets increase in size. Modern hardware, with its parallelization capabilities, offers new opportunities to accelerate statistical inference. However, many traditional methods are not inherently designed for parallel computation. Markov chain Monte Carlo (MCMC), for instance, typically relies on a few long-running chains. I propose an alternative approach: running hundreds or thousands of shorter chains in parallel. To support this paradigm, I introduce the nested “R-hat,” a novel convergence diagnostic tailored for the many-short-chains regime, paving the way for faster and more automated MCMC.
Next I examine variational inference (VI). VI already leverages the parallelization capacities of modern hardware, however it lacks the theoretical guarantees of MCMC and other statistical methods. I present two key theoretical results: (1) a positive result demonstrating that VI can effectively learn symmetries even under misspecified approximations, and (2) a negative result revealing that factorized (or mean-field) approximations lead to an impossibility theorem, preventing the simultaneous estimation of multiple measures of uncertainty . These findings provide practical guidance for selecting VI’s objective function and approximation family, offering a path toward robust and scalable inference.
Friday, January 24, CDS 548
Yuetian Luo (University of Chicago)
Title:
Challenges and Opportunities in Assumption-free and Robust Inference
Abstract:
With the growing application of data science to complex high-stakes tasks, ensuring the reliability of statistical inference methods has become increasingly critical. This talk considers two key challenges to achieving this goal: model misspecification and data corruption, highlighting their associated difficulties and potential solutions. In the first part, we investigate the problem of distribution-free algorithm risk evaluation, uncovering fundamental limitations for answering these questions with limited amounts of data. To navigate the challenge, we will also discuss how incorporating an assumption about algorithmic stability might help. The second part focuses on constructing robust confidence intervals in the presence of arbitrary data contamination. We show that when the proportion of contamination is unknown, uncertainty quantification incurs a substantial cost, resulting in optimal robust confidence intervals that must be significantly wider.
Monday, January 27, CDS 1646
Kai Tan (Rutgers University)
Title:
Estimating Generalization Error for Iterative Algorithms in High-Dimensional Regression
Abstract:
In the first part of the talk, I will investigate the generalization error of iterates from iterative algorithms in high-dimensional linear regression. The proposed estimators apply to Gradient Descent, Proximal Gradient Descent, and accelerated methods like FISTA. These estimators are consistent under Gaussian designs and enable the selection of the optimal iteration when the generalization error follows a U-shaped pattern. Simulations on synthetic data demonstrate the practical utility of these methods.
In the second part of the talk, I will focus on the generalization performance of iterates obtained by Stochastic Gradient Descent (SGD), and their proximal variants in high-dimensional robust regression problems. I will introduce estimators that can precisely track the generalization error of the iterates along the trajectory of the iterative algorithm. These estimators are shown to be consistent under mild conditions that allow the noise to have infinite variance. Extensive simulations confirm the effectiveness of the proposed generalization error estimators.
Tuesday, January 28, CDS 365
Anirban Chatterjee (University of Pennsylvania)
Title:
Kernel and graphical methods for conditional inference
Abstract:
In this talk, we will discuss methods for comparing two conditional distributions using kernels and geometric graphs. Specifically, we propose a new measure of discrepancy between two conditional distributions that can be estimated using the ‘kernel trick’ and nearest-neighbor graphs in nearly linear time. When the two conditional distributions are the same, the estimate has a Gaussian limit and its asymptotic variance has a simple form that can be easily estimated from the data. This leads to a test that attains precise asymptotic level and is universally consistent for detecting differences between two conditional distributions.
Next, we will introduce a resampling-based test using our proposed measure for the conditional goodness-of-fit problem that controls the level in finite samples while maintaining asymptotic consistency with a finite number of resamples. A method to de-randomize the resampling-based test will also be discussed.
These methods can be readily applied to a broad range of problems, ranging from classical nonparametric statistics to modern machine learning. Specifically, we will discuss applications in testing model calibration, regression curve evaluation, and validation of emulators in simulation-based inference.
(Joint work with Ziang Niu and Bhaswar B. Bhattacharya)
Thursday, January 30, Online seminar
Georgia Papadogeorgou (University of Florida)
Title:
Addressing selection bias in cluster randomized experiments via weighting
Abstract:
In cluster randomized experiments, individuals are often recruited after the cluster treatment assignment, and data are typically only available for the recruited sample. Post-randomization recruitment can lead to selection bias, inducing systematic differences between the overall and the recruited populations, and between the recruited intervention and control arms. In this setting, we define causal estimands for the overall and the recruited populations. We prove, under the assumption of ignorable recruitment, that the average treatment effect on the recruited population can be consistently estimated from the recruited sample using inverse probability weighting. Generally we cannot identify the average treatment effect on the overall population. Nonetheless, we show, via a principal stratification formulation, that one can use weighting of the recruited sample to identify treatment effects on two meaningful subpopulations of the overall population: individuals who would be recruited into the study regardless of the assignment, and individuals who would be recruited into the study under treatment but not under control. We develop an estimation strategy and a sensitivity analysis approach for checking the ignorable recruitment assumption. The proposed methods are applied to the ARTEMIS cluster randomized trial, where removing co-payment barriers increases the persistence of P2Y12 inhibitor among the always-recruited population.
Thursday, March 6, CDS 548
Organized by BUSCASA
Youssef Marzouk (MIT)
Title:
Scaling up the black box: simulation-based inference, transport, and dimension reduction
Abstract:
Many practical Bayesian inference problems fall into the simulation-based or “likelihood-free” setting, where evaluations of the likelihood function or prior density are intractable; instead one can only draw samples from the joint parameter-data prior. Conditional sampling becomes the key computational challenge in this setting. Transportation of measure offers a unifying framework for tackling this challenge, encompassing an enormous variety of contemporary algorithms. Scaling these algorithms to high dimensional parameters and data, however, requires exploiting the prospect of low-dimensional structure. Our recent work has shown how to identify maximally informative (and informed) low-dimensional projections of the data (and parameters), and obtain error bounds on the resulting posterior approximations, via gradient-based dimension reduction. We then propose a framework, derived from score-matching, to extend these dimension reduction methods to the simulation-based setting where gradients are unavailable. I will showcase examples of this approach in inverse problems, data assimilation, and energy market modeling.
Thursday, March 13
Spring Break
Thursday, March 20, CDS 365
Yuguo Chen (University of Illinois)
Title:
TBA
Abstract:
TBA
Thursday, March 27, CDS 365
Yao Xie (Georgia Institute of Technology)
Title:
TBA
Abstract:
TBA
Thursday, April 3, CDS 365
Qiyang Han (Rutgers University)
Title:
TBA
Abstract:
TBA
Thursday, April 10, CDS 365
Andrea Rotnitzky (University of Washington)
Title:
TBA
Abstract:
TBA
Thursday, April 17, CDS 365
TBA
Thursday, April 24, CDS 365
TBA
Previous Speakers
Fall 2024
Zhongyang Li (University of Connecticut)
Devavrat Shah (MIT)
Natesh Pillai (Harvard University)
Pamela Reinagel (UC San Diego)
Bodhisattva Sen (Columbia University)
Susan Murphy (Harvard University)
Luc Rey-Bellet (University of Massachusetts Amherst)
James Murphy (Tufts University)
Pragya Sur (Harvard University)
Spring 2024
Tracy Ke (Harvard University)
Feng Liu (Stevens Institute of Technology)
Rajarshi Mukherjee (Harvard University)
Guido Consonni (Università Cattolica del Sacro Cuore)
Fan Li (Duke University)
Kavita Ramanan (Brown University)
Fall 2023
Cynthia Rush (Columbia University)
James Maclaurin (New Jersey Institute of Technology)
Ruoyu Wu (Iowa State University)
Jonathan Pillow (Princeton University)
Subhabrata Sen (Harvard University)
Le Chen (Auburn University)
Raluca Balan (University of Ottawa)
Eric Tchetgen Tchetgen (University of Pennsylvania)
Tyler VanderWeele (Harvard University)
Jose Zubizarreta (Harvard University)