research | Piersilvio De Bartolomeis

preprints

Efficient Randomized Experiments Using Foundation Models

Piersilvio De Bartolomeis, Javier Abad, Guanbo Wang, Konstantin Donhauser, Raymond M. Duch, Fanny Yang, and Issa J. Dahabreh

arXiv preprint, 2025

abstract arXiv talk code

Randomized experiments are the preferred approach for evaluating the effects of interventions, but they are costly and often yield estimates with substantial uncertainty. On the other hand, in silico experiments leveraging foundation models offer a cost-effective alternative that can potentially attain higher statistical precision. However, the benefits of in silico experiments come with a significant risk: statistical inferences are not valid if the models fail to accurately predict experimental responses to interventions. In this paper, we propose a novel approach that integrates the predictions from multiple foundation models with experimental data while preserving valid statistical inference. Our estimator is consistent and asymptotically normal, with asymptotic variance no larger than the standard estimator based on experimental data alone. Importantly, these statistical properties hold even when model predictions are arbitrarily biased. Empirical results across several randomized experiments show that our estimator offers substantial precision gains, equivalent to a reduction of up to 20% in the sample size needed to match the same precision as the standard estimator based on experimental data alone.
Causal Lifting of Neural Representations: Zero-Shot Generalization for Causal Inferences

Riccardo Cadei, Ilker Demirel*, Piersilvio De Bartolomeis*, Lukas Lindorfer, Sylvia Cremer, Cordelia Schmid, and Francesco Locatello

arXiv preprint, 2025

abstract arXiv

A plethora of real-world scientific investigations is waiting to scale with the support of trustworthy predictive models that can reduce the need for costly data annotations. We focus on causal inferences on a target experiment with unlabeled factual outcomes, retrieved by a predictive model fine-tuned on a labeled similar experiment. First, we show that factual outcome estimation via Empirical Risk Minimization (ERM) may fail to yield valid causal inferences on the target population, even in a randomized controlled experiment and infinite training samples. Then, we propose to leverage the observed experimental settings during training to empower generalization to downstream interventional investigations, “Causal Lifting” the predictive model. We propose Deconfounded Empirical Risk Minimization (DERM), a new simple learning procedure minimizing the risk over a fictitious target population, preventing potential confounding effects. We validate our method on both synthetic and real-world scientific data. Notably, for the first time, we zero-shot generalize causal inferences on ISTAnt dataset (without annotation) by causal lifting a predictive model on our experiment variant.

publications

Doubly robust identification of treatment effects from multiple environments

Piersilvio De Bartolomeis, Julia Kostin, Javier Abad, Yixin Wang, and Fanny Yang

International Conference on Learning Representations (ICLR), 2025

abstract arXiv poster code

Practical and ethical constraints often dictate the use of observational data for causal inference, particularly in medicine and social sciences. Yet, observational datasets are prone to confounding, potentially compromising the validity of conclusions. While adjusting for all available covariates is a common corrective strategy, this approach can introduce bias, especially when post-treatment variables are present or some variables remain unobserved—a frequent scenario in practice. Avoiding this bias often requires detailed knowledge of the underlying causal graph, a challenging and often impractical prerequisite. In this work, we propose RAMEN, an algorithm that tackles this challenge by leveraging the heterogeneity of multiple data sources without the need to know the complete causal graph. Notably, RAMEN achieves doubly robust identification: we identify the treatment effect if either the causal parents of the treatment or those of the outcome are observed. Empirical evaluations across synthetic, semi-synthetic, and real-world datasets show that our approach significantly outperforms existing methods.
Robust integration of external control data in randomized trials

Rickard Karlsson, Guanbo Wang, Piersilvio De Bartolomeis, Jesse H. Krijthe, and Issa J. Dahabreh

Biometrics, 2025

abstract arXiv

One approach for increasing the efficiency of randomized trials is the use of "external controls" – individuals who received the control treatment studied in the trial during routine practice or in prior experimental studies. Existing external control methods, however, can be biased if the populations underlying the trial and the external control data are not exchangeable. Here, we characterize a randomization-aware class of treatment effect estimators in the population underlying the trial that remain consistent and asymptotically normal when using external control data, even when exchangeability does not hold. We consider two members of this class of estimators: the well-known augmented inverse probability weighting trial-only estimator, which is the efficient estimator when only trial data are used; and a potentially more efficient member of the class when exchangeability holds and external control data are available, which we refer to as the optimized randomization-aware estimator. To achieve robust integration of external control data in trial analyses, we then propose a combined estimator based on the efficient trial-only estimator and the optimized randomization-aware estimator. We show that the combined estimator is consistent and no less efficient than the most efficient of the two component estimators, whether the exchangeability assumption holds or not. We examine the estimators’ performance in simulations and we illustrate their use with data from two trials of paliperidone extended-release for schizophrenia.

Detecting critical treatment effect bias in small subgroups

Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, and Fanny Yang

Conference on Uncertainty in Artificial Intelligence (UAI), 2024

abstract arXiv poster code

Randomized trials are considered the gold standard for making informed decisions in medicine, yet they often lack generalizability to the patient populations in clinical practice. Observational studies, on the other hand, cover a broader patient population but are prone to various biases. Thus, before using an observational study for decision-making, it is crucial to benchmark its treatment effect estimates against those derived from a randomized trial. We propose a novel strategy to benchmark observational studies beyond the average treatment effect. First, we design a statistical test for the null hypothesis that the treatment effects, conditioned on a set of relevant features, differ up to some tolerance. We then estimate an asymptotically valid lower bound on the maximum bias strength for any subgroup in the observational study. Finally, we validate our benchmarking strategy in a real-world setting and show that it leads to conclusions that align with established medical knowledge.
Hidden yet quantifiable: A lower bound for confounding strength using randomized trials

Piersilvio De Bartolomeis*, Javier Abad*, Konstantin Donhauser, and Fanny Yang

International Conference on Artificial Intelligence and Statistics (AISTATS), 2024

abstract arXiv poster code

In the era of fast-paced precision medicine, observational studies play a major role in properly evaluating new drugs in clinical practice. Yet, unobserved confounding can significantly compromise causal conclusions from observational data. We propose a novel strategy to quantify unobserved confounding by leveraging randomized trials. First, we design a statistical test to detect unobserved confounding with strength above a given threshold. Then, we use the test to estimate an asymptotically valid lower bound on the unobserved confounding strength. We evaluate the power and validity of our statistical test on several synthetic and semi-synthetic datasets. Further, we show how our lower bound can correctly identify the absence and presence of unobserved confounding in a real-world setting.

Convex Reinforcement Learning in Finite Trials

Mirco Mutti, Riccardo De Santi, Piersilvio De Bartolomeis, and Marcello Restelli

Journal of Machine Learning Research (JMLR), 2023

abstract pdf

Convex Reinforcement Learning (RL) is a recently introduced framework that generalizes the standard RL objective to any convex (or concave) function of the state distribution induced by the agent’s policy. This framework subsumes several applications of practical interest, such as pure exploration, imitation learning, and risk-averse RL, among others. However, the previous convex RL literature implicitly evaluates the agent’s performance over infinite realizations (or trials), while most of the applications require excellent performance over a handful, or even just one, trials. To meet this practical demand, we formulate convex RL in finite trials, where the objective is any convex function of the empirical state distribution computed over a finite number of realizations. In this paper, we provide a comprehensive theoretical study of the setting, which includes an analysis of the importance of non-Markovian policies to achieve optimality, as well as a characterization of the computational and statistical complexity of the problem in various configurations.

Challenging Common Assumptions in Convex Reinforcement Learning

Mirco Mutti*, Riccardo De Santi*, Piersilvio De Bartolomeis, and Marcello Restelli

Advances in Neural Information Processing Systems (NeurIPS), 2022

abstract arXiv poster

The classic Reinforcement Learning (RL) formulation concerns the maximization of a scalar reward function. More recently, convex RL has been introduced to extend the RL formulation to all the objectives that are convex functions of the state distribution induced by a policy. Notably, convex RL covers several relevant applications that do not fall into the scalar formulation, including imitation learning, risk-averse RL, and pure exploration. In classic RL, it is common to optimize an infinite trials objective, which accounts for the state distribution instead of the empirical state visitation frequencies, even though the actual number of trajectories is always finite in practice. This is theoretically sound since the infinite trials and finite trials objectives are equivalent and thus lead to the same optimal policy. In this paper, we show that this hidden assumption does not hold in convex RL. In particular, we prove that erroneously optimizing the infinite trials objective in place of the actual finite trials one, as it is usually done, can lead to a significant approximation error. Since the finite trials setting is the default in both simulated and real-world RL, we believe shedding light on this issue will lead to better approaches and methodologies for convex RL, impacting relevant research areas such as imitation learning, risk-averse RL, and pure exploration among others.

workshops

How robust accuracy suffers from certified training with convex relaxations

Piersilvio De Bartolomeis, Jacob Clarysse, Amartya Sanyal, and Fanny Yang

I Can’t Believe It’s Not Better Workshop (NeurIPS), Oral, 2023

abstract arXiv

Adversarial attacks pose significant threats to deploying state-of-the-art classifiers in safety-critical applications. Two classes of methods have emerged to address this issue: empirical defences and certified defences. Although certified defences come with robustness guarantees, empirical defences such as adversarial training enjoy much higher popularity among practitioners. In this paper, we systematically compare the standard and robust error of these two robust training paradigms across multiple computer vision tasks. We show that in most tasks and for both \(\ell_∞\)-ball and \(\ell_2\)-ball threat models, certified training with convex relaxations suffers from worse standard and robust error than adversarial training. We further explore how the error gap between certified and adversarial training depends on the threat model and the data distribution. In particular, besides the perturbation budget, we identify as important factors the shape of the perturbation set and the implicit margin of the data distribution. We support our arguments with extensive ablations on both synthetic and image datasets.

Enhancing Unit-tests for Invariance Discovery

Piersilvio De Bartolomeis, Antonio Orvieto, and Giambattista Parascandolo

Spurious Correlations, Invariance, and Stability Workshop (ICML), 2022

abstract pdf

Recently, Aubin et al. (2021) proposed a set of linear low-dimensional problems to precisely evaluate different types of out-of-distribution generalization. In this paper, we show that one of these problems can already be solved by established algorithms, simply by better hyper-parameter tuning. We then propose an enhanced version of the linear unit-tests. To the best of our hyperparameter search and within the set of algorithms evaluated, AND-mask is the best performing algorithm on this new suite of tests. Our findings on synthetic data are further reinforced by experiments on an image classification task where we introduce spurious correlations.