Tao of RWD Blog

Dynamic Treatment Regimes with RL, Part II: Optimizing Pessimism

Abstract

Dynamic treatment regimes (DTRs) estimated from observational data can produce overconfident recommendations when patients fall outside the training distribution. This article overviews an algorithm introduced by Zhou et al 2022, which uses the pessimism principle, a strategy from offline reinforcement learning that penalizes treatment estimates proportional to model uncertainty. We present methods for quantifying that uncertainty, including resampling techniques and Bayesian linear basis models. The result is a complete pessimistic DTR algorithm that favors treatments with reliable evidence over those with potentially higher but unverifiable estimated rewards, with theoretical guarantees on regret scaling.

About the Authors

Yunzhe (Jeff) Zhou, Voleon Group|Researcher in reinforcement learning and causal inference, with a focus on novel methods for identifying optimal treatment strategies in dynamic treatment regimes.

Aimee Harrison|Aimee Harrison (BS, MFA) co-maintains Tao of RWD, and works as a product manager support real world evidence study design tools at Navidence.

Andy Wilson|Andy Wilson (PhD, MStat), Founder and Principal of The Tao of RWD and Adjunct Professor at the University of Utah School of Medicine. Andy bridges cutting-edge causal inference methodology with practical application in regulatory and healthcare settings. With over 100 peer-reviewed publications and a decade of experience in pharmacoepidemiology and real-world evidence, he focuses on helping organizations move beyond correlation to understand true cause-and-effect relationships. He currently teaches PBHLT 7115: Causal Methods in Public Health at the University of Utah and has presented alongside leaders at the FDA, EMA, and the American Causal Inference Conference.

Loading content...