Dynamic Treatment Regimes with RL, Part II: Optimizing Pessimism
Abstract
Dynamic treatment regimes (DTRs) estimated from observational data can produce overconfident recommendations when patients fall outside the training distribution. This article overviews an algorithm introduced by Zhou et al 2022, which uses the pessimism principle, a strategy from offline reinforcement learning that penalizes treatment estimates proportional to model uncertainty. We present methods for quantifying that uncertainty, including resampling techniques and Bayesian linear basis models. The result is a complete pessimistic DTR algorithm that favors treatments with reliable evidence over those with potentially higher but unverifiable estimated rewards, with theoretical guarantees on regret scaling.
About the Authors
Yunzhe (Jeff) Zhou, Voleon Group|Researcher in reinforcement learning and causal inference, with a focus on novel methods for identifying optimal treatment strategies in dynamic treatment regimes.
Aimee Harrison|Aimee Harrison (BS, MFA) co-maintains Tao of RWD, and works as a product manager support real world evidence study design tools at Navidence.
Andy Wilson|Andy Wilson (PhD, MStat), Founder and Principal of The Tao of RWD and Adjunct Professor at the University of Utah School of Medicine. Andy bridges cutting-edge causal inference methodology with practical application in regulatory and healthcare settings. With over 100 peer-reviewed publications and a decade of experience in pharmacoepidemiology and real-world evidence, he focuses on helping organizations move beyond correlation to understand true cause-and-effect relationships. He currently teaches PBHLT 7115: Causal Methods in Public Health at the University of Utah and has presented alongside leaders at the FDA, EMA, and the American Causal Inference Conference.
Loading content...
Section 1
Introduction
Picking Up Where We Left Off
In Part I, we introduced dynamic treatment regimes (DTRs) as sequential treatment strategies that adapt to a patient's evolving condition. Using reinforcement learning (RL) foundations, we showed how to estimate optimal DTRs from observational data via backward induction and Q-learning. We also identified a critical failure mode: when the model encounters patients unlike those in the training data, it extrapolates, producing confident-looking recommendations built on thin evidence.
This article develops the pessimism principle, a strategy from offline RL that proposes a solution to this deficiency by adjusting treatment recommendations to account for the model's uncertainty. When the model is confident, it trusts the estimate. When the model is guessing, it penalizes the recommendation. The result is a DTR that is cautious where it should be and ambitious where the evidence supports it.
This article assumes familiarity with the material from Part I, including DTRs, Q-learning, the value and Q-functions, and the extrapolation problem. If those concepts aren't fresh, we recommend starting with Part I.
A DTR is a sequence of decision rules that adapt treatment to a patient's evolving state — it is an RL policy
A policy maps states to actions; the optimal policy maximizes expected cumulative reward
The value function summarizes how good a state is; the Q-function evaluates each available action
The Bellman equation connects values recursively, enabling backward induction from the final stage
Fitted Q-iteration learns Q-functions from observational data stage-by-stage, working backward to estimate the optimal DTR
Three causal assumptions (consistency, positivity, sequential exchangeability) must hold for observational DTR estimates to be valid
The extrapolation problem: when a patient's features fall outside the training data, the model's predictions become unreliable — confident-looking estimates can mask enormous uncertainty
Section 2
The Pessimism Principle
When Uncertain, Choose the Treatment You're Sure Is At Least Decent
The Problem, Restated
In Part I, we followed an example patient Homer, a sepsis patient undergoing multi-stage treatment. Homer arrives in the ICU with an unusual combination of features: high lactate, a specific comorbidity profile including cystic fibrosis, and a SOFA of 8. At intake, clinicians choose an initial fluid and vasopressor regimen based on this profile. Every four hours after, they reassess his lab values and adjust his treatment. Homer's profile places him in a sparse region of the training data, which is exactly the situation where naive DTR estimation breaks down.
The figure below plots historical sepsis patients in a simplified two-dimensional feature space. Each dot represents a patient (blue for those who received Treatment A, red for Treatment B, a gold star for Homer). The dashed circle marks Homer's neighborhood, indicating patients with similar features whose outcomes inform the model's predictions for him.
Notice how few dots fall within Homer's neighborhood. Only a handful of patients in the training data share his unusual combination of features, and those few mostly received Treatment A. For Treatment B, we have essentially no direct evidence, and the model must extrapolate from patients quite different from Homer. The whisker plots at the bottom of the figure show the range of plausible treatment effects at Homer's location, with a wider interval indicating the model has less evidence and more uncertainty about the true reward. The confidence intervals in the figure quantify this gap. Treatment A's interval is moderate (limited but real evidence), while Treatment B's interval is very wide (the model is guessing).
This demonstrates the crux of the extrapolation problem. A naive optimizer would recommend Treatment B because it has the higher point estimate (the red dot sits further right than the blue). But that estimate is built on extrapolation, not evidence. Should the clinical team trust a recommendation when the model has never seen a patient like Homer receive that treatment?
The Key Idea
The pessimism principle offers Homer's clinical care team an alternative strategy for selecting a treatment policy. Instead of picking the treatment with the highest estimated reward, they subtract a penalty proportional to the model's uncertainty about each estimate. Treatments where the model is extrapolating from sparse data get penalized. Treatments where the model has real evidence retain their value. The system prefers treatments it's confident are at least decent over treatments it hopes might be great but cannot verify.
Recall Homer's clinical team's assitant, Pod, introduced in Part I. After learning about the pessimism principle, Pod decides to put this idea to the test using patient data from the scatter plot above. Pod introduces a pessimism dial that controls how aggressively the model penalizes uncertainty. At zero, Pod trusts the model's point estimates at face value (representing the naive approach from Part I). As Pod turns the dial up, treatments with wide confidence intervals get penalized more heavily. Try the slider below to follow along as Pod adjusts the pessimism level and watches the treatment policy recommendation change.
The Pessimism Principle
Adjust the pessimism level to see how treatment recommendations change
Trusting the model
0 (optimistic)1.02.0 (very cautious)
012345
Treatment A
est = 2.8, sd = 0.7
2.80
Treatment BRecommended
est = 3.8, sd = 1.9
3.80
Notice the crossover. At low pessimism, Treatment B wins because it has the higher average predicted reward. But as Pod turns the dial up, the penalty for uncertainty grows. Treatment B's penalty grows much faster than Treatment A's, because the model has so much less evidence for it. At some point, Treatment B's adjusted score drops below Treatment A's, and Pod's recommendation flips. The math behind this is straightforward: for each treatment, take the model's predicted reward and subtract the pessimism level multiplied by how uncertain the model is about that prediction. A treatment with a slightly lower average but tight uncertainty can beat a treatment with a higher average but enormous uncertainty. Pod is no longer chasing the highest point estimate, but is instead choosing the treatment whose expected reward holds up even in a pessimistic scenario.
The pessimism principle produces a lower confidence bound on the expected reward. For each treatment, this approach takes the model's predicted reward and subtracts a penalty proportional to the model's uncertainty about that prediction. The more uncertain the model, the larger the penalty, and the lower the adjusted reward. Rather than optimizing the best guess, we optimize a conservative estimate: the reward we're fairly confident the treatment will achieve, even in a pessimistic scenario.
Why Pessimism, Not Optimism?
If you've encountered the phrase "optimism in the face of uncertainty," you may wonder why we're doing the opposite. In online RL, where an agent can experiment and observe outcomes after alternate actions, optimism makes sense. Here, optimism encourages exploration: try the uncertain option, learn from the result, update the model. This is how a robot learns to walk (try, fall, adjust) or how a game-playing agent discovers winning strategies.
Clinical DTR estimation is almost never online. Instead, the model learns from a fixed dataset of historical patient records, with no ability to experiment. Exploration means potentially giving Homer an untested treatment to see what happens, which is an unacceptable risk. In this offline setting, the principled response to uncertainty is caution. We can't learn our way out of the data gaps, but we can only responsible decisions within them.
The pessimism principle, as described in Zhou et al. (2023), formalizes this intuition and provides theoretical guarantees. The pessimistic policy's regret (how much worse it performs compared to the true optimal policy) scales with the actual uncertainty in the data, not with the model's unchecked predictions.
estimated reward for treatment in state
standard deviation of the prediction (model uncertainty)
pessimism parameter controlling conservatism
pessimistic decision rule
Zhou et al. (2023), "Pessimistic DTR via Bayesian Learning" — formalizes the pessimism principle for DTR estimation and provides theoretical regret guarantees.
Section 3
Measuring Uncertainty
Understanding the Pessimism Penalty
The pessimism penalty is subtracted from the model's predicted reward to produce the conservative estimate. It is the product of two terms: (1) the pessimism parameter that we looked at in Section 2, which the analyst sets, and (2) the standard deviation of the model's prediction, which must be estimated from data. The pessimism parameter controls how much we penalize uncertainty globally. The standard deviation measures how uncertain the model actually is for a specific patient and treatment; it's a different number for every (patient, treatment) pair. This section presents three resampling-based methods to estimate that standard deviation. In Section 4, we will introduce a fundamentally different alternative, Bayesian methods.
Resampling-based Method 1: Bootstrap
Bootstrapping is a general-purpose statistical technique for estimating the variability of any quantity by resampling from observed data, conceptually similar to re-running a study many times with slightly different enrollment each time. In this context, we use it to estimate the standard deviation of the model's predicted reward (the uncertainty term in the pessimism penalty).
First, the existing data is split into a training set and a validation set. From the training data, a new sample is drawn that is the same size with replacement. For example, if the training set contains 10,000 patients, the resample also contains 10,000, but some patients appear multiple times and others are left out. The model is fit to this resample, and the process is repeated many times (typically 200–1,000 resamples). Each resample produces a slightly different model.
Now, for any patient of interest, each of those resampled models is used to predict the reward under each available treatment. The spread of those predictions across resamples is the uncertainty estimate for that patient and treatment. Homer could be one such patient, but the procedure applies identically to every individual the clinical team wants to evaluate. A patient with a common profile will show tight, consistent predictions across resamples (low standard deviation, low penalty). Homer, with his unusual features, will show predictions that vary widely depending on which patients happened to land in each resample (high standard deviation, large penalty).
The plot below shows an example where each curve is a model refit from a different bootstrap resample. Where data is abundant, curves agree. Where data is sparse, they diverge. Uncertainty is highest exactly where it matters most: in the sparse-data region where the model is extrapolating.
Bootstrap is conceptually simple and broadly applicable. Its main limitation is computational cost. Fitting the model hundreds of times is expensive, especially for complex models (and each refit must be evaluated across every patient and treatment of interest). It can also underestimate uncertainty for some model classes, particularly deep ensembles and tree-based methods. The next method we will look at, jackknife, takes a more systematic approach.
Resampling-based Method 2: Jackknife
Jackknife resampling is a systematic leave-one-out technique for estimating the variability of a statistical quantity. Unlike the bootstrap, which introduces randomness through resampling, the jackknife produces the same estimate every time.
First, a single patient is removed from the training data, and the model is refit on the remaining patients. The excluded patient is put back, a second patient is excluded, and the model is refit. This process is repeated until every patient in the training set has been left out once. If the training set contains 10,000 patients, this produces 10,000 slightly different models, each containing 9,999 patients and reflecting what the model would have learned without a single patient's data.
For any patient of interest, each of those refitted models is used to predict the reward under each available treatment. The variation in predictions across refits defines the uncertainty (how much does Homer's predicted reward change when patient #47 is excluded from the training data versus when patient #203 is? If the prediction is stable regardless of which patient is left out, confidence is high?). If it shifts meaningfully, the model is sensitive to the composition of the training data, and the estimated uncertainty will be large. As with the bootstrap, this produces a different uncertainty estimate for every (patient, treatment) pair.
Jackknife solves bootstrap's two main problems: it's deterministic (fully reproducible, no fluctuation across runs) and it has well-understood bias correction properties for smooth estimators. However, it introduces a cost problem of its own. The bootstrap requires hundreds of refits; the jackknife requires one refit per patient in the training set. For a dataset of 10,000 patients with a complex model, that's 10,000 model fits, which is computationally prohibitive in most applied settings. The infinitesimal jackknife avoids this.
The infinitesimal jackknife method is an analytic approximation to jackknife that replaces brute-force refitting with a mathematical shortcut. Rather than fully removing each patient and refitting the model thousands of times, it asks how sensitive the model's prediction is to a tiny change in each patient's weight in the training data. This sensitivity is formalized through influence functions. Influence functions measure how much a model's output would change if a single observation's contribution to the training process were infinitesimally increased or decreased. They approximate what the full jackknife would compute but require only a single model fit.
First, the model is fit to the full training dataset. Then, for any (patient, treatment) pair of interest, the influence of each training observation on that prediction is computed analytically. The spread of these influences gives the uncertainty estimate: the same (patient, treatment)-level standard deviation that the bootstrap and jackknife estimate through repeated refitting, but without the computational burden.
The practical payoff of using infinitesimal jackknife is significant. It works with random forests, gradient-boosted trees, neural networks, and other ensemble methods commonly used for treatment response modeling. The model is fit once, and the uncertainty estimate for every (patient, treatment) pair is extracted in closed form. This computational simplicity is what makes pessimistic DTRs feasible for real clinical datasets with tens of thousands of patients.
Bootstrap
Resample the data with replacement many times. Each resample gives a slightly different model estimate. The spread of these estimates = uncertainty.
Imagine re-running the study many times.
Jackknife
Leave out one patient at a time, re-fit the model each time. How much does the estimate change? That sensitivity = uncertainty.
Leave one out, see what changes.
Infinitesimal Jackknife
Instead of removing a full patient, down-weight each by an infinitesimal amount. Works with random forests, neural nets, and ensemble models.
The limit of tiny perturbations.
bootstrap variance: variance across resampled model fits
jackknife variance: leave-one-out sensitivity
prediction with patient removed
average across all leave-one-out predictions
Wager, Hastie, & Efron (2014), "Confidence Intervals for Random Forests" — develops the infinitesimal jackknife for tree-based models.
Efron (2014), "Estimation and Accuracy after Model Selection" — broader treatment of post-selection inference and influence functions.
Section 4
The Bayesian Approach
Instead of One Estimate, Get a Whole Distribution of Possibilities*
All three resampling methods share a common logic: fit a model, then run a separate procedure to probe how stable it is. The model itself has no concept of its own uncertainty; that must be estimated through resampling. Bayesian machine learning takes a fundamentally different approach. A Bayesian model learns a full probability distribution over its parameters, so that fitting the model and estimating uncertainty are the same computation. When you make a prediction for a single patient (like Homer), you don't just get a point estimate but a distribution of possible outcomes. The width of that distribution is the uncertainty, exactly the standard deviation that plugs into the pessimism penalty.
The Bayesian Linear Basis Model
Bayesian modeling is a statistical framework in which model parameters are treated not as fixed unknown values but as quantities with their own probability distributions. For example, rather than estimating that a particular treatment improves the SOFA score by exactly 1.5 points, a Bayesian model might estimate that the improvement is distributed around 1.5 with a spread that reflects how confident the model is in that number. In a Bayesian model, the distribution before observing data is called the prior: it encodes what we believe (or assume) before seeing any evidence. After observing data, the prior is updated into a posterior distribution that reflects both the prior beliefs and what the data supports.
The specific Bayesian model used in this framework is the Bayesian Linear Basis Model (BLBM), which extends Bayesian linear regression to capture nonlinear relationships between patient features and outcomes while keeping the posterior simple enough to compute exactly, with no simulation or iterative approximation required. Both the mean prediction and the standard deviation at any (patient, treatment) pair can be extracted directly from the posterior, giving the pessimism penalty everything it needs in a single model fit.
The logic has three steps. First, a prior is specified, which provides a belief about the model's parameters before seeing any data. The prior might encode domain knowledge ("treatment effects for sepsis are likely moderate, not extreme") or simply express broad uncertainty ("we don't have strong expectations"). Second, the data is observed. In this step, the model is trained on the existing data (the recorded trajectories of past sepsis patients, their treatments, and their outcomes). Third, the prior is updated into a posterior distribution using Bayes' rule, which combines what we believed before with what the data supports. This process is only performed once. Unlike the resampling methods, there is no repeated refitting, and the posterior is computed in a single pass over the training data. The uncertainty estimate for every (patient, treatment) pair follows directly from that single pass.
The mechanics of how the posterior is calculated are intuitive. Each possible set of parameter values has some prior plausibility and some compatibility with the observed data (measured by the likelihood: how probable is the data we actually saw, if these parameters were the truth?). Bayes' rule multiplies these together. Parameter values that were plausible and consistent with the data receive high posterior probability. Values that contradict the data are downweighted, even if the prior favored them. The result is a distribution that concentrates on the parameter values the evidence actually supports. Where the training data is abundant, the posterior distribution is concentrated around a narrow range of predictions, indicating high confidence. Where the data is sparse, the posterior spreads across a wide range, indicating that the model cannot distinguish between very different possible outcomes.
For Homer, this distinction plays out directly. Common patient profiles (moderate SOFA, standard comorbidities, well-represented in the training data) produce tight posteriors and reliable predictions. Homer, with his unusual lactate and rare comorbidity profile, produces a wide posterior. The model's uncertainty about his predicted reward under each treatment is large, which is exactly the information the pessimism penalty needs.
The widget below simulates this process. Click “Observe Patient” to feed training records into the model one at a time and watch the posterior update. Then toggle between the two profiles to see how the same model, trained on the same data, produces different levels of confidence for different patients.
Bayesian Belief Updater
How training data shapes confidence in Homer's predicted reward
No training data yet — the prior reflects pure uncertainty about this treatment's reward.
Why Use BLBM?
Three practical properties distinguish BLBM from the resampling methods introduced in Section 3.
Coherent uncertainty propagation. In a multi-stage DTR, uncertainty at stage 1 propagates to stage 2: if we're unsure about the best first treatment, that uncertainty affects which second-stage state the patient reaches, which in turn affects the second-stage uncertainty. BLBM handles this naturally, because the posterior is a single probabilistic model that carries uncertainty forward across stages. Resampling methods don't have this property. Each bootstrap or jackknife refit produces a point estimate at each stage, and combining uncertainty across stages requires additional, ad hoc procedures that can lose coherence.
Prior knowledge. If domain experts have beliefs about plausible treatment effect sizes or dangerous parameter regions, the prior can encode them directly. A clinician who knows that vasopressor effects above a certain magnitude are physiologically implausible can build that into the model before seeing any data. The resampling methods have no natural mechanism for incorporating prior knowledge: they operate entirely on the observed data, with no way to express "we believe effects in this range are unlikely."
Computational tractability. The BLBM posterior is available in a single computation, so inference is fast. Bootstrap requires hundreds of refits, jackknife requires one per patient, and infinitesimal jackknife requires a separate influence computation for each (patient, treatment) pair. For multi-stage DTRs, where backward induction iterates the estimation across stages, these costs multiply. BLBM computes the mean and standard deviation at any (patient, treatment) pair directly from the posterior, making it practical to run backward induction across thousands of patients and multiple stages.
Coherent Uncertainty
Uncertainty at stage 1 flows naturally into stage 2 and beyond. The posterior carries it forward across stages — no ad hoc stitching required.
One model, all stages.
Prior Knowledge
Encode domain expertise directly: "vasopressor effects above this magnitude are implausible." Resampling methods have no mechanism for this.
Tell the model what you already know.
Single Computation (Computational Tractability)
BLBM computes mean and uncertainty in one pass. No hundreds of bootstrap refits, no per-patient leave-one-out loops.
One model fit gives both the prediction and its uncertainty.
prior distribution over parameters
posterior distribution after observing data
posterior mean prediction (expected reward)
posterior standard deviation (uncertainty)
basis function matrix (features transformed for BLBM)
pessimistic policy using Bayesian posterior
Bishop (2006), Pattern Recognition and Machine Learning, Chapter 3 — thorough treatment of Bayesian linear regression and basis function models.
Hahn, Murray, & Carvalho (2020), "Bayesian Regression Tree Models for Causal Inference" — Bayesian methods applied to treatment effect estimation.
Section 5
The Full Algorithm
Piecing Together What We've Learned from DTR, RL, and the Pessimistic Parameter
We now have all the pieces for pessimistic DTR. Backward induction from Part I tells us how to structure optimization, the pessimism principle from Section 2 tells us how to handle uncertainty, and the Bayesian approach from Section 4 gives us a principled uncertainty estimate. This section assembles these pieces into a complete algorithm, then shows how the algorithm can be applied step by step to determine a patient's treatment rules.
The Algorithm
Step 1: Fit the BLBM
Fit the BLBM to the observed patient trajectories to obtain the posterior distribution over model parameters.
Step 2: Extract the Estimated Q-Value and Standard Deviation
Extract the estimated Q-value and standard deviation from the posterior at every (patient, treatment) pair of interest. The estimated Q-value is the posterior mean: the model's best prediction of the expected reward for a given patient receiving a given treatment. The standard deviation measures how uncertain the model is about that prediction.
Step 3: Construct the Pessimistic Q-Value
Construct the pessimistic Q-value for each (patient, treatment) pair by subtracting the pessimism parameter times the standard deviation from the estimated Q-value. This is the lower confidence bound: the expected reward we are fairly confident the treatment will achieve, even accounting for the model's uncertainty. Treatments where the model is guessing see their Q-values pulled down. Treatments where the model is confident remain close to their original estimates.
Step 4: Apply Backward Induction Using the Pessimistic Q-Values
Apply backward induction using the pessimistic Q-values, following the same procedure from Part I. Start at the final stage: for each patient state, select the treatment with the highest pessimistic Q-value. Then work backward through earlier stages, computing the pessimistic value function at each stage (the expected reward under the pessimistic policy from that point forward). At each earlier stage, the pessimistic Q-values incorporate the pessimistic value of the stages that follow.
Step 5: Output the Pessimistic Optimal Policy
At each stage and state, the policy recommends the treatment with the highest pessimistic Q-value. The result is conservative by design: it may not achieve the absolute highest expected reward, but it avoids catastrophic recommendations driven by overconfident extrapolation in data-sparse regions.
Homer's Case: A Worked Example
Homer's clinical team has access to observational records from 500 sepsis patients admitted to their ICU over the previous two years. Each patient record contains states, actions, and outcomes across two stages. The dashboard applies the pessimistic DTR algorithm to help the team make an informed treatment decision for Homer.
Steps 1–2: Fit the BLBM and Extract Estimates
The model is trained on all 500 trajectories, then queried with Homer's stage-1 state (SOFA = 8, lactate = 8.4, age = 62, cystic fibrosis) and each available treatment. Treatment A (Standard) has an estimated Q-value of 6.1 with standard deviation 0.9. Treatment B (Aggressive) has an estimated Q-value of 7.3 with standard deviation 3.2.
Treatment B has the higher estimated Q-value, but the standard deviation is 3.5 times larger than Treatment A's. Few patients in the training set share Homer's combination of high lactate and rare comorbidity, and most of those received Treatment A. The model's estimate for Treatment B under Homer's profile is largely extrapolation.
The pessimism penalty eliminates Treatment B's apparent advantage.
Steps 4–5: Backward Induction and the Recommendation
The pessimistic policy recommends Treatment A. The naive policy (pessimism parameter = 0) would have recommended Treatment B based on the higher estimated Q-value alone. The pessimistic approach recognizes that the model's confidence in Treatment B is low and defaults to the treatment that is well-supported by the data.
The large standard deviation for Treatment B also provides a secondary signal: the evidence base for patients like Homer is thin. The recommendation is to follow Treatment A, but to monitor closely, and to consider Homer a strong candidate for enrollment in a future adaptive trial that could fill the data gap.
Importantly, this is not a universal override. For a patient with a common profile and abundant training data, both protocols would have low standard deviations, the pessimism penalty would be small, and the pessimistic policy would agree with the naive policy. Pessimism only changes the recommendation when uncertainty is high enough to alter the ranking.
pessimistic Q-value (lower confidence bound)
pessimistic policy: pick highest lower bound
Section 6
Bigger Picture
Frontiers and Limitations
What the Pessimism Framework Provides
The pessimistic DTR framework offers three practical contributions.
Safety guarantees. Under standard regularity conditions, the pessimistic policy provably limits worst-case regret: the gap between its performance and the true optimal policy scales with the actual uncertainty in the data, not with the model's unchecked predictions. A clinician using this framework can trust that recommendations are anchored in evidence, even for unusual patients.
Interpretable caution. The pessimism coefficient gives clinicians a transparent mechanism for adjusting the level of conservatism. It doesn't hide uncertainty inside a black box. If a hospital's risk tolerance is low — say, for a fragile patient population — they can increase the coefficient. If the data is rich and well-curated, they might reduce it. The tradeoff between ambition and safety is explicit and adjustable.
Model flexibility. The pessimism framework works with any model that provides uncertainty estimates. We have presented it with the BLBM, but the same logic extends to Bayesian neural networks, Gaussian processes, and ensemble methods paired with bootstrap or infinitesimal jackknife uncertainty. The pessimism formula is agnostic to the source of the uncertainty estimate.
Limitations
The framework is not without constraints, and honesty about them matters.
Conservatism can be excessive. If all available treatments have high uncertainty, because the patient is in a genuinely novel region of the state space, the pessimistic policy defaults to whichever option has the lowest uncertainty, even if that option is mediocre. Being safe is not helpful if you are safely recommending an ineffective treatment. In extreme cases, pessimism can reduce to inaction.
Model misspecification. The Bayesian posterior is only as good as the model it describes. If the BLBM's basis functions cannot capture the true relationship between patient features, treatments, and outcomes, the uncertainty estimates may be miscalibrated — too narrow in some regions, too wide in others. A model that is confidently wrong is more dangerous than one that is honestly uncertain. In practice, model checking and validation against held-out data are essential complements to the Bayesian framework. Notably, the prior's influence on the posterior is strongest in exactly the data-sparse regions where the pessimism penalty is largest. This means the choice of prior can affect treatment recommendations for the patients who need the most caution. Sensitivity analysis across reasonable prior specifications is essential in applied settings.
Positivity remains necessary. The pessimism penalty partially compensates for weak positivity: it down-weights recommendations in data-sparse regions, which is often where positivity is weakest. But it does not replace the identification assumption. If a treatment was never observed for a patient subgroup, no amount of pessimism can estimate what would have happened. The causal assumptions from Part I (consistency, positivity, sequential exchangeability) must still hold for the framework to produce valid results.
Uncertainty compounds across stages. In DTRs with more than two stages, uncertainty at each stage propagates backward through the induction. By the time the algorithm reaches stage 1, the pessimism penalty incorporates uncertainty from every subsequent stage. The cumulative penalty can become very large, making the algorithm overly conservative at early decision points. This is a practical challenge for DTRs with many stages, such as chronic disease management over months or years.
Uncertainty Across Stages
How pessimism compounds in multi-stage DTRs
For a typical patient, adding stages barely increases the penalty. For Homer, whose data is sparse, the penalty compounds rapidly—by 5 stages it's so large that every treatment looks bad.
The vocabulary mapping between causal inference and RL reflects a genuine mathematical equivalence that enables researchers in both fields to build on each other's progress. The pessimism principle is one example: a tool from offline RL that addresses the fundamental problem of distribution shift in causal DTR estimation.
Section 7
Test Yourself
A Short Quiz on the Foundations of Pessimistic DTR
Questions to check your understanding of Part II material.