A Unified Framework for Sequential Clinical Decision-Making Using Causal Inference and Reinforcement Learning
Based on the book Optimal Control Using Causal Agents
Abstract
Dynamic treatment regimes (DTRs) describe sequenced treatment strategies that depend on individual features and adapt to patients' evolving conditions. In the first of two articles exploring DTRs, we bridge causal inference and reinforcement learning (RL), showing that a DTR is an RL policy and that finding the optimal DTR is equivalent to solving for the optimal policy. We introduce core RL concepts, including policies, value functions, and Q-learning, and map them to familiar causal and clinical language. Using a grid-based robot demonstration, we build intuition for backward induction and Q-function estimation from observational data. We conclude by identifying a critical limitation: model unreliability under distribution shift, motivating the pessimism principle developed in Part II.
About the Authors
MaryLena Bleile|Researcher specializing in causal inference and reinforcement learning. Her work on bridging these fields inspired our reformulation of the Causal Navigator to emphasize the foundational causal inference engine developed by Barenbboim and collaborators, a unified framework for understanding when and how causal effects can be identified from observational data.
Aimee Harrison|Aimee Harrison (BS, MFA) co-maintains Tao of RWD, and works as a product manager support real world evidence study design tools at Navidence.
Andy Wilson|Andy Wilson (PhD, MStat), Founder and Principal of The Tao of RWD and Adjunct Professor at the University of Utah School of Medicine. Andy bridges cutting-edge causal inference methodology with practical application in regulatory and healthcare settings. With over 100 peer-reviewed publications and a decade of experience in pharmacoepidemiology and real-world evidence, he focuses on helping organizations move beyond correlation to understand true cause-and-effect relationships. He currently teaches PBHLT 7115: Causal Methods in Public Health at the University of Utah and has presented alongside leaders at the FDA, EMA, and the American Causal Inference Conference.
Loading content...
Section 1
Introduction
Where Causal Methods Meet Reinforcement Learning
Imagine Homer arrives in the emergency department with suspected sepsis. His blood pressure is dropping. His lactate is elevated. The clinical team has to act fast. They assess Homer's severity using the Sequential Organ Failure Assessment (SOFA) score (which tracks dysfunction across organ systems).
Based on Homer's score at intake, the team makes their first treatment decision to start Homer on a combination of IV fluids and vasopressors. Every four hours after, they take new labs and reassess his score. If Homer has positively responded to treatment, they decide either to stay the course or reduce treatment; if he is not responding to treatment, they have to make the decision to escalate the vasopressors or switch the fluid strategy. Each decision the clinical team makes is based not just on Homer's current state, but also on the treatment he received in all prior rounds of treatment and how he responded.
This back-and-forth sequence (observe → decide → observe → decide) forms the essence of a dynamic treatment regime (DTR). Formally, DTR is a sequence of treatment rules that adapt to a patient's evolving condition. The goal of personalized medicine is to find the optimal DTR, or the strategy that gives patients the best chance of recovery based on their unique characteristics.
DTRs can be determined in many differing subfields of medicine: ICU management (as we've seen in Homer's sepsis example; also see Zhang et al 2020), addiction treatment (when to switch from medication to cognitive behavioral therapy; see Chakraborty 2011), cancer care (adaptive dosing based on tumor response; see Gluzman et al 2020), and many other settings where treatment unfolds in stages and adapts to patient response.
In this two part article, we introduce DTRs and develop a principled strategy for estimating safe treatment policies under uncertainty. Our inquiry sits at the intersection of two fields: (1) causal inference, which provides the framework for drawing valid conclusions from observational data, and (2) reinforcement learning (RL), which provides computational tools for sequential decision-making under uncertainty. We highlight the growing body of work on hybrid causal RL systems by emphasizing that a DTR is an RL policy, and by illustrating how the optimal DTR is a type of optimal RL policy. These explicit conceptual maps allow researchers in each field to directly leverage the other's tools and theoretical guarantees.
In Part I, we introduce RL foundations, formalize DTRs, and show how to learn optimal treatment strategies from observational data. To ground this synthesis, we borrow comparative approaches established more completely in Marylena Bleile's Optimal Control Using Causal Agents. We highlight the problem of out of distribution sampling, a critical limitation that must be addressed in the estimation of the DTR: When our model encounters patients unlike those in the training data, its treatment recommendations become dangerously unreliable. In Part II, we develop a DTR adaptation of the pessimism principle (a strategy from offline RL for making safe treatment recommendations under uncertainty) and present a full DTR algorithm applying the pessimism principle using machine learning (ML). Our approach borrows from Yunzhe Zhou's work, in particular Optimizing Pessimism in Dynamic Treatment Regimes with Uncertainty Quantification: A Bayesian Learning Approach.
These articles assume familiarity with causal inference concepts (confounding, potential outcomes, the distinction between association and causation) but not with ML (or RL). If you don't yet have a background in causal inference methods, you can still follow the core ideas in this article, but we recommend keeping a causal inference reference nearby as foundational concepts arise. We also recommend readers review the freely available chapters in Optimal Control Using Causal Agents to deepen their foundations across both fields.
The RL decision-making framework consists of an agent that interacts with an environment over time. At each time step, the agent observes the current state of the environment, takes an action, and receives a reward. The environment then transitions to a new state, and the cycle repeats.
For Homer's sepsis case: the clinical team is the agent. Homer's physiology (including his SOFA score) is the state, which evolves over time. The treatment choice (including which fluids and which vasopressors) is the action. Homer's health outcome (how well his organs recover) is the reward. The environment is everything that governs how Homer's body responds to treatment, which the clinical team can observe but not fully control.
Policy
A policy is a rule that tells the agent which action to take in each state. A policy can be deterministic (”if SOFA > 8, start norepinephrine“) or stochastic (”start norepinephrine with probability 0.7 if SOFA > 8“). If you've worked with treatment assignment rules, exposure protocols, or clinical guidelines, you've already encountered policies. The RL vocabulary just gives them a generalized name.
Value Function and Q-Function
Here's where RL introduces a concept that doesn't have an exact twin in standard causal inference: the value function. The value function asks: if I'm in state and I follow a policy from here on, what's my expected total reward? It summarizes how ”good“ a state is under a given policy. For example, if Homer's SOFA score has dropped to 4 at the first reassessment point, the value function tells us how good that situation is for his overall health outcomes over time, not just at that isolated assessment point. The value function accounts for the best treatment decisions we are able to make from that point on.
The Q-function asks a more specific question: if I'm in state , I take action right now, and then follow a policy afterward, what's my expected reward? Unlike the value function, the Q-function is defined over all possible actions, not just the one the current policy selects. It allows us to evaluate the relative outcome of alternative choices at any state. For example, if we want to consider the relative benefits of continuing or stopping Homer's IV fluids at the third assessment point, the Q-function can tell us which choice leads to a better expected outcome for this state, assuming we'll act optimally from the next stage on.
The value function and the Q-function are connected recursively; that is, the value of a state depends on the value of the future states you can reach from it. This recursive relationship is defined through the Bellman equation, which is one of the critical pieces of mathematical machinery that makes the whole framework computationally tractable. We'll see it in action in section 4 when we compute values on an example grid.
Optimal Policy
The optimal policy is the policy that maximizes expected reward from every state. It allows us to determine the treatment strategy that gives the best expected outcome. Finding the optimal policy is the central goal of both RL and DTR estimation.
A Shared Vocabulary
If you're coming from causal inference, you may notice that the components of RL map onto concepts you already know. Below we offer an overview of the translation, which we will return to in more detail in section 7.
Reinforcement Learning
Causal Inference
Clinical Language
Policy (π)
Treatment regime / rule
Clinical protocol
State (s)
Time-varying covariates
Patient's current condition
Action (a)
Treatment / intervention
Treatment decision
Reward (r)
Outcome (potential)
Health outcome
Optimal policy (π*)
Optimal DTR
Best treatment strategy
Q-function Q(s,a), evaluated at s,a
Conditional treatment effect of a given s
Expected outcome given treatment
RL and causal DTR estimation are two sets of terminology and strategy formalizing a principled approach to the decision-making sequence: observe a state, choose an action, optimize an outcome. Hence, it is possible to import tools from offline RL into clinical treatment optimization, such as the pessimism approach we'll discuss in Part II.
state, action, reward
number of treatment stages
deterministic policy in state
probability of action in state
value of state under
value of action in state
optimal policy
Section 3
Dynamic Treatment Regime Primer
A Foundation for Personalized Medicine
Overview
Now that we have the RL vocabulary, we can precisely define a DTR as a sequence of decision rules that maps a patient's evolving state and treatment history to the next action. In the language of RL, a DTR is a policy that operates over multiple time steps.
An individual's history at stage is the record of all states and actions observed up to that point. The decision rule at takes as input the individual's history at and outputs an action. The optimal DTR is the sequence of decision rules that maximizes expected reward.
In practice, the clinical team doesn't know how each treatment will affect a given patient's trajectory. To estimate the optimal DTR, they rely on decision rules learned from observational data (the recorded trajectories of prior patients). The dataset contains each prior patient's states, actions, and outcomes across all observed stages. For example, Homer's clinical team could learn decision rules from the records of all sepsis patients previously admitted to that ICU, including their SOFA scores, treatments, and recovery outcomes at each assessment point.
Causal Assumptions
DTR estimation from observational data relies on three identification assumptions: consistency, positivity, and sequential exchangeability. If you've worked in causal inference, these conditions will be familiar, since they are the sequential versions of standard assumptions for estimating treatment effects. If the consistency, positivity, and exchangeability assumptions hold, then causal identification theory guarantees that the DTR estimated from observational data approaches the same optimal policy we'd find in a sequential randomized trial.
Consistency: Or, well-defined interventions
Consistency requires that the treatment a patient actually received is the same treatment we're modeling. That is, if the data are consistent, then there cannot be multiple versions of the ”give vasopressors“ action such that different versions might lead to different clinical outcomes. In potential outcomes language: consistency requires that the observed outcome under treatment equals the potential outcome .
Positivity: Or, overlap
Positivity requires that for every combination of patient state and treatment history, there's a nonzero probability of receiving each treatment option. No patient subgroup is deterministically assigned one treatment. This can fail when protocols mandate specific treatments for extreme states. For example, if SOFA > 12 always triggers maximum vasopressor support, we have no data on alternative treatments for those patients: In this case, the positivity assumption is violated.
Sequential exchangeability: Or, no unmeasured time-varying confounding
Sequential exchangeability requires that at each decision point, treatment assignment is independent of future potential outcomes, conditional on the observed history. This is the sequential, time-varying version of the ”no unmeasured confounders“ assumption. Sequential exchangeability is the hardest to verify. Unfortunately, it is also the most consequential assumption if violated: Unobserved factors that influence both treatment choice and outcome can induce strong bias onto the estimated DTR which can cause it to catastrophically fail on new observations.
decision rule at stage
history up to stage
observed trajectory
rule maps history to action
optimal DTR maximizing
dataset of patient trajectories
DTR estimation from observational data rests on the sequential versions of the standard causal identification assumptions: consistency, positivity, and sequential exchangeability. When these hold, the DTR estimated from observational data targets the same optimal policy we'd find in a perfectly conducted sequential randomized trial.
Section 4
The Robot's Quest
Gamified Dynamic Treatment Regimes
Now that we've defined DTRs formally, let's build intuition. We'll start with a small example you can verify by hand, then give you the controls for a more complex version.
A Simple Grid
Meet our robot, Pod (short for Podalirius, Homer's physician). Pod lives on a 2×3 grid and needs to reach the goal (the money bag) from a fixed starting position. It can move one step at a time (up, down, left, right) and wants to reach the goal in as few steps as possible.
2×3 Grid: Worked Example
Pod navigates from start (top-left) to goal (bottom-right). Reward: +10 at goal, −1 per step.
The Grid
Value Function
Q-Values
Optimal Policy
Here's the setup: There are 6 cells (states) and 2 to 3 possible moves (actions) at each cell, since the grid's edges limit which directions are available. Pod receives +10 for reaching the goal and −1 for every step along the way, so a longer path means a lower total reward. The numbers overlaid on each cell are the value function (the expected total reward from that state, assuming optimal play from there on). The third panel shows the Q-values for each available action at each cell.
To determine the optimal path, we need to calculate the value of each cell. We do this by starting from the goal and working backward. The goal cell has value 10 (if you're already there, there's nowhere left to go, so you simply collect the reward).
Next, look at the cell immediately to the left of the goal. The best move is to step right into the goal, earning 10 on the next move minus 1 for the step. So that cell has value 9. Continue this backward through the grid. Each cell's value equals the value of the best neighboring cell minus the step cost. Two steps from the goal? Value = 8. Three steps? Value = 7.
The Q-function gives us the expected reward of taking a specific action when located in a specific cell. Suppose, for example, that Pod is sitting at the top middle cell. From here, Pod can move right (reaching a cell worth 9, so ), down (reaching a cell worth 9, so ), or left (reaching a cell worth 7, so ). The policy says: go right or down, both are equally good. Note that , which can be achieved by choosing either a= right or a= left. Thus, the policy says: go right or down, both are equally good. Notice that the cell's value () equals the highest Q-value among available actions. That's the relationship: value at equals the maximum Q-value at that state. The value function summarizes the best option; the Q-function shows you why it's best by laying out all the options.
At every cell, the optimal action is the one that leads to the highest-value neighbor. Draw an arrow for each cell pointing toward the best next step, and you have the optimal policy (the best strategy from every possible starting position).
Bellman Equations
The recursive structure we just used is formally known as the Bellman equation. Each cell's value equals the reward for the best action plus the value of the state that action leads to. It's what makes backward induction work: you don't need to enumerate every possible path; you just need to solve one step at a time, starting from the end. This equation says: the value of a state is the best immediate reward plus the value of wherever that action takes you.
Now Try It Yourself
The 2×3 grid is simple enough to solve in your head. Real treatment decisions aren't. In the interactive game below, Pod has been sent by the Achaeans on a voyage to find buried treasure with medicine and gold. It navigates a 3×3 grid with a randomized start and goal. Each move (action) costs it 10 talents of gold (or more if Pod bumps into a pirate). Pod's position is the state. Acquisition of the treasure bag is the outcome we're optimizing for. To calculate the optimal move, one might use Q-functions to plan, or to simply test every possible outcome to see what works (trial and error).
Start on Calm Seas mode. Here there are no pirates, just Pod and the treasure. Each move costs 10 points, so the optimal strategy is the shortest path. Guide Pod to the goal and see how many points you end up with.
Pod's Buried Treasure
0 pts
(0,2)
(1,2)
(2,2)
(0,1)
(1,1)
(2,1)
(0,0)
(1,0)
(2,0)
Guide Pod to the treasure. Each move costs 10 pts.
Now switch to Pirates mode. Pirates symbolize adverse outcomes that Pod will want to avoid (but may not be able to). In pirate mode, each obstacles costs a fixed 15 points. The optimal policy will likely be to route around danger, even when the direct path is shorter. This is exactly what happens in clinical decision-making: a more aggressive treatment might promise faster improvement, but if it carries serious risks, a cautious, longer path may be optimal.
Finally, the bravest heroes may attempt the treasure quest on Dangerous Waters, where each pirate incurs a different cost; some pirates take more money than others (the risk, therefore, of encountering a competent pirate who takes a great deal of money is greater than the risk of encountering a less competent pirate who cannot find your secret stash). Now, the optimal path depends on the shortest path and balancing which pirates are most costly to hit. The practical analogue is a clinical situation where risk cannot be avoided in treatment, for example when the most efficacious treatments carry the highest risk of side effects. In these scenarios, determining the best treatment plan requires the policymaker to balance the relative severity of each risk.
Back to Homer
Let us return, momentarily, to the treatment of Homer, our sepsis patient. His starting position is his initial presentation: SOFA = 8, elevated lactate, compromised kidneys. The goal is organ recovery, indicated by a normalized SOFA, improving labs, and a path towards discharge. Each move is a treatment adjustment (change fluid rates, add or adjust vasopressors). The pirates represent the clinical hazards along the way (organ toxicity, adverse drug reactions, fluid overload). The grid represents the treatment landscape (which options are available and which are dangerous). The available treatment options and their relative risk are shaped by Homer's physiology.
When Pod compares Q-values at each cell to determine the optimal path to get to the buried treasure, it's doing exactly what a DTR does at any decision point (evaluating the expected outcome of each available action, and accounting for everything that will follow). The difference is that Pod knows the grid perfectly. Homer's clinical team doesn't. Instead, they must estimate and from prior patients' records.
value of state
best reward now plus value of next state
value of taking action in state
When Pod compares Q-values at each cell to determine the optimal path to get to the buried treasure, it's doing exactly what a DTR does at any decision point: Pod evaluates the expected outcome of each available action, and accounts for everything that will follow. The difference is that Pod knows the grid perfectly. Homer's clinical team doesn't. Instead, they must estimate and from prior patients' records.
Section 5
Finding the Best Strategy
Approaches to Offline Reinforcement Learning for DTR
Pod knew its grid perfectly: every cell, every pirate, every reward. A clinician doesn't have that luxury. We don't have a map of a new patient's future clinical trajectory, but we may have observational data to learn from. In ML, this setting is known as offline RL: learning a treatment policy from a fixed dataset, with no ability to experiment or collect new data. It's the standard setting for clinical DTR estimation, where we have patient records, not a laboratory.
Can Pod learn an optimal DTR for Homer from prior patients' records?
Single Stage: Pick the Best Treatment
Let us simplify, momentarily, by considering only a single decision point in Homer's treatment. Homer arrives with a set of features (SOFA score, lactate, comorbidity burden). The clinical team has two first-line options: IV fluids or vasopressors. Which treatment will maximize Homer's outcome?
Imagine we have records from 25 prior sepsis patients admitted to the same ICU. Each patient had their severity assessed, they received one of two treatments, and we recorded their recovery outcome. This is our training data.
An ML model can learn the mapping from patient features and treatment to expected outcome. Once trained, for each new patient, the model predicts the expected recovery under each treatment and recommends the one with the higher predicted reward. In the simulator below, study the training data, then try your hand at being the clinician: adjust the patient's features with the sliders, choose a treatment, and see whether you picked the optimal one. Can you spot the pattern before the model reveals it?
Training Data: 25 Prior ICU Patients
Recovery scores range from roughly −1 to +1. Higher is better: positive values indicate improvement, negative values indicate decline.
Pt
SOFA
Lactate
Comorbidity
Treatment
Recovery
1
9
3.9
4
Vasopressors
+0.05
2
4
5.2
4
IV Fluids
+0.27
3
11
2.8
1
Vasopressors
-0.28
4
0
4.0
4
IV Fluids
+0.06
5
4
1.0
1
Vasopressors
-0.48
Treatment Choice Simulator
A model has been trained on the 25 patients above. A new patient has arrived — study their features, then pick the treatment you think will lead to better recovery.
Patient Features
SOFA Score7
Lactate (mmol/L)4.0
Comorbidity Index2
Choose a treatment
Multi-Stage: Working Backwards with Q-Learning
We have established that given a reasonable set of prior data, Pod can learn DTR for a single decision point. But Homer's treatment unfolds over multiple stages, involving many assessment points over his hospital stay. To address the multi-stage problem, we will start by assuming Homer's treatment happens in exactly two stages: the initial decision, then a reassessment and adjustment at 4–6 hours. In this two-stage scenario, we can't optimize each stage independently, because the first treatment affects the second state. The solution is backward induction, just like we used in Pod's game, but applied to learned models instead of a known environment.
Start by considering only second state observations from the dataset. For each prior patient, look at the features, treatment, and outcomes for their second assessment point. Use this reduced dataset to learn which stage-2 treatment maximizes the final outcome. This gives us an estimated Q-function for stage 2; it tells us the predicted reward for each (state, action) pair at the second decision point.
Now that we know approximately how good each stage-2 state is, we can fold that into the first decision. We use the dataset for all individuals' first assessment point. We build a stage-1 Q-function that predicts not just the immediate effect of the first treatment, but the total reward including the best stage-2 response. From this, we can pick the first treatment for Homer that maximizes his total (two-stage) value.
The interactive step overview below shows how this procedure works in practice. Detailed mathematical instruction is outside the scope of this article, but Pod's interactive DTR dashboard can give you a flavor of the calculations involved.
Fitted Q-Iteration
Backward induction for DTR estimation
We start at the end because Stage 2 has no future to worry about. Using each patient’s second-assessment data, we learn which Stage-2 treatment maximizes recovery.
Pod considers the relative benefits and risks of each treatment for Homer.
We illustrated fitted Q-iteration with two stages, but the procedure generalizes directly. With K stages, start at Stage K, fit the terminal Q-function, then work backward through Stages K − 1, …, 1, constructing pseudo-outcomes at each step. For Homer, this means the same backward-induction logic can handle daily reassessments over an entire ICU stay, not just two time points.
This is Q-learning applied to DTR estimation: backward induction using estimated Q-functions fitted from observational data. And it works beautifully when the model's predictions are reliable. But what happens when our model encounters a patient unlike anyone in the training data?
estimated Q-function (expected cumulative reward)
greedy policy: choose the action with highest estimated value
Q-learning applied to DTR estimation is backward induction using estimated Q-functions fitted from observational data. It works beautifully when the model's predictions are reliable — but what happens when the model encounters a patient unlike anyone in the training data?
Section 6
When Things Go Wrong
Handling Predictions with Sparse Data
Our ML model only knows about patients it's seen in the training data. When it encounters a new patient whose features fall outside the observed range, it has to extrapolate (predict beyond the range of observed data). Extrapolation is like guessing in the dark; it produces high variance. When the model is uncertain, it might recommend aggressive treatments with wide confidence intervals. This is especially dangerous in healthcare, where a wrong decision can cause real harm.
Think about a scatter plot of features (lab values, comorbidities) for past sepsis patients who received only vasopressors (treatment A, blue) versus only IV fluids (treatment B, red). In regions where both treatments have lots of data points, the model's predictions are grounded in evidence. The model is confident, and it should be. But in regions where data is sparse (for example, patients with very high lactate combined with a rare comorbidity) the predictions are based on distant extrapolation from dissimilar patients. The model might still produce a confident-looking recommendation, but that confidence is an illusion.
Suppose that Homer presents with lactate of 8.4 mmol/L and cystic fibrosis. Suppose he is one of the first patients with cystic fibrosis that we have encountered: We may have dataset of sepsis patients to consider, but so far there have only been 2-3 other people with cystic fibrosis; consequently, this profile places him in a region of the feature space where the training data is thin. The model can still predict Homer's reward for each treatment option; however, since the model has seen very few patients like Homer, its estimates consequently carry enormous uncertainty, since the function of cystic fibrosis within the causal network is not yet known. The problem is that the DTR methods we've established so far treat those shaky predictions with the same confidence as well-supported ones: In offline RL, this is called the distribution shift problem.
Resolving this issue requires a principled way to account for uncertainty, so that we can be cautious when the model is guessing. In Part II, we develop a method for this exact problem. The pessimism principle penalizes treatment recommendations in proportion to the model's uncertainty about them. The idea, in a sentence: when you don't know, choose the treatment you're most confident is at least decent. We'll show how to measure that uncertainty, how to incorporate it into the optimization, and what the full algorithm looks like.
We wish to re-emphasize that DTR models are estimated from finite, observational data; the Q-functions that these DTRs are based on can be wrong. Any model can be biased by unmeasured confounders, degraded by sparse regions of the feature space, or simply misspecified. Every estimate carries uncertainty, and that uncertainty compounds across stages. A DTR is a strategy for informing real physicians, not replacing them. Methods like fitted Q-iteration produce structured, transparent summaries of existing evidence, which a clinician may subsequently interrogate, challenge, and weigh against their own experience. The math formalizes a question (“Given what we've seen before, which treatment sequence looks best?”); the physician decides whether the answer applies to the patient in front of them. Our goal is to understand computational techniques that can provide physicians the best possible information given the constraints of observational data to inform their decisions. Perturbative approaches which compare the suggestions made by different Q functions (as in Bleile 2023, and Bleile, Dias, Ageueusop et al 2025) can also improve the robustness of these analyses.
prediction variance of estimated Q-function
state outside training support; variance grows with distance from observed data
point estimates mask uncertainty in sparse regions
Section 7
Bigger Picture
Translating between Causal Methods and Reinforcement Learning
Throughout this article, we've been translating between two vocabularies. The DTR framework sits at the intersection of causal inference and RL, two fields that developed in parallel for decades, solving versions of the same problem with different tools and different terminology. In earlier sections, we overviewed Bleile's comparative framework. As we conclude our introduction to DTR using RL, we return to those concepts.
Causal inference researchers built the identification theory, which asks: under what assumptions can we draw valid causal conclusions from observational data? The answers (consistency, positivity, sequential exchangeability) tell us when DTR methods work. RL researchers built the optimization machinery, which asks: given a sequential decision problem, how do we find the best policy? The answers (Q-learning, backward induction, policy evaluation) tell us how to compute the DTR solution. Causal identification theory validates the estimand, while RL algorithms compute the estimator. Neither field alone is sufficient; together, they provide both the theoretical foundation and the computational tools for optimizing sequential treatment decisions.
Below we represent the table shown in section 2 to review the explicit concept mapping. If you come from causal inference, read left to right. If you come from RL, read right to left. The middle column is the shared clinical language that both fields are ultimately trying to serve.
Reinforcement Learning
Causal Inference
Clinical Language
Policy (π)
Treatment regime / rule
Clinical protocol
State (s)
Time-varying covariates
Patient's current condition
Action (a)
Treatment / intervention
Treatment decision
Reward (r)
Outcome (potential)
Health outcome
Optimal policy (π*)
Optimal DTR
Best treatment strategy
Q-function Q(s,a), evaluated at s,a
Conditional treatment effect of a given s
Expected outcome given treatment
The power of this translation is that advances in one field immediately benefit the other. Causal sensitivity analysis tells RL researchers when their offline estimates might be biased. RL optimization algorithms give causal researchers scalable tools for complex sequential problems. And both fields contribute to the clinical goal: finding treatment strategies that actually work for patients like Homer.
Causal identification theory validates the estimand. RL algorithms compute the estimator. Neither field alone is sufficient; together they provide both the theoretical foundation and the computational tools for optimizing sequential treatment decisions.
Section 8
Test Yourself
A Short Quiz on the Foundations of DTR and RL
Questions to check your understanding of Part I material.