An Introduction to Directed Acyclic Graphs

Malcolm Barrett

2024-01-23

A quick note on terminology: I use the terms confounding and selection bias below, the terms of choice in epidemiology. The terms, however, depend on the field. In some fields, confounding is referred to as omitted variable bias or selection bias. Selection bias also sometimes refers to variable selection bias, a related issue that refers to misspecified models.

#  set theme of all DAGs to `theme_dag()`
library(ggdag)
library(ggplot2)
theme_set(theme_dag())

Directed Acyclic Graphs

A DAG displays assumptions about the relationship between variables (often called nodes in the context of graphs). The assumptions we make take the form of lines (or edges) going from one node to another. These edges are directed, which means to say that they have a single arrowhead indicating their effect. Here’s a simple DAG where we assume that x affects y:

dagify(y ~ x) %>%
  ggdag()

You also sometimes see edges that look bi-directed, like this:

dagify(y ~ ~x) %>%
  ggdag()

But this is actually shorthand for an unmeasured cause of the two variables (in other words, unmeasured confounding):

#  canonicalize the DAG: Add the latent variable in to the graph
dagify(y ~ ~x) %>%
  ggdag_canonical()

A DAG is also acyclic, which means that there are no feedback loops; a variable can’t be its own descendant. The above are all DAGs because they are acyclic, but this is not:

dagify(
  y ~ x,
  x ~ a,
  a ~ y
) %>%
  ggdag()

Structural Causal Graphs

ggdag is more specifically concerned with structural causal models (SCMs): DAGs that portray causal assumptions about a set of variables. Beyond being useful conceptions of the problem we’re working on (which they are), this also allows us to lean on the well-developed links between graphical causal paths and statistical associations. Causal DAGs are mathematically grounded, but they are also consistent and easy to understand. Thus, when we’re assessing the causal effect between an exposure and an outcome, drawing our assumptions in the form of a DAG can help us pick the right model without having to know much about the math behind it. Another way to think about DAGs is as non-parametric structural equation models (SEM): we are explicitly laying out paths between variables, but in the case of a DAG, it doesn’t matter what form the relationship between two variables takes, only its direction. The rules underpinning DAGs are consistent whether the relationship is a simple, linear one, or a more complicated function.

Relationships between variables

Let’s say we’re looking at the relationship between smoking and cardiac arrest. We might assume that smoking causes changes in cholesterol, which causes cardiac arrest:

smoking_ca_dag <- dagify(cardiacarrest ~ cholesterol,
  cholesterol ~ smoking + weight,
  smoking ~ unhealthy,
  weight ~ unhealthy,
  labels = c(
    "cardiacarrest" = "Cardiac\n Arrest",
    "smoking" = "Smoking",
    "cholesterol" = "Cholesterol",
    "unhealthy" = "Unhealthy\n Lifestyle",
    "weight" = "Weight"
  ),
  latent = "unhealthy",
  exposure = "smoking",
  outcome = "cardiacarrest"
)

ggdag(smoking_ca_dag, text = FALSE, use_labels = "label")

The path from smoking to cardiac arrest is directed: smoking causes cholesterol to rise, which then increases risk for cardiac arrest. Cholesterol is an intermediate variable between smoking and cardiac arrest. Directed paths are also chains, because each is causal on the next. Let’s say we also assume that weight causes cholesterol to rise and thus increases risk of cardiac arrest. Now there’s another chain in the DAG: from weight to cardiac arrest. However, this chain is indirect, at least as far as the relationship between smoking and cardiac arrest goes.

We also assume that a person who smokes is more likely to be someone who engages in other unhealthy behaviors, such as overeating. On the DAG, this is portrayed as a latent (unmeasured) node, called unhealthy lifestyle. Having a predilection towards unhealthy behaviors leads to both smoking and increased weight. Here, the relationship between smoking and weight is through a forked path (weight <- unhealthy lifestyle -> smoking) rather than a chain; because they have a mutual parent, smoking and weight are associated (in real life, there’s probably a more direct relationship between the two, but we’ll ignore that for simplicity).

Forks and chains are two of the three main types of paths:

  1. Chains
  2. Forks
  3. Inverted forks (paths with colliders)

An inverted fork is when two arrowheads meet at a node, which we’ll discuss shortly.

There are also common ways of describing the relationships between nodes: parents, children, ancestors, descendants, and neighbors (there are a few others, as well, but they refer to less common relationships). Parents and children refer to direct relationships; descendants and ancestors can be anywhere along the path to or from a node, respectively. Here, smoking and weight are both parents of cholesterol, while smoking and weight are both children of an unhealthy lifestyle. Cardiac arrest is a descendant of an unhealthy lifestyle, which is in turn an ancestor of all nodes in the graph.

So, in studying the causal effect of smoking on cardiac arrest, where does this DAG leave us? We only want to know the directed path from smoking to cardiac arrest, but there also exists an indirect, or back-door, path. This is confounding. Judea Pearl, who developed much of the theory of causal graphs, said that confounding is like water in a pipe: it flows freely in open pathways, and we need to block it somewhere along the way. We don’t necessarily need to block the water at multiple points along the same back-door path, although we may have to block more than one path. We often talk about confounders, but really we should talk about confounding, because it is about the pathway more than any particular node along the path.

Chains and forks are open pathways, so in a DAG where nothing is conditioned upon, any back-door paths must be one of the two. In addition to the directed pathway to cardiac arrest, there’s also an open back-door path through the forked path at unhealthy lifestyle and on from there through the chain to cardiac arrest:

ggdag_paths(smoking_ca_dag, text = FALSE, use_labels = "label", shadow = TRUE)

We need to account for this back-door path in our analysis. There are many ways to go about that–stratification, including the variable in a regression model, matching, inverse probability weighting–all with pros and cons. But each strategy must include a decision about which variables to account for. Many analysts take the strategy of putting in all possible confounders. This can be bad news, because adjusting for colliders and mediators can introduce bias, as we’ll discuss shortly. Instead, we’ll look at minimally sufficient adjustment sets: sets of covariates that, when adjusted for, block all back-door paths, but include no more or no less than necessary. That means there can be many minimally sufficient sets, and if you remove even one variable from a given set, a back-door path will open. Some DAGs, like the first one in this vignette (x -> y), have no back-door paths to close, so the minimally sufficient adjustment set is empty (sometimes written as “{}”). Others, like the cyclic DAG above, or DAGs with important variables that are unmeasured, can not produce any sets sufficient to close back-door paths.

For the smoking-cardiac arrest question, there is a single set with a single variable: {weight}. Accounting for weight will give us an unbiased estimate of the relationship between smoking and cardiac arrest, assuming our DAG is correct. We do not need to (or want to) control for cholesterol, however, because it’s an intermediate variable between smoking and cardiac arrest; controlling for it blocks the path between the two, which will then bias our estimate (see below for more on mediation).

ggdag_adjustment_set(smoking_ca_dag, text = FALSE, use_labels = "label", shadow = TRUE)

More complicated DAGs w