*A quick note on terminology: I use the terms confounding and
selection bias below, the terms of choice in epidemiology. The terms,
however, depend on the field. In some fields, confounding is referred to
as omitted variable bias or selection bias. Selection bias also
sometimes refers to* variable *selection bias, a related issue
that refers to misspecified models.*

A DAG displays assumptions about the relationship between variables
(often called nodes in the context of graphs). The assumptions we make
take the form of lines (or edges) going from one node to another. These
edges are *directed*, which means to say that they have a single
arrowhead indicating their effect. Here’s a simple DAG where we assume
that *x* affects *y*:

You also sometimes see edges that look bi-directed, like this:

But this is actually shorthand for an unmeasured cause of the two variables (in other words, unmeasured confounding):

```
# canonicalize the DAG: Add the latent variable in to the graph
dagify(y ~ ~x) %>%
ggdag_canonical()
```

A DAG is also *acyclic*, which means that there are no
feedback loops; a variable can’t be its own descendant. The above are
all DAGs because they are acyclic, but this is not:

`ggdag`

is more specifically concerned with structural
causal models (SCMs): DAGs that portray causal assumptions about a set
of variables. Beyond being useful conceptions of the problem we’re
working on (which they are), this also allows us to lean on the
well-developed links between graphical causal paths and statistical
associations. Causal DAGs are mathematically grounded, but they are also
consistent and easy to understand. Thus, when we’re assessing the causal
effect between an exposure and an outcome, drawing our assumptions in
the form of a DAG can help us pick the right model without having to
know much about the math behind it. Another way to think about DAGs is
as non-parametric structural equation models (SEM): we are explicitly
laying out paths between variables, but in the case of a DAG, it doesn’t
matter what form the relationship between two variables takes, only its
direction. The rules underpinning DAGs are consistent whether the
relationship is a simple, linear one, or a more complicated
function.

Let’s say we’re looking at the relationship between smoking and cardiac arrest. We might assume that smoking causes changes in cholesterol, which causes cardiac arrest:

```
smoking_ca_dag <- dagify(cardiacarrest ~ cholesterol,
cholesterol ~ smoking + weight,
smoking ~ unhealthy,
weight ~ unhealthy,
labels = c(
"cardiacarrest" = "Cardiac\n Arrest",
"smoking" = "Smoking",
"cholesterol" = "Cholesterol",
"unhealthy" = "Unhealthy\n Lifestyle",
"weight" = "Weight"
),
latent = "unhealthy",
exposure = "smoking",
outcome = "cardiacarrest"
)
ggdag(smoking_ca_dag, text = FALSE, use_labels = "label")
```

The path from smoking to cardiac arrest is *directed*: smoking
causes cholesterol to rise, which then increases risk for cardiac
arrest. Cholesterol is an intermediate variable between smoking and
cardiac arrest. Directed paths are also chains, because each is causal
on the next. Let’s say we also assume that weight causes cholesterol to
rise and thus increases risk of cardiac arrest. Now there’s another
chain in the DAG: from weight to cardiac arrest. However, this chain is
*indirect*, at least as far as the relationship between smoking
and cardiac arrest goes.

We also assume that a person who smokes is more likely to be someone who engages in other unhealthy behaviors, such as overeating. On the DAG, this is portrayed as a latent (unmeasured) node, called unhealthy lifestyle. Having a predilection towards unhealthy behaviors leads to both smoking and increased weight. Here, the relationship between smoking and weight is through a forked path (weight <- unhealthy lifestyle -> smoking) rather than a chain; because they have a mutual parent, smoking and weight are associated (in real life, there’s probably a more direct relationship between the two, but we’ll ignore that for simplicity).

Forks and chains are two of the three main types of paths:

- Chains
- Forks
- Inverted forks (paths with colliders)

An inverted fork is when two arrowheads meet at a node, which we’ll discuss shortly.

There are also common ways of describing the relationships between nodes: parents, children, ancestors, descendants, and neighbors (there are a few others, as well, but they refer to less common relationships). Parents and children refer to direct relationships; descendants and ancestors can be anywhere along the path to or from a node, respectively. Here, smoking and weight are both parents of cholesterol, while smoking and weight are both children of an unhealthy lifestyle. Cardiac arrest is a descendant of an unhealthy lifestyle, which is in turn an ancestor of all nodes in the graph.

So, in studying the causal effect of smoking on cardiac arrest, where
does this DAG leave us? We only want to know the directed path from
smoking to cardiac arrest, but there also exists an indirect, or
back-door, path. This is confounding. Judea Pearl, who developed much of
the theory of causal graphs, said that confounding is like water in a
pipe: it flows freely in open pathways, and we need to block it
somewhere along the way. We don’t necessarily need to block the water at
multiple points along the same back-door path, although we may have to
block more than one path. We often talk about *confounders*, but
really we should talk about *confounding*, because it is about
the pathway more than any particular node along the path.

Chains and forks are open pathways, so in a DAG where nothing is conditioned upon, any back-door paths must be one of the two. In addition to the directed pathway to cardiac arrest, there’s also an open back-door path through the forked path at unhealthy lifestyle and on from there through the chain to cardiac arrest:

We need to account for this back-door path in our analysis. There are
many ways to go about that–stratification, including the variable in a
regression model, matching, inverse probability weighting–all with pros
and cons. But each strategy must include a decision about *which*
variables to account for. Many analysts take the strategy of putting in
all possible confounders. This can be bad news, because adjusting for
colliders and mediators can introduce bias, as we’ll discuss shortly.
Instead, we’ll look at *minimally sufficient* adjustment sets:
sets of covariates that, when adjusted for, block all back-door paths,
but include no more or no less than necessary. That means there can be
many minimally sufficient sets, and if you remove even one variable from
a given set, a back-door path will open. Some DAGs, like the first one
in this vignette (x -> y), have no back-door paths to close, so the
minimally sufficient adjustment set is empty (sometimes written as
“{}”). Others, like the cyclic DAG above, or DAGs with important
variables that are unmeasured, can not produce any sets sufficient to
close back-door paths.

For the smoking-cardiac arrest question, there is a single set with a single variable: {weight}. Accounting for weight will give us an unbiased estimate of the relationship between smoking and cardiac arrest, assuming our DAG is correct. We do not need to (or want to) control for cholesterol, however, because it’s an intermediate variable between smoking and cardiac arrest; controlling for it blocks the path between the two, which will then bias our estimate (see below for more on mediation).

More complicated DAGs w