- Introduction
- System requirements and installation troubleshooting
- Data required for a NetRep analysis
- Running the permutation procedure to test module preservation
- The module preservation statistics
- Visualising network modules
- Calculating the network properties of a module
- Managing memory with large datasets
- Running NetRep on a cluster
- Optimising runtime

The **NetRep** package provides functions for assessing
the preservation of network modules across datasets.

This type of analysis is suitable where networks can be meaningfully
inferred from multiple datasets. These include gene coexpression
networks, protein-protein interaction networks, and microbial
co-occurence networks. Modules within these networks consist of groups
of nodes that are particularly interesting: for example a group of
tightly connected genes associated with a disease, groups of genes
annotated with the same term in the Gene Ontology database, or groups of
interacting microbial species, *i.e.* communities.

Application of this method can answer questions such as:

- Do the relationships between genes in a module replicate in an independent cohort?
- Are these gene coexpression modules preserved across tissues or are they tissue specific?
- Are these modules conserved across species?
- Are microbial communities preseved across multiple spatial locations?

A typical workflow for a **NetRep** analysis will
usually contain the following steps, usually as separate scripts.

- Calculate the correlation structure and network edges in each dataset using some network inference algorithm.
- Load these matrices into R and set up the input lists for
**NetRep**’s functions. - Run the permutation test procedure to determine which modules are preserved in your test dataset(s).
- Visualise your modules of interest.
- Calculate the network properties in your modules of interest for downstream analyses.

**NetRep** and its dependencies require several third
party libraries to be installed. If not found, installation of the
package will fail.

**NetRep** requires:

- A compiler with
`C++11`

support for the`<thread>`

libary. - A
`fortran`

compiler. `BLAS`

and`LAPACK`

libraries.

The following sections provide operating system specific advice for
getting **NetRep** working if installation through R
fails.

The necessary `fortran`

and `C++11`

compilers
are provided with the `Xcode`

application and subsequent
installation of `Command line tools`

. The most recent version
of OSX should prompt you to install these tools when installing the
`devtools`

package from RStudio. Those with older versions of
OSX should be able to install these tools by typing the following
command into their Terminal application:
`xcode-select --install`

.

Some users on OSX Mavericks have reported that even after this step
they receive errors relating to `-lgfortran`

or
`-lquadmath`

. This is reportedly solved by installing the
version of `gfortran`

used to compile the R binary for OSX:
`gfortran-4.8.2`

. This can be done using the following
commands in your `Terminal`

application:

For Windows users **NetRep** requires R version 3.3.0 or
later. The necessary `fortran`

and `C++11`

compilers are provided with the `Rtools`

program. We
recommend installation of `NetRep`

through
`RStudio`

, which should prompt the user and install these
tools when running
`devtools::install_github("InouyeLab/NetRep")`

. You may need
to run this command again after `Rtools`

finishes
installing.

If installation fails when compiling **NetRep** at
`permutations.cpp`

with an error about
`namespace thread`

, you will need to install a newer version
of your compiler that supports this `C++11`

feature. We have
found that this works on versions of `gcc`

as old as
`gcc-4.6.3`

.

If installation fails prior to this step it is likely that you will
need to install the necessary compilers and libraries, then reinstall R.
For `C++`

and `fortran`

compilers we recommend
installing `g++`

and `gfortran`

from the
appropriate package manager for your operating system
(e.g. `apt-get`

for Ubuntu). `BLAS`

and
`LAPACK`

libraries can be installed by installing
`libblas-dev`

and `liblapack-dev`

. Note that these
libraries **must** be installed prior to installation of
R.

Any **NetRep** analysis requires the following data to
be provided and pre-computed for each dataset:

- An adjacency matrix whose entries indicate the strength of the relationship between nodes.
- A matrix whose entries contain the correlation coefficient between each pair of nodes in the network.
- a vector containing the module/group label for each node in the network for each discovery dataset.
- Optionally, a “data matrix”, which contains the data used to
calculate the correlation structure and infer the network,
*e.g.*gene expression data.

There are many different approaches to network inference and module detection. For gene expression data, we recommend using Weighted Gene Coexpression Network Analysis through the WGCNA package. For microbial abundance data we recommend the Python program SparCC. Microbial communities (modules) can then be defined as any group of significantly co-occuring microbes.

For this vignette, we will use gene expression data simulated for two
independent cohorts. The *discovery* dataset was simulated to
contain four modules of varying size, two of which (Modules 1 and 4)
replicate in the *test* dataset.

Details of the simulation are provided in the documentation for the
package data (see `help("NetRep-data")`

).

This data is provided with the **NetRep** package:

This command loads seven objects into the R session:

`discovery_data`

: a matrix with 150 columns (genes) and 30 rows (samples) whose entries correspond to the expression level of each gene in each sample in the discovery dataset.`discovery_correlation`

: a matrix with 150 columns and 150 rows containing the correlation-coefficients between each pair of genes calculated from the`discovery_data`

matrix.`discovery_network`

: a matrix with 150 columns and 150 rows containing the network edge weights encoding the interaction strength between each pair of genes in the discovery dataset.`module_labels`

: a named vector with 150 entries containing the module assignment for each gene as identified in the discovery dataset. Here, we’ve given genes that are not part of any module/group the label “0”.`test_data`

: a matrix with 150 columns (genes) and 30 rows (samples) whose entries correspond to the expression level of each gene in each sample in the test dataset.`test_correlation`

: a matrix with 150 columns and 150 rows containing the correlation-coefficients between each pair of genes calculated from the`test_data`

matrix.`test_network`

: a matrix with 150 columns and 150 rows containing the network edge weights encoding the interaction strength between each pair of genes in the test dataset.

Next, we will combine these objects into list structures. All
functions in the **NetRep** package take the following
arguments:

`network`

: a list of interaction networks, one for each dataset.`data`

: a list of data matrices used to infer those networks, one for each dataset.`correlation`

: a list of matrices containing the pairwise correlation coefficients between variables/nodes in each dataset.`moduleAssignments`

: a list of vectors, one for each*discovery*dataset, containing the module assignments for each node in that dataset.`modules`

: a list of vectors, one vector for each*discovery*dataset, containing the names of the modules from that dataset to run the function on.`discovery`

: a vector indicating the names or indices to use as the*discovery*datasets in the`network`

,`data`

,`correlation`

,`moduleAssignments`

, and`modules`

arguments.`test`

: a list of vectors, one vector for each*discovery*dataset, containing the names or indices of the`network`

,`data`

, and`correlation`

argument lists to use as the*test*dataset(s) for the analysis of each*discovery*dataset.

Each of these lists may contain any number of datasets. The names
provided to each list are used by the `discovery`

and
`test`

arguments to determine which datasets to compare. More
than one dataset can be specified in each of these arguments, for
example when performing a pairwise analysis of gene coexpression modules
identified in multiple tissues.

Typically we would put the code that reads in our data and sets up the input lists in its own script. This loading script can then be called from our scripts where we calculate the module preservation, visualise our networks, and calculate the network properties:

```
# Read in the data:
data("NetRep")
# Set up the input data structures for NetRep. We will call these datasets
# "cohort1" and "cohort2" to avoid confusion with the "discovery" and "test"
# arguments in NetRep's functions:
data_list <- list(cohort1=discovery_data, cohort2=test_data)
correlation_list <- list(cohort1=discovery_correlation, cohort2=test_correlation)
network_list <- list(cohort1=discovery_network, cohort2=test_network)
# We do not need to set up a list for the 'moduleAssignments', 'modules', or
# 'test' arguments because there is only one "discovery" dataset.
```

We will call these “cohort1” and “cohort2” to avoid confusion with
the arguments “discovery” and “test” common to **NetRep**’s
functions.

Now we will use **NetRep** to permutation test whether
the network topology of each module is preserved in our test dataset
using the `modulePreservation`

function. This function
calculates seven module preservation statistics for each module (more on
these later), then performs a permutation procedure in the test dataset
to determine whether these statistics are significant.

We will run 10,000 permutations, and split calculation across 2
threads so that calculations are run in parallel. By default,
`modulePreservaton`

will test the preservation of all
modules, excluding the network background which is assumed to have the
label “0”. This of course can be changed: there are many more arguments
than shown here which control how `modulePreservation`

runs.
See `help("modulePreservation")`

for a full list of
arguments.

```
# Assess the preservation of modules in the test dataset.
preservation <- modulePreservation(
network=network_list, data=data_list, correlation=correlation_list,
moduleAssignments=module_labels, discovery="cohort1", test="cohort2",
nPerm=10000, nThreads=2
)
```

```
## [2023-08-19 15:45:15 BST] Validating user input...
## [2023-08-19 15:45:15 BST] Checking matrices for problems...
## [2023-08-19 15:45:15 BST] Input ok!
## [2023-08-19 15:45:15 BST] Calculating preservation of network subsets from dataset "cohort1" in
## dataset "cohort2".
## [2023-08-19 15:45:15 BST] Pre-computing network properties in dataset "cohort1"...
## [2023-08-19 15:45:15 BST] Calculating observed test statistics...
## [2023-08-19 15:45:15 BST] Generating null distributions from 10000 permutations using 2
## threads...
##
##
0% completed.
69% completed.
100% completed.
##
## [2023-08-19 15:45:17 BST] Calculating P-values...
## [2023-08-19 15:45:17 BST] Collating results...
## [2023-08-19 15:45:17 BST] Done!
```

The results returned by `modulePreservation`

for each
dataset comparison are a list containing seven elements:

`nulls`

the null distribution for each statistic and module generated by the permutation procedure.`observed`

the observed value of each module preservation statistic for each module.`p.values`

the p-values for each module preservation statistic for each module.`nVarsPresent`

the number of variables in the*discovery*dataset that had corresponding measurements in the*test*dataset.`propVarsPresent`

the proportion of nodes in each module that had corresponding measurements in the*test*dataset.`totalSize`

the total number of nodes in the*discovery*network.`alternative`

the alternate hypothesis used in the test (e.g. “the module preservation statistics are higher than expected by chance”).

If the *test* dataset has also had module discovery performed
in it, a contigency table tabulating the overlap in module content
between the two datasets is returned.

Let’s take a look at our results:

```
## avg.weight coherence cor.cor cor.degree cor.contrib avg.cor avg.contrib
## 1 0.161069393 0.6187688 0.78448573 0.90843993 0.8795006 0.550004272 0.76084777
## 2 0.001872928 0.1359063 0.17270312 -0.03542772 0.5390504 0.034040922 0.23124826
## 3 0.001957475 0.1263280 0.01121223 -0.17179855 -0.1074944 -0.007631867 0.05412794
## 4 0.046291489 0.4871179 0.32610667 0.68122446 0.5251965 0.442614173 0.68239136
```

```
## avg.weight coherence cor.cor cor.degree cor.contrib avg.cor avg.contrib
## 1 0.00009999 0.00009999 0.00009999 0.00009999 0.00009999 0.00009999 0.00009999
## 2 0.97960204 0.96560344 0.00799920 0.56514349 0.00249975 0.01609839 0.00549945
## 3 0.98940106 0.98560144 0.42195780 0.81091891 0.71872813 0.99330067 0.87881212
## 4 0.00009999 0.00009999 0.00009999 0.00009999 0.00079992 0.00009999 0.00009999
```

For now, we will consider all statistics equally important, so we will consider a module to be preserved in “cohort2” if all the statistics have a permutation test P-value < 0.01:

```
## 1 2 3 4
## 0.00009999 0.97960204 0.99330067 0.00079992
```

Only modules 1 and 4 are reproducible at this significance threshold.

So what do these statistics measure? Let’s take a look at the network topology of Module 1 in the discovery dataset, “cohort1”:

From top to bottom, the plot shows:

- A heatmap of the correlation coefficients between nodes in the module.
- A heatmap of the network edge weights between nodes in the module.
- The scaled weighted degree for each node: this is the sum of each node’s connections to all other nodes in the module, normalised to the most connected node. This is a relative measure of how connected each node is within the module.
- The contribution of each node to the module: this is the correlation between each node and the module’s summary profile.
- A heatmap of the measurements of each node in the module across samples in the dataset (y-axis)
- To the left, the module’s summary profile: a set of observations
that best summarise the measurements across all nodes for each sample.
This is calculated as the first eigenvector of a principal component
analysis:
*i.e.*the linear combination of nodes that explains the greatest portion of the variance in the module’s data.

Now, let’s take a look at the topology of Module 1 in the discovery and the test datasets side by side along with the module preservation statistics:

There are seven module preservation statistics:

**‘cor.cor’**measures the*concordance of the correlation structure*: or, how similar the correlation heatmaps are between the two datasets.**‘avg.cor’**measures the*average magnitude of the correlation coefficients*of the module in the test dataset: or, how tightly correlated the module is on average in the test dataset. This score is penalised where the correlation coefficients change in sign between the two datasets.**‘avg.weight’**measures the*average magnitude of edge weights*in the test dataset: or how connected nodes in the module are to each other on average.**‘cor.degree’**measures the*concordance of the weighted degree*of nodes between the two datasets: or, whether the nodes that are most strongly connected in the discovery dataset remain the most strongly connected in the test dataset.**‘cor.contrib’**measures the*concordance of the node contribution*between the two datasets: this measures whether the module’s summary profile summarises the data in the same way in both datasets.**‘avg.contrib’**measures the*average magnitude of the node contribution*in the test dataset: this is a measure of how coherent the data is in the test dataset. This score is penalised where the node contribution changes in sign between the two datasets: for example, where a gene is differentially expressed between the two datasets.**‘coherence’**measures the proportion of variance in the module data explained by the module’s summary profile vector in the test dataset.

A permutation procedure is necessary to determine whether the value
of each statistic is significant: *e.g.* whether they are higher
than expected by chance, *i.e.* when measuring the statistics
between the module in the discovery dataset, and random sets of nodes in
the test dataset.

By default, the permutation procedure will sample from only nodes
that are present in both datasets. This is appropriate where the
assumption is that any nodes that are present in the test dataset but
not the discovery dataset are unobserved in the discovery dataset:
*i.e.* they may very well fall in one of your modules of
interest. This is appropriate for microarray data. Alternatively, you
may set `null="all"`

, in which case the permutation procedure
will sample from all variables in the test dataset. This is appropriate
where the variable can be assumed not present in the discovery dataset:
for example microbial abundance or RNA-seq data.

You can also test whether these statistics are smaller than expected
by chance by changing the alternative hypothesis in the
`modulePreservation`

function
(e.g. `alternative="lower"`

).

The module preservation statistics that **NetRep**
calculates were designed for weighted gene coexpression networks. These
are *complete* networks: every gene is connected to every other
gene with an edge weight of varying strength. Modules within these
networks are groups of genes that are tightly connected or
coexpressed.

For other types of networks, some statistics may be more suitable than others when assessing module preservation. Here, we provide some guidelines and pitfalls to be aware of when interpreting the network properties and module preservation statistics in other types of networks.

Sparse networks are networks where many edges have a “0” value: that is, networks where many nodes have no connection to each other. Typically these are networks where edges are defined as present if the relationship between nodes passes some pre-defined cut-off value, for example where genes are significantly correlated, or where the correlation between microbe presence and absence is significant. In these networks, edges may simply indicate presence or absence, or they may also carry a weight indicating the strength of the relationship.

For networks with unweighted edges, the *average edge weight*
(**‘avg.weight’**) measures the proportion of nodes that
are connected to each other. The *weighted degree* simply becomes
the node *degree*: the number of connections each node has to any
other node in the module.

If the network is sparse the permutation tests for the
*correlation of weighted degree* may be underpowered. Entries in
the null distribution will be `NA`

where there were no edges
between any nodes in the permuted module. This is because the
*weighted degree* will be 0 for all nodes, and the correlation
coefficient cannot be calculated between two vectors if all entries are
the same in either vector. This reduces the effective number of
permutations for that test: the permutation P-values will be calculated
ignoring the `NA`

entries, and the
`modulePreservation`

function will generate a warning.

You may wish to consider `NA`

entries where there were no
edges as 0 when calculating the permutation test P-values. Note that an
`NA`

entry does not necessarily mean that all edges in the
permuted module were 0: it can also mean that all edges are present and
have identical weights. To distinguish between these cases you should
check whether the `avg.weight`

is also 0.

The following code snippet shows how to identify these entries in the null distribution, replace them with zeros, and recalculate the permutation test P-values:

```
# Handling NA entries in the 'cor.degree' null distribution for sparse networks
# Get the entries in the null distribution where there were no edges in the
# permuted module
na.entries <- which(is.na(preservation$nulls[,'cor.degree',]))
no.edges <- which(preservation$nulls[,'avg.weight',][na.entries] == 0)
# Set those entries to 0
preservation$nulls[,'cor.degree',][no.edges] <- 0
# Recalculate the permutation test p-values
preservation$p.values <- permutationTest(
preservation$nulls, preservation$observed, preservation$nVarsPresent,
preservation$totalSize, preservation$alternative
)
```

For networks where the edges are directed, the user should be aware
that the *weighted degree* is calculated as the column sum of the
module within the supplied `network`

matrix. This usually
means that the result will be the *in*-degree: the number and
combined weight of edges ending in each node. To calculate the
*out*-degree you will need to transpose the matrix supplied to
the `network`

argument (*i.e.* using the
`t()`

function).

Note that directed networks are typically sparse, and have the same pitfalls as sparse networks described above.

Sparse data is data where many entries are zero. Examples include microbial abundance data: where most microbes are present in only a few samples.

Users should be aware that the *average node contribution*
(**‘avg.contrib’**), *concordance of node
contribution* (**‘cor.contrib’**), and the *module
coherence* (**‘coherence’**) will be systematically
underestimated. They are all calculated from the *node
contribution*, which measures the Pearson correlation coefficient
between each node and the *module summary*. Pearson correlation
coefficinets are inappropriate when data is sparse: their value will be
underestimated when calculated between two vectors where many
observations in either vector are equal to 0. However, this should not
affect the permutation test P-values since observations in their null
distributions will be similarly underestimated.

The biggest problem with sparse data is how to handle variables where
all observations are zero in either dataset. These will result in
`NA`

values for their *node contribution* to a module
(or permuted module). These will be ignored by the *average node
contribution* (**‘avg.contrib’**), *concordance of
node contribution* (**‘cor.contrib’**), and *module
coherence* (**‘coherence’**) statistics: which only
take complete cases. This is problematic if many nodes have
`NA`

values, since observations in their null distributions
will be for permuted modules of different sizes.

Their are two approaches to dealing with this issue:

- Filtering both datasets to contain only variables which are present in both datasets. For examples, microbes that are abundant in at least one sample in both datasets.
- Setting observations that are zero to a very small randomly
generated number. The goal is for
*node contribution*values to be close to 0 where they would otherwise be set to`NA`

. For microbial abundance data we recommend generating numbers between 0 and 1/the number of samples: the noise values should be small enough that the do not change the*node contribution*for microbes which are present in one or more samples.

For the latter, code to generate noise would look something like:

Proportional data is data where the sum of measurements across each sample is equal to 1. Examples of this include RNA-seq data and microbial abundance read data.

Users should be aware that the *average node contribution*
(**‘avg.controb’**), *concordance of node
contribution* (**‘cor.contrib’**), and the *module
coherence* (**‘coherence’**) will be systematically
overestimated. They are all calculated from the *node
contribution*, which measures the Pearson correlation coefficient
between each node and the *module summary*. Pearson correlation
coefficients are overestimated when calculated on proportional data.
This should not affect the permutation test P-values since the null
distribution observations will be similarly overestimated.

Users should also be aware of this when calculating the correlation
structure between all nodes for the `correlation`

matrix
input, and use an appropriate method for calculating these
relationships.

Homogenous modules are modules where all nodes are similarly correlated or similarly connected: differences in edge weights, correlation coefficients, and node contributions are due to noise.

For these modules, the *concordance of correlation*
(**‘cor.cor’**), *concordance of node contribution*
(**‘cor.contrib’**), and *correlation of weighted
degree* (**‘cor.degree’**) may be small, with large
permutation test P-values, even where a module is preserved, due to
irrelevant changes in node rank for each property between the discovery
and test datasets.

These statistics should be considered in the context of their
“average” counterparts: the *average correlation coefficient*
(**‘avg.cor’**), *average node contribution*
(**‘avg.contrib’**) and *average edge weight*
(**‘avg.weight’**). If these are high, with significant
permutation test P-values, and the *module coherence* is high,
then the module should be investigated further.

Module homogeneity can be investigated through plotting their network topology in both datasets (see next section). In our experience, the smaller the module, the more likely it is to be topologically homogenous.

The module preservation statistics break down for modules with less
than four nodes. The number of nodes is effectively the sample size when
calculating the value of a module preservation statistic. If you wish to
use **NetRep** to analyse these modules, you should use
only the *average edge weight* (**‘avg.weight’**),
*module coherence* (**‘coherence’**), *average
node contribution* (**‘avg.contrib’**), and *average
correlation coefficient* (**‘avg.cor’**)
statistics.

We can visualise the network topology of our modules using the
`plotModule`

function. It takes the same input data as the
`modulePreservation`

function:

`network`

: a list of network adjacency matrices, one for each dataset.`correlation`

: a list of matrices containing the correlation coefficients between nodes.`data`

: a list of data matrices used to infer the`network`

and`correlation`

matrices.`moduleAssignments`

: a list of vectors, one for each*discovery*dataset, containing the module labels for each node.`modules`

: the modules we want to plot.`discovery`

: the dataset the modules were identified in.`test`

: the dataset we want to plot the modules in.

First, let’s look at the four modules in the *discovery*
dataset:

```
plotModule(
data=data_list, correlation=correlation_list, network=network_list,
moduleAssignments=module_labels, modules=c(1,2,3,4),
discovery="cohort1", test="cohort1"
)
```

```
## [2023-08-19 15:45:17 BST] Validating user input...
## [2023-08-19 15:45:17 BST] Checking matrices for problems...
## [2023-08-19 15:45:17 BST] User input ok!
## [2023-08-19 15:45:17 BST] Calculating network properties of network subsets from dataset "cohort1"
## in dataset "cohort1"...
## [2023-08-19 15:45:17 BST] Ordering nodes...
## [2023-08-19 15:45:17 BST] Ordering samples...
## [2023-08-19 15:45:17 BST] Ordering samples...
## [2023-08-19 15:45:17 BST] rendering plot components...
## [2023-08-19 15:45:18 BST] Done!
```