RAMEN
Erick I. Navarro-Delgado
The University of British Columbiaerick.navarrodelgado@bcchr.ca
RAMEN.Rmd
Introduction
Regional Association of Methylome variability with the Exposome and geNome (RAMEN) is an R package whose goal is to integrate genomic, methylomic and exposomic data to model the contribution of genetics (G) and the environment (E) to DNA methylation (DNAme) variability. RAMEN identifies Variable Methylated Regions (VMRs) in microarray DNAme data and then, using genotype and environmental data, it identifies which of the following models better explains this variability in regions across the methylome:
Model | Name | Abbreviation |
---|---|---|
DNAme ~ G + covars | Genetics | G |
DNAme ~ E + covars | Environmental exposure | E |
DNAme ~ G + E + covars | Additive | G+E |
DNAme ~ G + E + G*E + covars | Interaction | GxE |
where G variables are represented by SNPs, E variables by environmental exposures, and where covars are concomitant variables (i.e. variables that are adjusted for in the model and not of interest in the study such as cell type proportion, age, etc.).
The main gene-environment interaction modeling pipeline is conducted though six core functions:
-
findVMRs()
identifies Variable Methylated Regions (VMRs) in microarrays -
summarizeVMRs()
summarizes the regional methylation state of each VMR -
findCisSNPs()
identifies the SNPs in cis of each VMR -
selectVariables()
conducts a LASSO-based variable selection strategy to identify potentially relevant cis SNPs and environmental variables -
lmGE()
fits linear single-variable genetic (G) and environmental (E), and pairwise additive (G+E) and interaction (GxE) linear models and select the best explanatory model per VMR. -
nullDistGE()
simulates a delta R squared null distribution of G and E effects on DNAme variability. Useful for filtering out poor-performing best explanatory models selected by lmGE().
These functions are compatible with parallel computing, which is recommended due to the computationally intensive tasks conducted by the package.
In addition to the standard gene-environment interaction modeling pipeline, RAMEN can be useful for other DNAme analyses (see Variations to the standard workflow), such as reducing the tested sites in Epigenome Wide Association Studies, grouping DNAme probes into regions, identifying SNPs near a probe, etc.
Citation
The manuscript detailing RAMEN and its use is currently under preparation. For more information about this please contact Erick I. Navarro-Delgado at erick.navarrodelgado@bcchr.ca.
Gene-environment interaction analysis
The main purpose of the RAMEN package is to conduct a methylome-wide analysis to identify which model (G, E, G+E or GxE) better explains the variability across the genome. In this vignette, we will illustrate how to use the package.
To conduct this analysis, the following cleaned data sets (i.e. after quality and exploratory data analysis checks) from a cohort are required:
- DNAme data
- DNAme array manifest
- Genotyping data
- Genotype information
- Environmental exposure data
- Concomitant variables data
Once that we have that data, the overview of the pipeline is the following:
where:
- DNAme data is grouped into VMRs, and then the DNAme state per individual is summarized in each region
- Using the identified VMRs and the genomic information, we identify the SNPs in cis for each VMR
- Both the cis SNPs and the exposome data are subjected to the variable selection stage
- The selected variables (SNPs a and Es) enter the modelling stage, which outputs one single winning model per VMR
- The thresholds obtained from the simulated null distribution are used to remove winning models which performance are likely to be due to chance.
In the following sections we will go through each of these steps and guide the user regarding the recommended parameters to use in each function of the package. For illustration purposes, we provide small toy data sets that do not intend to simulate a real biological phenomenon. These data sets are already available in the RAMEN package.
#Load the packages used throughout the vignette
library(RAMEN)
library(dplyr)
library(ggplot2)
library(tidyr)
Identify VMRs and summarize their methylation state
The first step of the pipeline is to identify the Variable Methylated Regions(VMRs) in the data set. You might be wondering “What is a VMR and why do we use them instead of DNAme levels from each CpG site?”. We use regions because it is well established that nearby CpG sites are very likely to share a similar DNAme profile and therefore work as functional units. Then, from a statistical point of view, testing separately proximal CpGs that are part of the same unit is redundant. On the other side, we use only variable regions because we are interested in the units that display a high level of variability; in other words, in non-variant sites there is no variability left to be explained by genetics or environment. So, in conclusion, we use VMRs to increase our power and reduce the multiple hypothesis testing burden by grouping probes that are likely to work as a biological unit, and by only focusing in the set of regions that are of interest of this study.
RAMEN identifies 2 categories of VMRs:
- Canonical VMRs: Group of Highly Variable Probes that are proximal and correlated. Highly Variable Probes are defined as probes above a specific variance percentile threshold specified by the user (the default is 90th percentile). The proximity distance and pearson correlation threshold is specified by the user, and the defaults are 1 kilobase and 0.15 respectively. For guidance on which correlation threshold to use, we recommend checking the Supplementary Figure 1 of the CoMeBack R package (Gatev et al., 2020) where a simulation to empirically determine a default guidance specification for a correlation threshold parameter dependent on sample size is done.
- Non canonical VMRs: Regions that are composed of Highly Variable Probes that have no nearby probes measured in the array (according to the distance parameter specified by the user). This category was created to take into account the characteristics of the DNAme microarray plataform, which has probes covering non-homogenelously the genome. This is specially important for microarrays such as the EPIC array which has a high number of probes in regulatory regions that are represented by a single probe. Furthermore, these probes have been shown to be good representatives of the methylation state of its surroundings (Pidsley et al., 2016). By creating this category, we recover those informative HVPs that would otherwise be excluded from the analysis because of working with the canonical VMR definition in the array context.
The first step is to identify Variable Methylated
Regions(VMRs) using the RAMEN::findVMRs()
function. This function uses GenomicRanges::reduce() to group the
regions, which is strand-sensitive. In the Illumina microarrays, the
MAPINFO for all the probes is usually provided as for the + strand. If
you are using this array, we recommend to first convert the strand of
all the probes to “+”.
For this step, we recommend users to use M-values because its use is more appropiate for statistical analyses (see Pan Du, et al., 2010, BMC Bioinformatics)
#We need to modify the RAMEN::test_array_manifest object by assigning to
#row names to the probe ID column; it was saved this way because storing
#the TargetID as row names reduced significantly the size of the data set.
test_array_manifest_final = RAMEN::test_array_manifest %>%
tibble::rownames_to_column(var = "TargetID")
VMRs = RAMEN::findVMRs(array_manifest = test_array_manifest_final,
methylation_data = RAMEN::test_methylation_data,
cor_threshold = 0,
var_method = "variance",
var_threshold_percentile = 0.9,
max_distance = 1000)
#> Identifying Highly Variable Probes...
#> Identifying non canonical Variable Methylated Regions...
#> Identifying canonical Variable Methylated Regions...
#> Applying correlation filter to canonical Variable Methylated Regions...
#> Warning: executing %dopar% sequentially: no parallel backend registered
#Take a look at the resulting object
dplyr::glimpse(VMRs)
#> List of 4
#> $ var_score_threshold : Named num 13.5
#> ..- attr(*, "names")= chr "90%"
#> $ highly_variable_probes:'data.frame': 300 obs. of 2 variables:
#> ..$ TargetID : chr [1:300] "cg06187584" "cg09872009" "cg05437132" "cg26689799" ...
#> ..$ var_score: num [1:300] 17.1 14.8 15.2 13.7 17.1 ...
#> $ canonical_VMRs :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
#> .. ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots
#> .. ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots
#> .. ..@ strand :Formal class 'Rle' [package "S4Vectors"] with 4 slots
#> .. ..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
#> .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. ..@ elementType : chr "ANY"
#> .. ..@ metadata : list()
#> $ non_canonical_VMRs :Formal class 'GRanges' [package "GenomicRanges"] with 7 slots
#> .. ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots
#> .. ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots
#> .. ..@ strand :Formal class 'Rle' [package "S4Vectors"] with 4 slots
#> .. ..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots
#> .. ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots
#> .. ..@ elementType : chr "ANY"
#> .. ..@ metadata : list()
As we can see, RAMEN::findVMRs()
returns a list with
four elements:
- var_score_threshold: The MAD-score or variance threshold used to define Highly Variable Probes.
- highly_variable_probes: a data frame with the probes that passed the variability score threshold imposed by the user, and their variability score (MAD score or variance).
- canonical_VMRs: a GRanges object with strict candidate VMRs - regions composed of two or more contiguous, correlated and proximal Highly Variable Probes; thresholds depend on the ones specified by the user)
- non_canonical_VMRs: a GRanges object with highly variable probes without neighboring CpGs measured in max_distance on the array. Category created to take into acccount the Illumina array design of single probes capturing the methylation state of regulatory regions.
Furthermore, we can see the following warning message in the chunk above:
#> Warning: executing %dopar% sequentially: no parallel backend registered
This is printed in the screen just to warn us that
RAMEN::findVMRs()
is running sequentially. RAMEN supports
parallel computing for increased speed, which is really important when
working with real data sets that tend to contain information from the
whole genome. To do so, you have to set the parallel backend in your R
session BEFORE running the function (e.g.,
doFuture::registerDoFuture()
) and then the evaluation
strategy (e.g., future::plan(multisession)
). After that,
the function can be run normally. When working with big datasets, the
parallel backend might throw an error if you exceed the maximum allowed
size of globals exported for future expression. This can be fixed by
increasing the allowed size (e.g. running
options(future.globals.maxSize= +Inf)
)
Finally, we will convert the output of RAMEN::findVMRs()
to a data frame, which is an object that we can easily use to produce
plots and explore the results, and the object that is needed for the
following parts of the pipeline.
VMRs_df = as.data.frame(VMRs[["canonical_VMRs"]]) %>%
rbind(as.data.frame(VMRs[["non_canonical_VMRs"]])) %>%
dplyr::select( -c(width.1,strand))
head(VMRs_df)
#> seqnames start end width n_VMPs probes median_correlation
#> 1 21 10990119 10990903 785 2 cg098720.... 0.6099180
#> 2 21 11109021 11109336 316 2 cg007508.... 0.6261681
#> 3 21 15456605 15456789 185 2 cg195094.... 0.8595934
#> 4 21 15588038 15588830 793 2 cg023649.... 0.6829510
#> 5 21 15955548 15956048 501 3 cg147721.... 0.7548517
#> 6 21 16436438 16437768 1331 3 cg116304.... 0.7124538
With the VMRs as a data frame, we can explore our data set using ggplot, such as the following example:
VMRs_df %>%
dplyr::filter(width > 1) %>% #Only plot canonical VMRs
ggplot2::ggplot(aes(x = width))+
ggplot2::geom_histogram(binwidth = 50, fill = "#BAB4D8")+
ggplot2::theme_classic()+
ggplot2::ggtitle("Canonical VMRs width (bp)")
Next, we want to summarize the DNAme level of each VMR per
individual. To do this, we use RAMEN::summarizeVMRs()
. For
non canonical VMRs, there is nothing to summarize, so the DNAme level of
the corresponding probe is returned. For canonical VMRs, the median
DNAme level of the region is returned per individual as the
representative value.
summarized_methyl_VMR = RAMEN::summarizeVMRs(VMRs_df = VMRs_df,
methylation_data = test_methylation_data)
# Look at the resulting object
summarized_methyl_VMR[1:5,1:5]
#> 1 2 3 4 5
#> ID1 4.935942 2.853168 3.219038 5.384813 3.320883317
#> ID2 1.879166 2.699689 3.952487 2.927166 1.934283315
#> ID3 3.311818 1.078262 1.153573 4.690962 9.238518404
#> ID4 6.558106 4.683173 1.688233 3.448903 0.007576317
#> ID5 2.899969 4.930614 2.243714 5.788151 2.843904388
The result is a data frame of VMR IDs as columns and individual IDs as rows.
Identify cis SNPs
After identifying the VMRs, we recommend to use only SNPs in cis of each VMR, since genetic variants that associate with DNAme changes tend to be more abundant in the surroundings of the corresponding DNAme site (McClay et al., 2015). Also, the effect sizes of mQTLs (genetic variants associated with DNAme changes) are stronger in cis SNPs compared to trans SNPs. Then, by restricting the analysis to cis SNPs, we greatly reduce the number of variables while keeping most of the important ones.
There is not a clear consensus on how close a SNP has to be from a DNAme site to be considered cis - the distance threshold tend to go from few kb to 1 megabase. We recommend to use a 1 Mb window to cast a wide net to catch potentially relevant SNPs.
VMRs_df = RAMEN::findCisSNPs(VMRs_df = VMRs_df,
genotype_information = RAMEN::test_genotype_information,
distance = 1e+06)
#> Important: please make sure that the positions of the VMR data frame and the ones in the genotype information are from the same genome build.
#Take a look at the result
dplyr::glimpse(VMRs_df)
#> Rows: 131
#> Columns: 10
#> $ VMR_index <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", …
#> $ seqnames <fct> 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21,…
#> $ start <int> 10990119, 11109021, 15456605, 15588038, 15955548, 1…
#> $ end <int> 10990903, 11109336, 15456789, 15588830, 15956048, 1…
#> $ width <int> 785, 316, 185, 793, 501, 1331, 888, 977, 573, 61, 5…
#> $ n_VMPs <int> 2, 2, 2, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
#> $ probes <list> <"cg09872009", "cg05437132">, <"cg00750806", "cg12…
#> $ median_correlation <dbl> 0.6099180, 0.6261681, 0.8595934, 0.6829510, 0.75485…
#> $ surrounding_SNPs <int> 1, 1, 610, 667, 726, 794, 757, 529, 709, 788, 829, …
#> $ SNP <list> "21:10873592:G:A", "21:10873592:G:A", <"21:1459574…
We can see that the resulting VMRs_df object is almost exactly the same, but with two new columns (surrounding_SNPs and SNP) that contain information about how many SNPs were found in cis and what are their IDs according to the genotype data that we have.
It is important to highlight the following characteristics of the VMRs_df object:
- The VMR_index column is a character vector. This column corresponds to the unique identifier of each VMR in our data set. It is important to keep it as a character and be careful with it not being converted to numeric, which can happen if you save the VMRs_df object as a table, read it, and use that second object in the rest of the pipeline.
- The columns probes and SNP contain lists. This structure is really important for the rest of the analysis and columns containing lists will keep appearing in other function outputs. If you want to know the recommended way to save and load these objects, please check the Frequently Asked Questions.
We can also explore the resulting object through plots such as the following:
VMRs_df %>%
dplyr::mutate(type = case_when(n_VMPs == 1 ~ "non canonical", #non canonical VMRs have only 1 probe by definition
TRUE ~ "canonical"),
surrounding_SNPs = case_when( surrounding_SNPs > 3000 ~ 3000,
TRUE ~ surrounding_SNPs)) %>%
ggplot2::ggplot(aes(x = surrounding_SNPs)) +
ggplot2::geom_density() +
ggplot2::facet_grid("type") +
ggplot2::theme_classic()
Conduct variable selection on genome and exposome variables
The following stage in the pipeline is to screen the available
variables in our environmental and cis SNPs data sets to
identify the potentially relevant ones. This is achieved with the
RAMEN::selectVariables()
function. This function uses a
data-driven approach based on LASSO, which is an embedded variable
selection method commonly used in machine learning.
In a nutshell, LASSO penalizes models that are more complex (i.e., that contain more variables) in favor of simpler models (i.e. that contain less variables), but not at the expense of reducing predictive power. Using LASSO’s variable screening property (i.e., with high probability, the LASSO estimated model includes the substantial covariates and drops the redundant ones) this function is intended to select genotype and environmental variables in each Variable Methylated Region (VMR) with potential relevance in the presence of the user-specified concomitant variables (which are known DNAme confounders such as age, cell type proportion, etc.). For more information about the method, we encourage the users to read the documentation of the function, and for further information about LASSO we direct readers to Bühlmann and Van de Geer, 2011.
Overall, conducting our variable selection strategy reduces the overall computational time and improves the modeling performance by:
- Reducing the universe of variables that will be used to fit models in the following stage (G/E/G+E/GxE model fitting and comparison)
- Removing redundant variables, which are highly expected in genetic and environmental data sets with a high number of variables
- Limiting the interactions terms to scenarios where both the G and E main effects were selected to be potentially relevant, which can be think of as an interaction variable selection using a weak hierarchy
- Using LASSO, a method with good variable selection performance and scalability
Please make sure that your data has no NAs, since the LASSO implementation we use in RAMEN does not support missing values. If your data has missing values, consider handling them.
selected_variables = RAMEN::selectVariables(
VMRs_df = VMRs_df,
genotype_matrix = RAMEN::test_genotype_matrix,
environmental_matrix= RAMEN::test_environmental_matrix,
covariates = RAMEN::test_covariates,
summarized_methyl_VMR = summarized_methyl_VMR,
seed = 1
)
#> Loading required package: foreach
#> Loading required package: rngtools
Since LASSO makes use of Random Number Generation, setting a seed is highly encouraged for result’s reproducibility using the seed argument. As a note, setting a seed inside of this function modifies the seed globally (which is R’s default behavior).
The output of RAMEN::selectVariables()
is an object with
the VMR index, and the G and E variables selected for each VMR.
dplyr::glimpse(selected_variables)
#> Rows: 131
#> Columns: 3
#> $ VMR_index <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11"…
#> $ selected_genot <list> "21:10873592:G:A", "21:10873592:G:A", <"21:15450190:C:…
#> $ selected_env <list> <"E43", "E3", "E5", "E7", "E24", "E25", "E28", "E35", …
We can see how using RAMEN::selectVariables()
reduces
the number of variables (originally 100 environmental variables and 785
SNPs per VMR on average as seen in Figure @ref(fig:cissnps)).
selected_variables %>%
dplyr::left_join(VMRs_df %>%
select(c(VMR_index,n_VMPs)),
by = "VMR_index") %>%
dplyr::transmute(VMR_index = VMR_index,
VMR_type = case_when(n_VMPs > 1 ~ "canonical",
n_VMPs == 1 ~ "non canonical"),
genotype = lengths(selected_genot),
environment = lengths(selected_env)) %>%
tidyr::pivot_longer(-c(VMR_index, VMR_type)) %>%
dplyr::rename(group = name,
variables = value) %>%
ggplot2::ggplot(aes(x = VMR_type, y = variables)) +
ggplot2::geom_violin() +
ggplot2::geom_boxplot(width=0.1, outlier.shape=NA) +
ggplot2::facet_wrap(~group)+
ggplot2::ggtitle("Selected variables") +
ggplot2::theme_classic()
It is also expected in real data to have VMRs where no SNP and/or no environmental variables were selected, since not all the DNAme sites in the genome are expected to show an association with the genetic variation or environmental exposures data sets that are captured in a study. The proportion of VMRs under these scenarios will depend on the data sets.
Author’s note about variables interpretation
LASSO variable selection is not consistent when there is multicollinearity in the data (i.e., correlation between variables), which is expected due to the high amount of G and E variables that are present in studies of this kind. This means that if you were to run LASSO several times, and two variables were to be highly correlated, the method would select one and drop the other one at random. This is not a problem with the pipeline because the main conclusion per VMR is whether the DNAme is better explained by G and/or E components. As an example, if a VMR is better explained by SNP1 and SNP2, which are both highly correlated one with the other, LASSO will randomly pick SNP1 OR SNP2 (because they are relevant but they provide redundant information); if we were to fit a model with SNP1 or SNP2 in the following stage, the winning model would still be G. In other words, the main goal of the pipeline is to know whether the VMR’s DNAme is better explained by G and/or E. The user is therefore warned to be cautious not to over-interpret the individual selected variables. Selected variables might be used as hypothesis generators of associations, keeping in mind that the selected variable might be representing other variables in the data set that provide similar information.
Identify the best explanatory model (G/E/G+E/GxE) per VMR
Now that we have selected the list of potentially relevant G and E
variables, we will fit the models mentioned in Table
@ref(tab:modelstable) using the RAMEN::lmGE()
function.
This function fits, for each VMR, G and E models with all of the
variables selected, as well as all their possible pairwise combinations
of G+E and GxE.
After fitting this model, the best model per group (group = G, E, G+E or GxE) is selected using Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC). We recommend using AIC because BIC assumes that the true model is in the set of compared models. Since this function fits models with individual variables, and we assume that DNAme variability is more likely to be influenced by more than one single SNP/environmental exposure at a time, we hypothesize that in most cases, the true model will not be in the set of compared models. Also, AIC excels in situations where all models in the model space are “incorrect”, and AIC is preferentially used in cases where the true underlying function is unknown and our selected model could belong to a very large class of functions where the relationship could be pretty complex. It is worth mentioning however that, both metrics tend to pick the same model in a large number of scenarios. We suggest the users to read Arijit Chakrabarti & Jayanta K. Ghosh, 2011 for further information about the difference between these metrics. After selecting the best model per group (G,E,G+E pr GxE), the model with the lowest AIC or BIC will be declared as the winning model.
Additionally, RAMEN::lmGE()
conducts a variance
decomposition analysis, so that the relative R2 contribution of each of
the variables of interest (G, E and GxE) is reported. This decomposition
is done using the relaimpo R
package, using the Lindeman, Merenda and Gold (lmg) method, which is
based on the heuristic approach of averaging the relative R contribution
of each variable over all input orders in the linear model.
lmge_res = RAMEN::lmGE(
selected_variables = selected_variables,
summarized_methyl_VMR = summarized_methyl_VMR,
genotype_matrix = RAMEN::test_genotype_matrix,
environmental_matrix = RAMEN::test_environmental_matrix,
covariates = RAMEN::test_covariates,
model_selection = "AIC"
)
# Check the output
dplyr::glimpse(lmge_res)
#> Rows: 128
#> Columns: 13
#> $ VMR_index <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11…
#> $ model_group <chr> "G+E", "G+E", "GxE", "GxE", "G+E", "GxE", "G+E", "GxE"…
#> $ variables <list> <"21:10873592:G:A", "E43">, <"21:10873592:G:A", "E43"…
#> $ tot_r_squared <dbl> 0.5527507, 0.5115764, 0.6075657, 0.6794845, 0.7680461,…
#> $ g_r_squared <dbl> 0.1996156, 0.2095645, 0.3064024, 0.2738786, 0.3981840,…
#> $ e_r_squared <dbl> 0.34206198, 0.27143559, 0.15293109, 0.20266291, 0.2769…
#> $ gxe_r_squared <dbl> NA, NA, 0.1330680, 0.1126333, NA, 0.1432113, NA, 0.108…
#> $ AIC <dbl> 144.6841, 148.7250, 145.6987, 135.7819, 129.9252, 123.…
#> $ second_winner <chr> "GxE", "GxE", "G+E", "G+E", "GxE", "G+E", "GxE", "G+E"…
#> $ delta_aic <dbl> 1.04970268, 1.34557464, 1.01022314, 3.88212651, 0.0761…
#> $ delta_r_squared <dbl> -0.013945293, -0.010539189, 0.041420496, 0.069427874, …
#> $ basal_AIC <dbl> 164.4893, 165.2905, 167.3019, 161.0771, 166.8374, 159.…
#> $ basal_rsquared <dbl> 0.011073059, 0.030576374, 0.015164242, 0.090309667, 0.…
The output of RAMEN::lmGE()
is a data frame with the
following 13 columns:
- VMR_index: The index of the respective VMR
- model_group: The selected winning model (G, E, G+E or GxE)
- variables: The variable(s) that are present in the winning model (excluding the covariates, which are included in all the models)
- tot_r_squared: total R squared of the winning model
- g_r_squared: Estimated R2 allocated to the G component in the winning model, if applicable
- e_r_squared: Estimated R2 allocated to the E in the winning model, if applicable.
- gxe_r_squared: Estimated R2 allocated to the interaction in the winning model (GxE), if applicable.
- AIC/BIC: AIC or BIC metric from the best model in each VMR (depending on the option specified in the argument model_selection).
- second_winner: The second group that possesses the next best model after the winning one (i.e., G, E, G+E or GxE). This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other model groups to compare to.
- delta_aic/delta_bic: The difference of AIC or BIC (depending on the option specified in the argument model_selection) of the winning model and the best model from the second_winner group (i.e., G, E, G+E or GxE). This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other groups to compare to.
- delta_r_squared: The R2 of the winning model - R2 of the second winner model. This column may have NA if the variables in selected_variables correspond only to one group (G or E), so that there is no other groups to compare to.
- basal_AIC/basal_BIC: AIC or BIC of the basal model (i.e., model fitted only with the concomitant variables specified in the covariates argument)
- basal_rsquared: The R2 of the basal model (i.e., model fitted only with the concomitant variables specified in the covariates argument)
Remove poor performing winning models
The core pipeline from the RAMEN package identifies the best
explanatory model per VMR. However, despite these models being winners
in comparison to models including any other G/E variable(s) in the
dataset, some winning models might perform no better than what we would
expect by chance. Therefore, The last step of the pipeline after
identifying the winning models per VMR is to compute a null distribution
to remove the winning models that are very likely to be winners by
chance. To do so, we use RAMEN::nullDistGE()
.
The goal of RAMEN::nullDistGE()
is to create a
distribution of increase in R2 after including the SNP/E/SNP*E variables
under the null hypothesis of G and E having no associations with DNAme.
The null distribution is obtained through shuffling the G and E
variables in a given dataset and conducting the variable selection and
G/E model selection. That way, we can simulate how much additional
variance would be explained by the models defined as winners by the
RAMEN methodology in a scenario where the G and E associations with
DNAme are randomized. This distribution can be then used to filter out
winning models in the non-shuffled dataset that do not add more to the
explained variance of the basal model than what randomized data do.
For clarification, please note that in this vignette when we refer to SNP*E, we are referring to the interaction term that is present in the the interaction model (i.e. interaction variable in the GxE model).
Under the assumption that after adjusting for the concomitant variables all VMRs across the genome follow the same behavior regarding an increment of explained variance with randomized G and E data, we can pool the delta R squared values from all VMRs to create a null distribution taking advantage of the high number of VMRs in the dataset. This assumption decreases significantly the number of permutations required to create a null distribution and reduces the computational time. For further information please read the RAMEN paper (in preparation).
RAMEN::nullDistGE()
simulates the delta R squared
distribution under the null hypothesis of G and E having no association
with DNA methylation (DNAme) variability through a permutation analysis.
To do so, this function shuffles the G and E variables in the dataset,
which is followed by a the variable selection and modelling steps with
selectVariables() and lmGE().These steps are repeated several times as
indicated in the permutations parameter. In other words, by
using shuffled G and E data, we simulate the increase of R2 that would
be observed in random data using the RAMEN methodology.
# Compute the null distribution
null_dist = RAMEN::nullDistGE(
VMRs_df = VMRs_df,
genotype_matrix = RAMEN::test_genotype_matrix,
environmental_matrix = RAMEN::test_environmental_matrix,
summarized_methyl_VMR = summarized_methyl_VMR,
permutations = 5,
covariates = RAMEN::test_covariates,
seed = 1,
model_selection = "AIC"
)
#Take a look at the object
head(null_dist)
#> VMR_index tot_r_squared model_group R2_difference AIC_difference permutation
#> 1 1 0.2606017 E 0.2495286 157.7547 1
#> 2 2 0.2976822 E 0.2671059 157.5905 1
#> 3 3 0.6086939 G+E 0.5935297 143.5972 1
#> 4 4 0.3777878 E 0.2874781 151.5921 1
#> 5 6 0.6133220 G+E 0.5269145 137.8163 1
#> 6 7 0.3451632 E 0.2497069 155.1677 1
The output is a data frame where the most useful column for our purpose is R2_difference, which corresponds to the increase in R squared obtained by including the G/E variable(s) from the winning model (i.e., the R squared difference between the winning model and the model only with the concomitant variables specified in covariates; tot_r_squared - basal_rsquared in the lmGE output)
We recommend to use two different thresholds for the winning models depending of whether they are marginal (G or E) or joint models (G+E or GxE). The reason for this is that they have different R2_difference distributions. E and G models have a lower mean R2_difference because they have a single shuffled term in the model (SNP or E). In comparison, joint models have a higher mean R2_difference because they have two or three shuffled terms (SNP, E and SNP*E), which just by chance increases their probability of having a higher R2_difference.
# See the distribution of R2_difference across different winning models
null_dist %>%
ggplot2::ggplot(aes(x = R2_difference)) +
ggplot2::geom_histogram() +
ggplot2::facet_grid("model_group") +
ggplot2::theme_classic()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We suggest using the 95th percentile of those distributions as a threshold to remove bad performing winning models found in our observed data.
# Get a cutoff of the 95th percentile of the null distribution for single and joint models
cutoff_single = quantile(null_dist %>%
filter(model_group %in% c("G","E")) %>%
pull(R2_difference),
0.95)
cutoff_joint = quantile(null_dist %>%
filter(model_group %in% c("G+E","GxE")) %>%
pull(R2_difference),
0.95)
#Filter out bad performing winning models.
filtered_res = lmge_res %>%
dplyr::mutate(r2_difference_basal = tot_r_squared - basal_rsquared,
pass_cutoff_threshold = case_when(model_group %in% c("G", "E") ~ r2_difference_basal > cutoff_single,
model_group %in% c("G+E", "GxE") ~ r2_difference_basal > cutoff_joint)) %>%
dplyr::filter(pass_cutoff_threshold) %>% #Filter based on the cutoff threshold
dplyr::select(-pass_cutoff_threshold) #Drop temporary column
#Check the final results
dplyr::glimpse(filtered_res)
#> Rows: 7
#> Columns: 14
#> $ VMR_index <chr> "5", "6", "28", "39", "96", "98", "106"
#> $ model_group <chr> "G+E", "GxE", "E", "GxE", "G", "G", "GxE"
#> $ variables <list> <"21:15671097:T:A", "E43">, <"21:15671097:T:A", "E…
#> $ tot_r_squared <dbl> 0.7680461, 0.7769276, 0.5555035, 0.7097936, 0.6431…
#> $ g_r_squared <dbl> 0.39818402, 0.38220942, NA, 0.09101486, 0.6398122…
#> $ e_r_squared <dbl> 0.2769884, 0.1650993, 0.5244324, 0.2406247, NA, NA…
#> $ gxe_r_squared <dbl> NA, 0.1432113, NA, 0.3522698, NA, NA, 0.1515714
#> $ AIC <dbl> 129.9252, 123.3999, 145.1528, 147.3162, 140.8549, …
#> $ second_winner <chr> "GxE", "G+E", NA, "G+E", "G+E", NA, "G+E"
#> $ delta_aic <dbl> 0.07612103, 3.90964213, NA, 3.27939276, 0.34990027…
#> $ delta_r_squared <dbl> -0.01440811, 0.04856945, NA, 0.05583976, -0.019097…
#> $ basal_AIC <dbl> 166.8374, 159.6965, 166.5303, 177.6443, 169.6680, …
#> $ basal_rsquared <dbl> 0.092873702, 0.086407501, 0.031071178, 0.025884145…
#> $ r2_difference_basal <dbl> 0.6751724, 0.6905201, 0.5244324, 0.6839094, 0.6398…
We can see that the final data set in this example dropped almost all of the VMRs we had. This is something expected since we are working with toy data coming from random sampling.
We recommend the users of the package to include the number of VMRs
where we could not find a conclusive best model in the final results
report (either because no variables were selected with
RAMEN::selectVariables()
or because they did not pass the
R2_difference threshold obtained with
RAMEN::nullDistGE()
).
# Include the VMRs with no winning model in the final results object
final_res = VMRs_df %>%
dplyr::left_join(filtered_res,
by = "VMR_index")
# Plot final results
final_res %>%
dplyr::group_by(model_group) %>%
dplyr::summarise(count = n()) %>%
ggplot2::ggplot(aes(x = model_group, y = count)) +
ggplot2::geom_col() +
ggplot2::xlab("Best explanatory model") +
ggplot2::ylab("VMRs") +
ggplot2::theme_classic()
So, we can see that for this toy example, we got the following results:
- VMRs better explained by a G model: 2
- VMRs better explained by a E model: 1
- VMRs better explained by a G+E model: 1
- VMRs better explained by a GxE model: 3
- VMRs with no conclusive explanatory model: 124
Author’s note about model interpretation
For model simplicity, each winning model have a single E and/or G variable (and its interaction term when applicable). That means that in a scenario where a given VMR is under the influence of 2 or more Es or Gs, only the one that better explains the VMR’s DNAme alone will be selected. In other words, if a VMR in reality is influenced by e.g. folate intake and smoking, and we have information about both environmental exposures, the best model (E) will have only folate intake (in case that is the variable that better explains DNAme variability in that region alone). So, interpreting this as the VMR not being potentially under the influence of smoking might not be correct. We recommend the user to check the total R2 of the winning model to explore the remaining variance that is not explained by the winning model.
We also stress that interpretation of individual variables should be done with caution and used only as exploration and research hypothesis generation. Please see Author’s note about variables interpretation) where we advice against over-interpretation of the selected variables; the same logic applies to the variables present in the winning models.
Variations to the standard workflow
Besides using RAMEN for completing the analysis mentioned above, the outputs of the package’s function can help users in other DNAme analyses, such as:
- Reduction of tests prior to an EWAS or differential methylation
analysis with
RAMEN::findVMRs()
(i.e., conducting the analysis on identified VMRs which 1) reduces redundant tests by grouping nearby correlated CpGs, and 2) avoids tests in non-variant regions). This can help to reduce the multiple hypothesis testing burden. - Summarize a DNAme region of interest with
RAMEN::summarizeVMRs()
- Easily conduct variable selection in high-dimensional data sets to
identify potentially relevant variables from one or two independent data
sets with
RAMEN::selectVariables()
. - Fit additive and interaction models given two sets of variables of
interest (not limited to G and E) and select the best explanatory model
for DNAme data with
RAMEN::selectVariables()
andRAMEN::lmGE()
(e.g. exploring the interaction between two environmental dimensions and their contribution to DNAme variability). - Quickly identify SNPs in cis of CpG probes with
RAMEN::findCisSNPs()
(e.g. for cis mQTL analyses) - Get the median correlation of probes in regions of interest (with
RAMEN::medCorVMR()
).
Frequently Asked Questions
How can I save the RAMEN data frames that have columns with lists as observations?
Saving the data frames produced by RAMEN might seem difficult because it has lists as observations in several columns, which is not supported by some read/write functions. We suggest two options:
- Use
fwrite()
andfread()
from the data.table package (recommended because of time and storage space).
# Example for saving and reading selected_variables object
data.table::fwrite(selected_variables, file = "path/selected_variables.csv")
# Read the csv file and make lists the elements in the required columns
selected_variables = fread("path/selected_variables.csv", data.table = FALSE) %>%
mutate(selected_genot = str_split(selected_genot, pattern = "\\|"), # fwrite saves lists as strings separated by |
selected_env =str_split(selected_env, pattern = "\\|"),
VMR_index = as.character(VMR_index))
- Save files as .rds
Session info
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] doRNG_1.8.6 rngtools_1.5.2 foreach_1.5.2 tidyr_1.3.1
#> [5] ggplot2_3.5.1 dplyr_1.1.4 RAMEN_1.0.0 knitr_1.48
#> [9] BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] gtable_0.3.5 shape_1.4.6.1 xfun_0.48
#> [4] bslib_0.8.0 lattice_0.22-6 vctrs_0.6.5
#> [7] tools_4.4.1 generics_0.1.3 stats4_4.4.1
#> [10] parallel_4.4.1 tibble_3.2.1 fansi_1.0.6
#> [13] highr_0.11 pkgconfig_2.0.3 Matrix_1.7-0
#> [16] desc_1.4.3 S4Vectors_0.42.1 lifecycle_1.0.4
#> [19] GenomeInfoDbData_1.2.12 compiler_4.4.1 farver_2.1.2
#> [22] stringr_1.5.1 textshaping_0.4.0 munsell_0.5.1
#> [25] mitools_2.4 codetools_0.2-20 survey_4.4-2
#> [28] GenomeInfoDb_1.40.1 htmltools_0.5.8.1 sass_0.4.9
#> [31] yaml_2.3.10 glmnet_4.1-8 pillar_1.9.0
#> [34] pkgdown_2.1.1 jquerylib_0.1.4 MASS_7.3-60.2
#> [37] cachem_1.1.0 iterators_1.0.14 boot_1.3-30
#> [40] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4
#> [43] purrr_1.0.2 bookdown_0.40 labeling_0.4.3
#> [46] splines_4.4.1 fastmap_1.2.0 grid_4.4.1
#> [49] colorspace_2.1-1 cli_3.6.3 magrittr_2.0.3
#> [52] survival_3.6-4 utf8_1.2.4 corpcor_1.6.10
#> [55] withr_3.0.1 scales_1.3.0 UCSC.utils_1.0.0
#> [58] rmarkdown_2.28 XVector_0.44.0 httr_1.4.7
#> [61] ragg_1.3.3 evaluate_1.0.0 GenomicRanges_1.56.1
#> [64] IRanges_2.38.1 rlang_1.1.4 Rcpp_1.0.13
#> [67] DBI_1.2.3 glue_1.8.0 BiocManager_1.30.25
#> [70] BiocGenerics_0.50.0 jsonlite_1.8.9 R6_2.5.1
#> [73] relaimpo_2.2-7 systemfonts_1.1.0 fs_1.6.4
#> [76] zlibbioc_1.50.0