Skip to contents

This function simulates the delta R squared distribution under the null hypothesis of G and E having no association with DNA methylation (DNAme) variability through a permutation analysis. To do so, this function shuffles the G and E variables in the dataset, which is followed by a the variable selection and modelling steps with selectVariables() and lmGE().These steps are repeated several times as indicated in the permutations parameter. By using shuffled G and E data, we simulate the increase of R2 that would be observed in random data using the RAMEN methodology.

Usage

nullDistGE(
  VMRs_df,
  genotype_matrix,
  environmental_matrix,
  summarized_methyl_VMR,
  permutations = 10,
  covariates = NULL,
  seed = NULL,
  model_selection = "AIC"
)

Arguments

VMRs_df

A data frame converted from a GRanges object. Recommended to use the output of RAMEN::findCisSNPs(). Must have one VMR per row, and contain the following columns: "VMR_index" (a unique ID for each VMR in VMRs_df AS CHARACTERS) and "SNP" (a column with a list as observation, containing the name of the SNPs surrounding the corresponding VMR). The SNPs contained in the "SNP" column must be present in the object that is indicated in the genotype_matrix argument, and it must contain all the VMRs contained in summarized_methyl_VMR. VMRs with no surrounding SNPs must have an empty list in the SNP column (either list(NULL), list(NA), list("") or list(character(0)) ).

genotype_matrix

A matrix of number-encoded genotypes. Columns must correspond to samples, and rows to SNPs. We suggest using a gene-dosage model, which would encode the SNPs ordinally depending on the genotype allele charge, such as 2 (AA), 1 (AB) and 0 (BB). The column names must correspond with individual IDs.

environmental_matrix

A matrix of environmental variables. Only numeric values are supported. In case of factor variables, it is recommended to encode them as numbers or re-code them into dummy variables if there are more than two levels. Columns must correspond to environmental variables and rows to individuals. Row names must be the individual IDs.

summarized_methyl_VMR

A data frame containing each individual's VMR summarized region methylation. It is suggested to use the output of RAMEN::summarizeVMRs().Rows must reflects individuals, and columns VMRs The names of the columns must correspond to the index of said VMR, and it must match the index of VMRs_df$VMR_index. The names of the rows must correspond to the sample IDs, and must match with the IDs of the other matrices.

permutations

description

covariates

A matrix containing the covariates (i.e., concomitant variables / variables that are not the ones you are interested in) that will be adjusted for in the final GxE models (e.g., cell type proportions, age, etc.). Each column should correspond to a covariate and each row to an individual. Row names must correspond to the individual IDs.

seed

An integer number that initializes a pseudo-random number generator. Random numbers in this function are created during the lambda cross validation and the LASSO stages. Setting a seed is highly encouraged for result reproducibility. Please note that setting a seed in this function modifies the seed globally.

model_selection

Which metric to use to select the best model for each VMR. Supported options are "AIC" or BIC". More information about which one to use can be found in the Details section.

Value

A data frame with the following columns:

  • VMR_index: The unique ID of the VMR.

  • model_group: The group to which the winning model belongs to (i.e., G, E, G+E or GxE)

  • tot_r_squared: R squared of the winning model

  • R2_difference: the increase in R squared obtained by including the G/E variable(s) from the winning model (i.e., the R squared difference between the winning model and the model only with the concomitant variables specified in covariates; tot_r_squared - basal_rsquared in the lmGE output)

  • AIC_difference/BIC_difference: the AIC/BIC difference between the winning model and the model only with the concomitant variables specified in covariates; BIC/AIC - basal_BIC/basal_BIC in the lmGE output)

Details

The core pipeline from the RAMEN package identifies the best explanatory model per VMR. However, despite these models being winners in comparison to models including any other G/E variable(s) in the dataset, some winning models might perform no better than what we would expect by chance. Therefore, the goal of this function is to create a distribution of increase in R2 under the null hypothesis of G and E having no associations with DNAme. The null distribution is obtained through shuffling the G and E variables in a given dataset and conducting the variable selection and G/E model selection. That way, we can simulate how much additional variance would be explained by the models defined as winners by the RAMEN methodology in a scenario where the G and E associations with DNAme are randomized. This distribution can be then used to filter out winning models in the non-shuffled dataset that do not add more to the explained variance of the basal model than what randomized data do.

Under the assumption that after adjusting for the concomitant variables all VMRs across the genome follow the same behavior regarding an increment of explained variance with randomized G and E data, we can pool the delta R squared values from all VMRs to create a null distribution taking advantage of the high number of VMRs in the dataset. This assumption decreases significantly the number of permutations required to create a null distribution and reduces the computational time. For further information please read the RAMEN paper (in preparation).