| Title: | Sparse Marginal Epistasis Test |
| Version: | 0.0.2 |
| URL: | https://github.com/lcrawlab/sme,https://lcrawlab.github.io/sme/ |
| BugReports: | https://github.com/lcrawlab/sme/issues |
| Description: | The Sparse Marginal Epistasis Test is a computationally efficient genetics method which detects statistical epistasis in complex traits; see Stamp et al. (2025, <doi:10.1101/2025.01.11.632557>) for details. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.2 |
| LinkingTo: | BH, HighFive, Rcpp, RcppEigen, Rhdf5lib, testthat |
| Imports: | dplyr, genio, logging, mvMAPIT, Rcpp, tidyr |
| Suggests: | GenomicRanges, ggplot2, knitr, rmarkdown, testthat (≥3.0.0), xml2 |
| Config/testthat/edition: | 3 |
| SystemRequirements: | GNU make |
| VignetteBuilder: | knitr |
| Depends: | R (≥ 4.4.0) |
| LazyData: | true |
| biocViews: | GenomeWideAssociation, Epistasis, Genetics, SNP,LinearMixedModel |
| NeedsCompilation: | yes |
| Packaged: | 2025-08-22 12:25:10 UTC; jds |
| Author: | Julian Stamp |
| Maintainer: | Julian Stamp <julian.d.stamp@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-08-22 12:50:02 UTC |
Estimate Memory Requirements for SME Routine
Description
This function provides an approximate estimate of the memory requirements(in gigabytes) for running the Sparse Marginal Epistasis (SME) routinebased on input parameters such as the number of samples, SNPs, and other configurations.
Usage
approximate_memory_requirements( n_samples, n_snps, n_blocks, n_randvecs, chunksize)Arguments
n_samples | Integer. The number of samples in the dataset. |
n_snps | Integer. The total number of SNPs in the dataset. |
n_blocks | Integer. The number of genotype blocks used to partition SNPs.Affects the size of encoded genotype segments. |
n_randvecs | Integer. The number of random vectors used for stochastictrace estimation. Affects memory for operations involving random vectors. |
chunksize | Integer. The number of focal SNPs processed per chunk. |
Details
The function calculates memory usage by summing the contributions fromvarious components used in the SME routine, including:
Variance component estimates (
vc_estimates)Phenotype-related matrices
Random vector-based computations
Genotype objects and block statistics
Gene-by-gene interaction masks
The estimated memory requirement is derived from the data dimensionsand operational needs, and it provides a guideline for configuring resourcesfor the analysis.
Value
Numeric. The approximate memory requirement (in gigabytes) for theSME routine.
Examples
n_samples <- 1e5n_snps <- 1e6n_blocks <- 100n_randvecs <- 100chunksize <- 10approximate_memory_requirements(n_samples, n_snps, n_blocks, n_randvecs, chunksize)Create an HDF5 File
Description
This function creates a new, empty HDF5 file at the specified location.
Usage
create_hdf5_file(hdf5_file)Arguments
hdf5_file | A character string specifying the path and name of the HDF5file to be created. |
Value
No return value; the function creates the HDF5 file at the specifiedlocation.
Examples
# Create an empty HDF5 filehdf5_file <- tempfile()create_hdf5_file(hdf5_file)Simulated Dataset for Genome-Wide Interaction Analysis
Description
getting_started is a simulated dataset created to demonstrate the use ofthesme() function for genome-wide interaction analyses. It containsresults from a simulated analysis involving additive genetic effects andgene-by-gene (GxG) interactions.
Usage
data("getting_started")Format
A list with results fromsme(), including the following components:
summaryA data frame summarizing the analysis results, includingp-values for SNP associations (
p).pveA data frame containing the per SNP variance componentestimates normalized to phenotypic variance explained (PVE).
vcA data frame containing the per SNP variance componentestimates.
gxg_snpsA vector containing the indices of the SNPs assigned tohave epistatic interactions in the trait simulations.
Details
The dataset was generated as follows:
Genotype Simulation:Genotype data for 5000 individuals and 6,000 SNPs was simulated withsynthetic allele counts.
Phenotype Simulation:Phenotypic values were simulated with an additive heritability of 0.3 and aGxG interaction heritability of 0.25. A set of 100 SNPs were selected foradditive effects, and two groups of 5 SNPs each were used for GxGinteractions.
PLINK-Compatible Files:The simulated data was saved in PLINK-compatible
.bed,.fam,and.bimfiles.Interaction Analysis:The
sme()function was used to perform genome-wide interaction analyseson a subset of SNP indices, including the GxG SNP groups and 100 additionaladditive SNPs. Memory-efficient computation parameters(e.g.,chun_ksize,n_randvecs, andn_blocks) were applied.
Key Parameters
Additive Heritability: 0.3
GxG Heritability: 0.25
Number of Samples: 5000
Number of SNPs: 6,000
Selected Additive SNPs: 100
Selected GxG SNP Groups:
Group 1: 5 SNPs
Group 2: 5 SNPs
Source
data-raw/getting_started.R
See Also
Examples
data("getting_started")head(getting_started$summary)Read Dataset from an HDF5 File
Description
This function reads a dataset from an existing HDF5 file.
Usage
read_hdf5_dataset(file_name, dataset_name)Arguments
file_name | A character string specifying the path to the HDF5 file. |
dataset_name | A character string specifying the name of the datasetwithin the HDF5 file to read. |
Value
The content of the dataset from the HDF5 file, typically in the formof an R object.
Examples
data_to_write <- 1:10# Create an empty HDF5 filehdf5_file <- tempfile()create_hdf5_file(hdf5_file)# Write new data to a dataset in the HDF5 filewrite_hdf5_dataset(hdf5_file, "group/dataset", data_to_write)# Read a dataset from an HDF5 filehdf5_data <- read_hdf5_dataset(hdf5_file, "group/dataset")print(hdf5_data)Simulate Quantitative Traits from PLINK Genotypes
Description
This function simulates a quantitative trait based on additive and epistaticgenetic effects using genotype data from a PLINK dataset. The simulated traitis saved to a specified output file in a phenotype format compatible withPLINK.
Usage
simulate_traits( plink_file, output_file, additive_heritability, gxg_heritability, additive_indices, gxg_indices_1, gxg_indices_2, log_level = "WARNING")Arguments
plink_file | Character. Path to the PLINK dataset (without fileextension). The function will append |
output_file | Character. Path to the output file where the simulatedtrait will be saved. |
additive_heritability | Numeric. A value between 0 and 1 specifying theproportion of trait variance due to additive genetic effects. |
gxg_heritability | Numeric. A value between 0 and 1 specifying theproportion of trait variance due to gene-by-gene (epistatic) interactions.The sum of |
additive_indices | Integer vector. Indices of SNPs contributing toadditive genetic effects. |
gxg_indices_1 | Integer vector. Indices of SNPs in the first group forepistatic interactions. |
gxg_indices_2 | Integer vector. Indices of SNPs in the second group forepistatic interactions. |
log_level | Character. Logging level for messages(e.g., "DEBUG", "INFO", "WARNING"). Default is "WARNING". |
Details
The function uses the following components to simulate the trait:
Additive genetic effects: Determined by
additive_indicesand thespecifiedadditive_heritability.Epistatic interactions: Simulated using pairs of SNPs from
gxg_indices_1andgxg_indices_2, contributing to thegxg_heritability.Environmental effects: Any remaining variance not explained by geneticeffects is assigned to random environmental noise.
The output file is in PLINK-compatible phenotype format with three columns:Family ID (FID), Individual ID (IID), and the simulated trait (TRAIT).
Value
None. The simulated trait is written to the specifiedoutput_file.
Examples
plink_file <- gsub("\\.bed", "", system.file("testdata", "test.bed", package = "smer"))out_file <- tempfile()additive_heritability <- 0.3gxg_heritability <- 0.1additive_snps <- sort(sample(1:100, 50, replace = FALSE))gxg_group_1 <- sort(sample(additive_snps, 10, replace = FALSE))gxg_group_2 <- sort(sample(setdiff(additive_snps, gxg_group_1), 10, replace = FALSE))n_samples <- 200simulate_traits( plink_file, out_file, additive_heritability, gxg_heritability, additive_snps, gxg_group_1, gxg_group_2)from_file <- read.table(out_file, header = TRUE)head(from_file)Sparse Marginal Epistasis Test (SME)
Description
SME fits a linear mixed model in order to test for marginal epistasis. Itconcentrates the scans for epistasis to regions of the genome that have knownfunctional enrichment for a trait of interest.
Usage
sme( plink_file, pheno_file, mask_file = NULL, gxg_indices = NULL, chunk_size = NULL, n_randvecs = 10, n_blocks = 100, n_threads = 1, gxg_h5_group = "gxg", ld_h5_group = "ld", rand_seed = -1, log_level = "WARNING")Arguments
plink_file | Character. File path to the PLINK dataset(without *.bed extension).The function will append |
pheno_file | Character. File path to a phenotype file in PLINK format.The file should contain exactly one phenotype column. |
mask_file | Character or NULL. File path to an HDF5 file specifyingper-SNP masks for gene-by-gene interaction tests. This file informs whichSNPs are tested for marginal epistasis. Defaults to |
gxg_indices | Integer vector or NULL. List of indices corresponding toSNPs to test for marginal epistasis.If |
chunk_size | Integer or NULL. Number of SNPs processed per chunk.This influences memoryusage and can be left |
n_randvecs | Integer. Number of random vectors used for stochastic traceestimation.Higher values yield more accurate estimates but increase computationalcost. Default is 10. |
n_blocks | Integer. Number of blocks into which SNPs are divided forprocessing.This parameter affects memory requirements. Default is 100. |
n_threads | Integer. Number of threads for OpenMP parallel processing.Default is 1. |
gxg_h5_group | Character. Name of the HDF5 group within the mask filecontaining gene-by-geneinteraction masks. SNPs in this group will be included in the gene-by-geneinteractions. Defaults to "gxg". |
ld_h5_group | Character. Name of the HDF5 group within the mask filecontaining linkage disequilibriummasks. SNPs in this group are excluded from analysis. Defaults to "ld". |
rand_seed | Integer. Seed for random vector generation. If |
log_level | Character. Logging level for messages. Must be in uppercase(e.g., "DEBUG", "INFO", "WARNING", "ERROR"). Default is "WARNING". |
Details
This function integrates PLINK-formatted genotype and phenotype data toperform marginal epistasis tests on a set of SNPs. Using stochastic traceestimation, the method computes variance components for gene-by-geneinteraction and genetic relatedness using the MQS estimator. The process isparallelized using OpenMP whenn_threads > 1.
The memory requirements and computation time scaling can be optimized throughthe parameterschunk_size,n_randvecs, andn_blocks.
Mask Format Requirements
The mask file format is an HDF5 file used for storing index data forthe masking process. This format supports data retrieval by index.Below are the required groups and datasets within the HDF5 file:
The required group names can be configured as input parameters.The defaults are described below.
Groups:
ld: Stores SNPs in LD with the focal SNP. These SNPs will beexcluded.gxg: Stores indices of SNPs that the marginal epistasis test is conditioned on. These SNPs will beincluded.
Datasets:
ld/<j>: For each focal SNP<j>, this dataset contains indices of SNPsin the same LD block as that SNP. These SNPs will beexcluded from the gene-by-gene interaction covariance matrix.gxg/<j>: For each focal SNP<j>, this dataset contains indices of SNPs toinclude in thethe gene-by-gene interaction covariance matrix for focal SNP<j>.
Important: All indices in the mask file data arezero-based, matching the zero-based indices of the PLINK.bim file.
Value
A list containing:
summary: A tibble summarizing results for each tested SNP, including:id: Variant ID.index: Index of the SNP in the dataset.chromosome: Chromosome number.position: Genomic position of the SNP.p: P value for the gene-by-gene interaction test.pve: Proportion of variance explained (PVE) by gene-by-gene interactions.vc: Variance component estimate.se: Standard error of the variance component.
pve: A long-format tibble of PVE for all variance components.vc_estimate: A long-format tibble of variance component estimates.vc_se: A long-format tibble of standard errors for variance components.average_duration: Average computation time per SNP.
References
Stamp, J., Pattillo Smith, S., Weinreich, D., & Crawford, L. (2025).Sparse modeling of interactions enables fast detection of genome-wideepistasis in biobank-scale studies. bioRxiv, 2025.01.11.632557.
Stamp, J., DenAdel, A., Weinreich, D., & Crawford, L. (2023).Leveraging the genetic correlation between traits improves the detection ofepistasis in genome-wide association studies.G3: Genes, Genomes, Genetics, 13(8), jkad118.
Crawford, L., Zeng, P., Mukherjee, S., & Zhou, X. (2017).Detecting epistasis with the marginal epistasis test in genetic mappingstudies of quantitative traits. PLoS genetics, 13(7), e1006869.
Examples
plink_file <- gsub("\\.bed", "", system.file("testdata", "test.bed", package="smer"))pheno_file <- system.file("testdata", "test_h2_0.5.pheno", package="smer")mask_file <- ""# Parameter inputschunk_size <- 10n_randvecs <- 10n_blocks <- 10n_threads <- 1# 1-based Indices of SNPs to be analyzedn_snps <- 100snp_indices <- 1:n_snpssme_result <- sme( plink_file, pheno_file, mask_file, snp_indices, chunk_size, n_randvecs, n_blocks, n_threads)head(sme_result$summary)Write Data to an HDF5 Dataset
Description
This function writes new data to an existing HDF5 file. If the datasetalready exists, it will be replaced with the new data.
Usage
write_hdf5_dataset(file_name, dataset_name, new_data)Arguments
file_name | A character string specifying the path to the HDF5 file. |
dataset_name | A character string specifying the name of the datasetto be written in the HDF5 file. |
new_data | The new data to write to the dataset. The data must becompatible with the dataset's structure. |
Value
No return value; the function modifies the specified dataset in theHDF5 file.
Examples
data_to_write <- 1:10# Create an empty HDF5 filehdf5_file <- tempfile()create_hdf5_file(hdf5_file)# Write new data to a dataset in the HDF5 filewrite_hdf5_dataset(hdf5_file, "group/dataset", data_to_write)