QuantGen/BGDataPublic

NotificationsYou must be signed in to change notification settings
Fork14
Star34

A Suite of Packages for Analysis of Big Genomic Data

License

View license

34 stars 14 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,011 Commits
R		R
inst		inst
man		man
src		src
tests		tests
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md

Repository files navigation

BGData: A Suite of Packages for Analysis of Big Genomic Data

BGData (Grueneberg & de los Campos, 2019) is an R package that provides scalable and efficient computational methods for large genomic datasets, e.g., genome-wide association studies (GWAS) or genomic relationship matrices (G matrices). It also contains a container class calledBGData that holds genotypes, sample information, and variant information.

Modern genomic datasets are big (largen), high-dimensional (largep), and multi-layered. The challenges that need to be addressed are memory requirements and computational demands. Our goal is to develop software that will enable researchers to carry out analyses with big genomic data within the R environment.

We have identified several approaches to tackle those challenges within R:

File-backed matrices: The data is stored in on the hard drive and users can read in smaller chunks when they are needed.
Linked arrays: For very large datasets a single file-backed array may not be enough or convenient. A linked array is an array whose content is distributed over multiple file-backed nodes.
Multiple dispatch: Methods are presented to users so that they can treat these arrays pretty much as if they were RAM arrays.
Multi-level parallelism: Exploit multi-core and multi-node computing.
Inputs: Users can create these arrays from standard formats (e.g., PLINK .bed).

The BGData package is an umbrella package that comprises several packages:BEDMatrix,LinkedMatrix, andsymDMatrix.

Examples

Loading the package

Load the BGData package:

library(BGData)

Inspecting the example dataset

Theinst/extdata folder contains example files that were generated from the 250k SNP and phenotype data inAtwell et al. (2010). Only the first 300 SNPs of chromosome 1, 2, and 3 were included to keep the size of the example dataset small enough for CRAN.PLINK was used to convert the data to.bed and.raw files.FT10 has been chosen as a phenotype and is provided as analternate phenotype file. The file is intentionally shuffled to demonstrate that the additional phenotypes are put in the same order as the rest of the phenotypes.

path<- system.file("extdata",package="BGData")list.files(path)#>  [1] "chr1.bed"  "chr1.bim"  "chr1.fam"  "chr1.raw"  "chr2.bed"  "chr2.bim"#>  [7] "chr2.fam"  "chr2.raw"  "chr3.bed"  "chr3.bim"  "chr3.fam"  "chr3.raw"#> [13] "pheno.txt"

Loading example dataset

Loading individual PLINK .bed files

Load the .bed file for chromosome 1 (chr1.bed) using theBEDMatrix package:

chr1<- BEDMatrix(paste0(path,"/chr1.bed"))#> Extracting number of individuals and rownames from .fam file...#> Extracting number of markers and colnames from .bim file...

BEDMatrix objects behave similarly to regular matrices:

dim(chr1)#> [1] 199 300rownames(chr1)[1:10]#> [1] "5837_5837" "6008_6008" "6009_6009" "6016_6016" "6040_6040" "6042_6042"#> [7] "6043_6043" "6046_6046" "6064_6064" "6074_6074"colnames(chr1)[1:10]#> [1] "snp1_T"  "snp2_G"  "snp3_A"  "snp4_T"  "snp5_G"  "snp6_T"  "snp7_C"#> [8] "snp8_C"  "snp9_C"  "snp10_G"chr1["6008_6008","snp5_G"]#> [1] 0

Linking multiple BEDMatrix objects together

Load the other two .bed files:

chr2<- BEDMatrix(paste0(path,"/chr2.bed"))#> Extracting number of individuals and rownames from .fam file...#> Extracting number of markers and colnames from .bim file...chr3<- BEDMatrix(paste0(path,"/chr3.bed"))#> Extracting number of individuals and rownames from .fam file...#> Extracting number of markers and colnames from .bim file...

Combine the BEDMatrix objects by columns using theLinkedMatrix to avoid the inconvenience of having three separate matrices:

wg<- ColumnLinkedMatrix(chr1,chr2,chr3)

Just likeBEDMatrix objects,LinkedMatrix objects also behave similarly to regular matrices:

dim(wg)#> [1] 199 900rownames(wg)[1:10]#> [1] "5837_5837" "6008_6008" "6009_6009" "6016_6016" "6040_6040" "6042_6042"#> [7] "6043_6043" "6046_6046" "6064_6064" "6074_6074"colnames(wg)[1:10]#> [1] "snp1_T"  "snp2_G"  "snp3_A"  "snp4_T"  "snp5_G"  "snp6_T"  "snp7_C"#> [8] "snp8_C"  "snp9_C"  "snp10_G"wg["6008_6008","snp5_G"]#> [1] 0

Creating a BGData object

BGData objects can be created from individualBEDMatrix objects or a collection ofBEDMatrix objects as aLinkedMatrix object using theas.BGData() function. This will read the .fam and .bim file that comes with the .bed files. ThealternatePhenotypeFile parameter points to the file that contains theFT10 phenotype:

bg<- as.BGData(wg,alternatePhenotypeFile= paste0(path,"/pheno.txt"))#> Extracting phenotypes from .fam file, assuming that the .fam file of the first BEDMatrix instance is representative of all the other nodes...#> Extracting map from .bim files...#> Merging alternate phenotype file...

Thebg object will use theLinkedMatrix object as genotypes, the .fam file augmented by theFT10 phenotype as sample information, and the .bim file as variant information.

str(bg)#> Formal class 'BGData' [package "BGData"] with 3 slots#>   ..@ geno :Formal class 'ColumnLinkedMatrix' [package "LinkedMatrix"] with 1 slot#>   .. .. ..@ .Data:List of 3#>   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr1.bed]#>   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr2.bed]#>   .. .. .. ..$ :BEDMatrix: 199 x 300 [/home/agrueneberg/.pkgs/R/BGData/extdata/chr3.bed]#>   ..@ pheno:'data.frame':       199 obs. of  7 variables:#>   .. ..$ FID      : int [1:199] 5837 6008 6009 6016 6040 6042 6043 6046 6064 6074 ...#>   .. ..$ IID      : int [1:199] 5837 6008 6009 6016 6040 6042 6043 6046 6064 6074 ...#>   .. ..$ PAT      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...#>   .. ..$ MAT      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...#>   .. ..$ SEX      : int [1:199] 0 0 0 0 0 0 0 0 0 0 ...#>   .. ..$ PHENOTYPE: int [1:199] -9 -9 -9 -9 -9 -9 -9 -9 -9 -9 ...#>   .. ..$ FT10     : num [1:199] 57 60 98 75 71 56 90 93 96 91 ...#>   ..@ map  :'data.frame':       900 obs. of  6 variables:#>   .. ..$ chromosome        : int [1:900] 1 1 1 1 1 1 1 1 1 1 ...#>   .. ..$ snp_id            : chr [1:900] "snp1" "snp2" "snp3" "snp4" ...#>   .. ..$ genetic_distance  : int [1:900] 0 0 0 0 0 0 0 0 0 0 ...#>   .. ..$ base_pair_position: int [1:900] 657 3102 4648 4880 5975 6063 6449 6514 6603 6768 ...#>   .. ..$ allele_1          : chr [1:900] "T" "G" "A" "T" ...#>   .. ..$ allele_2          : chr [1:900] "C" "A" "C" "C" ...

Saving a BGData object

A BGData object can be saved like any other R object using thesave function:

save(bg,file="BGData.RData")

Loading a BGData object

The genotypes in aBGData object can be of various types, some of which need to be initialized in a particular way. Theload.BGData takes care of reloading a saved BGData object properly:

load.BGData("BGData.RData")#> Loaded objects: bg

Summarizing data

UsechunkedApply to count missing values (among others):

countNAs<- chunkedApply(X= geno(bg),MARGIN=2,FUN=function(x) sum(is.na(x)))

Use thesummarize function to calculate minor allele frequencies and frequency of missing values:

summarize(geno(bg))

Running GWASes with different regression methods

A data structure for genomic data is useful when defining methods that act on both phenotype and genotype information. We have implemented aGWAS function that supports various regression methods. The formula takes phenotypes from the sample information of theBGData object and inserts one marker at a time.

gwas<- GWAS(formula=FT10~1,data=bg)

Generating the G Matrix

G<- getG(geno(bg))

Installation

Install the stable version from CRAN:

install.packages("BGData")

Alternatively, install the development version from GitHub:

# install.packages("remotes")remotes::install_github("QuantGen/BGData")

Documentation

Further documentation can be found onRDocumentation.

Contributing

Issue Tracker:https://github.com/QuantGen/BGData/issues
Source Code:https://github.com/QuantGen/BGData

About

A Suite of Packages for Analysis of Big Genomic Data

Movatterモバイル変換

License

QuantGen/BGData

Folders and files

Latest commit

History

Repository files navigation

BGData: A Suite of Packages for Analysis of Big Genomic Data

Examples

Loading the package

Inspecting the example dataset

Loading example dataset

Loading individual PLINK .bed files

Linking multiple BEDMatrix objects together

Creating a BGData object

Saving a BGData object

Loading a BGData object

Summarizing data

Running GWASes with different regression methods

Generating the G Matrix

Installation

Documentation

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors2

Languages

Packages