- Notifications
You must be signed in to change notification settings - Fork28
Genomic Data Retrieval with R
ropensci/biomartr
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This package is born out of my own frustration to automate the genomic data retrieval process to create computationally reproducible scripts for large-scale genomics studies. Since I couldn't find easy-to-use and fully reproducible software libraries I sat down and tried to implement a framework that would enable anyone to automate and standardize the genomic data retrieval process. I hope that this package is useful to others as well and that it helps to promote reproducible research in genomics studies.
I happily welcome anyone who wishes to contribute to this project :) Just drop me an email.
Please find a detaileddocumentation here.
Please citebiomartr if it was helpful for your research. This will allow me tocontinue maintaining this project in the future.
Drost HG, Paszkowski J.Biomartr: genomic data retrieval with R.Bioinformatics (2017) 33(8): 1216-1217.doi:10.1093/bioinformatics/btw821.
The vastly growing number of sequenced genomes allows us to perform a new type of biological research.Using a comparative approach these genomes provide us with new insights on how biological information is encodedon the molecular level and how this information changes over evolutionary time.
The first step, however, of any genome based study is to retrieve genomes and their annotation from databases. To automate theretrieval process of this information on a meta-genomic scale, thebiomartr package provides interface functions for genomic sequence retrieval and functional annotation retrieval. The major aim ofbiomartr is to facilitate computational reproducibility and large-scale handling of genomic data for (meta-)genomic analyses.In addition,biomartr aims to address thegenome version crisis. Withbiomartr users can now control and be informedabout the genome versions they retrieve automatically. Many large scale genomics studies lack this informationand thus, reproducibility and data interpretation become nearly impossible when documentation of genome version informationgets neglected.
In detail,biomartr automates genome, proteome, CDS, RNA, Repeats, GFF/GTF (annotation), genome assembly quality, and metagenome project data retrieval from the major biological databases such as
- NCBI RefSeq
- NCBI Genbank
- ENSEMBL
- ENSEMBLGENOMES (as of April 2019 -
ENSEMBLandENSEMBLGENOMESwere joined - seedetails here) - UniProt
Furthermore, an interface to theEnsembl Biomart database allows users to retrieve functional annotation for genomic loci using a novel and organism centric search strategy. In addition, users candownload entire databases such as
NCBI RefSeqNCBI nrNCBI ntNCBI GenbankENSEMBL
with only one command.
The main difference between theBiomaRt package and thebiomartr package is thatbiomartr extends thefunctional annotation retrieval procedure ofBiomaRt andin addition provides useful retrieval functions for genomes, proteomes, coding sequences, gff files, RNA sequences, Repeat Masker annotations files, and functions for the retrieval of entire databases such asNCBI nr etc.
Please consult theTutorials section for more details.
In the context of functional annotation retrieval thebiomartr package allows users to screen available marts using only the scientific name of an organism of interest instead of first searching for marts and datasets which support a particular organism of interest (which is required when using theBiomaRt package). Furthermore,biomartr allows you to search for particular topics when searching for attributes and filters. I am aware that the similar naming of the packages is unfortunate, but it arose due to historical reasons (please find a detailed explanation here:https://github.com/ropensci/biomartr/blob/master/FAQs.md and here#11).
I also dedicatedan entire vignette to compare theBiomaRt andbiomartr package functionality in the context ofFunctional Annotation (where their functionality overlaps which comprises about only 20% of the overall functionality of the biomartr package).
I truly value your opinion and improvement suggestions. Hence, I would be extremely grateful if you could take this 1 minute and 3 question survey (https://goo.gl/forms/Qaoxxjb1EnNSLpM02) so that I can learn how to improve
biomartrin the best possible way. Many many thanks in advance.
Thebiomartr package relies on someBioconductor tools and thus requiresinstallation of the following packages:
# Install core Bioconductor packagesif (!requireNamespace("BiocManager")) install.packages("BiocManager")BiocManager::install()# Install package dependenciesBiocManager::install("Biostrings")BiocManager::install("biomaRt")
Now users can installbiomartr from CRAN:
# install biomartr 1.0.11 from CRANinstall.packages("biomartr",dependencies=TRUE)# install the developer version containing the newest featuresBiocManager::install("ropensci/biomartr")
With an activated Bioconda channel (see 2. Set up channels), install with:
conda install r-biomartrand update with:
conda update r-biomartror use the docker container:
docker pull quay.io/biocontainers/r-biomartr:<tag>(checkr-biomartr/tags for valid values for )
The automated retrieval of collections (= Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStats files)will make sure that the genome file of an organism will match the CDS, proteome, RNA, GFF, etc fileand was generated using the same genome assembly version. One aspect of why genomics studiesfail in computational and biological reproducibility is that it is not clear whether CDS, proteome, RNA, GFF, etc filesused in a proposed analysis were generated using the same genome assembly file denoting the same genome assembly version.To avoid this seemingly trivial mistake we encourage users to retrievegenome file collections using thebiomartr functiongetCollection()and attach the corresponding output as Supplementary Datato the respective genomics study to ensure computational and biological reproducibility.
# download collection for Saccharomyces cerevisiaebiomartr::getCollection(db="refseq",organism="Saccharomyces cerevisiae")
Internally, thegetCollection() function will now generate a folder namedrefseq/Collection/Saccharomyces_cerevisiaeand will store all genome and annotation files forSaccharomyces cerevisiae in the same folder.In addition, the exact genoem and annotation version will be logged in thedoc folder.
Internally, a text file nameddoc_Saccharomyces_cerevisiae_db_refseq.txt is generated. The information stored in this log file is structured as follows:
File Name: Saccharomyces_cerevisiae_assembly_stats_refseq.txtOrganism Name: Saccharomyces_cerevisiaeDatabase: NCBI refseqURL: ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_assembly_stats.txtDownload_Date: Wed Jun 27 15:21:51 2018refseq_category: reference genomeassembly_accession: GCF_000146045.2bioproject: PRJNA128biosample: NAtaxid: 559292infraspecific_name: strain=S288Cversion_status: latestrelease_type: Majorgenome_rep: Fullseq_rel_date: 2014-12-17submitter: Saccharomyces Genome DatabaseIn an ideal world this reference file could then be included as supplementary information in anylife science publication that relies on genomic information so thatreproducibility of experiments and analyses becomes achievable.
Download all mammalian vertebrate genomes fromNCBI RefSeq via:
# download all vertebrate genomesmeta.retrieval(kingdom="vertebrate_mammalian",db="refseq",type="genome")
All geneomes are stored in the folder named according to the kingdom.In this casevertebrate_mammalian. Alternatively, users can specifytheout.folder argument to define a custom output folder path.
Please findall FAQs here.
I would be very happy to learn more about potential improvements of the concepts and functionsprovided in this package.
Furthermore, in case you find some bugs or need additional (more flexible) functionality of partsof this package, please let me know:
https://github.com/HajkD/biomartr/issues
Getting Started withbiomartr:
- NCBI Database Retrieval
- Genomic Sequence Retrieval
- Meta-Genome Retrieval
- Functional Annotation
- BioMart Examples
Users can also read the tutorials within (Posit (former RStudio)) :
# source the biomartr packagelibrary(biomartr)# look for all tutorials (vignettes) available in the biomartr package# this will open your web browserbrowseVignettes("biomartr")
The current status of the package as well as a detailed history of the functionality of each version ofbiomartr can be found in theNEWS section.
Some bug fixes or new functionality will not be available on CRAN yet, but in the developer version here on GitHub. To download and install the most recent version ofbiomartr run:
# install the current version of biomartr on your systemif (!requireNamespace("BiocManager",quietly=TRUE)) install.packages("BiocManager")BiocManager::install("ropensci/biomartr")
meta.retrieval(): Perform Meta-Genome Retieval from NCBI of species belonging to the same kingdom of life or to the same taxonomic subgroupmeta.retrieval.all(): Perform Meta-Genome Retieval from NCBI of the entire kingdom of lifegetMetaGenomes(): Retrieve metagenomes from NCBI GenbankgetMetaGenomeAnnotations(): Retrieve annotation *.gff files for metagenomes from NCBI GenbanklistMetaGenomes(): List available metagenomes on NCBI GenbankgetMetaGenomeSummary(): Helper function to retrieve the assembly_summary.txt file from NCBI genbank metagenomesclean.retrieval(): Format meta.retrieval output
listGenomes(): List all genomes available on NCBI and ENSEMBL serverslistKingdoms(): list the number of available species per kingdom of life on NCBI and ENSEMBL serverslistGroups(): list the number of available species per group on NCBI and ENSEMBL serversgetKingdoms(): Retrieve available kingdoms of lifegetGroups(): Retrieve available groups for a kingdom of lifeis.genome.available(): Check Genome Availability NCBI and ENSEMBL serversgetCollection(): Retrieve a Collection: Genome, Proteome, CDS, RNA, GFF, Repeat Masker, AssemblyStatsgetGenome(): Download a specific genome stored on NCBI and ENSEMBL serversgetGenomeSet(): Genome Retrieval of multiple speciesgetProteome(): Download a specific proteome stored on NCBI and ENSEMBL serversgetProteomeSet(): Proteome Retrieval of multiple speciesgetCDS(): Download a specific CDS file (genome) stored on NCBI and ENSEMBL serversgetCDSSet(): CDS Retrieval of multiple speciesgetRNA(): Download a specific RNA file stored on NCBI and ENSEMBL serversgetRNASet(): RNA Retrieval of multiple speciesgetGFF(): Genome Annotation Retrieval from NCBI (*.gff) and ENSEMBL (*.gff3) serversgetGTF(): Genome Annotation Retrieval (*.gtf) from ENSEMBL serversgetRepeatMasker() :Repeat Masker TE Annotation RetrievalgetAssemblyStats(): Genome Assembly Stats Retrieval from NCBIgetKingdomAssemblySummary(): Helper function to retrieve the assembly_summary.txt files from NCBI for all kingdomsgetMetaGenomeSummary(): Helper function to retrieve the assembly_summary.txt files from NCBI genbank metagenomesgetSummaryFile(): Helper function to retrieve the assembly_summary.txt file from NCBI for a specific kingdomgetENSEMBLInfo(): Retrieve ENSEMBL info filegetGENOMEREPORT(): Retrieve GENOME_REPORTS file from NCBI
read_genome(): Import genomes as Biostrings or data.table objectread_proteome(): Import proteome as Biostrings or data.table objectread_cds(): Import CDS as Biostrings or data.table objectread_gff(): Import GFF fileread_rna(): Import RNA fileread_rm(): Import Repeat Masker output fileread_assemblystats(): Import Genome Assembly Stats File
listNCBIDatabases(): Retrieve a List of Available NCBI Databases for Downloaddownload.database(): Download a NCBI database to your local hard drivedownload.database.all(): Download a complete NCBI Database such as e.g.NCBI nrto your local hard drive
biomart(): Main function to query the BioMart databasegetMarts(): Retrieve All Available BioMart DatabasesgetDatasets(): Retrieve All Available Datasets for a BioMart DatabasegetAttributes(): Retrieve All Available Attributes for a Specific DatasetgetFilters(): Retrieve All Available Filters for a Specific DatasetorganismBM(): Function for organism specific retrieval of available BioMart marts and datasetsorganismAttributes(): Function for organism specific retrieval of available BioMart attributesorganismFilters(): Function for organism specific retrieval of available BioMart filters
getGO(): Function to retrieve GO terms for a given set of genes
# On Windows, this won't work - see ?build_github_devtoolsinstall_github("HajkD/biomartr",build_vignettes=TRUE,dependencies=TRUE)# When working with Windows, first you need to install the# R package: rtools -> install.packages("rtools")# Afterwards you can install devtools -> install.packages("devtools")# and then you can run:devtools::install_github("HajkD/biomartr",build_vignettes=TRUE,dependencies=TRUE)# and then call it from the librarylibrary("biomartr",lib.loc="C:/Program Files/R/R-3.1.1/library")
Please note that this project is released with aContributor Code of Conduct. By participating in this project you agree to abide by its terms.
About
Genomic Data Retrieval with R
Topics
Resources
Code of conduct
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors7
Uh oh!
There was an error while loading.Please reload this page.