Ensembl VEP can use a variety of annotation sources to retrieve the transcript models used to predict consequence types.
Data from VCF, BED and bigWig files can also be incorporated by Ensembl VEP'sCustom annotation feature.
Using a cache (--cache) is the fastest and most efficient way to use Ensembl VEP, as in most cases only a single initial network connection is made and most data is read from local disk. Useoffline mode to eliminate all network connections for speed and/or privacy.
We strongly recommend that you download/use the cache version which corresponds to your Ensembl VEP installation,
i.e. cache version114 should be used with the Ensembl VEP tool version114.
This is mainly due to the fact that the cache (data content and structure) is generated every Ensembl release, regarding the data and API updates for this release, therefore the cache data format might differ between versions (and be incompatible with a newer version of the Ensembl VEP tool).
Cache files are created for every species for each Ensembl release. They can be automatically downloaded and configured usingINSTALL.pl.
If interested in RefSeq transcripts you may download an alternate cache file (e.g. homo_sapiens_refseq), or a merged file of RefSeq and Ensembl transcripts (eg homo_sapiens_merged); remember to specify--refseq or--merged when running Ensembl VEP to use the relevant cache. Seedocumentation for full details.
It is also simple to download and set up caches without using the installer. By default, Ensembl VEP searches for caches in $HOME/.vep; to use a different directory when running Ensembl VEP, use--dir_cache.
Indexed cache (https://ftp.ensembl.org/pub/release-114/variation/indexed_vep_cache/)
Essential for human and other species with large sets of variant data - requiresBio::DB::HTS (setup by INSTALL.pl) ortabix, e.g.:
cd $HOME/.vepcurl -O https://ftp.ensembl.org/pub/release-114/variation/indexed_vep_cache/homo_sapiens_vep_114_GRCh38.tar.gztar xzf homo_sapiens_vep_114_GRCh38.tar.gz
FTP directories with indexed cache data:
Ensembl: | Vertebrates |
---|---|
Ensembl Genomes: | Bacteria|Fungi|Metazoa|Plants|Protists |
NB: When using Ensembl Genomes caches, you should use the--cache_version option to specify the relevant Ensembl Genomes version number as these differ from the concurrent Ensembl VEP version numbers.
Ensembl VEP caches are also available for Human Pangenome Reference Consortium (HPRC) data at theEnsembl HPRC data page. Clickhere for more information on how to annotate variants on HPRC assemblies.
The data content of Ensembl VEP caches vary by species. This table shows the contents of the default human cache files in release 114.
Source | Version (GRCh38) | Version (GRCh37) |
---|---|---|
Ensembl database version | 114 | 114 |
Genome assembly | GRCh38.p14 | GRCh37.p13 |
MANE Version | v1.4 | n/a |
GENCODE | 48 | 19 |
RefSeq | GCF_000001405.40-RS_2023_10 (GCF_000001405.40_GRCh38.p14_genomic.gff) | 105.20220307 (GCF_000001405.25_GRCh37.p13_genomic.gff) |
Regulatory build | 1.0 | 1.0 |
PolyPhen-2 | 2.2.3 | 2.2.2 |
SIFT | 6.2.1 | 5.2.2 |
dbSNP | 156 | 156 |
COSMIC | 100 | 98 |
HGMD-PUBLIC | 2020.4 | 2020.4 |
ClinVar | 2024-09 | 2023-06 |
1000 Genomes | Phase 3 (remapped) | Phase 3 |
gnomAD exomes | v4.1 | v4.1 |
gnomAD genomes | v4.1 | v4.1 |
The cachestores the following information:
The cachedoes not store any information pertaining to, and therefore cannot be used for, the following:
Enabling one of these options with--cache will cause Ensembl VEP to warn you in its status output with something like the following:
2011-06-16 16:24:51 - INFO: Database will be accessed when using --hgvs
Hereexisting variants referes to those variants that have been loaded toEnsembl variation database from accessioning resources. For example, for human, you can see the source of data in the above table. We load variants from accessioning resources such as dbSNP, COSMIC, and HGMD-PUBLIC.
Note thatgnomAD is not a variant accessioning body. What it means is that any gnomAD variant that are not accessioned will not be avialable in the cache. For example, gnomAD v4.1 was released in April 2024, but will not be available in the cache until the variants have been submitted to dbSNP for accessioning and made available in a dbSNP release. If you run the variant5-32100960-ATAAG-A using 113 cache you would not get any frequency information because it was not accessioned at the time of Ensembl 113 release -
./vep --id "5 32100960 . ATAAG A" --af_gnomadg --af_gnomade --check_existing --cache --cache_version 113 --fasta genome.fa.gz#Uploaded_variationLocationAlleleGeneFeatureFeature_typeConsequencecDNA_positionCDS_positionProtein_positionAmino_acidsCodonsExisting_variationExtra5_32100961_TAAG/-5:32100961-32100964-ENSG00000133401ENST00000397559Transcriptsplice_donor_variant,non_coding_transcript_exon_variant95-?-----IMPACT=HIGH;STRAND=1
Alternative: In such cases you can use gnomAD VCF file with--custom option.
When using the public database servers, Ensembl VEP requests transcript and variation data that overlap the loci in your input file. As such, these coordinates are transmitted over the network to a public server, which may not be appropriate for the analysis of sensitive or private data.
Only thecoordinates are transmitted to the server; no other information is sent.
To use offline mode that does not use any network connections, use the flag--offline.
Thelimitations described above apply absolutely when using offline mode. For example, if you specify--offline and--format id, Ensembl VEP will report an error and refuse to run:
ERROR: Cannot use ID format in offline mode
All other features, including the ability to usecustom annotations andplugins, are accessible in offline mode.
Ensembl VEP can use transcript annotations defined inGFF orGTF files. The files must be bgzipped and indexed with tabix and aFASTA file containing the genomic sequence is required in order to generate transcript models. This allows you to annotate variants from any species and assembly with these data.
Your GFF or GTF file must be sorted in chromosomal order. Ensembl VEP does not use header lines so it is safe to remove them.
grep -v "#" data.gff | sort -k1,1 -k4,4n -k5,5n -t$'\t' | bgzip -c > data.gff.gztabix -p gff data.gff.gz./vep -i input.vcf --gff data.gff.gz --fasta genome.fa.gz
You may use any number of GFF/GTF files in this way, providing they refer to the same genome. You may also use them in concert with annotations from a cache or database source; annotations are distinguished by the SOURCE field in the output.
GFF file
Example of command line with GFF, using flag--gff :
./vep -i input.vcf --cache --gff data.gff.gz --fasta genome.fa.gz
NOTE: If you wish to customise the name of the GFF as it appears in the SOURCE field and Ensembl VEP output header, use thelonger--custom annotation form:
--custom file=data.gff.gz,short_name=frequency,format=gff
GTF file
Example of command line with GTF, using flag--gtf :
./vep -i input.vcf --cache --gtf data.gtf.gz --fasta genome.fa.gz
NOTE: If you wish to customise the name of the GFF as it appears in the SOURCE field and Ensembl VEP output header, use thelonger--custom annotation form:
--custom file=data.gtf.gz,short_name=frequency,format=gtf
Ensembl VEP has been tested on GFF files generated by Ensembl and NCBI (RefSeq). Due to inconsistency in the GFF specification and adherence to it, not all GFF files will be compatible with Ensembl VEP and not all transcript biotypes may be supported. Additionally, Ensembl VEP does not support GFF files with embedded FASTA sequence.
Column "type" (3rd column):
The following entity/feature types are supported by Ensembl VEP.
Lines of other types will be ignored; if this leads to an incomplete transcript model, the whole transcript model may be discarded. If unsupported types are used you will see a warning like the following -
WARNING:Ignoring 'five_prime_utr' feature_type from Homo_sapiens.GRCh38.111.gtf.gz GFF/GTF file. This feature_type is not supported in Ensembl VEP.
Expected parameters in the 9th column:
Only required for the genes and transcripts entities.
- Entities in the GFF are expected to be linked using a key named "parent" or "Parent" in the attributes (9th) column of the GFF.
- Unlinked entities (i.e. those with no parentsor children) are discarded.
- Sibling entities (those that share the same parent) may have overlapping coordinates, e.g. for exon and CDS entities.
Transcripts require a Sequence Ontology biotype to be defined in order to be used.
The simplest way to define this is using an attribute named "biotype" on the transcript entity. Other configurations are supported in order for Ensembl VEP to use GFF files from NCBI and other sources.
Here is an example:
##gff-version 3.2.1##sequence-region 1 1 100001 Ensembl gene 1000 5000 . + . ID=gene1;Name=GENE11 Ensembl transcript 1100 4900 . + . ID=transcript1;Name=GENE1-001;Parent=gene1;biotype=protein_coding1 Ensembl exon 1200 1300 . + . ID=exon1;Name=GENE1-001_1;Parent=transcript11 Ensembl exon 1500 3000 . + . ID=exon2;Name=GENE1-001_2;Parent=transcript11 Ensembl exon 3500 4000 . + . ID=exon3;Name=GENE1-001_2;Parent=transcript11 Ensembl CDS 1300 3800 . + . ID=cds1;Name=CDS0001;Parent=transcript1
The following GTF entity types will be extracted:
Entities are linked by an attribute named for theparent entity type e.g. exon is linked to transcript by transcript_id, transcript is linked to gene by gene_id.
Transcript biotypes are defined in attributes named "biotype", "transcript_biotype" or "transcript_type". If none of these exist, Ensembl VEP will attempt to interpret the source field (2nd column) of the GTF as the biotype.
Here is an example:
1 Ensembl gene 1000 5000 . + . gene_id "gene1"; gene_name "GENE1";1 Ensembl transcript 1100 4900 . + . gene_id "gene1"; transcript_id "transcript1"; gene_name "GENE1"; transcript_name "GENE1-001"; transcript_biotype "protein_coding";1 Ensembl exon 1200 1300 . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon1"; exon_id "GENE1-001_1";1 Ensembl exon 1500 3000 . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon2"; exon_id "GENE1-001_2";1 Ensembl exon 3500 4000 . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon3"; exon_id "GENE1-001_2";1 Ensembl CDS 1300 3800 . + . gene_id "gene1"; transcript_id "transcript1"; exon_number "exon2"; ccds_id "CDS0001";
If the chromosome names used in your GFF/GTF differ from those used in the FASTA or your input VCF, you may see warnings like this when running Ensembl VEP:
WARNING: Chromosome 21 not found in annotation sources or synonyms on line 160
To circumvent this you may provide Ensembl VEP with asynonyms file. A synonym file is included in Ensembl VEP's cache files, so if you have one of these for your species you can use it as follows:
./vep -i input.vcf -cache -gff data.gff.gz -fasta genome.fa.gz -synonyms ~/.vep/homo_sapiens/114_GRCh38/chr_synonyms.txt
Using a GFF or GTF file as the gene annotation source limits access to some auxiliary information available when using acache. Currently most external reference data such as gene symbols, transcript identifiers and protein domains are inaccessible when using only a GFF/GTF file.
Ensembl VEP's flexibility does allow some annotation types to be replaced. The following table illustrates some examples and alternative means to retrieve equivalent data.
Data type | Alternative |
---|---|
SIFT and PolyPhen-2 predictions (--sift,--polyphen) | Use thePolyPhen_SIFT plugin |
Co-located variants (--check_existing, --af* flags) | A couple of options are available:
|
Regulatory consequences (--regulatory) | Add--cache to use regulatory features in thecache.* |
* Note this will also instruct Ensembl VEP to annotate input variants against transcript models retrieved from the cacheas well as those from the GFF/GTF file. It is possible to use--transcript_filter to include only the transcripts from your GFF/GTF file:
./vep -i input.vcf -cache --custom file=data.gff.gz,short_name=myGFF,format=gff --fasta genome.fa.gz --transcript_filter "_source_cache is myGFF"
By pointing Ensembl VEP to a FASTA file (or directory containing several files), it is possible to retrieve reference sequence locally when using--cache or--offline. This enables Ensembl VEP to:
FASTA files from Ensembl can be set up using theinstaller; files set up using the installer are automatically detected when using--cache or--offline; you should not need to use--fasta to manually specify them.
The following plugins do require the fasta file to be explicitly passed as a command line argument (i.e.--fasta /VEP_DIR/your_downloaded.fasta)
To enable this, Ensembl VEP uses one of two modules:
The first time you run Ensembl VEP with a specific FASTA file, an index will be built. This can take a few minutes, depending on the size of the FASTA file and the speed of your system. On subsequent runs the index does not need to be rebuilt (if the FASTA file has been modified, Ensembl VEP will force a rebuild of the index).
FASTA FTP directories
Suitable reference FASTA files are available to download from the Ensembl FTP server. See theDownloads page for details.
You should preferably use the installer as described above to fetch these files; manual instructions are provided for reference. In most cases it is best to download the single large "primary_assembly" file for your species. You should use the unmasked (without_rm or_sm in the name) sequences.
Note that Ensembl VEP requires that the file be either unzipped (Bio::DB::Fasta) or unzipped and then recompressed with bgzip (Bio::DB::HTS::Faidx) to run; when unzipped these files can be very large (25GB for human). An example set of commands for setting up the data for human follows:
curl -O https://ftp.ensembl.org/pub/release-114/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gzgzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gzbgzip Homo_sapiens.GRCh38.dna.primary_assembly.fa./vep -i input.vcf --offline --hgvs --fasta Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
Ensembl VEP can use remote or local database servers to retrieve annotations.
By default, Ensembl VEP is configured to connect to the public MySQL instance at ensembldb.ensembl.org. If you are in the USA (or geographically closer to the east coast of the USA than to the Ensembl data centre in Cambridge, UK), a mirror server is available at useastdb.ensembl.org. To use the mirror, use the flag--host useastdb.ensembl.org
Data for Ensembl Genomes species (e.g. plants, fungi, microbes) is available through a different public MySQL server. The appropriate connection parameters can be automatically loaded by using the flag--genomes
If you have a very small data set (100s of variants), using the public database servers should provide adequate performance. If you have larger data sets, or wish to use Ensembl VEP in a batch manner, consider one of the alternatives below.
It is possible to set up a local MySQL mirror with the databases for your species of interest installed. For instructions on installing a local mirror, seehere. You will need a MySQL server that you can connect to from the machine where you will run Ensembl VEP (this can be the same machine). For most annotation functionality, you will only need the Core database (e.g. homo_sapiens_core_114_38) installed. In order to find co-located variants or to use SIFT or PolyPhen-2, it is also necessary to install the relevant variation database (e.g. homo_sapiens_variation_114_38).
Note that unless you have custom data to insert in the database, in most cases it will be much more efficient to use apre-built cache in place of a local database.
To connect to your mirror, you can either set the connection parameters using--host,--port,--user and--password, or use a registry file. Registry files contain all the connection parameters for your database, as well as any species aliases you wish to set up:
use Bio::EnsEMBL::DBSQL::DBAdaptor;use Bio::EnsEMBL::Variation::DBSQL::DBAdaptor;use Bio::EnsEMBL::Registry;Bio::EnsEMBL::DBSQL::DBAdaptor->new( '-species' => "Homo_sapiens", '-group' => "core", '-port' => 5306, '-host' => 'ensembldb.ensembl.org', '-user' => 'anonymous', '-pass' => '', '-dbname' => 'homo_sapiens_core_114_38');Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new( '-species' => "Homo_sapiens", '-group' => "variation", '-port' => 5306, '-host' => 'ensembldb.ensembl.org', '-user' => 'anonymous', '-pass' => '', '-dbname' => 'homo_sapiens_variation_114_38');Bio::EnsEMBL::Registry->add_alias("Homo_sapiens","human");
For more information on the registry and registry files, seehere.
ADVANCED The cache consists of compressed files containing listrefs of serialised objects. These objects are initially created from the database as if using the Ensembl API normally. In order to reduce the size of the cache and allow the serialisation to occur, some changes are made to the objects before they are dumped to disk. This means that they will not behave in exactly the same way as an object retrieved from the database when writing, for example, a plugin that uses the cache.
The following hash keys are deleted from each transcript object:
$transcript->{_variation_effect_feature_cache}->{mapper}
As mentioned above, a special hash key "_variation_effect_feature_cache" is created on the transcript object and used to cache things used by Ensembl VEP in predicting consequences, things which might otherwise have to be fetched from the database. Some of these are stored in place of equivalent keys that are deleted as described above. The following keys and data are stored:
$transcript->translateable_seq
$transcript->translate->seq
$transcript->translation->get_all_ProteinFeaturesEach protein feature is stripped of all keys but: start, end, analysis, hseqname
$transcript->slice->get_all_Attributes('codon_table')->[0]
$protein_function_prediction_matrix_adaptor->fetch_by_analysis_translation_md5('sift', md5_hex($transcript-{_variation_effect_feature_cache}->{peptide}))
Similarly, some further data is cached directly on the transcript object under the following keys:
.