Virus report
Virus record metadata
Virus report
The downloaded virus package contains a virus data report inJSON Linesformat in the file:
ncbi_dataset/data/data_report.jsonl
Each line of the virus data report file is a hierarchicalJSONobject that represents a single virus record. The schema of the virus record is defined in the tables below whereeach row describes a single field in the report or a sub-structure, which is a collection of fields.The outermost structure of the report isVirusAssembly.
Table fields that include aTable Field Mnemonic can be used with thedataformat command-line tool's--fields
Sample report
{"accession":"NC_045512.2","bioprojects": ["PRJNA485481" ],"completeness":"COMPLETE","geneCount":11,"host": {"lineage": [ {"name":"cellular organisms","taxId":131567 }, {"name":"Eukaryota","taxId":2759 }, {"name":"Opisthokonta","taxId":33154 }, {"name":"Metazoa","taxId":33208 }, {"name":"Eumetazoa","taxId":6072 }, {"name":"Bilateria","taxId":33213 }, {"name":"Deuterostomia","taxId":33511 }, {"name":"Chordata","taxId":7711 }, {"name":"Craniata","taxId":89593 }, {"name":"Vertebrata","taxId":7742 }, {"name":"Gnathostomata","taxId":7776 }, {"name":"Teleostomi","taxId":117570 }, {"name":"Euteleostomi","taxId":117571 }, {"name":"Sarcopterygii","taxId":8287 }, {"name":"Dipnotetrapodomorpha","taxId":1338369 }, {"name":"Tetrapoda","taxId":32523 }, {"name":"Amniota","taxId":32524 }, {"name":"Mammalia","taxId":40674 }, {"name":"Theria","taxId":32525 }, {"name":"Eutheria","taxId":9347 }, {"name":"Boreoeutheria","taxId":1437010 }, {"name":"Euarchontoglires","taxId":314146 }, {"name":"Primates","taxId":9443 }, {"name":"Haplorrhini","taxId":376913 }, {"name":"Simiiformes","taxId":314293 }, {"name":"Catarrhini","taxId":9526 }, {"name":"Hominoidea","taxId":314295 }, {"name":"Hominidae","taxId":9604 }, {"name":"Homininae","taxId":207598 }, {"name":"Homo","taxId":9605 }, {"name":"Homo sapiens","taxId":9606 } ],"organismName":"Homo sapiens","taxId":9606 },"isAnnotated":true,"isolate": {"collectionDate":"2019-12","name":"Wuhan-Hu-1" },"length":29903,"location": {"geographicLocation":"China","geographicRegion":"Asia" },"maturePeptideCount":26,"nucleotide": {"sequenceHash":"A926D55E" },"proteinCount":12,"releaseDate":"2020-01-13T00:00:00Z","sourceDatabase":"RefSeq","submitter": {"affiliation":"National Center for Biotechnology Information, NIH","country":"USA","names": ["Wu,F.","Zhao,S.","Yu,B.","Chen,Y.M.","Wang,W.","Song,Z.G.","Hu,Y.","Tao,Z.W.","Tian,J.H.","Pei,Y.Y.","Yuan,M.L.","Zhang,Y.L.","Dai,F.H.","Liu,Y.","Wang,Q.M.","Zheng,J.J.","Xu,L.","Holmes,E.C.","Zhang,Y.Z.","Baranov,P.V.","Henderson,C.M.","Anderson,C.B.","Gesteland,R.F.","Atkins,J.F.","Howard,M.T.","Robertson,M.P.","Igel,H.","Baertsch,R.","Haussler,D.","Ares,M. Jr.","Scott,W.G.","Williams,G.D.","Chang,R.Y.","Brian,D.A.","Chen,Y.-M.","Song,Z.-G.","Tao,Z.-W.","Tian,J.-H.","Pei,Y.-Y.","Zhang,Y.-L.","Dai,F.-H.","Wang,Q.-M.","Zheng,J.-J.","Zhang,Y.-Z." ] },"updateDate":"2020-07-18T00:00:00Z","virus": {"lineage": [ {"name":"Viruses","taxId":10239 }, {"name":"Riboviria","taxId":2559587 }, {"name":"Orthornavirae","taxId":2732396 }, {"name":"Pisuviricota","taxId":2732408 }, {"name":"Pisoniviricetes","taxId":2732506 }, {"name":"Nidovirales","taxId":76804 }, {"name":"Cornidovirineae","taxId":2499399 }, {"name":"Coronaviridae","taxId":11118 }, {"name":"Orthocoronavirinae","taxId":2501931 }, {"name":"Betacoronavirus","taxId":694002 }, {"name":"Sarbecovirus","taxId":2509511 }, {"name":"Betacoronavirus pandemicum","taxId":3418604 }, {"name":"Severe acute respiratory syndrome coronavirus 2","taxId":2697049 } ],"organismName":"Severe acute respiratory syndrome coronavirus 2","pangolinClassification":"B","taxId":2697049 }}VirusAssembly Structure
| Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
|---|---|---|---|---|---|
accession | accession | Accession | string | The accession.version of the viral nucleotide sequence. Includes both GenBank and RefSeq accessions | NC_045512.2 |
isAnnotated | is-annotated | Is Annotated | bool | The viral genome has been annotated by either the submitter (GenBank) or by NCBI (RefSeq) | |
isolate | isolate- | Isolate | Isolate | ||
sourceDatabase | sourcedb | Source database | string | Indicates if the source of the viral nucleotide record is from a GenBank submitter or from NCBI-derived curation (RefSeq) | RefSeqGenBank |
proteinCount | protein-count | Protein count | uint32 | The total count of annotated proteins including both proteins and polyproteins but not processed mature peptides | |
host | host- | Host | Organism | Taxon from which the virus sample was isolated | |
virus | virus- | Virus | Organism | Viral taxon | |
bioprojects repeated | bioprojects | BioProjects | string | Associated BioProject accessions, when available | PRJNA485481 |
location | geo- | Geographic | VirusAssembly.CollectionLocation | ||
updateDate | update-date | Update date | string | Date the viral nucleotide accession was last updated in NCBI Virus | |
releaseDate | release-date | Release date | string | Date the viral nucleotide accession was first released in NCBI Virus | |
completeness | completeness | Completeness | VirusAssembly.Completeness | Indicates whether the viral nucleotide sequence represents a complete or partial genome | |
length | length | Length | uint32 | Length of the viral nucleotide sequence | |
geneCount | gene-count | Gene count | uint32 | Total count of genes annotated on the viral nucleotide sequence | |
maturePeptideCount | matpeptide-count | Mature peptide count | uint32 | Total count of processed mature peptides annotated on the viral nucleotide sequence | |
biosample | biosample-acc | BioSample accession | string | Associated Biosample accessions | SAMN15394129 |
molType | mol-type | Molecule type | string | ICTV (International Committee on Taxonomy of Viruses) viral classification based on nucleic acid composition, strandedness and method of replication | |
nucleotide | SeqRangeSetFasta | The whole genomic nucleotide record of the CDS feature. | |||
purposeOfSampling | purpose-of-sampling | Purpose of Sampling | PurposeOfSampling | SARS-CoV-2 only, indicates whether the sequence was collected randomly for epedimiology studies | |
sraAccessions repeated | sra-accs | SRA Accessions | string | SRA accessions linked to the genbank genome | |
submitter | submitter- | Submitter | VirusAssembly.SubmitterInfo | Name, affiliation, and country of the submitter(s) | |
labHost | lab-host | Lab Host | string | This sequence is from viruses passaged in this host | |
isLabHost | is-lab-host | Is Lab Host | bool | If true, this sequence is from viruses passaged in a laboratory | |
isVaccineStrain | is-vaccine-strain | Is Vaccine Strain | bool | If true, this sequence is derived from a virus used as a vaccine or potential vaccine | |
segment | segment | Segment | string | The virus segment |
InfraspecificNames Structure
| Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
|---|---|---|---|---|---|
breed | breed | Breed | string | A homogenous group of animals within a domesticated species | Herefordboxer |
cultivar | cultivar | Cultivar | string | A variety of plant within a species produced and maintained by cultivation | B73 |
ecotype | ecotype | Ecotype | string | A population or subspecies occupying a distinct habitat | Alpine |
isolate | isolate | Isolate | string | The individual isolate from which the sequences in the genome assembly were derived | L1 Dominette 01449 registration number 42190680Pmale09 |
sex | sex | Sex | string | Male or female | female |
strain | strain | Strain | string | A genetic variant, subtype or culture within a species | SE11 |
Isolate Structure
| Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
|---|---|---|---|---|---|
name | lineage | Lineage | string | BioSample harmonized attribute nameshttps://www.ncbi.nlm.nih.gov/biosample/docs/attributes/ | |
source | lineage-source | Lineage source | string | Source material from which the viral specimen was isolated | bloodfeceslung |
collectionDate | collection-date | Collection date | string | The collection date for the sample from which the viral nucleotide sequence was derived |
LineageOrganism Structure
| Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
|---|---|---|---|---|---|
taxId | coming soon | coming soon | uint32 | NCBI Taxonomy identifier | 11118 |
name | coming soon | coming soon | string | Scientific name | Coronaviridae |
Organism Structure
| Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
|---|---|---|---|---|---|
taxId | tax-id | Taxonomic ID | uint32 | NCBI Taxonomy identifier | 96062697049 |
organismName | name | Name | string | Scientific name | Homo sapiensSevere acute respiratory syndrome coronavirus 2 |
commonName | common-name | Common Name | string | Common name | humanpangolinMERSSARS2 |
lineage repeated | LineageOrganism | Lineage ordered from superkingdom level to increasingly more specific taxonomic entries | |||
pangolinClassification | pangolin | Pangolin Classification | string | B.1.1.7 | |
infraspecificNames | infraspecific- | Infraspecific Names | InfraspecificNames |
Range Structure
A 1-based range on a sequence record.
| Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
|---|---|---|---|---|---|
begin | start | Start | uint64 | Sequence start position | |
end | stop | Stop | uint64 | Sequence stop position | |
orientation | orientation | Orientation | Orientation | Direction relative to the genome | |
order | order | Order | uint32 | The position of this sequence in a group of sequences | |
ribosomalSlippage | coming soon | coming soon | int32 | When ribosomal slippage is desired, fill out slippage amount between this and previous range. |
SeqRangeSetFasta Structure
| Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
|---|---|---|---|---|---|
seqId | seq-id | Sequence ID | string | Seq_id may include location info in addition to a sequence accession | |
accessionVersion | accession | Accession | string | Accession and version of the viral nucleotide sequence | |
title | title | Title | string | ||
sequenceHash | hash | Hash | string | Unique identifier for identical sequences | |
range repeated | range- | Range | Range | Series of intervals on above accession_version |
VirusAssembly.CollectionLocation Structure
| Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
|---|---|---|---|---|---|
geographicLocation | location | Location | string | Country of virus specimen collection | USAFrance |
geographicRegion | region | Region | string | Region of virus specimen collection | AsiaNorth America |
usaState | state | State | string | Two letter abbreviation of the state of the virus specifime collection (if United States) | NYVA |
VirusAssembly.SubmitterInfo Structure
| Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
|---|---|---|---|---|---|
names repeated | names | Names | string | List of submitters or authors of the virus assembly | Jane DJohn S |
affiliation | affiliation | Affiliation | string | The submitter’s organization and/or institution | Centers for Disease Control and Prevention, Respiratory Viruses Branch, Division of Viral DiseasesPublic Health Directorate, Communicable Disease Laboratory |
country | country | Country | string | The country representing the submitter’s affilation | USAChina |
Orientation Enumeration
| Name | Number | Description |
|---|---|---|
none | 0 | |
plus | 1 | |
minus | 2 |
PurposeOfSampling Enumeration
| Name | Number | Description |
|---|---|---|
PURPOSE_OF_SAMPLING_UNKNOWN | 0 | |
PURPOSE_OF_SAMPLING_BASELINE_SURVEILLANCE | 1 |
VirusAssembly.Completeness Enumeration
| Name | Number | Description |
|---|---|---|
UNKNOWN | 0 | |
COMPLETE | 1 | |
PARTIAL | 2 |
Scalar Value Types
| Protocol buffers type | Notes | C++ | Python | Java | Go |
|---|---|---|---|---|---|
double | double | float | double | float64 | |
float | float | float | float | float32 | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | bool | boolean | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |