Movatterモバイル変換


[0]ホーム

URL:


Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
Thehttps:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log inShow account info
Access keysNCBI HomepageMyNCBI HomepageMain ContentMain Navigation
pubmed logo
Advanced Clipboard
User Guide

Full text links

Frontiers Media SA full text link Frontiers Media SA Free PMC article
Full text links

Actions

Share

.2021 Apr 27:12:656334.
doi: 10.3389/fgene.2021.656334. eCollection 2021.

Ade novo Full-Length mRNA Transcriptome Generated From Hybrid-Corrected PacBio Long-Reads Improves the Transcript Annotation and Identifies Thousands of Novel Splice Variants in Atlantic Salmon

Affiliations

Ade novo Full-Length mRNA Transcriptome Generated From Hybrid-Corrected PacBio Long-Reads Improves the Transcript Annotation and Identifies Thousands of Novel Splice Variants in Atlantic Salmon

Sigmund Ramberg et al. Front Genet..

Abstract

Atlantic salmon (Salmo salar) is a major species produced in world aquaculture and an important vertebrate model organism for studying the process of rediploidization following whole genome duplication events (Ss4R, 80 mya). The currentSalmo salar transcriptome is largely generated from genome sequence basedin silico predictions supported by ESTs and short-read sequencing data. However, recent progress in long-read sequencing technologies now allows for full-length transcript sequencing from single RNA-molecules. This study provides ade novo full-length mRNA transcriptome from liver, head-kidney and gill materials. A pipeline was developed based on Iso-seq sequencing of long-reads on the PacBio platform (HQ reads) followed by error-correction of the HQ reads by short-reads from the Illumina platform. The pipeline successfully processed more than 1.5 million long-reads and more than 900 million short-reads into error-corrected HQ reads. A surprisingly high percentage (32%) represented expressed interspersed repeats, while the remaining were processed into 71 461 full-length mRNAs from 23 071 loci. Each transcript was supported by several single-molecule long-read sequences and at least three short-reads, assuring a high sequence accuracy. On average, each gene was represented by three isoforms. Comparisons to the current Atlantic salmon transcripts in the RefSeq database showed that the long-read transcriptome validated 25% of all known transcripts, while the remaining full-length transcripts were novel isoforms, but few were transcripts from novel genes. A comparison to the current genome assembly indicates that the long-read transcriptome may aid in improving transcript annotation as well as provide long-read linkage information useful for improving the genome assembly. More than 80% of transcripts were assigned GO terms and thousands of transcripts were from genes or splice-variants expressed in an organ-specific manner demonstrating that hybrid error-corrected long-read transcriptomes may be applied to study genes and splice-variants expressed in certain organs or conditions (e.g., challenge materials). In conclusion, this is the single largest contribution of full-length mRNAs in Atlantic salmon. The results will be of great value to salmon genomics research, and the pipeline outlined may be applied to generate additionalde novo transcriptomes in Atlantic Salmon or applied for similar projects in other species.

Keywords: Atlantic salmon; Illumina sequencing; PacBio Iso-seq; full-length mRNA; hybrid error correction; transcriptome.

Copyright © 2021 Ramberg, Høyheim, Østbye and Andreassen.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
The PacBio Isoseq3 pipeline for processing SMRT-sequencing data. Each Zero-Mode Wave (ZMW) provides information from a single DNA polymerase, which sequences each cDNA-SMRTBell adapter repeatedly. Consensus: the CCS program generates a consensus sequence for each read that contains a complete repeated insert-adapter complex. Demulitplex: lima filters away sequences with unwanted primer combinations, trims away the adapter sequences, and orients the reads in the 5′→3′orientation. Refine: the refine program filters away concatemers, and sequences without polyA tails of at least 20 bp. Finally, it trims the polyA tails from the remaining sequences. Cluster: Isoseq cluster performs conservative clustering of sequences and uses partial order alignment to generate a consensus sequence for each cluster. The output is classified as High Quality or Low Quality based on the predicted accuracy. The final outputs are high quality and low-quality sequences in fastq format. This figure is used with permission from Pacific Biosciences.
FIGURE 2
FIGURE 2
Overview of the analysis pipeline from processing of sequences up to a non-rendundant Error Corrected High Quality transcriptome. The PacBio SMRT High Quality reads were the input from the PacBio platform. The Illumina reads were first trimmed using cutadapt to remove the adapter sequences. Subsequently, they were used to generate a De Bruijn graph for LoRDEC to error-correct of the High Quality reads on a sample-by-sample basis. Inhouse python script: The error-corrected reads were filtered based on degree of Illumina support and coverage of the High Quality reads. Repeatmasker was used to identify and remove reads containing known Long Interspersed Repeats. Sequences that could be mapped accurately to theSalmo salar orSalmo trutta genome were clustered using cdna_Cupcake, while the remaining sequences were instead clustered using Cogent. All the reads were additionally clustered using CD-Hit prior to annotation. The final output was a non-redundant Error Corrected High Quality transcriptome.
FIGURE 3
FIGURE 3
Overview of the annotation process. Transcripts that were clustered using theSalmo salar orSalmo trutta genomes were characterized against the genome annotation for the corresponding species using SQANTI3. All the sequences were also used for open reading frame prediction using Transdecoder. The sequences predicted to contain a complete coding sequence were Blasted against the RefSeq protein database and searched through the Interpro database to retrieve gene names and functional annotation. The reads were filtered based on the SQANTI-classification, open reading frame prediction and support in the PacBio sequencing data. Information from the structural classification path and the functional annotation path was added to the final filtered mRNA transcriptome.
FIGURE 4
FIGURE 4
Distribution of coverage by LoRDEC for sequences with (orange bars) and without (orange bars) internal gaps in the coverage interval 75–100%.
FIGURE 5
FIGURE 5
The final full-length mRNA transcriptome distributed by transcript length. Each column shows the number of transcripts falling within the given 500 bp length interval.
FIGURE 6
FIGURE 6
Distribution of Full-Length Non-Concatemer-support in the final mRNA dataset. Each column shows the number of transcripts in the final mRNA transcriptome with a number of Full-Length Non-Concatemer reads supporting the long-reads.
FIGURE 7
FIGURE 7
Pie chart showing the distribution of blast results when searching allSalmo salar RefSeq mRNAs against the final full-length mRNA dataset. Blue are identical isoforms, orange are significant hits, gray are non-matching RefSeq mRNAs.
FIGURE 8
FIGURE 8
Pie chart showing the distribution of blast results when searching the final full-length mRNA dataset against allSalmo salar RefSeq mRNAs. Blue are identical isoforms, orange are significant hits, gray are non-matching novel mRNAs.
FIGURE 9
FIGURE 9
Venn diagram illustrating the number of identical isoforms (Full Slice-Match) shared among the mRNAs in the genome annotation and the final full-length mRNA transcriptome. No FSM represents isoforms in the genome annotation with no identical match in the final full-length mRNA transcriptome. Novel mRNAs refers to transcript isoforms in the final mRNA dataset without an identical match to the sequences annotated to theSalmo Salar genome assembly.
FIGURE 10
FIGURE 10
Distribution of number of predicted Gene Ontology terms. Each column shows the number of transcripts falling within the given interval of Gene Ontology terms identified for the transcript.
FIGURE 11
FIGURE 11
Multilevel Gene Ontology chart, gill. The pie chart shows the most specific Gene Ontology terms occurring in at least 144 gill-specific transcripts in a non-redundant way (see also Materials and Methods “Functional Annotation”).
FIGURE 12
FIGURE 12
Multilevel Gene Ontology chart, head-kidney. The pie chart shows the most specific Gene Ontology terms occurring in at least 98 head-kidney-specific transcripts in a non-redundant way (see also Materials and Methods “Functional Annotation”).
FIGURE 13
FIGURE 13
Multilevel Gene Ontology chart, liver. The pie chart shows the most specific Gene Ontology terms occurring in at least 116 gill-specific transcripts in a non-redundant way (see also Materials and Methods “Functional Annotation”).
See this image and copyright information in PMC

Similar articles

See all similar articles

Cited by

See all "Cited by" articles

References

    1. Abdelrahman H., ElHady M., Alcivar-Warren A., Allen S., Al-Tobasei R., Bao L., et al. (2017). Aquaculture genomics, genetics and breeding in the United States: current status, challenges, and priorities for future research. BMC Genomics 18:191. 10.1186/s12864-017-3557-1 - DOI - PMC - PubMed
    1. Adzhubei A. A., Vlasova A. V., Hagen-Larsen H., Ruden T. A., Laerdahl J. K., Hoyheim B. (2007). Annotated expressed sequence tags (ESTs) from pre-smolt Atlantic salmon (Salmo salar) in a searchable data resource. BMC Genomics 8:209. 10.1186/1471-2164-8-209 - DOI - PMC - PubMed
    1. Allendorf F. W., Thorgaard G. H. (1984). “Tetraploidy and the evolution of salmonid fishes,” in Evolutionary Genetics of Fishes. Monographs in Evolutionary Biology, ed. Turner B. J. (Boston, MA: Springer; ), 1–53. 10.1007/978-1-4684-4652-4_1 - DOI
    1. Andreassen R., Lunner S., Hoyheim B. (2009). Characterization of full-length sequenced cDNA inserts (FLIcs) from Atlantic salmon (Salmo salar). BMC Genomics 10:502. 10.1186/1471-2164-10-502 - DOI - PMC - PubMed
    1. Andreassen R., Lunner S., Hoyheim B. (2010). Targeted SNP discovery in Atlantic salmon (Salmo salar) genes using a 3’UTR-primed SNP detection approach. BMC Genomics 11:706. 10.1186/1471-2164-11-706 - DOI - PMC - PubMed

Related information

LinkOut - more resources

Full text links
Frontiers Media SA full text link Frontiers Media SA Free PMC article
Cite
Send To

NCBI Literature Resources

MeSHPMCBookshelfDisclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.


[8]ページ先頭

©2009-2025 Movatter.jp