.2021 Apr 27:12:656334.

doi: 10.3389/fgene.2021.656334. eCollection 2021.

Ade novo Full-Length mRNA Transcriptome Generated From Hybrid-Corrected PacBio Long-Reads Improves the Transcript Annotation and Identifies Thousands of Novel Splice Variants in Atlantic Salmon

Sigmund Ramberg¹, Bjørn Høyheim², Tone-Kari Knutsdatter Østbye³, Rune Andreassen¹

Affiliations

PMID:33986770
PMCID: PMC8110904
DOI: 10.3389/fgene.2021.656334

Ade novo Full-Length mRNA Transcriptome Generated From Hybrid-Corrected PacBio Long-Reads Improves the Transcript Annotation and Identifies Thousands of Novel Splice Variants in Atlantic Salmon

Sigmund Ramberg et al. Front Genet.2021.

.2021 Apr 27:12:656334.

doi: 10.3389/fgene.2021.656334. eCollection 2021.

Authors

Sigmund Ramberg¹, Bjørn Høyheim², Tone-Kari Knutsdatter Østbye³, Rune Andreassen¹

Affiliations

¹ Department of Life Sciences and Health, Faculty of Health Sciences, OsloMet - Oslo Metropolitan University, Oslo, Norway.
² Department of Preclinical Sciences and Pathology, Faculty of Veterinary Medicine, Norwegian University of Life Sciences, Ås, Norway.
³ Nofima (Norwegian Institute of Food, Fisheries and Aquaculture Research), Ås, Norway.

PMID:33986770
PMCID: PMC8110904
DOI: 10.3389/fgene.2021.656334

Abstract

Atlantic salmon (Salmo salar) is a major species produced in world aquaculture and an important vertebrate model organism for studying the process of rediploidization following whole genome duplication events (Ss4R, 80 mya). The currentSalmo salar transcriptome is largely generated from genome sequence basedin silico predictions supported by ESTs and short-read sequencing data. However, recent progress in long-read sequencing technologies now allows for full-length transcript sequencing from single RNA-molecules. This study provides ade novo full-length mRNA transcriptome from liver, head-kidney and gill materials. A pipeline was developed based on Iso-seq sequencing of long-reads on the PacBio platform (HQ reads) followed by error-correction of the HQ reads by short-reads from the Illumina platform. The pipeline successfully processed more than 1.5 million long-reads and more than 900 million short-reads into error-corrected HQ reads. A surprisingly high percentage (32%) represented expressed interspersed repeats, while the remaining were processed into 71 461 full-length mRNAs from 23 071 loci. Each transcript was supported by several single-molecule long-read sequences and at least three short-reads, assuring a high sequence accuracy. On average, each gene was represented by three isoforms. Comparisons to the current Atlantic salmon transcripts in the RefSeq database showed that the long-read transcriptome validated 25% of all known transcripts, while the remaining full-length transcripts were novel isoforms, but few were transcripts from novel genes. A comparison to the current genome assembly indicates that the long-read transcriptome may aid in improving transcript annotation as well as provide long-read linkage information useful for improving the genome assembly. More than 80% of transcripts were assigned GO terms and thousands of transcripts were from genes or splice-variants expressed in an organ-specific manner demonstrating that hybrid error-corrected long-read transcriptomes may be applied to study genes and splice-variants expressed in certain organs or conditions (e.g., challenge materials). In conclusion, this is the single largest contribution of full-length mRNAs in Atlantic salmon. The results will be of great value to salmon genomics research, and the pipeline outlined may be applied to generate additionalde novo transcriptomes in Atlantic Salmon or applied for similar projects in other species.

Keywords: Atlantic salmon; Illumina sequencing; PacBio Iso-seq; full-length mRNA; hybrid error correction; transcriptome.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
The PacBio Isoseq3 pipeline for processing SMRT-sequencing data. Each Zero-Mode Wave (ZMW) provides information from a single DNA polymerase, which sequences each cDNA-SMRTBell adapter repeatedly. Consensus: the CCS program generates a consensus sequence for each read that contains a complete repeated insert-adapter complex. Demulitplex: lima filters away sequences with unwanted primer combinations, trims away the adapter sequences, and orients the reads in the 5′→3′orientation. Refine: the refine program filters away concatemers, and sequences without polyA tails of at least 20 bp. Finally, it trims the polyA tails from the remaining sequences. Cluster: Isoseq cluster performs conservative clustering of sequences and uses partial order alignment to generate a consensus sequence for each cluster. The output is classified as High Quality or Low Quality based on the predicted accuracy. The final outputs are high quality and low-quality sequences in fastq format. This figure is used with permission from Pacific Biosciences.

**FIGURE 2**
Overview of the analysis pipeline from processing of sequences up to a non-rendundant Error Corrected High Quality transcriptome. The PacBio SMRT High Quality reads were the input from the PacBio platform. The Illumina reads were first trimmed using cutadapt to remove the adapter sequences. Subsequently, they were used to generate a De Bruijn graph for LoRDEC to error-correct of the High Quality reads on a sample-by-sample basis. Inhouse python script: The error-corrected reads were filtered based on degree of Illumina support and coverage of the High Quality reads. Repeatmasker was used to identify and remove reads containing known Long Interspersed Repeats. Sequences that could be mapped accurately to theSalmo salar orSalmo trutta genome were clustered using cdna_Cupcake, while the remaining sequences were instead clustered using Cogent. All the reads were additionally clustered using CD-Hit prior to annotation. The final output was a non-redundant Error Corrected High Quality transcriptome.

**FIGURE 3**
Overview of the annotation process. Transcripts that were clustered using theSalmo salar orSalmo trutta genomes were characterized against the genome annotation for the corresponding species using SQANTI3. All the sequences were also used for open reading frame prediction using Transdecoder. The sequences predicted to contain a complete coding sequence were Blasted against the RefSeq protein database and searched through the Interpro database to retrieve gene names and functional annotation. The reads were filtered based on the SQANTI-classification, open reading frame prediction and support in the PacBio sequencing data. Information from the structural classification path and the functional annotation path was added to the final filtered mRNA transcriptome.

**FIGURE 4**
Distribution of coverage by LoRDEC for sequences with (orange bars) and without (orange bars) internal gaps in the coverage interval 75–100%.

**FIGURE 5**
The final full-length mRNA transcriptome distributed by transcript length. Each column shows the number of transcripts falling within the given 500 bp length interval.

**FIGURE 6**
Distribution of Full-Length Non-Concatemer-support in the final mRNA dataset. Each column shows the number of transcripts in the final mRNA transcriptome with a number of Full-Length Non-Concatemer reads supporting the long-reads.

**FIGURE 7**
Pie chart showing the distribution of blast results when searching allSalmo salar RefSeq mRNAs against the final full-length mRNA dataset. Blue are identical isoforms, orange are significant hits, gray are non-matching RefSeq mRNAs.

**FIGURE 8**
Pie chart showing the distribution of blast results when searching the final full-length mRNA dataset against allSalmo salar RefSeq mRNAs. Blue are identical isoforms, orange are significant hits, gray are non-matching novel mRNAs.

**FIGURE 9**
Venn diagram illustrating the number of identical isoforms (Full Slice-Match) shared among the mRNAs in the genome annotation and the final full-length mRNA transcriptome. No FSM represents isoforms in the genome annotation with no identical match in the final full-length mRNA transcriptome. Novel mRNAs refers to transcript isoforms in the final mRNA dataset without an identical match to the sequences annotated to theSalmo Salar genome assembly.

**FIGURE 10**
Distribution of number of predicted Gene Ontology terms. Each column shows the number of transcripts falling within the given interval of Gene Ontology terms identified for the transcript.

**FIGURE 11**
Multilevel Gene Ontology chart, gill. The pie chart shows the most specific Gene Ontology terms occurring in at least 144 gill-specific transcripts in a non-redundant way (see also Materials and Methods “Functional Annotation”).

**FIGURE 12**
Multilevel Gene Ontology chart, head-kidney. The pie chart shows the most specific Gene Ontology terms occurring in at least 98 head-kidney-specific transcripts in a non-redundant way (see also Materials and Methods “Functional Annotation”).

**FIGURE 13**
Multilevel Gene Ontology chart, liver. The pie chart shows the most specific Gene Ontology terms occurring in at least 116 gill-specific transcripts in a non-redundant way (see also Materials and Methods “Functional Annotation”).

See this image and copyright information in PMC

Cited by

Full-length transcriptome from different life stages of cobia (Rachycentron canadum, Rachycentridae).
Ebeneezar S, Krupesha Sharma SR, Vijayagopal P, Sebastian W, Sajina KA, Tamilmani G, Sakthivel M, Rameshkumar P, Anikuttan KK, Varghese E, Linga Prabu D, Jeena NS, Sumithra TG, Gayathri S, Iyyapparaja Narasimapallavan G, Gopalakrishnan A.Ebeneezar S, et al.Sci Data. 2023 Feb 16;10(1):97. doi: 10.1038/s41597-022-01907-0.Sci Data. 2023.PMID:36797271Free PMC article.
Long-read isoform sequencing reveals tissue-specific isoform expression between active and hibernating brown bears (Ursus arctos).
Tseng E, Underwood JG, Evans Hutzenbiler BD, Trojahn S, Kingham B, Shevchenko O, Bernberg E, Vierra M, Robbins CT, Jansen HT, Kelley JL.Tseng E, et al.G3 (Bethesda). 2022 Mar 4;12(3):jkab422. doi: 10.1093/g3journal/jkab422.G3 (Bethesda). 2022.PMID:35100340Free PMC article.
Differential Expression of miRNAs and Their Predicted Target Genes Indicates That Gene Expression in Atlantic Salmon Gill Is Post-Transcriptionally Regulated by miRNAs in the Parr-Smolt Transformation and Adaptation to Sea Water.
Shwe A, Krasnov A, Visnovska T, Ramberg S, Østbye TK, Andreassen R.Shwe A, et al.Int J Mol Sci. 2022 Aug 8;23(15):8831. doi: 10.3390/ijms23158831.Int J Mol Sci. 2022.PMID:35955964Free PMC article.
PacBio Iso-Seq Improves the Rainbow Trout Genome Annotation and Identifies Alternative Splicing Associated With Economically Important Phenotypes.
Ali A, Thorgaard GH, Salem M.Ali A, et al.Front Genet. 2021 Jul 15;12:683408. doi: 10.3389/fgene.2021.683408. eCollection 2021.Front Genet. 2021.PMID:34335690Free PMC article.
MicroSalmon: A Comprehensive, Searchable Resource of Predicted MicroRNA Targets and 3'UTR Cis-Regulatory Elements in the Full-Length Sequenced Atlantic Salmon Transcriptome.
Ramberg S, Andreassen R.Ramberg S, et al.Noncoding RNA. 2021 Sep 22;7(4):61. doi: 10.3390/ncrna7040061.Noncoding RNA. 2021.PMID:34698276Free PMC article.

See all "Cited by" articles

References

1. Abdelrahman H., ElHady M., Alcivar-Warren A., Allen S., Al-Tobasei R., Bao L., et al. (2017). Aquaculture genomics, genetics and breeding in the United States: current status, challenges, and priorities for future research. BMC Genomics 18:191. 10.1186/s12864-017-3557-1 - DOI - PMC - PubMed
1. Adzhubei A. A., Vlasova A. V., Hagen-Larsen H., Ruden T. A., Laerdahl J. K., Hoyheim B. (2007). Annotated expressed sequence tags (ESTs) from pre-smolt Atlantic salmon (Salmo salar) in a searchable data resource. BMC Genomics 8:209. 10.1186/1471-2164-8-209 - DOI - PMC - PubMed
1. Allendorf F. W., Thorgaard G. H. (1984). “Tetraploidy and the evolution of salmonid fishes,” in Evolutionary Genetics of Fishes. Monographs in Evolutionary Biology, ed. Turner B. J. (Boston, MA: Springer; ), 1–53. 10.1007/978-1-4684-4652-4_1 - DOI
1. Andreassen R., Lunner S., Hoyheim B. (2009). Characterization of full-length sequenced cDNA inserts (FLIcs) from Atlantic salmon (Salmo salar). BMC Genomics 10:502. 10.1186/1471-2164-10-502 - DOI - PMC - PubMed
1. Andreassen R., Lunner S., Hoyheim B. (2010). Targeted SNP discovery in Atlantic salmon (Salmo salar) genes using a 3’UTR-primed SNP detection approach. BMC Genomics 11:706. 10.1186/1471-2164-11-706 - DOI - PMC - PubMed

Related information

LinkOut - more resources

Full Text Sources

Movatterモバイル変換

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Full text links

Actions

Share

Ade novo Full-Length mRNA Transcriptome Generated From Hybrid-Corrected PacBio Long-Reads Improves the Transcript Annotation and Identifies Thousands of Novel Splice Variants in Atlantic Salmon

Affiliations

Ade novo Full-Length mRNA Transcriptome Generated From Hybrid-Corrected PacBio Long-Reads Improves the Transcript Annotation and Identifies Thousands of Novel Splice Variants in Atlantic Salmon

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources