NakarinP/longreadvqsPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Toolkit for Viral Quasispecies Comparison from Long-Read Sequencing performing variety of viral quasispecies diversity analyses based on long-read sequence alignment.

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
R		R
inst		inst
man		man
sim_data		sim_data
vignettes		vignettes
.DS_Store		.DS_Store
.Rbuildignore		.Rbuildignore
.Rhistory		.Rhistory
.gitattributes		.gitattributes
.gitignore		.gitignore
CRAN-SUBMISSION		CRAN-SUBMISSION
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
cran-comments.md		cran-comments.md
longreadvqs.Rproj		longreadvqs.Rproj

Repository files navigation

longreadvqs

Toolkit for Viral Quasispecies Comparison from Long-Read Sequencing performing variety of viral quasispecies diversity analyses based on long-read sequence alignment. Main functions include 1) sequencing error/noise minimization and read sampling, 2) Single nucleotide variant (SNV) profiles comparison, and 3) viral quasispecies profiles comparison and visualization.

Installation

'longreadvqs' is a package available on CRAN and can be installed easily in R:

install.packages("longreadvqs")

If you have a problem installingQSutils, a key dependency of this package, please do the following:

install.packages("devtools")library(devtools)install_git("https://github.com/VHIRHepatiques/QSutils")

Background

Viral quasispecies (VQS) refers to a genetically closely relatedpopulation of viral variants. Measurement of VQS diversity, especiallyin highly mutating RNA viruses, is vital for understanding theirmicro-evolution.Long-read sequencing facilitates retrievingfull-length sequences of specific genes or genomes from each variant,yet challenges like sequencing errors and low coverage depth hinderdiversity metric computation and comparison. Here, we presentmethodologies to overcome these challenges, enabling comparisons of VQSdiversity.

Individual sample preparation

Sequencing error/noise minimization and down-sampling

Sequencing errors create plenty of artificial single nucleotide variants(SNVs), leading to an overestimated number of singleton haplotypes(haplotype = group of identical reads, singleton haplotype = haplotypewith only a single read). Our package aims to minimize potentialsequencing errors and potential noises from rare mutations along longreads defined as SNV(s) or nucleotide base(s) with a frequency less thann % at each nucleotide position. Such problematic base(s) will be replacedwith the consensus or majority base of that particular position.

First, load the “longreadvqs” package and a FASTA file of example readalignment (sample1.fasta). To choose the % cut-off for noiseminimization, use the “pctopt” function that shows % singletonhaplotypes in the read alignment across different % cut-offs. Then, usethe “vqssub” function for noise minimization, in this case, we choose a10% cut-off that decreases singleton haplotypes to less than 10% of allreads. This function will give a summary of VQS diversity metricscalculated by the “QSutils” package (Gregori, et al.,2016) after noiseminimization.

library(longreadvqs)#Load sample 1.sample1<- system.file("extdata","sample1.fasta",package="longreadvqs")

#Check which % cut-off could effectively minimize noises by assessing % singleton haplotypes.library(ggplot2)x<- pctopt(sample1,pctsing=0,label="sample1")ggplot(x, aes(x=pct,y=pctsingleton))+ geom_line()+ geom_point()

#VQS diversity metrics of noise minimized (10% cut-off) read alignmentvqssub(sample1,pct=10,label="sample1")#>     label  method samplingwhen pct fulldepth depth haplotypes nsingleton#> 1 sample1 conbase        after  10       311   311         47         17#>   pctsingleton polymorph mutations  shannon norm_shannon gini_simpson      FAD#> 1     5.466238         6       116 2.917347    0.7577235    0.8819625 6.475025#>           Mfe         Pie         Mfm         Pim#> 1 0.002580953 0.002994924 6.79165e-09 0.001803544

To compare VQS diversity across samples, it’s important to standardizethe read alignment depth. Since sequencing results can vary, we performdown-sampling after minimizing noises with the “vqssub” function. In ourcase, for sample1, sample2, and sample3 with original depths of 311,316, and 655, respectively, we randomly down-sample them to a commondepth of 300.

#Load samples 2 and 3.sample2<- system.file("extdata","sample2.fasta",package="longreadvqs")sample3<- system.file("extdata","sample3.fasta",package="longreadvqs")#Noise minimization (10% cut-off) and down-sampling (depth of 300)a<- vqssub(sample1,pct=10,samsize=300,label="sample1")b<- vqssub(sample2,pct=10,samsize=300,label="sample2")c<- vqssub(sample3,pct=10,samsize=300,label="sample3")#Compare VQS diversity across three samples after noise minimization and down-sampling.rbind(a,b,c)#>     label  method samplingwhen pct fulldepth depth haplotypes nsingleton#> 1 sample1 conbase        after  10       311   300         44         16#> 2 sample2 conbase        after  10       316   300         56         24#> 3 sample3 conbase        after  10       655   300         28          4#>   pctsingleton polymorph mutations  shannon norm_shannon gini_simpson      FAD#> 1     5.333333         6       108 2.875905    0.7599792    0.8799554 5.626911#> 2     8.000000         7       135 3.126571    0.7767200    0.8977035 9.861366#> 3     1.333333         5        53 2.670116    0.8013062    0.8943813 1.845056#>           Mfe         Pie          Mfm         Pim#> 1 0.002548420 0.002974055 7.016045e-09 0.001782496#> 2 0.002457405 0.003201742 6.025138e-09 0.001891865#> 3 0.002147954 0.002440550 5.532911e-09 0.001636262

We can also repeatedly down-sample the alignments and check theconsistency of each VQS diversity metric at the different depths likethe following example.

#Randomly down-sample from depth of 655 to 300, and 100 (10 iterations for each sample size).set.seed(123)c_full<- vqssub(sample3,pct=10,label="full_depth")#non-sampled alignmentc_300<- vqsresub(sample3,iter=10,pct=10,samsize=300,label="depth_300")#down-sample to depth of 300c_100<- vqsresub(sample3,iter=10,pct=10,samsize=100,label="depth_100")#down-sample to depth of 100all_c<- rbind(c_full,c_300,c_100)#combine data#Load packages for visualization.library(cowplot)all_c$depth<- as.character(all_c$depth)depthorder<- c("655","300","100")#Plot box-plots of three VQS metrics variation through different sampling depths.haplotypes<- ggplot(all_c, aes(x=factor(depth,level=depthorder),y=haplotypes,group= interaction(depth,label),fill=label))+ geom_boxplot()+   xlab("depth")+ theme(legend.position="none")shannon<- ggplot(all_c, aes(x=factor(depth,level=depthorder),y=shannon,group= interaction(depth,label),fill=label))+ geom_boxplot()+   xlab("depth")+ theme(legend.position="none")gini_simpson<- ggplot(all_c, aes(x=factor(depth,level=depthorder),y=gini_simpson,group= interaction(depth,label),fill=label))+ geom_boxplot()+   xlab("depth")+ theme(legend.position="none")

plot_grid(haplotypes,shannon,gini_simpson,nrow=1)

From this plot, the number of haplotypes obviously decreases as thedepth decreases, while Shannon entropy and the Gini-Simpson index remainrelatively stable with high variation at a depth of 100.

VQS data preparation for in-depth comparison

The previous “vqssub” function just summarizes key VQS diversity metricsafter noise minimization and down-sampling. However, these metrics donot provide us with comprehensive data for further comparative analyses,such as SNV positions and full read alignment after down-sampling. The“vqsassess” function will generate a complete VQS profile for eachsample, and the outputs can be used for other functions.

In this example, we repeat noise minimization and down-sampling forsamples 1-3 but apply the “vqsassess” function instead of the “vqssub”function. We then compare the SNV profiles between these samples usingthe “snvcompare” function.

#Generate complete VQS data for further comparison for each sample.s1<- vqsassess(sample1,pct=10,samsize=300,label="sample1")s2<- vqsassess(sample2,pct=10,samsize=300,label="sample2")s3<- vqsassess(sample3,pct=10,samsize=300,label="sample3")

#Compare SNV profile between three samples.snvcompare(samplelist=list(s1,s2,s3),ncol=1)

This plot shows that SNV positions are well-aligned between threesamples. In some cases, SNV plot can highlight sequencing errorconcentrating at a particular range in read alignment like theadditional sample added to the following example.

#Load sample 4.sample4<- system.file("extdata","mock.fasta",package="longreadvqs")s4<- vqsassess(sample4,pct=10,samsize=300,label="sample4")

#SNV profile of sample 4s4$snv

The plot above captures atypical SNVs towards the 3’ end of the sample 4alignment, which is potentially caused by sequencing error. We canaddress this issue using the “vqscustompct” function, which allows us toadjust the % cut-off for noise minimization at specific range(s) in thealignment. In the case of sample 4, we increase the % cut-off fromposition 972 onward, raising it from 10% to 30%.

#Use "vqscustompct" function to increase % cut-off after position 972 to 30%.s4_fix<- vqscustompct(sample4,pct=10,brkpos= c("1:971","972:982"),lspct= c(10,30),samsize=300,label="sample4")

#SNV profile of sample 4 after % cut-off adjustments4_fix$snv

Comparing VQS diversity across multiple samples

Once we have all samples prepared, we can combine them to not onlycompare within-sample diversity but also explore between-samplerelationships. Using the “vqscompare” function, noise-minimizeddown-sampled reads from all samples are pooled to identify 1) identicalhaplotypes and 2) genetically closely related haplotypes betweensamples. In this example, we combine the previously prepared outputs ofsamples 1 to 4 into the “vqscompare” function, allowing it to generate asummary comparative plot.

#List outputs from "vqsassess" or "vqscustompct" that we want to compare into "vqscompare" function.#Set the number of new OTU groups based on k-means clustering to 10 groups (kmeans.n = 10).set.seed(1234)comp<- vqscompare(samplelist=list(s1,s2,s3,s4_fix),lab_name="Sample",kmeans.n=10)

#The most important output of the "vqscompare" function is the summary plot.comp$summaryplot

The summary plot comprises four main subplots. First, the haplotype plotat the top left displays all unique haplotypes and their proportionsacross all samples. Second, the operational taxonomic unit (OTU) plot atthe bottom left groups genetically closely related haplotypes into newOTUs through k-means clustering on a multidimensional scale (MDS) of allsamples’ pairwise SNV distances. Third, the MDS plot at the top rightvisualizes the genetic relationships and clustering scheme of the fivelargest OTUs and haplotypes belonging to each OTU. Fourth, the MDS plotat the bottom right shows all classified OTUs.

In our example, samples 1 to 4 represent read alignments of theinfluenza A virus (IAV)’s M segment, gathered from patients in the sameward (Williams, et al., 2023).Based on our comparison, patients of samples 1 and 2 likely contractedthe same virus, as evidenced by the similarity in both VQS haplotypeprofiles and OTU profiles. Meanwhile, patients of samples 3 and 4 appearto have been infected with distinct sets of viruses, differing from eachother and from those in samples 1 and 2. Despite these distinctions, theMDS plots indicate that the genetic distances between all samples arenot significantly far apart. This suggests that these diversities maynot be adequately captured at the consensus gene level.

Citation

https://doi.org/10.21203/rs.3.rs-4637890/v1

About

Toolkit for Viral Quasispecies Comparison from Long-Read Sequencing performing variety of viral quasispecies diversity analyses based on long-read sequence alignment.

Releases

No releases published

Packages

No packages published

Languages

R100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

longreadvqs

Installation

Background

Individual sample preparation

Sequencing error/noise minimization and down-sampling

VQS data preparation for in-depth comparison

Comparing VQS diversity across multiple samples

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

NakarinP/longreadvqs

Folders and files

Latest commit

History

Repository files navigation

longreadvqs

Installation

Background

Individual sample preparation

Sequencing error/noise minimization and down-sampling

VQS data preparation for in-depth comparison

Comparing VQS diversity across multiple samples

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages