- Notifications
You must be signed in to change notification settings - Fork0
NakarinP/longreadvqs
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Toolkit for Viral Quasispecies Comparison from Long-Read Sequencing performing variety of viral quasispecies diversity analyses based on long-read sequence alignment. Main functions include 1) sequencing error/noise minimization and read sampling, 2) Single nucleotide variant (SNV) profiles comparison, and 3) viral quasispecies profiles comparison and visualization.
'longreadvqs' is a package available on CRAN and can be installed easily in R:
install.packages("longreadvqs")If you have a problem installingQSutils, a key dependency of this package, please do the following:
install.packages("devtools")library(devtools)install_git("https://github.com/VHIRHepatiques/QSutils")Viral quasispecies (VQS) refers to a genetically closely relatedpopulation of viral variants. Measurement of VQS diversity, especiallyin highly mutating RNA viruses, is vital for understanding theirmicro-evolution.Long-read sequencing facilitates retrievingfull-length sequences of specific genes or genomes from each variant,yet challenges like sequencing errors and low coverage depth hinderdiversity metric computation and comparison. Here, we presentmethodologies to overcome these challenges, enabling comparisons of VQSdiversity.
Sequencing errors create plenty of artificial single nucleotide variants(SNVs), leading to an overestimated number of singleton haplotypes(haplotype = group of identical reads, singleton haplotype = haplotypewith only a single read). Our package aims to minimize potentialsequencing errors and potential noises from rare mutations along longreads defined as SNV(s) or nucleotide base(s) with a frequency less thann % at each nucleotide position. Such problematic base(s) will be replacedwith the consensus or majority base of that particular position.
First, load the “longreadvqs” package and a FASTA file of example readalignment (sample1.fasta). To choose the % cut-off for noiseminimization, use the “pctopt” function that shows % singletonhaplotypes in the read alignment across different % cut-offs. Then, usethe “vqssub” function for noise minimization, in this case, we choose a10% cut-off that decreases singleton haplotypes to less than 10% of allreads. This function will give a summary of VQS diversity metricscalculated by the “QSutils” package (Gregori, et al.,2016) after noiseminimization.
library(longreadvqs)#Load sample 1.sample1<- system.file("extdata","sample1.fasta",package="longreadvqs")
#Check which % cut-off could effectively minimize noises by assessing % singleton haplotypes.library(ggplot2)x<- pctopt(sample1,pctsing=0,label="sample1")ggplot(x, aes(x=pct,y=pctsingleton))+ geom_line()+ geom_point()
#VQS diversity metrics of noise minimized (10% cut-off) read alignmentvqssub(sample1,pct=10,label="sample1")#> label method samplingwhen pct fulldepth depth haplotypes nsingleton#> 1 sample1 conbase after 10 311 311 47 17#> pctsingleton polymorph mutations shannon norm_shannon gini_simpson FAD#> 1 5.466238 6 116 2.917347 0.7577235 0.8819625 6.475025#> Mfe Pie Mfm Pim#> 1 0.002580953 0.002994924 6.79165e-09 0.001803544
To compare VQS diversity across samples, it’s important to standardizethe read alignment depth. Since sequencing results can vary, we performdown-sampling after minimizing noises with the “vqssub” function. In ourcase, for sample1, sample2, and sample3 with original depths of 311,316, and 655, respectively, we randomly down-sample them to a commondepth of 300.
#Load samples 2 and 3.sample2<- system.file("extdata","sample2.fasta",package="longreadvqs")sample3<- system.file("extdata","sample3.fasta",package="longreadvqs")#Noise minimization (10% cut-off) and down-sampling (depth of 300)a<- vqssub(sample1,pct=10,samsize=300,label="sample1")b<- vqssub(sample2,pct=10,samsize=300,label="sample2")c<- vqssub(sample3,pct=10,samsize=300,label="sample3")#Compare VQS diversity across three samples after noise minimization and down-sampling.rbind(a,b,c)#> label method samplingwhen pct fulldepth depth haplotypes nsingleton#> 1 sample1 conbase after 10 311 300 44 16#> 2 sample2 conbase after 10 316 300 56 24#> 3 sample3 conbase after 10 655 300 28 4#> pctsingleton polymorph mutations shannon norm_shannon gini_simpson FAD#> 1 5.333333 6 108 2.875905 0.7599792 0.8799554 5.626911#> 2 8.000000 7 135 3.126571 0.7767200 0.8977035 9.861366#> 3 1.333333 5 53 2.670116 0.8013062 0.8943813 1.845056#> Mfe Pie Mfm Pim#> 1 0.002548420 0.002974055 7.016045e-09 0.001782496#> 2 0.002457405 0.003201742 6.025138e-09 0.001891865#> 3 0.002147954 0.002440550 5.532911e-09 0.001636262
We can also repeatedly down-sample the alignments and check theconsistency of each VQS diversity metric at the different depths likethe following example.
#Randomly down-sample from depth of 655 to 300, and 100 (10 iterations for each sample size).set.seed(123)c_full<- vqssub(sample3,pct=10,label="full_depth")#non-sampled alignmentc_300<- vqsresub(sample3,iter=10,pct=10,samsize=300,label="depth_300")#down-sample to depth of 300c_100<- vqsresub(sample3,iter=10,pct=10,samsize=100,label="depth_100")#down-sample to depth of 100all_c<- rbind(c_full,c_300,c_100)#combine data#Load packages for visualization.library(cowplot)all_c$depth<- as.character(all_c$depth)depthorder<- c("655","300","100")#Plot box-plots of three VQS metrics variation through different sampling depths.haplotypes<- ggplot(all_c, aes(x=factor(depth,level=depthorder),y=haplotypes,group= interaction(depth,label),fill=label))+ geom_boxplot()+ xlab("depth")+ theme(legend.position="none")shannon<- ggplot(all_c, aes(x=factor(depth,level=depthorder),y=shannon,group= interaction(depth,label),fill=label))+ geom_boxplot()+ xlab("depth")+ theme(legend.position="none")gini_simpson<- ggplot(all_c, aes(x=factor(depth,level=depthorder),y=gini_simpson,group= interaction(depth,label),fill=label))+ geom_boxplot()+ xlab("depth")+ theme(legend.position="none")
plot_grid(haplotypes,shannon,gini_simpson,nrow=1)
From this plot, the number of haplotypes obviously decreases as thedepth decreases, while Shannon entropy and the Gini-Simpson index remainrelatively stable with high variation at a depth of 100.
The previous “vqssub” function just summarizes key VQS diversity metricsafter noise minimization and down-sampling. However, these metrics donot provide us with comprehensive data for further comparative analyses,such as SNV positions and full read alignment after down-sampling. The“vqsassess” function will generate a complete VQS profile for eachsample, and the outputs can be used for other functions.
In this example, we repeat noise minimization and down-sampling forsamples 1-3 but apply the “vqsassess” function instead of the “vqssub”function. We then compare the SNV profiles between these samples usingthe “snvcompare” function.
#Generate complete VQS data for further comparison for each sample.s1<- vqsassess(sample1,pct=10,samsize=300,label="sample1")s2<- vqsassess(sample2,pct=10,samsize=300,label="sample2")s3<- vqsassess(sample3,pct=10,samsize=300,label="sample3")
#Compare SNV profile between three samples.snvcompare(samplelist=list(s1,s2,s3),ncol=1)
This plot shows that SNV positions are well-aligned between threesamples. In some cases, SNV plot can highlight sequencing errorconcentrating at a particular range in read alignment like theadditional sample added to the following example.
#Load sample 4.sample4<- system.file("extdata","mock.fasta",package="longreadvqs")s4<- vqsassess(sample4,pct=10,samsize=300,label="sample4")
#SNV profile of sample 4s4$snv
The plot above captures atypical SNVs towards the 3’ end of the sample 4alignment, which is potentially caused by sequencing error. We canaddress this issue using the “vqscustompct” function, which allows us toadjust the % cut-off for noise minimization at specific range(s) in thealignment. In the case of sample 4, we increase the % cut-off fromposition 972 onward, raising it from 10% to 30%.
#Use "vqscustompct" function to increase % cut-off after position 972 to 30%.s4_fix<- vqscustompct(sample4,pct=10,brkpos= c("1:971","972:982"),lspct= c(10,30),samsize=300,label="sample4")
#SNV profile of sample 4 after % cut-off adjustments4_fix$snv
Once we have all samples prepared, we can combine them to not onlycompare within-sample diversity but also explore between-samplerelationships. Using the “vqscompare” function, noise-minimizeddown-sampled reads from all samples are pooled to identify 1) identicalhaplotypes and 2) genetically closely related haplotypes betweensamples. In this example, we combine the previously prepared outputs ofsamples 1 to 4 into the “vqscompare” function, allowing it to generate asummary comparative plot.
#List outputs from "vqsassess" or "vqscustompct" that we want to compare into "vqscompare" function.#Set the number of new OTU groups based on k-means clustering to 10 groups (kmeans.n = 10).set.seed(1234)comp<- vqscompare(samplelist=list(s1,s2,s3,s4_fix),lab_name="Sample",kmeans.n=10)
#The most important output of the "vqscompare" function is the summary plot.comp$summaryplot
The summary plot comprises four main subplots. First, the haplotype plotat the top left displays all unique haplotypes and their proportionsacross all samples. Second, the operational taxonomic unit (OTU) plot atthe bottom left groups genetically closely related haplotypes into newOTUs through k-means clustering on a multidimensional scale (MDS) of allsamples’ pairwise SNV distances. Third, the MDS plot at the top rightvisualizes the genetic relationships and clustering scheme of the fivelargest OTUs and haplotypes belonging to each OTU. Fourth, the MDS plotat the bottom right shows all classified OTUs.
In our example, samples 1 to 4 represent read alignments of theinfluenza A virus (IAV)’s M segment, gathered from patients in the sameward (Williams, et al., 2023).Based on our comparison, patients of samples 1 and 2 likely contractedthe same virus, as evidenced by the similarity in both VQS haplotypeprofiles and OTU profiles. Meanwhile, patients of samples 3 and 4 appearto have been infected with distinct sets of viruses, differing from eachother and from those in samples 1 and 2. Despite these distinctions, theMDS plots indicate that the genetic distances between all samples arenot significantly far apart. This suggests that these diversities maynot be adequately captured at the consensus gene level.
About
Toolkit for Viral Quasispecies Comparison from Long-Read Sequencing performing variety of viral quasispecies diversity analyses based on long-read sequence alignment.
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.






