Movatterモバイル変換

High-throughput sequencing enables an unprecedented resolution intranscript quantification, at the cost of magnifying the impact oftechnical noise. The consistent reduction of random background noise tocapture functionally meaningful biological signals is still challenging.Intrinsic sequencing variability introducing low-level expressionvariations can obscure patterns in downstream analyses.

The noisyR package comprises an end-to-end pipeline for quantifyingand removing technical noise from HTS datasets. The three main pipelinesteps are [i] similarity calculation across samples, [ii] noisequantification, and [iii] noise removal; each step can be finely tunedusing hyperparameters; optimal, data-driven values for these parametersare also determined.

Preprint:https://www.biorxiv.org/content/10.1101/2021.01.17.427026v2

Workflow diagram of thenoisyr pipeline

Similarity calculationacross samples

For the sample-similarity calculation, two approaches areavailable:

Thecount matrix approach uses the original,un-normalised expression matrix, as provided after alignment and featurequantification; each sample is processed individually, only the relativeexpressions across samples are compared. Relying on the hypothesis thatthe majority of genes are not DE, most of the evaluations are expectedto point towards a high similarity across samples. Choosing from acollection of >45 similarity metrics, users can select a measure toassess the localised consistency in expression across samples. A slidingwindow approach is used to compare the similarity of ranks or abundancesfor the selected features between samples. The window length is ahyperparameter, which can be user-defined or inferred from thedata.
Thetranscript approach uses as input the alignmentfiles derived from read-mappers (in BAM format). For each sample andeach exon, the point-to-point similarity of expression across thetranscript is calculated across samples in a pairwise all-versus-allcomparison.

The output formats for the two approaches correspond to their inputs:a denoised count matrix for the count matrix approach, and denoised BAMfiles for the transcript approach.

Main functions:calculate_expression_similarity_counts(),calculate_expression_similarity_transcript()

Supporting functions:get_methods_correlation_distance(),optimise_window_length(),calculate_expression_profile()

Input preparation functions:cast_matrix_to_numeric(),cast_gtf_to_genes()

Noise quantification

The noise quantification step uses the expression-similarity relationcalculated in step i to determine the noise threshold, representing thelevel below which the gene expression is considered noisy.

For example, if a similarity threshold is used as input then thecorresponding expression from a (smoothed) expression-similarity lineplot is selected as the noise threshold for each sample. The shape ofthe distribution can vary across experiments; we provide functionalityfor different thresholds and recommend the choice of the one thatresults in the lowest variance in the noise thresholds acrosssamples.

Options for smoothing, or summarising the observations in a box plotand selecting the minimum abundance for which the interquartile range(or median, or 5-95% range) is consistently above the similaritythreshold are also available. As a general rule (due to the number ofobservations), we recommend using the smoothing with the count matrixapproach, and the boxplot representation with the transcript option.

Indicative plots of the Pearson correlation calculated on windowsof increasing average abundance for the count matrix-based noise removalapproach (left) and per exon for the transcript-based noise removalapproach (right).

Main function:calculate_noise_threshold_base()

Supporting functions:get_methods_calculate_noise_threshold(),calculate_first_minimum_density()

Visualisation functions:plot_expression_similarity(),calculate_noise_threshold_method_statistics()

Noise removal

The third step uses the noise threshold calculated in step ii toremove noise from the count matrix or BAM files.

For the count matrix approach, genes whose expression is below thenoise thresholds for every sample are removed and the average noisethreshold is calculated and added to every entry. This ensures that thefold-changes observed by downstream analyses are not biased by lowexpression, while still preserving the structure and relative expressionlevels in the data.
For the transcript approach, genes are removed from the BAM files ifthe expression of all their exons is below the noise thresholds forevery sample. The removal is done at gene level to avoid scenarios thatare not biologically possible.

Main functions:remove_noise_from_matrix(),remove_noise_from_bams()

Required packages

CRAN (useinstall.packages())

utils
grDevices
tibble
dplyr
magrittr
ggplot2
philentropy
doParallel
foreach

Bioconductor (useBiocManager::install())

preprocessCore
IRanges
GenomicRanges
Rsamtools