High-throughput sequencing enables an unprecedented resolution intranscript quantification, at the cost of magnifying the impact oftechnical noise. The consistent reduction of random background noise tocapture functionally meaningful biological signals is still challenging.Intrinsic sequencing variability introducing low-level expressionvariations can obscure patterns in downstream analyses.
The noisyR package comprises an end-to-end pipeline for quantifyingand removing technical noise from HTS datasets. The three main pipelinesteps are [i] similarity calculation across samples, [ii] noisequantification, and [iii] noise removal; each step can be finely tunedusing hyperparameters; optimal, data-driven values for these parametersare also determined.
Preprint:https://www.biorxiv.org/content/10.1101/2021.01.17.427026v2

Workflow diagram of thenoisyr pipeline
For the sample-similarity calculation, two approaches areavailable:
The output formats for the two approaches correspond to their inputs:a denoised count matrix for the count matrix approach, and denoised BAMfiles for the transcript approach.
Main functions:calculate_expression_similarity_counts(),calculate_expression_similarity_transcript()
Supporting functions:get_methods_correlation_distance(),optimise_window_length(),calculate_expression_profile()
Input preparation functions:cast_matrix_to_numeric(),cast_gtf_to_genes()
The noise quantification step uses the expression-similarity relationcalculated in step i to determine the noise threshold, representing thelevel below which the gene expression is considered noisy.
For example, if a similarity threshold is used as input then thecorresponding expression from a (smoothed) expression-similarity lineplot is selected as the noise threshold for each sample. The shape ofthe distribution can vary across experiments; we provide functionalityfor different thresholds and recommend the choice of the one thatresults in the lowest variance in the noise thresholds acrosssamples.
Options for smoothing, or summarising the observations in a box plotand selecting the minimum abundance for which the interquartile range(or median, or 5-95% range) is consistently above the similaritythreshold are also available. As a general rule (due to the number ofobservations), we recommend using the smoothing with the count matrixapproach, and the boxplot representation with the transcript option.

Indicative plots of the Pearson correlation calculated on windowsof increasing average abundance for the count matrix-based noise removalapproach (left) and per exon for the transcript-based noise removalapproach (right).
Main function:calculate_noise_threshold_base()
Supporting functions:get_methods_calculate_noise_threshold(),calculate_first_minimum_density()
Visualisation functions:plot_expression_similarity(),calculate_noise_threshold_method_statistics()
The third step uses the noise threshold calculated in step ii toremove noise from the count matrix or BAM files.
Main functions:remove_noise_from_matrix(),remove_noise_from_bams()