| Title: | Bottom-Up Proteomics and LiP-MS Quality Control and DataAnalysis Tools |
| Version: | 0.9.1 |
| Description: | Useful functions and workflows for proteomics quality control and data analysis of both limited proteolysis-coupled mass spectrometry (LiP-MS) (Feng et. al. (2014) <doi:10.1038/nbt.2999>) and regular bottom-up proteomics experiments. Data generated with search tools such as 'Spectronaut', 'MaxQuant' and 'Proteome Discover' can be easily used due to flexibility of functions. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| LazyData: | true |
| Imports: | rlang, dplyr, stringr, magrittr, data.table, janitor,progress, purrr, tidyr, ggplot2, forcats, tibble, plotly,ggrepel, utils, grDevices, curl, readr, lifecycle, httr,methods, R.utils, stats |
| RoxygenNote: | 7.3.2 |
| Suggests: | testthat, covr, knitr, rmarkdown, shiny, r3dmol, proDA,limma, dendextend, pheatmap, heatmaply, furrr, future,parallel, seriation, drc, igraph, stringi, STRINGdb, iq,scales, farver, ggforce, xml2, jsonlite |
| Depends: | R (≥ 4.0) |
| URL: | https://github.com/jpquast/protti,https://jpquast.github.io/protti/ |
| BugReports: | https://github.com/jpquast/protti/issues |
| VignetteBuilder: | knitr |
| NeedsCompilation: | no |
| Packaged: | 2024-10-21 22:14:07 UTC; jan-philippquast |
| Author: | Jan-Philipp Quast |
| Maintainer: | Jan-Philipp Quast <quast@imsb.biol.ethz.ch> |
| Repository: | CRAN |
| Date/Publication: | 2024-10-21 22:40:02 UTC |
Analyse protein interaction network for significant hits
Description
The STRING database provides a resource for known and predicted protein-protein interactions.The type of interactions include direct (physical) and indirect (functional) interactions.Through the R packageSTRINGdb this resource if provided to R users. This functionprovides a convenient wrapper forSTRINGdb functions that allow an easy use within theprotti pipeline.
Usage
analyse_functional_network( data, protein_id, string_id, organism_id, version = "12.0", score_threshold = 900, binds_treatment = NULL, halo_color = NULL, plot = TRUE)Arguments
data | a data frame that contains significantly changing proteins (STRINGdb is only ableto plot 400 proteins at a time so do not provide more for network plots). Information abouttreatment binding can be provided and will be displayed as colorful halos around the proteinsin the network. |
protein_id | a character column in the |
string_id | a character column in the |
organism_id | a numeric value specifying an organism ID (NCBI taxon-ID). This can beobtained fromhere.H. sapiens: 9606, S. cerevisiae: 4932, E. coli: 511145. |
version | a character value that specifies the version of STRINGdb to be used.Default is 12.0. |
score_threshold | a numeric value specifying the interaction score that based onSTRINGhas to be between 0 and 1000. A score closer to 1000 is related to a higher confidence for theinteraction. The default value is 900. |
binds_treatment | a logical column in the |
halo_color | optional, character value with a color hex-code. This is the color of thehalo of proteins that bind the treatment. |
plot | a logical that indicates whether the result should be plotted or returned as a table. |
Value
A network plot displaying interactions of the provided proteins. Ifbinds_treatment was provided halos around the proteins show which proteins interact withthe treatment. Ifplot = FALSE a data frame with interaction information is returned.
Examples
# Create example datadata <- data.frame( uniprot_id = c( "P0A7R1", "P02359", "P60624", "P0A7M2", "P0A7X3", "P0AGD3" ), xref_string = c( "511145.b4203;", "511145.b3341;", "511145.b3309;", "511145.b3637;", "511145.b3230;", "511145.b1656;" ), is_known = c( TRUE, TRUE, TRUE, TRUE, TRUE, FALSE ))# Perform network analysisnetwork <- analyse_functional_network( data, protein_id = uniprot_id, string_id = xref_string, organism_id = 511145, binds_treatment = is_known, plot = TRUE)networkPerform ANOVA
Description
Performs an ANOVA statistical test
Usage
anova_protti(data, grouping, condition, mean_ratio, sd, n)Arguments
data | a data frame containing at least the input variables. |
grouping | a character column in the |
condition | a character or numeric column in the |
mean_ratio | a numeric column in the |
sd | a numeric column in the |
n | a numeric column in the |
Value
a data frame that contains the within group error (ms_group) and the betweengroup error (ms_error), f statistic and p-values.
Examples
data <- data.frame( precursor = c("A", "A", "A", "B", "B", "B"), condition = c("C1", "C2", "C3", "C1", "C2", "C3"), mean = c(10, 12, 20, 11, 12, 8), sd = c(2, 1, 1.5, 1, 2, 4), n = c(4, 4, 4, 4, 4, 4))anova_protti( data, grouping = precursor, condition = condition, mean = mean, sd = sd, n = n)Assignment of missingness types
Description
The type of missingness (missing at random, missing not at random) is assigned based on thecomparison of a reference condition and every other condition.
Usage
assign_missingness( data, sample, condition, grouping, intensity, ref_condition = "all", completeness_MAR = 0.7, completeness_MNAR = 0.2, retain_columns = NULL)Arguments
data | a data frame containing at least the input variables. |
sample | a character column in the |
condition | a character or numeric column in the |
grouping | a character column in the |
intensity | a numeric column in the |
ref_condition | a character vector providing the condition that is used as a reference formissingness determination. Instead of providing one reference condition, "all" can be supplied,which will create all pairwise condition pairs. By default |
completeness_MAR | a numeric value that specifies the minimal degree of data completeness tobe considered as MAR. Value has to be between 0 and 1, default is 0.7. It is multiplied withthe number of replicates and then adjusted downward. The resulting number is the minimal numberof observations for each condition to be considered as MAR. This number is always at least 1. |
completeness_MNAR | a numeric value that specifies the maximal degree of data completeness tobe considered as MNAR. Value has to be between 0 and 1, default is 0.20. It is multiplied withthe number of replicates and then adjusted downward. The resulting number is the maximal numberof observations for one condition to be considered as MNAR when the other condition is complete. |
retain_columns | a vector that indicates columns that should be retained from the inputdata frame. Default is not retaining additional columns |
Value
A data frame that contains the reference condition paired with each treatment condition.Thecomparison column contains the comparison name for the specific treatment/referencepair. Themissingness column reports the type of missingness.
"complete": No missing values for every replicate of this reference/treatment pair forthe specific grouping variable.
"MNAR": Missing not at random. All replicates of either the reference or treatmentcondition have missing values for the specific grouping variable.
"MAR": Missing at random. At least n-1 replicates have missing values for thereference/treatment pair for the specific grouping varible.
NA: The comparison is not complete enough to fall into any other category. It will notbe imputed if imputation is performed. For statistical significance testing these comparisonsare filtered out after the test and prior to p-value adjustment. This can be prevented by setting
filter_NA_missingness = FALSEin thecalculate_diff_abundance()function.
The type of missingness has an influence on the way values are imputeted if imputation isperformed subsequently using theimpute() function. How each type of missingness isspecifically imputed can be found in the function description. The type of missingnessassigned to a comparison does not have any influence on the statistical test in thecalculate_diff_abundance() function.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 10, frac_change = 0.5, n_replicates = 4, n_conditions = 2, method = "effect_random", additional_metadata = FALSE)head(data, n = 24)# Assign missingness informationdata_missing <- assign_missingness( data, sample = sample, condition = condition, grouping = peptide, intensity = peptide_intensity_missing, ref_condition = "all", retain_columns = c(protein))head(data_missing, n = 24)Assign peptide type
Description
Based on preceding and C-terminal amino acid, the peptide type of a given peptide is assigned.Peptides with preceeding and C-terminal lysine or arginine are considered fully-tryptic. If apeptide is located at the N- or C-terminus of a protein and fulfills the criterium to befully-tryptic otherwise, it is also considered as fully-tryptic. Peptides that only fulfill thecriterium on one terminus are semi-tryptic peptides. Lastly, peptides that are not fulfillingthe criteria for both termini are non-tryptic peptides.
Usage
assign_peptide_type( data, aa_before = aa_before, last_aa = last_aa, aa_after = aa_after)Arguments
data | a data frame containing at least information about the preceding and C-terminalamino acids of peptides. |
aa_before | a character column in the |
last_aa | a character column in the |
aa_after | a character column in the |
Value
A data frame that contains the input data and an additional column with the peptidetype information.
Examples
data <- data.frame( aa_before = c("K", "S", "T"), last_aa = c("R", "K", "Y"), aa_after = c("T", "R", "T"))assign_peptide_type(data, aa_before, last_aa, aa_after)Barcode plot
Description
Plots a "barcode plot" - a vertical line for each identified peptide. Peptides can be colored based on an additional variable. Also differentialabundance can be displayed.
Usage
barcode_plot( data, start_position, end_position, protein_length, coverage = NULL, colouring = NULL, fill_colour_gradient = protti::mako_colours, fill_colour_discrete = c("#999999", protti::protti_colours), protein_id = NULL, facet = NULL, facet_n_col = 4, cutoffs = NULL)Arguments
data | a data frame containing differential abundance, start and end peptide or precursor positions and protein length. |
start_position | a numeric column in the data frame containing the start positions for each peptide or precursor. |
end_position | a numeric column in the data frame containing the end positions for each peptide or precursor. |
protein_length | a numeric column in the data frame containing the length of the protein. |
coverage | optional, numeric column in the data frame containing coverage in percent. Will appear in the title of the barcode if provided. |
colouring | optional, column in the data frame containing information by which peptide or precursors shouldbe colored. |
fill_colour_gradient | a vector that contains colours that should be used to create a colour gradientfor the barcode plot bars if the |
fill_colour_discrete | a vector that contains colours that should be used to fill the barcode plot barsif the |
protein_id | optional, column in the data frame containing protein identifiers. Required if only one proteinshould be plotted and the data frame contains only information for this protein. |
facet | optional, column in the data frame containing information by which data should be faceted. This can beprotein identifiers. Only 20 proteins are plotted at a time, the rest is ignored. If more should be plotted, a mapper over asubsetted data frame should be created. |
facet_n_col | a numeric value that specifies the number of columns the faceted plot should haveif a column name is provided to group. The default is 4. |
cutoffs | optional argument specifying the log2 fold change and significance cutoffs used for highlighting peptides.If this argument is provided colouring information will be overwritten with peptides that fulfill this condition.The cutoff should be provided in a vector of the form c(diff = 2, pval = 0.05). The name of the cutoff should reflect thecolumn name that contains this information (log2 fold changes, p-values or adjusted p-values). |
Value
A barcode plot is returned.
Examples
data <- data.frame( start = c(5, 40, 55, 130, 181, 195), end = c(11, 51, 60, 145, 187, 200), length = rep(200, 6), pg_protein_accessions = rep("Protein 1", 6), diff = c(1, 2, 5, 2, 1, 1), pval = c(0.1, 0.01, 0.01, 0.2, 0.2, 0.01))barcode_plot( data, start_position = start, end_position = end, protein_length = length, facet = pg_protein_accessions, cutoffs = c(diff = 2, pval = 0.05))Calculate scores for each amino acid position in a protein sequence
Description
Calculate a score for each amino acid position in a protein sequence based on the product of the-log10(adjusted p-value) and the absolute log2(fold change) per peptide covering this amino acid. In detail, all thepeptides are aligned along the sequence of the corresponding protein, and the average score peramino acid position is computed. In a limited proteolysis coupled to mass spectrometry (LiP-MS)experiment, the score allows to prioritize and narrow down structurally affected regions.
Usage
calculate_aa_scores( data, protein, diff = diff, adj_pval = adj_pval, start_position, end_position, retain_columns = NULL)Arguments
data | a data frame containing at least the input columns. |
protein | a character column in the data frame containing the protein identifier or name. |
diff | a numeric column in the |
adj_pval | a numeric column in the |
start_position | a numeric column |
end_position | a numeric column in the data frame containing the end position of a peptide orprecursor. |
retain_columns | a vector indicating if certain columns should be retained from the inputdata frame. Default is not retaining additional columns |
Value
A data frame that contains the aggregated scores per amino acid position, enabling todraw fingerprints for each individual protein.
Author(s)
Patrick Stalder
Examples
data <- data.frame( pg_protein_accessions = c(rep("protein_1", 10)), diff = c(2, -3, 1, 2, 3, -3, 5, 1, -0.5, 2), adj_pval = c(0.001, 0.01, 0.2, 0.05, 0.002, 0.5, 0.4, 0.7, 0.001, 0.02), start = c(1, 3, 5, 10, 15, 25, 28, 30, 41, 51), end = c(6, 8, 10, 16, 23, 35, 35, 35, 48, 55))calculate_aa_scores( data, protein = pg_protein_accessions, diff = diff, adj_pval = adj_pval, start_position = start, end_position = end)Calculate differential abundance between conditions
Description
Performs differential abundance calculations and statistical hypothesis tests on data frameswith protein, peptide or precursor data. Different methods for statistical testing are available.
Usage
calculate_diff_abundance( data, sample, condition, grouping, intensity_log2, missingness = missingness, comparison = comparison, mean = NULL, sd = NULL, n_samples = NULL, ref_condition = "all", filter_NA_missingness = TRUE, method = c("moderated_t-test", "t-test", "t-test_mean_sd", "proDA"), p_adj_method = "BH", retain_columns = NULL)Arguments
data | a data frame containing at least the input variables that are required for theselected method. Ideally the output of |
sample | a character column in the |
condition | a character or numeric column in the |
grouping | a character column in the |
intensity_log2 | a numeric column in the |
missingness | a character column in the |
comparison | a character column in the |
mean | a numeric column in the |
sd | a numeric column in the |
n_samples | a numeric column in the |
ref_condition | optional, character value providing the condition that is used as areference for differential abundance calculation. Only required for |
filter_NA_missingness | a logical value, default is |
method | a character value, specifies the method used for statistical hypothesis testing.Methods include Welch test ( |
p_adj_method | a character value, specifies the p-value correction method. Possiblemethods are c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Defaultmethod is |
retain_columns | a vector indicating if certain columns should be retained from the inputdata frame. Default is not retaining additional columns |
Value
A data frame that contains differential abundances (diff), p-values (pval)and adjusted p-values (adj_pval) for each protein, peptide or precursor (depending onthegrouping variable) and the associated treatment/reference pair. Depending on themethod the data frame contains additional columns:
"t-test": The
std_errorcolumn contains the standard error of the differentialabundances.n_obscontains the number of observations for the specific protein, peptideor precursor (depending on thegroupingvariable) and the associated treatment/reference pair."t-test_mean_sd": Columns labeled as control refer to the second condition of thecomparison pairs. Treated refers to the first condition.
mean_controlandmean_treatedcolumns contain the means for the reference and treatment condition, respectively.sd_controlandsd_treatedcolumns contain the standard deviations for the reference and treatmentcondition, respectively.n_controlandn_treatedcolumns contain the numbers ofsamples for the reference and treatment condition, respectively. Thestd_errorcolumncontains the standard error of the differential abundances.t_statisticcontains thet_statistic for the t-test."moderated_t-test":
CI_2.5andCI_97.5contain the 2.5% and 97.5%confidence interval borders for differential abundances.avg_abundancecontains averageabundances for treatment/reference pairs (mean of the two group means).t_statisticcontains the t_statistic for the t-test.BThe B-statistic is the log-odds that theprotein, peptide or precursor (depending ongrouping) has a differential abundancebetween the two groups. Suppose B=1.5. The odds of differential abundance is exp(1.5)=4.48, i.e,about four and a half to one. The probability that there is a differential abundance is4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this group is differentiallyabundant. A B-statistic of zero corresponds to a 50-50 chance that the group is differentiallyabundant.n_obscontains the number of observations for the specific protein, peptide orprecursor (depending on thegroupingvariable) and the associated treatment/reference pair."proDA": The
std_errorcolumn contains the standard error of the differentialabundances.avg_abundancecontains average abundances for treatment/reference pairs(mean of the two group means).t_statisticcontains the t_statistic for the t-test.n_obscontains the number of observations for the specific protein, peptide or precursor(depending on thegroupingvariable) and the associated treatment/reference pair.
For all methods execept"proDA", the p-value adjustment is performed only on theproportion of data that contains a p-value that is notNA. For"proDA" thep-value adjustment is either performed on the complete dataset (filter_NA_missingness = TRUE)or on the subset of the dataset with missingness that is notNA (filter_NA_missingness = FALSE).
Examples
set.seed(123) # Makes example reproducible# Create synthetic datadata <- create_synthetic_data( n_proteins = 10, frac_change = 0.5, n_replicates = 4, n_conditions = 2, method = "effect_random", additional_metadata = FALSE)# Assign missingness informationdata_missing <- assign_missingness( data, sample = sample, condition = condition, grouping = peptide, intensity = peptide_intensity_missing, ref_condition = "all", retain_columns = c(protein, change_peptide))# Calculate differential abundances# Using "moderated_t-test" and "proDA" improves# true positive recovery progressivelydiff <- calculate_diff_abundance( data = data_missing, sample = sample, condition = condition, grouping = peptide, intensity_log2 = peptide_intensity_missing, missingness = missingness, comparison = comparison, method = "t-test", retain_columns = c(protein, change_peptide))head(diff, n = 10)Perform gene ontology enrichment analysis
Description
Analyses enrichment of gene ontology terms associated with proteins in the fraction ofsignificant proteins compared to all detected proteins. A two-sided Fisher's exact test isperformed to test significance of enrichment or depletion. GO annotations can be provided tothis function either through UniProtgo_annotations_uniprot, through a table obtainedwithfetch_go in thego_data argument or GO annotations are fetched automaticallyby the function by providingontology_type andorganism_id.
Usage
calculate_go_enrichment( data, protein_id, is_significant, group = NULL, y_axis_free = TRUE, facet_n_col = 2, go_annotations_uniprot = NULL, ontology_type, organism_id = NULL, go_data = NULL, plot = TRUE, plot_style = "barplot", plot_title = "Gene ontology enrichment of significant proteins", barplot_fill_colour = c("#56B4E9", "#E76145"), heatmap_fill_colour = protti::mako_colours, heatmap_fill_colour_rev = TRUE, label = TRUE, enrichment_type = "all", replace_long_name = TRUE, label_move_frac = 0.2, min_n_detected_proteins_in_process = 1, plot_cutoff = "adj_pval top10")Arguments
data | a data frame that contains at least the input variables. |
protein_id | a character column in the |
is_significant | a logical column in the |
group | optional, character column in the |
y_axis_free | a logical value that specifies if the y-axis of the plot should be "free"for each facet if a grouping variable is provided. Default is |
facet_n_col | a numeric value that specifies the number of columns the faceted plot should haveif a column name is provided to group. The default is 2. |
go_annotations_uniprot | recommended, a character column in the |
ontology_type | optional, character value specifying the type of ontology that shouldbe used. Possible values are molecular function (MF), biological process (BP), cellular component(CC). This argument is not required if GO annotations are provided from UniProt in |
organism_id | optional, character value specifying an NCBI taxonomy identifier of anorganism (TaxId). Possible inputs include only: "9606" (Human), "559292" (Yeast) and "83333"(E. coli). Is only necessary if GO data is not provided either by |
go_data | Optional, a data frame that can be obtained with |
plot | a logical argument indicating whether the result should be plotted or returned as a table. |
plot_style | a character argument that specifies the plot style. Can be either "barplot" (default)or "heatmap". The "heatmap" plot is especially useful for the comparison of multiple groups. We recommend,however, that you use it only with |
plot_title | a character value that specifies the title of the plot. The default is "Gene ontologyenrichment of significant proteins". |
barplot_fill_colour | a vector that contains two colours that should be used as the fill colours fordeenriched and enriched GO terms, respectively. If |
heatmap_fill_colour | a vector that contains colours that should be used to create the gradient in theheatmap plot. Default is |
heatmap_fill_colour_rev | a logical value that specifies if the provided colours in |
label | a logical argument indicating whether labels should be added to the plot.Default is TRUE. |
enrichment_type | a character argument that is either "all", "enriched" or "deenriched". Thisdetermines if the enrichment analysis should be performed in order to check for both enrichemnt anddeenrichemnt or only one of the two. This affects the statistics performed and therefore also the displayedplot. |
replace_long_name | a logical argument that specifies if GO term names above 50 characters shouldbe replaced by the GO ID instead for the plot. This ensures that the plotting area doesn't becometoo small due to the long name. The default is |
label_move_frac | a numeric argument between 0 and 1 that specifies which labels should bemoved outside of the bar. The default is 0.2, which means that the labels of all bars that have a sizeof 20% or less of the largest bar are moved to the right of the bar. This prevents labels fromoverlapping with the bar boundaries. |
min_n_detected_proteins_in_process | is a numeric argument that specifies the minimum number ofdetected proteins required for a GO term to be displayed in the plot. The default is 1, meaningno filtering of the plotted data is performed. This argument does not affect any computations orthe returned data if |
plot_cutoff | a character value indicating if the plot should contain the top n (e.g. top10) mostsignificant proteins (p-value or adjusted p-value), or if a significance cutoff should be usedto determine the number of GO terms in the plot. This information should be provided with thetype first followed by the threshold separated by a space. Example are |
Value
A bar plot or heatmap (depending onplot_style). By default the bar plot displays negative log10adjusted p-values for the top 10 enriched or deenriched gene ontology terms. Alternatively, plot cutoffscan be chosen individually with theplot_cutoff argument. Bars are colored according to the directionof the enrichment (enriched or deenriched). If a heatmap is returned, terms are organised on the y-axis, whilethe colour of each tile represents the negative log10 adjusted p-value (default). If agroup columnis provided the x-axis contains all groups. Ifplot = FALSE, a data frame is returned. P-values are adjusted withBenjamini-Hochberg.
Examples
# Load librarieslibrary(dplyr)library(stringr)# Create example data# Contains artificial de-enrichment for ribosomes.uniprot_go_data <- fetch_uniprot_proteome( organism_id = 83333, columns = c( "accession", "go_f" ))if (!is(uniprot_go_data, "character")) { data <- uniprot_go_data %>% mutate(significant = c( rep(TRUE, 1000), rep(FALSE, n() - 1000) )) %>% mutate(significant = ifelse( str_detect( go_f, pattern = "ribosome" ), FALSE, significant )) %>% mutate(group = c( rep("A", 500), rep("B", 500), rep("A", (n() - 1000) / 2), rep("B", round((n() - 1000) / 2)) )) # Plot gene ontology enrichment calculate_go_enrichment( data, protein_id = accession, go_annotations_uniprot = go_f, is_significant = significant, plot = TRUE, plot_cutoff = "pval 0.01" ) # Plot gene ontology enrichment with group calculate_go_enrichment( data, protein_id = accession, go_annotations_uniprot = go_f, is_significant = significant, group = group, facet_n_col = 1, plot = TRUE, plot_cutoff = "pval 0.01" ) # Plot gene ontology enrichment with group in a heatmap plot calculate_go_enrichment( data, protein_id = accession, group = group, go_annotations_uniprot = go_f, is_significant = significant, min_n_detected_proteins_in_process = 15, plot = TRUE, label = TRUE, plot_style = "heatmap", enrichment_type = "enriched", plot_cutoff = "pval 0.01" ) # Calculate gene ontology enrichment go_enrichment <- calculate_go_enrichment( data, protein_id = accession, go_annotations_uniprot = go_f, is_significant = significant, plot = FALSE, ) head(go_enrichment, n = 10)}Sampling of values for imputation
Description
calculate_imputation is a helper function that is used in theimpute function.Depending on the type of missingness and method, it samples values from a normal distributionthat can be used for the imputation. Note: The input intensities should be log2 transformed.
Usage
calculate_imputation( min = NULL, noise = NULL, mean = NULL, sd, missingness = c("MNAR", "MAR"), method = c("ludovic", "noise"), skip_log2_transform_error = FALSE)Arguments
min | a numeric value specifying the minimal intensity value of the precursor/peptide.Is only required if |
noise | a numeric value specifying a noise value for the precursor/peptide. Is onlyrequired if |
mean | a numeric value specifying the mean intensity value of the condition with missingvalues for a given precursor/peptide. Is only required if |
sd | a numeric value specifying the mean of the standard deviation of all conditions fora given precursor/peptide. |
missingness | a character value specifying the missingness type of the data determineshow values for imputation are sampled. This can be |
method | a character value specifying the method to be used for imputation. For |
skip_log2_transform_error | a logical value, if FALSE a check is performed to validate thatinput values are log2 transformed. If input values are > 40 the test is failed and an error isreturned. |
Value
A value sampled from a normal distribution with the input parameters. Method specificsare applied to input parameters prior to sampling.
Perform KEGG pathway enrichment analysis
Description
Analyses enrichment of KEGG pathways associated with proteins in the fraction of significantproteins compared to all detected proteins. A Fisher's exact test is performed to testsignificance of enrichment.
Usage
calculate_kegg_enrichment( data, protein_id, is_significant, pathway_id = pathway_id, pathway_name = pathway_name, plot = TRUE, plot_cutoff = "adj_pval top10")Arguments
data | a data frame that contains at least the input variables. |
protein_id | a character column in the |
is_significant | a logical column in the |
pathway_id | a character column in the |
pathway_name | a character column in the |
plot | a logical value indicating whether the result should be plotted or returned as atable. |
plot_cutoff | a character value indicating if the plot should contain the top 10 mostsignificant proteins (p-value or adjusted p-value), or if a significance cutoff should be usedto determine the number of GO terms in the plot. This information should be provided with thetype first followed by the threshold separated by a space. Example are |
Value
A bar plot displaying negative log10 adjusted p-values for the top 10 enriched pathways.Bars are coloured according to the direction of the enrichment. Ifplot = FALSE, a dataframe is returned.
Examples
# Load librarieslibrary(dplyr)set.seed(123) # Makes example reproducible# Create example datakegg_data <- fetch_kegg(species = "eco")if (!is.null(kegg_data)) { # only proceed if information was retrieved data <- kegg_data %>% group_by(uniprot_id) %>% mutate(significant = rep( sample( x = c(TRUE, FALSE), size = 1, replace = TRUE, prob = c(0.2, 0.8) ), n = n() )) # Plot KEGG enrichment calculate_kegg_enrichment( data, protein_id = uniprot_id, is_significant = significant, pathway_id = pathway_id, pathway_name = pathway_name, plot = TRUE, plot_cutoff = "pval 0.05" ) # Calculate KEGG enrichment kegg <- calculate_kegg_enrichment( data, protein_id = uniprot_id, is_significant = significant, pathway_id = pathway_id, pathway_name = pathway_name, plot = FALSE ) head(kegg, n = 10)}Label-free protein quantification
Description
Determines relative protein abundances from ion quantification. Only proteins with at leastthree peptides are considered for quantification. The three peptide rule applies for eachsample independently.
Usage
calculate_protein_abundance( data, sample, protein_id, precursor, peptide, intensity_log2, min_n_peptides = 3, method = "sum", for_plot = FALSE, retain_columns = NULL)Arguments
data | a data frame that contains at least the input variables. |
sample | a character column in the |
protein_id | a character column in the |
precursor | a character column in the |
peptide | a character column in the |
intensity_log2 | a numeric column in the |
min_n_peptides | An integer specifying the minimum number of peptides requiredfor a protein to be included in the analysis. The default value is 3, which meansproteins with fewer than three unique peptides will be excluded from the analysis. |
method | a character value specifying with which method protein quantities should becalculated. Possible options include |
for_plot | a logical value indicating whether the result should be only protein intensitiesor protein intensities together with precursor intensities that can be used for plotting using |
retain_columns | a vector indicating if certain columns should be retained from the inputdata frame. Default is not retaining additional columns |
Value
Iffor_plot = FALSE, protein abundances are returned, iffor_plot = TRUEalso precursor intensities are returned in a data frame. The later output is ideal for plottingwithpeptide_profile_plot() and can be filtered to only include protein abundances.
Examples
# Create example datadata <- data.frame( sample = c( rep("S1", 6), rep("S2", 6), rep("S1", 2), rep("S2", 2) ), protein_id = c( rep("P1", 12), rep("P2", 4) ), precursor = c( rep(c("A1", "A2", "B1", "B2", "C1", "D1"), 2), rep(c("E1", "F1"), 2) ), peptide = c( rep(c("A", "A", "B", "B", "C", "D"), 2), rep(c("E", "F"), 2) ), intensity = c( rnorm(n = 6, mean = 15, sd = 2), rnorm(n = 6, mean = 21, sd = 1), rnorm(n = 2, mean = 15, sd = 1), rnorm(n = 2, mean = 15, sd = 2) ))data# Calculate protein abundancesprotein_abundance <- calculate_protein_abundance( data, sample = sample, protein_id = protein_id, precursor = precursor, peptide = peptide, intensity_log2 = intensity, method = "sum", for_plot = FALSE)protein_abundance# Calculate protein abundances and retain precursor# abundances that can be used in a peptide profile plotcomplete_abundances <- calculate_protein_abundance( data, sample = sample, protein_id = protein_id, precursor = precursor, peptide = peptide, intensity_log2 = intensity, method = "sum", for_plot = TRUE)complete_abundancesProtein sequence coverage
Description
Calculate sequence coverage for each identified protein.
Usage
calculate_sequence_coverage(data, protein_sequence, peptides)Arguments
data | a data frame containing at least the protein sequence and the identified peptidesas columns. |
protein_sequence | a character column in the |
peptides | a character column in the |
Value
A new column in thedata data frame containing the calculated sequence coveragefor each identified protein
Examples
data <- data.frame( protein_sequence = c("abcdefghijklmnop", "abcdefghijklmnop"), pep_stripped_sequence = c("abc", "jklmn"))calculate_sequence_coverage( data, protein_sequence = protein_sequence, peptides = pep_stripped_sequence)Check treatment enrichment
Description
Check for an enrichment of proteins interacting with the treatment in significantly changingproteins as compared to all proteins.
Usage
calculate_treatment_enrichment( data, protein_id, is_significant, binds_treatment, group = NULL, treatment_name, plot = TRUE, fill_colours = protti::protti_colours, fill_by_group = FALSE, facet_n_col = 2)Arguments
data | a data frame contains at least the input variables. |
protein_id | a character column in the |
is_significant | a logical column in the |
binds_treatment | a logical column in the |
group | optional, character column in the |
treatment_name | a character value that indicates the treatment name. It will be includedin the plot title. |
plot | a logical value indicating whether the result should be plotted or returned as atable. |
fill_colours | a character vector that specifies the fill colours of the plot. |
fill_by_group | a logical value that specifies if the bars in the plot should be filled by groupif the group argument is provided. Default is |
facet_n_col | a numeric value that specifies the number of columns the facet plot should have ifa |
Value
A bar plot displaying the percentage of all detected proteins and all significant proteinsthat bind to the treatment. A Fisher's exact test is performed to calculate the significance ofthe enrichment in significant proteins compared to all proteins. The result is reported as ap-value. Ifplot = FALSE a contingency table in long format is returned.
Examples
# Create example datadata <- data.frame( protein_id = c(paste0("protein", 1:50)), significant = c( rep(TRUE, 20), rep(FALSE, 30) ), binds_treatment = c( rep(TRUE, 10), rep(FALSE, 10), rep(TRUE, 5), rep(FALSE, 25) ), group = c( rep("A", 5), rep("B", 15), rep("A", 15), rep("B", 15) ))# Plot treatment enrichmentcalculate_treatment_enrichment( data, protein_id = protein_id, is_significant = significant, binds_treatment = binds_treatment, treatment_name = "Rapamycin", plot = TRUE)# Plot treatment enrichment by groupcalculate_treatment_enrichment( data, protein_id = protein_id, group = group, is_significant = significant, binds_treatment = binds_treatment, treatment_name = "Rapamycin", plot = TRUE, fill_by_group = TRUE)# Calculate treatment enrichmentenrichment <- calculate_treatment_enrichment( data, protein_id = protein_id, is_significant = significant, binds_treatment = binds_treatment, plot = FALSE)enrichmentProtein abundance correction for LiP-data
Description
Performs the correction of LiP-peptides for changes in protein abundance andcalculates their significance using a t-test. This function was implemented basedon theMSstatsLiPpackage developed by the Vitek lab.
Usage
correct_lip_for_abundance( lip_data, trp_data, protein_id, grouping, comparison = comparison, diff = diff, n_obs = n_obs, std_error = std_error, p_adj_method = "BH", retain_columns = NULL, method = c("satterthwaite", "no_df_approximation"))Arguments
lip_data | a data frame containing at least the input variables. Ideally,the result from the |
trp_data | a data frame containing at least the input variables minus the grouping column. Ideally,the result from the |
protein_id | a character column in the |
grouping | a character column in the |
comparison | a character column in the |
diff | a numeric column in the |
n_obs | a numeric column in the |
std_error | a numeric column in the |
p_adj_method | a character value, specifies the p-value correction method. Possiblemethods are c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"). Defaultmethod is |
retain_columns | a vector indicating if certain columns should be retained from the inputdata frame. Default is not retaining additional columns |
method | a character value, specifies the method used to estimate the degrees of freedom.Possible methods are c("satterthwaite", "no_df_approximation"). |
Value
a data frame containing corrected differential abundances (adj_diff, adjustedstandard errors (adj_std_error), degrees of freedom (df), pvalues (pval) andadjusted p-values (adj_pval)
Author(s)
Aaron Fehr
Examples
# Load librarieslibrary(dplyr)# Load example data and simulate tryptic data by summing up precursorsdata <- rapamycin_10uMdata_trp <- data %>% dplyr::group_by(pg_protein_accessions, r_file_name) %>% dplyr::mutate(pg_quantity = sum(fg_quantity)) %>% dplyr::distinct( r_condition, r_file_name, pg_protein_accessions, pg_quantity )# Calculate differential abundances for LiP and Trp datadiff_lip <- data %>% dplyr::mutate(fg_intensity_log2 = log2(fg_quantity)) %>% assign_missingness( sample = r_file_name, condition = r_condition, intensity = fg_intensity_log2, grouping = eg_precursor_id, ref_condition = "control", retain_columns = "pg_protein_accessions" ) %>% calculate_diff_abundance( sample = r_file_name, condition = r_condition, grouping = eg_precursor_id, intensity_log2 = fg_intensity_log2, comparison = comparison, method = "t-test", retain_columns = "pg_protein_accessions" )diff_trp <- data_trp %>% dplyr::mutate(pg_intensity_log2 = log2(pg_quantity)) %>% assign_missingness( sample = r_file_name, condition = r_condition, intensity = pg_intensity_log2, grouping = pg_protein_accessions, ref_condition = "control" ) %>% calculate_diff_abundance( sample = r_file_name, condition = r_condition, grouping = pg_protein_accessions, intensity_log2 = pg_intensity_log2, comparison = comparison, method = "t-test" )# Correct for abundance changescorrected <- correct_lip_for_abundance( lip_data = diff_lip, trp_data = diff_trp, protein_id = pg_protein_accessions, grouping = eg_precursor_id, retain_columns = c("missingness"), method = "satterthwaite")head(corrected, n = 10)Creates a mass spectrometer queue for Xcalibur
Description
This function creates a measurement queue for sample acquisition for the software Xcalibur.All possible combinations of the provided information will be created to make file andsample names.
Usage
create_queue( date = NULL, instrument = NULL, user = NULL, measurement_type = NULL, experiment_name = NULL, digestion = NULL, treatment_type_1 = NULL, treatment_type_2 = NULL, treatment_dose_1 = NULL, treatment_dose_2 = NULL, treatment_unit_1 = NULL, treatment_unit_2 = NULL, n_replicates = NULL, number_runs = FALSE, organism = NULL, exclude_combinations = NULL, inj_vol = NA, data_path = NA, method_path = NA, position_row = NA, position_column = NA, blank_every_n = NULL, blank_position = NA, blank_method_path = NA, blank_inj_vol = 1, export = FALSE, export_to_queue = FALSE, queue_path = NULL)Arguments
date | optional, character value indicating the start date of the measurements. |
instrument | optional, character value indicating the instrument initials. |
user | optional, character value indicating the user name. |
measurement_type | optional, character value indicating the measurement type of thesamples (e.g "DIA", "DDA", "library" etc.). |
experiment_name | optional, character value indicating the name of the experiment. |
digestion | optional, character vector indicating the digestion types used in thisexperiment (e.g "LiP" and/or "tryptic control"). |
treatment_type_1 | optional, character vector indicating the name of the treatment. |
treatment_type_2 | optional, character vector indicating the name of a second treatmentthat was combined with the first treatment. |
treatment_dose_1 | optional, numeric vector indicating the doses used for treatment 1.These can be concentrations or times etc. |
treatment_dose_2 | optional, numeric vector indicating the doses used for treatment 2.These can be concentrations or times etc. |
treatment_unit_1 | optional, character vector indicating the unit of the doses fortreatment 1 (e.g min, mM, etc.). |
treatment_unit_2 | optional, character vector indicating the unit of the doses fortreatment 2 (e.g min, mM, etc.). |
n_replicates | optional, a numeric value indicating the number of replicates used per sample. |
number_runs | a logical that specifies if file names should be numbered from 1:n instead ofadding experiment information. Default is FALSE. |
organism | optional, character value indicating the name of the organism used. |
exclude_combinations | optional, list of lists that contains vectors of treatment types andtreatment doses of which combinations should be excluded from the final queue. |
inj_vol | a numeric value indicating the volume used for injection in microliter. Will be |
data_path | a character value indicating the file path where the MS raw data should be saved.Backslashes should be escaped by another backslash. Will be |
method_path | a character value indicating the file path of the MS acquisition method.Backslashes should be escaped by another backslash. Will be |
position_row | a character vector that contains row positions that can be used for thesamples (e.g c("A", "B")). If the number of specified rows and columns does not equal the totalnumber of samples, positions will be repeated. |
position_column | a character vector that contains column positions that can be used for thesamples (e.g 8). If the number of specified rows and columns does not equal the total numberof samples, positions will be repeated. |
blank_every_n | optional, numeric value that specifies in which intervals a blank sampleshould be inserted. |
blank_position | a character value that specifies the plate position of the blank. Will be |
blank_method_path | a character value that specifies the file path of the MS acquisitionmethod of the blank. Backslashes should be escaped by another backslash. Will be |
blank_inj_vol | a numeric value that specifies the injection volume of the blank sample.Will be |
export | a logical value that specifies if the queue should be exported from R and savedas a .csv file. Default is TRUE. Further options for export can be adjusted with the |
export_to_queue | a logical value that specifies if the resulting queue should be appendedto an already existing queue. If false result will be saved as |
queue_path | optional, a character value that specifies the file path to a queue file towhich the generated queue should be appended if |
Value
Ifexport_to_queue = FALSE a file namedqueue.csv will be returned thatcontains the generated queue. Ifexport_to_queue = TRUE, the resulting generated queuewill be appended to an already existing queue that needs to be specified either interactivelyor through the argumentqueue_path.
Examples
create_queue( date = c("200722"), instrument = c("EX1"), user = c("jquast"), measurement_type = c("DIA"), experiment_name = c("JPQ031"), digestion = c("LiP", "tryptic control"), treatment_type_1 = c("EDTA", "H2O"), treatment_type_2 = c("Zeba", "unfiltered"), treatment_dose_1 = c(10, 30, 60), treatment_unit_1 = c("min"), n_replicates = 4, number_runs = FALSE, organism = c("E. coli"), exclude_combinations = list(list( treatment_type_1 = c("H2O"), treatment_type_2 = c("Zeba", "unfiltered"), treatment_dose_1 = c(10, 30) )), inj_vol = c(2), data_path = "D:\\2007_Data", method_path = "C:\\Xcalibur\\methods\\DIA_120min", position_row = c("A", "B", "C", "D", "E", "F"), position_column = 8, blank_every_n = 4, blank_position = "1-V1", blank_method_path = "C:\\Xcalibur\\methods\\blank")Creates a contact map of all atoms from a structure file
Description
Creates a contact map of a subset or of all atom or residue distances in a structure orAlphaFold prediction file. Contact maps are a useful tool for the identification of proteinregions that are in close proximity in the folded protein. Additionally, regions that areinteracting closely with a small molecule or metal ion can be easily identified without theneed to open the structure in programs such as PyMOL or ChimeraX. For large datasets (morethan 40 contact maps) it is recommended to use theparallel_create_structure_contact_map()function instead, regardless of if maps should be created in parallel or sequential.
Usage
create_structure_contact_map( data, data2 = NULL, id, chain = NULL, auth_seq_id = NULL, distance_cutoff = 10, pdb_model_number_selection = c(0, 1), return_min_residue_distance = TRUE, show_progress = TRUE, export = FALSE, export_location = NULL, structure_file = NULL)Arguments
data | a data frame containing at least a column with PDB ID information of which the namecan be provided to the |
data2 | optional, a data frame that contains a subset of regions for which distances to regionsprovided in the |
id | a character column in the |
chain | optional, a character column in the |
auth_seq_id | optional, a character (or numeric) column in the |
distance_cutoff | a numeric value specifying the distance cutoff in Angstrom. All valuesfor pairwise comparisons are calculated but only values smaller than this cutoff will bereturned in the output. If a cutoff of e.g. 5 is selected then only residues with a distance of5 Angstrom and less are returned. Using a small value can reduce the size of the contact mapdrastically and is therefore recommended. The default value is 10. |
pdb_model_number_selection | a numeric vector specifying which models from the structurefiles should be considered for contact maps. E.g. NMR models often have many models in one file.The default for this argument is c(0, 1). This means the first model of each structure file isselected for contact map calculations. For AlphaFold predictions the model number is 0(only .pdb files), therefore this case is also included here. |
return_min_residue_distance | a logical value that specifies if the contact map should bereturned for all atom distances or the minimum residue distances. Minimum residue distances aresmaller in size. If atom distances are not strictly needed it is recommended to set thisargument to TRUE. The default is TRUE. |
show_progress | a logical value that specifies if a progress bar will be shown (defaultis TRUE). |
export | a logical value that indicates if contact maps should be exported as ".csv". Thename of the file will be the structure ID. Default is |
export_location | optional, a character value that specifies the path to the location inwhich the contact map should be saved if |
structure_file | optional, a character value that specifies the path to the location andname of a structure file in ".cif" or ".pdb" format for which a contact map should be created.All other arguments can be provided as usual with the exception of the |
Value
A list of contact maps for each PDB or UniProt ID provided in the input is returned.If theexport argument is TRUE, each contact map will be saved as a ".csv" file in thecurrent working directory or the location provided to theexport_location argument.
Examples
# Create example datadata <- data.frame( pdb_id = c("6NPF", "1C14", "3NIR"), chain = c("A", "A", NA), auth_seq_id = c("1;2;3;4;5;6;7", NA, NA))# Create contact mapcontact_maps <- create_structure_contact_map( data = data, id = pdb_id, chain = chain, auth_seq_id = auth_seq_id, return_min_residue_distance = TRUE)str(contact_maps[["3NIR"]])contact_mapsCreates a synthetic limited proteolysis proteomics dataset
Description
This function creates a synthetic limited proteolysis proteomics dataset that can be used totest functions while knowing the ground truth.
Usage
create_synthetic_data( n_proteins, frac_change, n_replicates, n_conditions, method = "effect_random", concentrations = NULL, median_offset_sd = 0.05, mean_protein_intensity = 16.88, sd_protein_intensity = 1.4, mean_n_peptides = 12.75, size_n_peptides = 0.9, mean_sd_peptides = 1.7, sd_sd_peptides = 0.75, mean_log_replicates = -2.2, sd_log_replicates = 1.05, effect_sd = 2, dropout_curve_inflection = 14, dropout_curve_sd = -1.2, additional_metadata = TRUE)Arguments
n_proteins | a numeric value that specifies the number of proteins in the synthetic dataset. |
frac_change | a numeric value that specifies the fraction of proteins that has a peptidechanging in abundance. So far only one peptide per protein is changing. |
n_replicates | a numeric value that specifies the number of replicates per condition. |
n_conditions | a numeric value that specifies the number of conditions. |
method | a character value that specifies the method type for the random sampling ofsignificantly changing peptides. If |
concentrations | a numeric vector of length equal to the number of conditions, only needsto be specified if |
median_offset_sd | a numeric value that specifies the standard deviation of normaldistribution that is used for sampling of inter-sample-differences. Default is 0.05. |
mean_protein_intensity | a numeric value that specifies the mean of the protein intensitydistribution. Default: 16.8. |
sd_protein_intensity | a numeric value that specifies the standard deviation of theprotein intensity distribution. Default: 1.4. |
mean_n_peptides | a numeric value that specifies the mean number of peptides per protein.Default: 12.75. |
size_n_peptides | a numeric value that specifies the dispersion parameter (the shapeparameter of the gamma mixing distribution). Can be theoretically calculated as |
mean_sd_peptides | a numeric value that specifies the mean of peptide intensity standarddeviations within a protein. Default: 1.7. |
sd_sd_peptides | a numeric value that specifies the standard deviation of peptide intensitystandard deviation within a protein. Default: 0.75. |
mean_log_replicates,sd_log_replicates | a numeric value that specifies the |
effect_sd | a numeric value that specifies the standard deviation of a normal distributionaround |
dropout_curve_inflection | a numeric value that specifies the intensity inflection pointof a probabilistic dropout curve that is used to sample intensity dependent missing values.This argument determines how many missing values there are in the dataset. Default: 14. |
dropout_curve_sd | a numeric value that specifies the standard deviation of theprobabilistic dropout curve. Needs to be negative to sample a droupout towards low intensities.Default: -1.2. |
additional_metadata | a logical value that determines if metadata such as proteincoverage, missed cleavages and charge state should be sampled and added to the list. |
Value
A data frame that contains complete peptide intensities and peptide intensities withvalues that were created based on a probabilistic dropout curve.
Examples
create_synthetic_data( n_proteins = 10, frac_change = 0.1, n_replicates = 3, n_conditions = 2)# determination of mean_n_peptides and size_n_peptides parameters based on real data (count)# example peptide count per proteincount <- c(6, 3, 2, 0, 1, 0, 1, 2, 2, 0)theta <- c(mu = 1, k = 1)negbinom <- function(theta) { -sum(stats::dnbinom(count, mu = theta[1], size = theta[2], log = TRUE))}fit <- stats::optim(theta, negbinom)fit# determination of mean_log_replicates and sd_log_replicates parameters# based on real data (standard_deviations)# example standard deviations of replicatesstandard_deviations <- c(0.61, 0.54, 0.2, 1.2, 0.8, 0.3, 0.2, 0.6)theta2 <- c(meanlog = 1, sdlog = 1)lognorm <- function(theta2) { -sum(stats::dlnorm(standard_deviations, meanlog = theta2[1], sdlog = theta2[2], log = TRUE))}fit2 <- stats::optim(theta2, lognorm)fit2Calculate differential abundance between conditions
Description
This function was deprecated due to its name changing to
calculate_diff_abundance().
Usage
diff_abundance(...)Value
A data frame that contains differential abundances (diff), p-values (pval)and adjusted p-values (adj_pval) for each protein, peptide or precursor (depending onthegrouping variable) and the associated treatment/reference pair. Depending on themethod the data frame contains additional columns:
"t-test": The
std_errorcolumn contains the standard error of the differentialabundances.n_obscontains the number of observations for the specific protein, peptideor precursor (depending on thegroupingvariable) and the associated treatment/reference pair."t-test_mean_sd": Columns labeled as control refer to the second condition of thecomparison pairs. Treated refers to the first condition.
mean_controlandmean_treatedcolumns contain the means for the reference and treatment condition, respectively.sd_controlandsd_treatedcolumns contain the standard deviations for the reference and treatmentcondition, respectively.n_controlandn_treatedcolumns contain the numbers ofsamples for the reference and treatment condition, respectively. Thestd_errorcolumncontains the standard error of the differential abundances.t_statisticcontains thet_statistic for the t-test."moderated_t-test":
CI_2.5andCI_97.5contain the 2.5% and 97.5%confidence interval borders for differential abundances.avg_abundancecontains averageabundances for treatment/reference pairs (mean of the two group means).t_statisticcontains the t_statistic for the t-test.BThe B-statistic is the log-odds that theprotein, peptide or precursor (depending ongrouping) has a differential abundancebetween the two groups. Suppose B=1.5. The odds of differential abundance is exp(1.5)=4.48, i.e,about four and a half to one. The probability that there is a differential abundance is4.48/(1+4.48)=0.82, i.e., the probability is about 82% that this group is differentiallyabundant. A B-statistic of zero corresponds to a 50-50 chance that the group is differentiallyabundant.n_obscontains the number of observations for the specific protein, peptide orprecursor (depending on thegroupingvariable) and the associated treatment/reference pair."proDA": The
std_errorcolumn contains the standard error of the differentialabundances.avg_abundancecontains average abundances for treatment/reference pairs(mean of the two group means).t_statisticcontains the t_statistic for the t-test.n_obscontains the number of observations for the specific protein, peptide or precursor(depending on thegroupingvariable) and the associated treatment/reference pair.
Dose response curve helper function
Description
This function peforms the four-parameter dose response curve fit. It is the helper functionfor the fit in thefit_drc_4p function.
Usage
drc_4p(data, response, dose, log_logarithmic = TRUE, pb = NULL)Arguments
data | a data frame that contains at least the dose and response column the model shouldbe fitted to. |
response | a numeric column that contains the response values. |
dose | a numeric column that contains the dose values. |
log_logarithmic | a logical value indicating if a logarithmic or log-logarithmic model isfitted. If response values form a symmetric curve for non-log transformed dose values, alogarithmic model instead of a log-logarithmic model should be used. Usually biological doseresponse data has a log-logarithmic distribution, which is the reason this is the default.Log-logarithmic models are symmetric if dose values are log transformed. |
pb | progress bar object. This is only necessary if the function is used in an iteration. |
Value
An object of classdrc. If no fit was performed a character vector with content"no_fit".
Plotting of four-parameter dose response curves
Description
Function for plotting four-parameter dose response curves for each group (precursor, peptide orprotein), based on output fromfit_drc_4p function.
Usage
drc_4p_plot( data, grouping, response, dose, targets, unit = "uM", y_axis_name = "Response", facet_title_size = 15, facet = TRUE, scales = "free", x_axis_scale_log10 = TRUE, x_axis_limits = c(NA, NA), colours = NULL, export = FALSE, export_height = 25, export_width = 30, export_name = "dose-response_curves")Arguments
data | a data frame that is obtained by calling the |
grouping | a character column in the |
response | a numeric column in a nested data frame called |
dose | a numeric column in a nested data frame called |
targets | a character vector that specifies the names of the precursors, peptides orproteins (depending on |
unit | a character value specifying the unit of the concentration. |
y_axis_name | a character value specifying the name of the y-axis of the plot. |
facet_title_size | a numeric value that specifies the size of the facet title. Default is 15. |
facet | a logical value that indicates if plots should be summarised into facets of 20plots. This is recommended for many plots. |
scales | a character value that specifies if the scales in faceted plots (if more than onetarget was provided) should be |
x_axis_scale_log10 | a logical value that indicates if the x-axis scale should be log10transformed. |
x_axis_limits | a numeric vector of length 2, defining the lower and upper x-axis limit. Thedefault is |
colours | a character vector containing at least three colours. The first is used for the points,the second for the confidence interval and the third for the curve. By default the first twoprotti colours are used for the points and confidence interval and the curve is black. |
export | a logical value that indicates if plots should be exported as PDF. The outputdirectory will be the current working directory. The name of the file can be chosen using the |
export_height | a numeric value that specifies the plot height in inches for an exported plot.The default is |
export_width | a numeric value that specifies the plot height in inches for an exported plot.The default is |
export_name | a character value providing the name of the exported file if |
Value
Iftargets = "all" a list containing plots for every unique identifier in thegrouping variable is created. Otherwise a plot for the specified targets is created withmaximally 20 facets.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 2, frac_change = 1, n_replicates = 3, n_conditions = 8, method = "dose_response", concentrations = c(0, 1, 10, 50, 100, 500, 1000, 5000), additional_metadata = FALSE)# Perform dose response curve fitdrc_fit <- fit_drc_4p( data = data, sample = sample, grouping = peptide, response = peptide_intensity_missing, dose = concentration, retain_columns = c(protein))str(drc_fit)# Plot dose response curvesif (!is.null(drc_fit)) { drc_4p_plot( data = drc_fit, grouping = peptide, response = peptide_intensity_missing, dose = concentration, targets = c("peptide_2_1", "peptide_2_3"), unit = "pM" )}Extract metal-binding protein information from UniProt
Description
Information of metal binding proteins is extracted from UniProt data retrieved withfetch_uniprot as well as QuickGO data retrieved withfetch_quickgo.
Usage
extract_metal_binders( data_uniprot, data_quickgo, data_chebi = NULL, data_chebi_relation = NULL, data_eco = NULL, data_eco_relation = NULL, show_progress = TRUE)Arguments
data_uniprot | a data frame containing at least the |
data_quickgo | a data frame containing molecular function gene ontology information for atleast the proteins of interest. This data should be obtained by calling |
data_chebi | optional, a data frame that can be manually obtained with |
data_chebi_relation | optional, a data frame that can be manually obtained with |
data_eco | optional, a data frame that contains evidence and conclusion ontology data that can beobtained by calling |
data_eco_relation | optional, a data frame that contains relational evidence and conclusionontology data that can be obtained by calling |
show_progress | a logical value that specifies if progress will be shown (default is TRUE). |
Value
A data frame containing information on protein metal binding state. It contains thefollowing columns:
accession: UniProt protein identifier.most_specific_id: ChEBI ID that is most specific for the position after combining information from all sources.Can be multiple IDs separated by "," if a position appears multiple times due to multiple fitting IDs.most_specific_id_name: The name of the ID in themost_specific_idcolumn. This information is based onChEBI.ligand_identifier: A ligand identifier that is unique per ligand per protein. It consists of the ligand ID andligand name. The ligand ID counts the number of ligands of the same type per protein.ligand_position: The amino acid position of the residue interacting with the ligand.binding_mode: Contains information about the way the amino acid residue interacts with the ligand. If it is"covalent" then the residue is not in contact with the metal directly but only the cofactor that binds the metal.metal_function: Contains information about the function of the metal. E.g. "catalytic".metal_id_part: Contains a ChEBI ID that identifiers the metal part of the ligand. This is always the metal atom.metal_id_part_name: The name of the ID in themetal_id_partcolumn. This information is based onChEBI.note: Contains notes associated with information based on cofactors.chebi_id: Contains the original ChEBI IDs the information is based on.source: Contains the sources of the information. This can consist of "binding", "cofactor", "catalytic_activity"and "go_term".eco: If there is evidence the annotation is based on it is annotated with an ECO ID, which is split by source.eco_type: The ECO identifier can fall into the "manual_assertion" group for manually curated annotations or the"automatic_assertion" group for automatically generated annotations. If there is no evidence it is annotated as"automatic_assertion". The information is split by source.evidence_source: The original sources (e.g. literature, PDB) of evidence annotations split by source.reaction: Contains information about the chemical reaction catalysed by the protein that involves the metal.Can contain the EC ID, Rhea ID, direction specific Rhea ID, direction of the reaction and evidence for the direction.go_term: Contains gene ontology terms if there are any metal related ones associated with the annotation.go_name: Contains gene ontology names if there are any metal related ones associated with the annotation.assigned_by: Contains information about the source of the gene ontology term assignment.database: Contains information about the source of the ChEBI annotation associated with gene ontology terms.
For each protein identifier the data frame contains information on the bound ligand as well as on its position if it is known.Since information about metal ligands can come from multiple sources, additional information (e.g. evidence) is nested in the returneddata frame. In order to unnest the relevant information the following steps have to be taken: It ispossible that there are multiple IDs in the "most_specific_id" column. This means that one position cannot be uniquelyattributed to one specific ligand even with the same ligand_identifier. Apart from the "most_specific_id" column, inwhich those instances are separated by ",", in other columns the relevant information is separated by "||". Theninformation should be split based on the source (not thesource column, that one can be removed from the dataframe). There are certain columns associated with specific sources (e.g.go_term is associatedwith the"go_term" source). Values of columns not relevant for a certain source should be replaced withNA.Since amost_specific_id can have multiplechebi_ids associated with it we need to unnest thechebi_idcolumn and associated columns in which information is separated by "|". Afterwards evidence and additional information can beunnested by first splitting data for ";;" and then for ";".
Examples
# Create example datauniprot_ids <- c("P00393", "P06129", "A0A0C5Q309", "A0A0C9VD04")## UniProt datadata_uniprot <- fetch_uniprot( uniprot_ids = uniprot_ids, columns = c( "ft_binding", "cc_cofactor", "cc_catalytic_activity" ))## QuickGO datadata_quickgo <- fetch_quickgo( id_annotations = uniprot_ids, ontology_annotations = "molecular_function")## ChEBI data (2 and 3 star entries)data_chebi <- fetch_chebi(stars = c(2, 3))data_chebi_relation <- fetch_chebi(relation = TRUE)## ECO dataeco <- fetch_eco()eco_relation <- fetch_eco(return_relation = TRUE)# Extract metal binding informationmetal_info <- extract_metal_binders( data_uniprot = data_uniprot, data_quickgo = data_quickgo, data_chebi = data_chebi, data_chebi_relation = data_chebi_relation, data_eco = eco, data_eco_relation = eco_relation)metal_infoFetch AlphaFold aligned error
Description
Fetches the aligned error for AlphaFold predictions for provided proteins.The aligned error is useful for assessing inter-domain accuracy. In detail itrepresents the expected position error at residue x (scored residue), whenthe predicted and true structures are aligned on residue y (aligned residue).
Usage
fetch_alphafold_aligned_error( uniprot_ids = NULL, error_cutoff = 20, timeout = 30, max_tries = 1, return_data_frame = FALSE, show_progress = TRUE)Arguments
uniprot_ids | a character vector of UniProt identifiers for which predictionsshould be fetched. |
error_cutoff | a numeric value specifying the maximum position error (in Angstroms) that should be retained.setting this value to a low number reduces the size of the retrieved data. Default is 20. |
timeout | a numeric value specifying the time in seconds until the download times out.The default is 30 seconds. |
max_tries | a numeric value that specifies the number of times the function tries to downloadthe data in case an error occurs. The default is 1. |
return_data_frame | a logical value; if |
show_progress | a logical value; if |
Value
A list that contains aligned errors for AlphaFold predictions. If return_data_frame isTRUE, a data frame with this information is returned instead. The data frame contains thefollowing columns:
scored_residue: The error for this position is calculated based on the alignment to thealigned residue.
aligned_residue: The residue that is aligned for the calculation of the error of the scoredresidue
error: The predicted aligned error computed by alpha fold.
accession: The UniProt protein identifier.
Examples
aligned_error <- fetch_alphafold_aligned_error( uniprot_ids = c("F4HVG8", "O15552"), error_cutoff = 5, return_data_frame = TRUE)head(aligned_error, n = 10)Fetch AlphaFold prediction
Description
Fetches atom level data for AlphaFold predictions either for selected proteins or wholeorganisms.
Usage
fetch_alphafold_prediction( uniprot_ids = NULL, organism_name = NULL, version = "v4", timeout = 3600, max_tries = 5, return_data_frame = FALSE, show_progress = TRUE)Arguments
uniprot_ids | optional, a character vector of UniProt identifiers for which predictionsshould be fetched. This argument is mutually exclusive to the |
organism_name | optional, a character value providing the name of an organism for whichall available AlphaFold predictions should be retreived. The name should be the capitalisedscientific species name (e.g. "Homo sapiens").Note: Some organisms contain a lot ofpredictions which might take a considerable amount of time and memory to fetch. Therefore, youshould be sure that your system can handle fetching predictions for these organisms. Thisargument is mutually exclusive to the |
version | a character value that specifies the alphafold version that should be used. Thisis regularly updated by the database. We always try to make the current version the default version.Available version can be found here: https://ftp.ebi.ac.uk/pub/databases/alphafold/ |
timeout | a numeric value specifying the time in seconds until the download of an organismarchive times out. The default is 3600 seconds. |
max_tries | a numeric value that specifies the number of times the function tries to downloadthe data in case an error occurs. The default is 5. This only applies if |
return_data_frame | a logical value that specifies if true, a data frame instead of a listis returned. It is recommended to only use this if information for few proteins is retrieved.Default is FALSE. |
show_progress | a logical value that specifies if true, a progress bar will be shown.Default is TRUE. |
Value
A list that contains atom level data for AlphaFold predictions. If return_data_frame isTRUE, a data frame with this information is returned instead. The data frame contains thefollowing columns:
label_id: Uniquely identifies every atom in the prediction following the standardisedconvention for mmCIF files.
type_symbol: The code used to identify the atom species representing this atom type.This code is the element symbol.
label_atom_id: Uniquely identifies every atom for the given residue following thestandardised convention for mmCIF files.
label_comp_id: A chemical identifier for the residue. This is the three- letter codefor the amino acid.
label_asym_id: Chain identifier following the standardised convention for mmCIF files.Since every prediction only contains one protein this is always "A".
label_seq_id: Uniquely and sequentially identifies residues for each protein. Thenumbering corresponds to the UniProt amino acid positions.
x: The x coordinate of the atom.
y: The y coordinate of the atom.
z: The z coordinate of the atom.
prediction_score: Contains the prediction score for each residue.
auth_seq_id: Same as
label_seq_id. But of type character.auth_comp_id: Same as
label_comp_id.auth_asym_id: Same as
label_asym_id.uniprot_id: The UniProt identifier of the predicted protein.
score_quality: Score annotations.
Examples
alphafold <- fetch_alphafold_prediction( uniprot_ids = c("F4HVG8", "O15552"), return_data_frame = TRUE)head(alphafold, n = 10)Fetch ChEBI database information
Description
Fetches information from the ChEBI database.
Usage
fetch_chebi(relation = FALSE, stars = c(3), timeout = 60)Arguments
relation | a logical value that indicates if ChEBI Ontology data will be returned insteadthe main compound data. This data can be used to check the relations of ChEBI ID's to each other.Default is FALSE. |
stars | a numeric vector indicating the "star" level (confidence) for which entries shouldbe retrieved (Possible levels are 1, 2 and 3). Default is |
timeout | a numeric value specifying the time in seconds until the download of an organismarchive times out. The default is 60 seconds. |
Value
A data frame that contains information about each molecule in the ChEBI database.
Examples
chebi <- fetch_chebi()head(chebi)Fetch evidence & conclusion ontology
Description
Fetches all evidence & conclusion ontology (ECO) information from the QuickGO EBI database. The ECO project ismaintained through a publicGitHub repository.
Usage
fetch_eco( return_relation = FALSE, return_history = FALSE, show_progress = TRUE)Arguments
return_relation | a logical value that indicates if relational information should be returned insteadthe main descriptive information. This data can be used to check the relations of ECO terms to each other.Default is FALSE. |
return_history | a logical value that indicates if the entry history of an ECO term should bereturned instead the main descriptive information.Default is FALSE. |
show_progress | a logical value that indicates if a progress bar will be shown.Default is TRUE. |
Details
According to the GitHub repository ECO is defined as follows:
"The Evidence & Conclusion Ontology (ECO) describes types of scientific evidence within thebiological research domain that arise from laboratory experiments, computational methods,literature curation, or other means. Researchers use evidence to support conclusionsthat arise out of scientific research. Documenting evidence during scientific researchis essential, because evidence gives us a sense of why we believe what we think we know.Conclusions are asserted as statements about things that are believed to be true, forexample that a protein has a particular function (i.e. a protein functional annotation) orthat a disease is associated with a particular gene variant (i.e. a phenotype-gene association).A systematic and structured (i.e. ontological) classification of evidence allows us to store,retreive, share, and compare data associated with that evidence using computers, which areessential to navigating the ever-growing (in size and complexity) corpus of scientificinformation."
More information can be found in their publication (doi:10.1093/nar/gky1036).
Value
A data frame that contains descriptive information about each ECO term in the EBI database.If eitherreturn_relation orreturn_history is set toTRUE, the respective information isreturned instead of the usual output.
Examples
eco <- fetch_eco()head(eco)Fetch gene ontology information from geneontology.org
Description
Fetches gene ontology data from geneontology.org for the provided organism ID.
Usage
fetch_go(organism_id)Arguments
organism_id | a character value NCBI taxonomy identifier of an organism (TaxId).Possible inputs inlude only: "9606" (Human), "559292" (Yeast) and "83333" (E. coli). |
Value
A data frame that contains gene ontology mappings to UniProt or SGD IDs. The originalfile is a .GAF file. A detailed description of all columns can be found here:http://geneontology.org/docs/go-annotation-file-gaf-format-2.1/
Examples
go <- fetch_go("9606")head(go)Fetch KEGG pathway data from KEGG
Description
Fetches gene IDs and corresponding pathway IDs and names for the provided organism.
Usage
fetch_kegg(species)Arguments
species | a character value providing an abreviated species name. "hsa" for human, "eco"for E. coli and "sce" for S. cerevisiae. Additional possible names can be found foreukaryotes and forprokaryotes. |
Value
A data frame that contains gene IDs with corresponding pathway IDs and names for aselected organism.
Examples
kegg <- fetch_kegg(species = "hsa")head(kegg)Fetch structural information about protein-metal binding from MetalPDB
Description
Fetches information about protein-metal binding sites from theMetalPDB database. A complete list of different possible searchqueries can be found on their website.
Usage
fetch_metal_pdb( id_type = "uniprot", id_value, site_type = NULL, pfam = NULL, cath = NULL, scop = NULL, representative = NULL, metal = NULL, ligands = NULL, geometry = NULL, coordination = NULL, donors = NULL, columns = NULL, show_progress = TRUE)Arguments
id_type | a character value that specifies the type of the IDs provided to |
id_value | a character vector supplying IDs that are of the ID type that was specified in |
site_type | optional, a character value that specifies a nuclearity for which informationshould be retrieved. The specific nuclearity can be supplied as e.g. "tetranuclear". |
pfam | optional, a character value that specifies a Pfam domain for which informationshould be retrieved. The domain can be specified as e.g. "Carb_anhydrase". |
cath | optional, a character value that specifies a CATH ID for which informationshould be retrieved. The ID can be specified as e.g. "3.10.200.10". |
scop | optional, a character value that specifies a SCOP ID for which informationshould be retrieved. The ID can be specified as e.g. "b.74.1.1". |
representative | optional, a logical that indicates if only information of representativesites of a family should be retrieved it can be specified here. A representative site is asite selected to represent a cluster of equivalent sites. The selection is done by choosingthe PDB structure with the best X-ray resolution among those containing the sites in thecluster. NMR structures are generally discarded in favor of X-ray structures, unless all thesites in the cluster are found in NMR structures. If it is |
metal | optional, a character value that specifies a metal for which informationshould be retrieved. The metal can be specified as e.g. "Zn". |
ligands | optional, a character value that specifies a metal ligand residue for whichinformation should be retrieved. The ligand can be specified as e.g. "His". |
geometry | optional, a character value that specifies a metal site geometry for whichinformation should be retrieved. The geometry can be specified here based on the three lettercode for geometries provided on their website. |
coordination | optional, a character value that specifies a coordination number for whichinformation should be retrieved. The number can be specified as e.g. "3". |
donors | optional, a character value that specifies a metal ligand atom for whichinformation should be retrieved. The atom can be specified as e.g. "S" for sulfur. |
columns | optional, a character vector that specifies specific columns that should beretrieved based on the MetalPDB website. Ifnothing is supplied here, all possible columns will be retrieved. |
show_progress | logical, if true, a progress bar will be shown. Default is TRUE. |
Value
A data frame that contains information about protein-metal binding sites. The dataframe contains some columns that might not be self explanatory.
auth_id_metal: Unique structure atom identifier of the metal, which is provided bythe author of the structure in order to match the identification used in the publicationthat describes the structure.
auth_seq_id_metal: Residue identifier of the metal, which is provided by the author ofthe structure in order to match the identification used in the publication that describes thestructure.
pattern: Metal pattern for each metal bound by the structure.
is_representative: A representative site is a site selected to represent a cluster ofequivalent sites. The selection is done by choosing the PDB structure with the best X-rayresolution among those containing the sites in the cluster. NMR structures are generallydiscarded in favor of X-ray structures, unless all the sites in the cluster are found in NMRstructures.
auth_asym_id_ligand: Chain identifier of the metal-coordinating ligand residues, whichis provided by the author of the structure in order to match the identification used in thepublication that describes the structure.
auth_seq_id_ligand: Residue identifier of the metal-coordinating ligand residues, whichis provided by the author of the structure in order to match the identification used in thepublication that describes the structure.
auth_id_ligand: Unique structure atom identifier of the metal-coordinating ligand residues, which is provided by the author of the structure in order to match the identificationused in the publication that describes the structure.
auth_atom_id_ligand: Unique residue specific atom identifier of the metal-coordinatingligand residues, which is provided by the author of the structure in order to match theidentification used in the publication that describes the structure.
Examples
head(fetch_metal_pdb(id_value = c("P42345", "P00918")))fetch_metal_pdb(id_type = "pdb", id_value = c("1g54"), metal = "Zn")Fetch protein disorder and mobility information from MobiDB
Description
Fetches information about disordered and flexible protein regions from MobiDB.
Usage
fetch_mobidb( uniprot_ids = NULL, organism_id = NULL, show_progress = TRUE, timeout = 60, max_tries = 2)Arguments
uniprot_ids | optional, a character vector of UniProt identifiers for which informationshould be fetched. This argument is mutually exclusive to the |
organism_id | optional, a character value providing the NCBI taxonomy identifier of an organism(TaxId) of an organism for which all available information should be retreived. Thisargument is mutually exclusive to the |
show_progress | a logical value; if |
timeout | a numeric value specifying the time in seconds until the download of an organismarchive times out. The default is 60 seconds. |
max_tries | a numeric value that specifies the number of times the function tries to downloadthe data in case an error occurs. The default is 2. |
Value
A data frame that contains start and end positions for disordered and flexible proteinregions. Thefeature column contains information on the source of thisannotation. More information on the source can be foundhere.
Examples
fetch_mobidb( uniprot_ids = c("P0A799", "P62707"))Fetch structure information from RCSB
Description
Fetches structure metadata from RCSB. If you want to retrieve atom data such as positions, usethe functionfetch_pdb_structure().
Usage
fetch_pdb(pdb_ids, batchsize = 100, show_progress = TRUE)Arguments
pdb_ids | a character vector of PDB identifiers. |
batchsize | a numeric value that specifies the number of structures to be processed in asingle query. Default is 100. |
show_progress | a logical value that indicates if a progress bar will be shown. Default isTRUE. |
Value
A data frame that contains structure metadata for the PDB IDs provided. The data framecontains some columns that might not be self explanatory.
auth_asym_id: Chain identifier provided by the author of the structure in order tomatch the identification used in the publication that describes the structure.
label_asym_id: Chain identifier following the standardised convention for mmCIF files.
entity_beg_seq_id, ref_beg_seq_id, length, pdb_sequence:
entity_beg_seq_idis aposition in the structure sequence (pdb_sequence) that matches the position given inref_beg_seq_id, which is a position within the protein sequence (not included in thedata frame).lengthidentifies the stretch of sequence for which positions matchaccordingly between structure and protein sequence.entity_beg_seq_idis a residue IDbased on the standardised convention for mmCIF files.auth_seq_id: Residue identifier provided by the author of the structure in order tomatch the identification used in the publication that describes the structure. This charactervector has the same length as the
pdb_sequenceand each position is the identifier forthe matching amino acid position inpdb_sequence. The contained values are notnecessarily numbers and the values do not have to be positive.modified_monomer: Is composed of first the composition ID of the modification, followedby the
label_seq_idposition. In parenthesis are the parent monomer identifiers asthey appear in the sequence.ligand_*: Any column starting with the
ligand_*prefix contains information aboutthe position, identity and donors for ligand binding sites. If there are multiple entities ofligands they are separated by "|". Specific donor level information is separated by ";".secondar_structure: Contains information about helix and sheet secondary structure elements.Individual regions are separated by ";".
unmodeled_structure: Contains information about unmodeled or partially modeled regions inthe model. Individual regions are separated by ";".
auth_seq_id_original: In some cases the sequence positions do not match the number of residuesin the sequence either because positions are missing or duplicated. This always coincides with modifiedresidues, however does not always occur when there is a modified residue in the sequence. This columncontains the original
auth_seq_idinformation that does not have these positions corrected.
Examples
pdb <- fetch_pdb(pdb_ids = c("6HG1", "1E9I", "6D3Q", "4JHW"))head(pdb)Fetch PDB structure atom data from RCSB
Description
Fetches atom data for a PDB structure from RCSB. If you want to retrieve metadata about PDBstructures, use the functionfetch_pdb(). The information retrieved is based on the.cif file of the structure, which may vary from the .pdb file.
Usage
fetch_pdb_structure(pdb_ids, return_data_frame = FALSE, show_progress = TRUE)Arguments
pdb_ids | a character vector of PDB identifiers. |
return_data_frame | a logical value that indicates if a data frame instead of a list isreturned. It is recommended to only use this if not many pdb structures are retrieved. Defaultis FALSE. |
show_progress | a logical value that indicates if a progress bar will be shown.Default is TRUE. |
Value
A list that contains atom data for each PDB structures provided. If return_data_frame isTRUE, a data frame with this information is returned instead. The data frame contains thefollowing columns:
label_id: Uniquely identifies every atom in the structure following the standardisedconvention for mmCIF files. Example value: "5", "C12", "Ca3g28", "Fe3+17", "H*251", "boron2a","C a phe 83 a 0", "Zn Zn 301 A 0"
type_symbol: The code used to identify the atom species representing this atom type.Normally this code is the element symbol. The code may be composed of any character except anunderscore with the additional proviso that digits designate an oxidation state and must befollowed by a + or - character. Example values: "C", "Cu2+", "H(SDS)", "dummy", "FeNi".
label_atom_id: Uniquely identifies every atom for the given residue following thestandardised convention for mmCIF files. Example values: "CA", "HB1", "CB", "N"
label_comp_id: A chemical identifier for the residue. For protein polymer entities,this is the three- letter code for the amino acid. For nucleic acid polymer entities, this isthe one-letter code for the base. Example values: "ala", "val", "A", "C".
label_asym_id: Chain identifier following the standardised convention for mmCIF files.Example values: "1", "A", "2B3".
entity_id: Records details about the molecular entities that are present in thecrystallographic structure. Usually all different types of molecular entities such as polymerentities, non-polymer entities or water molecules are numbered once for each structure. Eachtype of non-polymer entity has its own number. Thus, the highest number in this columnrepresents the number of different molecule types in the structure.
label_seq_id: Uniquely and sequentially identifies residues for each
label_asym_id.This is always a number and the sequence of numbers always progresses in increasing numerical order.x: The x coordinate of the atom.
y: The y coordinate of the atom.
z: The z coordinate of the atom.
site_occupancy: The fraction of the atom type present at this site.
b_iso_or_equivalent: Contains the B-factor or isotopic atomic displacement factor foreach atom.
formal_charge: The net integer charge assigned to this atom. This is the formal chargeassignment normally found in chemical diagrams. It is currently only assigned in a small subsetof structures.
auth_seq_id: An alternative residue identifier (
label_seq_id) provided by theauthor of the structure in order to match the identification used in the publication thatdescribes the structure. This does not need to be numeric and is therefore of type character.auth_comp_id: An alternative chemical identifier (
label_comp_id) provided by theauthor of the structure in order to match the identification used in the publication thatdescribes the structure.auth_asym_id: An alternative chain identifier (
label_asym_id) provided by theauthor of the structure in order to match the identification used in the publication thatdescribes the structure.pdb_model_number: The PDB model number.
pdb_id: The protein database identifier for the structure.
Examples
pdb_structure <- fetch_pdb_structure( pdb_ids = c("6HG1", "1E9I", "6D3Q", "4JHW"), return_data_frame = TRUE)head(pdb_structure, n = 10)Fetch information from the QuickGO API
Description
Fetches gene ontology (GO) annotations, terms or slims from the QuickGO EBI database.Annotations can be retrieved for specific UniProt IDs or NCBI taxonomy identifiers. Whenterms are retrieved, a complete list of all GO terms is returned. For the generation ofa slim dataset you can provide GO IDs that should be considered. A slim dataset is a subsetGO dataset that considers all child terms of the supplied IDs.
Usage
fetch_quickgo( type = "annotations", id_annotations = NULL, taxon_id_annotations = NULL, ontology_annotations = "all", go_id_slims = NULL, relations_slims = c("is_a", "part_of", "regulates", "occurs_in"), timeout = 1200, max_tries = 2, show_progress = TRUE)Arguments
type | a character value that indicates if gene ontology terms, annotations or slimsshould be retrieved. The possible values therefore include "annotations", "terms" and "slims".If annotations are retrieved, the maximum number of results is 2,000,000. |
id_annotations | an optional character vector that specifies UniProt IDs for which GO annotationsshould be retrieved. This argument should only be provided if annotations are retrieved. |
taxon_id_annotations | an optional character value that specifies the NCBI taxonomy identifier (TaxId)for an organism for which GO annotations should be retrieved.This argument should only be provided if annotations are retrieved. |
ontology_annotations | an optional character value that specifies the ontology that should be retrieved.This can either have the values "all", "molecular_function", "biological_process" or"cellular_component". This argument should only be provided if annotations are retrieved. |
go_id_slims | an optional character vector that specifies gene ontology IDs (e.g. GO:0046872) for whicha slim go set should be generated. This argument should only be provided if slims are retrieved. |
relations_slims | an optional character vector that specifies the relations of GO IDs that should beconsidered for the generation of the slim dataset. This argument should only be provided if slims are retrieved. |
timeout | a numeric value specifying the time in seconds until the download times out.The default is 1200 seconds. |
max_tries | a numeric value that specifies the number of times the function tries to downloadthe data in case an error occurs. The default is 2. |
show_progress | a logical value that indicates if a progress bar will be shown.Default is TRUE. |
Value
A data frame that contains descriptive information about gene ontology annotations, terms or slimsdepending on what the input "type" was.
Examples
# Annotationsannotations <- fetch_quickgo( type = "annotations", id = c("P63328", "Q4FFP4"), ontology = "molecular_function")head(annotations)# Termsterms <- fetch_quickgo(type = "terms")head(terms)# Slimsslims <- fetch_quickgo( type = "slims", go_id_slims = c("GO:0046872", "GO:0051540"))head(slims)Fetch protein data from UniProt
Description
Fetches protein metadata from UniProt.
Usage
fetch_uniprot( uniprot_ids, columns = c("protein_name", "length", "sequence", "gene_names", "xref_geneid", "xref_string", "go_f", "go_p", "go_c", "cc_interaction", "ft_act_site", "ft_binding", "cc_cofactor", "cc_catalytic_activity", "xref_pdb"), batchsize = 200, max_tries = 10, timeout = 20, show_progress = TRUE)Arguments
uniprot_ids | a character vector of UniProt accession numbers. |
columns | a character vector of metadata columns that should be imported from UniProt (allpossible columns can be foundhere. Forcross-referenced database provide the database name with the prefix "xref_", e.g. |
batchsize | a numeric value that specifies the number of proteins processed in a singlesingle query. Default and max value is 200. |
max_tries | a numeric value that specifies the number of times the function tries to downloadthe data in case an error occurs. |
timeout | a numeric value that specifies the maximum request time per try. Default is 20 seconds. |
show_progress | a logical value that determines if a progress bar will be shown. Defaultis TRUE. |
Value
A data frame that contains all protein metadata specified incolumns for theproteins provided. Theinput_id column contains the provided UniProt IDs. If an invalid IDwas provided that contains a valid UniProt ID, the valid portion of the ID is still fetched andpresent in theaccession column, while theinput_id column contains the original not completelyvalid ID.
Examples
fetch_uniprot(c("P36578", "O43324", "Q00796"))# Not completely valid IDfetch_uniprot(c("P02545", "P02545;P20700"))Fetch proteome data from UniProt
Description
Fetches proteome data from UniProt for the provided organism ID.
Usage
fetch_uniprot_proteome( organism_id, columns = c("accession"), reviewed = TRUE, timeout = 120, max_tries = 5)Arguments
organism_id | a numeric value that specifies the NCBI taxonomy identifier (TaxId) for anorganism. |
columns | a character vector of metadata columns that should be imported from UniProt (allpossible columns can be foundhere. Forcross-referenced database provide the database name with the prefix "xref_", e.g. |
reviewed | a logical value that determines if only reviewed protein entries will be retrieved. |
timeout | a numeric value specifying the time in seconds until the download times out.The default is 60 seconds. |
max_tries | a numeric value that specifies the number of times the function tries to downloadthe data in case an error occurs. The default is 2. |
Value
A data frame that contains all protein metadata specified incolumns for theorganism of choice.
Examples
head(fetch_uniprot_proteome(9606))Data filtering based on coefficients of variation (CV)
Description
Filters the input data based on precursor, peptide or protein intensity coefficients of variation.The function should be used to ensure that only robust measurements and quantifications are used fordata analysis. It is advised to use the function after inspection of raw values (quality control)and median normalisation. Generally, the function calculates CVs of each peptide, precursor orprotein for each condition and removes peptides, precursors or proteins that have a CV abovethe cutoff in less than the (user-defined) required number of conditions. Since the user-definedcutoff is fixed and does not depend on the number of conditions that have detected values, thefunction might bias for data completeness.
Usage
filter_cv( data, grouping, condition, log2_intensity, cv_limit = 0.25, min_conditions, silent = FALSE)Arguments
data | a data frame that contains at least the input variables. |
grouping | a character column in the |
condition | a character or numeric column in the |
log2_intensity | a numeric column in the |
cv_limit | optional, a numeric value that specifies the CV cutoff that will be applied.Default is 0.25. |
min_conditions | a numeric value that specifies the minimum number of conditions forwhich grouping CVs should be below the cutoff. |
silent | a logical value that specifies if a message with the number of filtered outconditions should be returned. Default is FALSE. |
Value
The CV filtered data frame.
Examples
set.seed(123) # Makes example reproducible# Create synthetic datadata <- create_synthetic_data( n_proteins = 50, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random", additional_metadata = FALSE)# Filter coefficients of variationdata_filtered <- filter_cv( data = data, grouping = peptide, condition = condition, log2_intensity = peptide_intensity_missing, cv_limit = 0.25, min_conditions = 2)Find all sub IDs of an ID in a network
Description
For a given ID, find all sub IDs and their sub IDs etc. The type ofrelationship can be selected too. This is a helper function for other functions.
Usage
find_all_subs( data, ids, main_id = id, type = type, accepted_types = "is_a", exclude_parent_id = FALSE)Arguments
data | a data frame that contains relational information on IDs (main_id) their subIDs (sub_id) and their relationship (type). For ChEBI this data frame can be obtained by calling |
ids | a character vector of IDs for which sub IDs should be searched. |
main_id | a character or integer column containing IDs. Default is |
type | a character column that contains the type of interactions. Default is |
accepted_types | a character vector containing the accepted_types of relationships that should be consideredfor the search. It is possible to use "all" relationships. The default type is "is_a". A list ofpossible relationships for e.g. ChEBI IDs can be foundhere. |
exclude_parent_id | a logical value that specifies if the parent ID should be included inthe returned list. |
Value
A list of character vectors containing the provided ID and all of its sub IDs. Itcontains one element per input ID.
Find ChEBI IDs for name patterns
Description
Search for chebi IDs that match a specific name pattern. A list of corresponding ChEBI IDs isreturned.
Usage
find_chebis(chebi_data, pattern)Arguments
chebi_data | a data frame that contains at least information on ChEBI IDs (id) and theirnames (name). This data frame can be obtained by calling |
pattern | a character vector that contains names or name patterns of molecules. Namepatterns can be for example obtained with the |
Value
A list of character vectors containing ChEBI IDs that have a name matching the suppliedpattern. It contains one element per pattern.
Find peptide location
Description
The position of the given peptide sequence is searched within the given protein sequence. Inaddition the last amino acid of the peptide and the amino acid right before are reported.
Usage
find_peptide(data, protein_sequence, peptide_sequence)Arguments
data | a data frame that contains at least the protein and peptide sequence. |
protein_sequence | a character column in the |
peptide_sequence | a character column in the |
Value
A data frame that contains the input data and four additional columns with peptidestart and end position, the last amino acid and the amino acid before the peptide.
Examples
# Create example datadata <- data.frame( protein_sequence = c("abcdefg"), peptide_sequence = c("cde"))# Find peptidefind_peptide( data = data, protein_sequence = protein_sequence, peptide_sequence = peptide_sequence)Finds peptide positions in a PDB structure based on positional matching
Description
Finds peptide positions in a PDB structure. Often positions of peptides in UniProt and a PDBstructure are different due to different lengths of structures. This function maps a peptidebased on its UniProt positions onto a PDB structure. This method is superior to sequencealignment of the peptide to the PDB structure sequence, since it can also match the peptide ifthere are truncations or mismatches. This function also provides an easy way to check if apeptide is present in a PDB structure.
Usage
find_peptide_in_structure( peptide_data, peptide, start, end, uniprot_id, pdb_data = NULL, retain_columns = NULL)Arguments
peptide_data | a data frame containing at least the input columns to this function. |
peptide | a character column in the |
start | a numeric column in the |
end | a numeric column in the |
uniprot_id | a character column in the |
pdb_data | optional, a data frame containing data obtained with |
retain_columns | a vector indicating if certain columns should be retained from the inputdata frame. Default is not retaining additional columns |
Value
A data frame that contains peptide positions in the corresponding PDB structures. If apeptide is not found in any structure or no structure is associated with the protein, the dataframe contains NAs values for the output columns. The data frame contains the following andadditional columns:
auth_asym_id: Chain identifier provided by the author of the structure in order tomatch the identification used in the publication that describes the structure.
label_asym_id: Chain identifier following the standardised convention for mmCIF files.
peptide_seq_in_pdb: The sequence of the peptide mapped to the structure. If thepeptide only maps partially, then only the part of the sequence that maps on the structure isreturned.
fit_type: The fit type is either "partial" or "fully" and it indicates if the completepeptide or only part of it was found in the structure.
label_seq_id_start: Contains the first residue position of the peptide in the structurefollowing the standardised convention for mmCIF files.
label_seq_id_end: Contains the last residue position of the peptide in the structurefollowing the standardised convention for mmCIF files.
auth_seq_id_start: Contains the first residue position of the peptide in the structurebased on the alternative residue identifier provided by the author of the structure in orderto match the identification used in the publication that describes the structure. This doesnot need to be numeric and is therefore of type character.
auth_seq_id_end: Contains the last residue position of the peptide in the structurebased on the alternative residue identifier provided by the author of the structure in orderto match the identification used in the publication that describes the structure. This doesnot need to be numeric and is therefore of type character.
auth_seq_id: Contains all positions (separated by ";") of the peptide in the structurebased on the alternative residue identifier provided by the author of the structure in orderto match the identification used in the publication that describes the structure. This doesnot need to be numeric and is therefore of type character.
n_peptides: The number of peptides from one protein that were searched for within thecurrent structure.
n_peptides_in_structure: The number of peptides from one protein that were found withinthe current structure.
Examples
# Create example datapeptide_data <- data.frame( uniprot_id = c("P0A8T7", "P0A8T7", "P60906"), peptide_sequence = c( "SGIVSFGKETKGKRRLVITPVDGSDPYEEMIPKWRQLNV", "NVFEGERVER", "AIGEVTDVVEKE" ), start = c(1160, 1197, 55), end = c(1198, 1206, 66))# Find peptides in protein structurepeptide_in_structure <- find_peptide_in_structure( peptide_data = peptide_data, peptide = peptide_sequence, start = start, end = end, uniprot_id = uniprot_id)head(peptide_in_structure, n = 10)Fitting four-parameter dose response curves
Description
Function for fitting four-parameter dose response curves for each group (precursor, peptide orprotein). In addition it can annotate data based on completeness, the completeness distributionand statistical testing using ANOVA. Filtering by the function is only performed based on completenessif selected.
Usage
fit_drc_4p( data, sample, grouping, response, dose, filter = "post", replicate_completeness = 0.7, condition_completeness = 0.5, n_replicate_completeness = NULL, n_condition_completeness = NULL, complete_doses = NULL, anova_cutoff = 0.05, correlation_cutoff = 0.8, log_logarithmic = TRUE, include_models = FALSE, retain_columns = NULL)Arguments
Details
If data filtering options are selected, data is annotated based on multiple criteria.If"post" is selected the data is annotated based on completeness, the completeness distribution, theadjusted ANOVA p-value cutoff and a correlation cutoff. Completeness of features is determined based onthen_replicate_completeness andn_condition_completeness arguments. The completeness distribution determinesif there is a distribution of not random missingness of data along the dose. For this it is checked if half of afeatures values (+/-1 value) pass the replicate completeness criteria and half do not pass it. In order to fall intothis category, the values that fulfill the completeness cutoff and the ones that do not fulfill itneed to be consecutive, meaning located next to each other based on their concentration values. Furthermore,the values that do not pass the completeness cutoff need to be lower in intensity. Lastly, the differencebetween the two groups is tested for statistical significance using a Welch's t-test and acutoff of p <= 0.1 (we want to mainly discard curves that falsely fit the other criteria but thathave clearly non-significant differences in mean). This allows curves to be considered that havemissing values in half of their observations due to a decrease in intensity. It can be thoughtof as conditions that are missing not at random (MNAR). It is often the case that those entitiesdo not have a significant p-value since half of their conditions are not considered due to datamissingness. The ANOVA test is performed on the features by concentration. If it is significant it islikely that there is some response. However, this test would also be significant even if there is oneoutlier concentration so it should only be used only in combination with other cutoffs to determineif a feature is significant. Thepassed_filter column isTRUE for all thefeatures that pass the above mentioned criteria and that have a correlation greater than the cutoff(default is 0.8) and the adjusted ANOVA p-value below the cutoff (default is 0.05).
The final list is ranked based on a score calculated on entities that pass the filter.The score is the negative log10 of the adjusted ANOVA p-value scaled between 0 and 1 and thecorrelation scaled between 0 and 1 summed up and divided by 2. Thus, the highest score anentity can have is 1 with both the highest correlation and adjusted p-value. The rank iscorresponding to this score. Please note, that entities with MNAR conditions might have alower score due to the missing or non-significant ANOVA p-value. If no score could be calculatedthe usual way these cases receive a score of 0. You should have a look at curves that are TRUEfordose_MNAR in more detail.
If the"pre" option is selected for thefilter argument then the data is filtered for completenessprior to curve fitting and the ANOVA test. Otherwise annotation is performed exactly as mentioned above.We recommend the"pre" option because it leaves you with not only the likely hits of your treatment, butalso with rather high confidence true negative results. This is because the filtered data has a highdegree of completeness making it unlikely that a real dose-response curve is missed due to data missingness.
Please note that in general, curves are only fitted if there are at least 5 conditions with data points presentto ensure that there is potential for a good curve fit. This is done independent of the selected filtering option.
Value
Ifinclude_models = FALSE a data frame is returned that contains correlationsof predicted to measured values as a measure of the goodness of the curve fit, an associatedp-value and the four parameters of the model for each group. Furthermore, input data for plotsis returned in the columnsplot_curve (curve and confidence interval) andplot_points(measured points). Ifinclude_models = TURE, a list is returned that contains:
fit_objects: The fit objects of typedrcfor each group.correlations: The correlation data frame described above
Examples
# Load librarieslibrary(dplyr)set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 2, frac_change = 1, n_replicates = 3, n_conditions = 8, method = "dose_response", concentrations = c(0, 1, 10, 50, 100, 500, 1000, 5000), additional_metadata = FALSE)# Perform dose response curve fitdrc_fit <- fit_drc_4p( data = data, sample = sample, grouping = peptide, response = peptide_intensity_missing, dose = concentration, n_replicate_completeness = 2, n_condition_completeness = 5, retain_columns = c(protein, change_peptide))glimpse(drc_fit)head(drc_fit, n = 10)Perform gene ontology enrichment analysis
Description
This function was deprecated due to its name changing to
calculate_go_enrichment().
Usage
go_enrichment(...)Value
A bar plot displaying negative log10 adjusted p-values for the top 10 enriched ordepleted gene ontology terms. Alternatively, plot cutoffs can be chosen individually with theplot_cutoff argument. Bars are colored according to the direction of the enrichment(enriched or deenriched). Ifplot = FALSE, a data frame is returned. P-values areadjusted with Benjamini-Hochberg.
Imputation of missing values
Description
impute is calculating imputation values for missing data depending on the selectedmethod.
Usage
impute( data, sample, grouping, intensity_log2, condition, comparison = comparison, missingness = missingness, noise = NULL, method = "ludovic", skip_log2_transform_error = FALSE, retain_columns = NULL)Arguments
data | a data frame that is ideally the output from the |
sample | a character column in the |
grouping | a character column in the |
intensity_log2 | a numeric column in the |
condition | a character or numeric column in the |
comparison | a character column in the |
missingness | a character column in the |
noise | a numeric column in the |
method | a character value that specifies the method to be used for imputation. For |
skip_log2_transform_error | a logical value that determines if a check is performed tovalidate that input values are log2 transformed. If input values are > 40 the test is failedand an error is returned. |
retain_columns | a vector that indicates columns that should be retained from the inputdata frame. Default is not retaining additional columns |
Value
A data frame that contains animputed_intensity andimputed column inaddition to the required input columns. Theimputed column indicates if a value wasimputed. Theimputed_intensity column contains imputed intensity values for previouslymissing intensities.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 10, frac_change = 0.5, n_replicates = 4, n_conditions = 2, method = "effect_random", additional_metadata = FALSE)head(data, n = 24)# Assign missingness informationdata_missing <- assign_missingness( data, sample = sample, condition = condition, grouping = peptide, intensity = peptide_intensity_missing, ref_condition = "all", retain_columns = c(protein, peptide_intensity))head(data_missing, n = 24)# Perform imputationdata_imputed <- impute( data_missing, sample = sample, grouping = peptide, intensity_log2 = peptide_intensity_missing, condition = condition, comparison = comparison, missingness = missingness, method = "ludovic", retain_columns = c(protein, peptide_intensity))head(data_imputed, n = 24)Perform KEGG pathway enrichment analysis
Description
This function was deprecated due to its name changing to
calculate_kegg_enrichment().
Usage
kegg_enrichment(...)Value
A bar plot displaying negative log10 adjusted p-values for the top 10 enriched pathways.Bars are coloured according to the direction of the enrichment. Ifplot = FALSE, a dataframe is returned.
Viridis colour scheme
Description
A perceptually uniform colour scheme originally created for the Seaborn python package.
Usage
mako_coloursFormat
A vector containing 256 colours
Source
created for the Seaborn statistical data visualization package for Python
Maps peptides onto a PDB structure or AlphaFold prediction
Description
Peptides are mapped onto PDB structures or AlphaFold prediction based on their positions.This is accomplished by replacing the B-factor information in the structure file withvalues that allow highlighting of peptides, protein regions or amino acids when the structureis coloured by B-factor. In addition to simply highlighting peptides, protein regions or aminoacids, a continuous variable such as fold changes associated with them can be mapped onto thestructure as a colour gradient.
Usage
map_peptides_on_structure( peptide_data, uniprot_id, pdb_id, chain, auth_seq_id, map_value, file_format = ".cif", scale_per_structure = TRUE, export_location = NULL, structure_file = NULL, show_progress = TRUE)Arguments
peptide_data | a data frame that contains the input columns to this function. If structureor prediction files should be fetched automatically, please provide column names to the followingarguments:uniprot_id,pdb_id,chain,auth_seq_id,map_value. If no PDB structure for a protein is available the |
uniprot_id | a character column in the |
pdb_id | a character column in the |
chain | a character column in the |
auth_seq_id | optional, a character (or numeric) column in the |
map_value | a numeric column in the |
file_format | a character vector containing the file format of the structure that will befetched from the database for the PDB identifiers provided in the |
scale_per_structure | a logical value that specifies if scaling should be performed foreach structure independently (TRUE) or over the whole data set (FALSE). The default is TRUE,which scales the scores of each structure independently so that each structure has a scorerange from 50 to 100. |
export_location | optional, a character argument specifying the path to the location inwhich the fetched and altered structure files should be saved. If left empty, they will besaved in the current working directory. The location should be provided in the followingformat "folderA/folderB". |
structure_file | optional, a character argument specifying the path to the location andname of a structure file in ".cif" or ".pdb" format. If a structure is provided the |
show_progress | a logical, if |
Value
The function exports a modified ".pdb" or ".cif" structure file. B-factors have beenreplaced with scaled (50-100) values provided in themap_value column.
Examples
# Load librarieslibrary(dplyr)# Create example datapeptide_data <- data.frame( uniprot_id = c("P0A8T7", "P0A8T7", "P60906"), peptide_sequence = c( "SGIVSFGKETKGKRRLVITPVDGSDPYEEMIPKWRQLNV", "NVFEGERVER", "AIGEVTDVVEKE" ), start = c(1160, 1197, 55), end = c(1198, 1206, 66), map_value = c(70, 100, 100))# Find peptide positions in structurespositions_structure <- find_peptide_in_structure( peptide_data = peptide_data, peptide = peptide_sequence, start = start, end = end, uniprot_id = uniprot_id, retain_columns = c(map_value)) %>% filter(pdb_ids %in% c("6UU2", "2EL9"))# Map peptides on structures# You can determine the preferred output location# with the export_location argument. Currently it# is saved in the working directory.map_peptides_on_structure( peptide_data = positions_structure, uniprot_id = uniprot_id, pdb_id = pdb_ids, chain = auth_asym_id, auth_seq_id = auth_seq_id, map_value = map_value, file_format = ".pdb", export_location = getwd())Intensity normalisation
Description
This function was deprecated due to its name changing to
normalise().The normalisation method in the new function needs to be provided as an argument.
Usage
median_normalisation(...)Value
A data frame with a column callednormalised_intensity_log2 containing thenormalised intensity values.
List of metal-related ChEBI IDs in UniProt
Description
A list that contains all ChEBI IDs that appear in UniProt and that contain either a metal atomin their formula or that do not have a formula but the ChEBI term is related to metals.This was last updated on the 19/02/24.
Usage
metal_chebi_uniprotFormat
A data.frame containing information retrieved from ChEBI usingfetch_chebi(stars = c(2, 3)),filtered using symbols in themetal_list and manual annotation of metal related ChEBI IDs that do notcontain a formula.
Source
UniProt (cc_cofactor, cc_catalytic_activity, ft_binding) and ChEBI
Molecular function gene ontology metal subset
Description
A subset of molecular function gene ontology terms related to metals that was createdusing the slimming process provided by the QuickGO EBI database.This was last updated on the 19/02/24.
Usage
metal_go_slim_subsetFormat
A data.frame containing a slim subset of molecular function gene ontology termsthat are related to metal binding. Theslims_from_id column contains all IDs relevantin this subset while theslims_to_ids column contains the starting IDs. If ChEBI IDshave been annotated manually this is indicated in thedatabase column.
Source
QuickGO and ChEBI
List of metals
Description
A list of all metals and metalloids in the periodic table.
Usage
metal_listFormat
A data.frame containing the columnsatomic_number,symbol,name,type,chebi_id.
Source
https://en.wikipedia.org/wiki/Metal and https://en.wikipedia.org/wiki/Metalloid
Analyse protein interaction network for significant hits
Description
This function was deprecated due to its name changing to
analyse_functional_network().
Usage
network_analysis(...)Value
A network plot displaying interactions of the provided proteins. Ifbinds_treatment was provided halos around the proteins show which proteins interact withthe treatment. Ifplot = FALSE a data frame with interaction information is returned.
Intensity normalisation
Description
Performs normalisation on intensities. For median normalisation the normalised intensity is theoriginal intensity minus the run median plus the global median. This is also the way it isimplemented in the Spectronaut search engine.
Usage
normalise(data, sample, intensity_log2, method = "median")Arguments
data | a data frame containing at least sample names and intensity values. Please note that if thedata frame is grouped, the normalisation will be computed by group. |
sample | a character column in the |
intensity_log2 | a numeric column in the |
method | a character value specifying the method to be used for normalisation. Defaultis "median". |
Value
A data frame with a column callednormalised_intensity_log2 containing thenormalised intensity values.
Examples
data <- data.frame( r_file_name = c("s1", "s2", "s3", "s1", "s2", "s3"), intensity_log2 = c(18, 19, 17, 20, 21, 19))normalise(data, sample = r_file_name, intensity_log2 = intensity_log2, method = "median")Creates a contact map of all atoms from a structure file (using parallel processing)
Description
This function is a wrapper aroundcreate_structure_contact_map() that allows the use of allsystem cores for the creation of contact maps. Alternatively, it can be used for sequentialprocessing of large datasets. The benefit of this function overcreate_structure_contact_map()is that it processes contact maps in batches, which is recommended for large datasets. If usedfor parallel processing it should only be used on systems that have enough memory available.Workers can either be set up manually before running the function withfuture::plan(multisession) or automatically by the function (maximum number of workersis 12 in this case). If workers are set up manually theprocessing_type argument shouldbe set to "parallel manual". In this case workers can be terminated after completion withfuture::plan(sequential).
Usage
parallel_create_structure_contact_map( data, data2 = NULL, id, chain = NULL, auth_seq_id = NULL, distance_cutoff = 10, pdb_model_number_selection = c(0, 1), return_min_residue_distance = TRUE, export = FALSE, export_location = NULL, split_n = 40, processing_type = "parallel")Arguments
data | a data frame containing at least a column with PDB ID information of which the namecan be provided to the |
data2 | optional, a data frame that contains a subset of regions for which distances to regionsprovided in the |
id | a character column in the |
chain | optional, a character column in the |
auth_seq_id | optional, a character (or numeric) column in the |
distance_cutoff | a numeric value specifying the distance cutoff in Angstrom. All valuesfor pairwise comparisons are calculated but only values smaller than this cutoff will bereturned in the output. If a cutoff of e.g. 5 is selected then only residues with a distance of5 Angstrom and less are returned. Using a small value can reduce the size of the contact mapdrastically and is therefore recommended. The default value is 10. |
pdb_model_number_selection | a numeric vector specifying which models from the structurefiles should be considered for contact maps. E.g. NMR models often have many models in one file.The default for this argument is c(0, 1). This means the first model of each structure file isselected for contact map calculations. For AlphaFold predictions the model number is 0(only .pdb files), therefore this case is also included here. |
return_min_residue_distance | a logical value that specifies if the contact map should bereturned for all atom distances or the minimum residue distances. Minimum residue distances aresmaller in size. If atom distances are not strictly needed it is recommended to set thisargument to TRUE. The default is TRUE. |
export | a logical value that indicates if contact maps should be exported as ".csv". Thename of the file will be the structure ID. Default is |
export_location | optional, a character value that specifies the path to the location inwhich the contact map should be saved if |
split_n | a numeric value that specifies the number of structures that should be includedin each batch. Default is 40. |
processing_type | a character value that is either "parallel" for parallel processing or"sequential" for sequential processing. Alternatively it can also be "parallel manual" in thiscase you have to set up the number of cores on your own using the |
Value
A list of contact maps for each PDB or UniProt ID provided in the input is returned.If theexport argument is TRUE, each contact map will be saved as a ".csv" file in thecurrent working directory or the location provided to theexport_location argument.
Examples
## Not run: # Create example datadata <- data.frame( pdb_id = c("6NPF", "1C14", "3NIR"), chain = c("A", "A", NA), auth_seq_id = c("1;2;3;4;5;6;7", NA, NA))# Create contact mapcontact_maps <- parallel_create_structure_contact_map( data = data, id = pdb_id, chain = chain, auth_seq_id = auth_seq_id, split_n = 1,)str(contact_maps[["3NIR"]])contact_maps## End(Not run)Fitting four-parameter dose response curves (using parallel processing)
Description
This function is a wrapper aroundfit_drc_4p that allows the use of all system cores formodel fitting. It should only be used on systems that have enough memory available. Workers caneither be set up manually before running the function withfuture::plan(multisession) orautomatically by the function (maximum number of workers is 12 in this case). If workers are setup manually the number of cores should be provided ton_cores. Worker can be terminatedafter completion withfuture::plan(sequential). It is not possible to export theindividual fit objects when using this function as compared to the non parallel function asthey are too large for efficient export from the workers.
Usage
parallel_fit_drc_4p( data, sample, grouping, response, dose, filter = "post", replicate_completeness = 0.7, condition_completeness = 0.5, n_replicate_completeness = NULL, n_condition_completeness = NULL, complete_doses = NULL, anova_cutoff = 0.05, correlation_cutoff = 0.8, log_logarithmic = TRUE, retain_columns = NULL, n_cores = NULL)Arguments
Details
If data filtering options are selected, data is annotated based on multiple criteria.If"post" is selected the data is annotated based on completeness, the completeness distribution, theadjusted ANOVA p-value cutoff and a correlation cutoff. Completeness of features is determined based onthen_replicate_completeness andn_condition_completeness arguments. The completeness distribution determinesif there is a distribution of not random missingness of data along the dose. For this it is checked if half of afeatures values (+/-1 value) pass the replicate completeness criteria and half do not pass it. In order to fall intothis category, the values that fulfill the completeness cutoff and the ones that do not fulfill itneed to be consecutive, meaning located next to each other based on their concentration values. Furthermore,the values that do not pass the completeness cutoff need to be lower in intensity. Lastly, the differencebetween the two groups is tested for statistical significance using a Welch's t-test and acutoff of p <= 0.1 (we want to mainly discard curves that falsely fit the other criteria but thathave clearly non-significant differences in mean). This allows curves to be considered that havemissing values in half of their observations due to a decrease in intensity. It can be thoughtof as conditions that are missing not at random (MNAR). It is often the case that those entitiesdo not have a significant p-value since half of their conditions are not considered due to datamissingness. The ANOVA test is performed on the features by concentration. If it is significant it islikely that there is some response. However, this test would also be significant even if there is oneoutlier concentration so it should only be used only in combination with other cutoffs to determineif a feature is significant. Thepassed_filter column isTRUE for all thefeatures that pass the above mentioned criteria and that have a correlation greater than the cutoff(default is 0.8) and the adjusted ANOVA p-value below the cutoff (default is 0.05).
The final list is ranked based on a score calculated on entities that pass the filter.The score is the negative log10 of the adjusted ANOVA p-value scaled between 0 and 1 and thecorrelation scaled between 0 and 1 summed up and divided by 2. Thus, the highest score anentity can have is 1 with both the highest correlation and adjusted p-value. The rank iscorresponding to this score. Please note, that entities with MNAR conditions might have alower score due to the missing or non-significant ANOVA p-value. If no score could be calculatedthe usual way these cases receive a score of 0. You should have a look at curves that are TRUEfordose_MNAR in more detail.
If the"pre" option is selected for thefilter argument then the data is filtered for completenessprior to curve fitting and the ANOVA test. Otherwise annotation is performed exactly as mentioned above.We recommend the"pre" option because it leaves you with not only the likely hits of your treatment, butalso with rather high confidence true negative results. This is because the filtered data has a highdegree of completeness making it unlikely that a real dose-response curve is missed due to data missingness.
Please note that in general, curves are only fitted if there are at least 5 conditions with data points presentto ensure that there is potential for a good curve fit. This is done independent of the selected filtering option.
Value
A data frame is returned that contains correlations of predicted to measured values asa measure of the goodness of the curve fit, an associated p-value and the four parameters ofthe model for each group. Furthermore, input data for plots is returned in the columnsplot_curve(curve and confidence interval) andplot_points (measured points).
Examples
## Not run: # Load librarieslibrary(dplyr)set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 2, frac_change = 1, n_replicates = 3, n_conditions = 8, method = "dose_response", concentrations = c(0, 1, 10, 50, 100, 500, 1000, 5000), additional_metadata = FALSE)# Perform dose response curve fitdrc_fit <- parallel_fit_drc_4p( data = data, sample = sample, grouping = peptide, response = peptide_intensity_missing, dose = concentration, n_replicate_completeness = 2, n_condition_completeness = 5, retain_columns = c(protein, change_peptide))glimpse(drc_fit)head(drc_fit, n = 10)## End(Not run)Peptide abundance profile plot
Description
Creates a plot of peptide abundances across samples. This is helpful to investigate effects ofpeptide and protein abundance changes in different samples and conditions.
Usage
peptide_profile_plot( data, sample, peptide, intensity_log2, grouping, targets, complete_sample = FALSE, protein_abundance_plot = FALSE, interactive = FALSE, export = FALSE, export_name = "peptide_profile_plots")Arguments
data | a data frame that contains at least the input variables. |
sample | a character column in the |
peptide | a character column in the |
intensity_log2 | a numeric column in the |
grouping | a character column in the |
targets | a character vector that specifies elements of the grouping column which shouldbe plotted. This can also be |
complete_sample | a logical value that indicates if samples that are completely missing fora given protein should be shown on the x-axis of the plot anyway. The default value is |
protein_abundance_plot | a logical value. If the input for this plot comes directly from |
interactive | a logical value that indicates whether the plot should be interactive(default is FALSE). If this is TRUE only one target can be supplied to the function. Interactiveplots cannot be exported either. |
export | a logical value that indicates if plots should be exported as PDF. The outputdirectory will be the current working directory. The name of the file can be chosen using the |
export_name | a character vector that provides the name of the exported file if |
Value
A list of peptide profile plots.
Examples
# Create example datadata <- data.frame( sample = c( rep("S1", 6), rep("S2", 6), rep("S1", 2), rep("S2", 2) ), protein_id = c( rep("P1", 12), rep("P2", 4) ), precursor = c( rep(c("A1", "A2", "B1", "B2", "C1", "D1"), 2), rep(c("E1", "F1"), 2) ), peptide = c( rep(c("A", "A", "B", "B", "C", "D"), 2), rep(c("E", "F"), 2) ), intensity = c( rnorm(n = 6, mean = 15, sd = 2), rnorm(n = 6, mean = 21, sd = 1), rnorm(n = 2, mean = 15, sd = 1), rnorm(n = 2, mean = 15, sd = 2) ))# Calculate protein abundances and retain precursor# abundances that can be used in a peptide profile plotcomplete_abundances <- calculate_protein_abundance( data, sample = sample, protein_id = protein_id, precursor = precursor, peptide = peptide, intensity_log2 = intensity, method = "sum", for_plot = TRUE)# Plot protein abundance profile# protein_abundance_plot can be set to# FALSE to to also colour precursorspeptide_profile_plot( data = complete_abundances, sample = sample, peptide = precursor, intensity_log2 = intensity, grouping = protein_id, targets = c("P1"), protein_abundance_plot = TRUE)Assign peptide type
Description
This function was deprecated due to its name changing to
assign_peptide_type().
Usage
peptide_type(...)Value
A data frame that contains the input data and an additional column with the peptidetype information.
Perform gene ontology enrichment analysis
Description
This function was deprecated due to its name changing to
drc_4p_plot().
Usage
plot_drc_4p(...)Value
Iftargets = "all" a list containing plots for every unique identifier in thegrouping variable is created. Otherwise a plot for the specified targets is created withmaximally 20 facets.
Peptide abundance profile plot
Description
This function was deprecated due to its name changing to
peptide_profile_plot().
Usage
plot_peptide_profiles(...)Value
A list of peptide profile plots.
Plot histogram of p-value distribution
Description
This function was deprecated due to its name changing to
pval_distribution_plot().
Usage
plot_pval_distribution(...)Value
A histogram plot that shows the p-value distribution.
Predict protein domains of AlphaFold predictions
Description
Uses the predicted aligned error (PAE) of AlphaFold predictions to find possible protein domains.A graph-based community clustering algorithm (Leiden clustering) is used on the predicted error(distance) between residues of a protein in order to infer pseudo-rigid groups in the protein. This isfor example useful in order to know which parts of protein predictions are likely in a fixed relativeposition towards each other and which might have varying distances.This function is based on python code written by Tristan Croll. The original code can be found on hisGitHub page.
Usage
predict_alphafold_domain( pae_list, pae_power = 1, pae_cutoff = 5, graph_resolution = 1, return_data_frame = FALSE, show_progress = TRUE)Arguments
pae_list | a list of proteins that contains aligned errors for their AlphaFold predictions.This list can be retrieved with the |
pae_power | a numeric value, each edge in the graph will be weighted proportional to ( |
pae_cutoff | a numeric value, graph edges will only be created for residue pairs with |
graph_resolution | a numeric value that regulates how aggressive the clustering algorithm is. Smaller valueslead to larger clusters. Value should be larger than zero, and values larger than 5 are unlikely to be useful.Higher values lead to stricter (i.e. smaller) clusters. The value is provided to the Leiden clustering algorithmof the |
return_data_frame | a logical value; if |
show_progress | a logical value that specifies if a progress bar will be shown. Defaultis |
Value
A list of the provided proteins that contains domain assignments for each residue. Ifreturn_data_frame isTRUE, a data frame with this information is returned instead. The data frame contains thefollowing columns:
residue: The protein residue number.
domain: A numeric value representing a distinct predicted domain in the protein.
accession: The UniProt protein identifier.
Examples
# Fetch aligned errorsaligned_error <- fetch_alphafold_aligned_error( uniprot_ids = c("F4HVG8", "O15552"), error_cutoff = 4)# Predict protein domainsaf_domains <- predict_alphafold_domain( pae_list = aligned_error, return_data_frame = TRUE)head(af_domains, n = 10)Colour scheme for protti
Description
A colour scheme for protti that contains 100 colours.
Usage
protti_coloursFormat
A vector containing 100 colours
Source
Dina's imagination.
Structural analysis example data
Description
Example data used for the vignette about structural analysis. The data was obtained fromCappelletti et al. 2021 (doi:10.1016/j.cell.2020.12.021)and corresponds to two separate experiments. Both experiments were limited proteolyis coupled tomass spectrometry (LiP-MS) experiments conducted on purified proteins. The first protein isphosphoglycerate kinase 1 (pgk) and it was treated with 25mM 3-phosphoglyceric acid (3PG).The second protein is phosphoenolpyruvate-protein phosphotransferase (ptsI) and it was treatedwith 25mM fructose 1,6-bisphosphatase (FBP). From both experiments only peptides belonging toeither protein were used for this data set. The ptsI data set contains precursor level datawhile the pgk data set contains peptide level data. The pgk data can be obtained fromsupplementary table 3 from the tab named "pgk+3PG". The ptsI data is only included as raw dataand was analysed using the functions of this package.
Usage
ptsi_pgkFormat
A data frame containing differential abundances and adjusted p-values forpeptides/precursors of two proteins.
Source
Cappelletti V, Hauser T, Piazza I, Pepelnjak M, Malinovska L, Fuhrer T, Li Y, Dörig C,Boersema P, Gillet L, Grossbach J, Dugourd A, Saez-Rodriguez J, Beyer A, Zamboni N, Caflisch A,de Souza N, Picotti P. Dynamic 3D proteomes reveal protein functional alterations at highresolution in situ. Cell. 2021 Jan 21;184(2):545-559.e22.doi:10.1016/j.cell.2020.12.021.Epub 2020 Dec 23. PMID: 33357446; PMCID: PMC7836100.
Plot histogram of p-value distribution
Description
Plots the distribution of p-values derived from any statistical test as a histogram.
Usage
pval_distribution_plot(data, grouping, pval, facet_by = NULL)Arguments
data | a data frame that contains at least grouping identifiers (precursor, peptide orprotein) and p-values derived from any statistical test. |
grouping | a character column in the |
pval | a numeric column in the |
facet_by | optional, a character column that contains information by which the data shouldbe faceted into multiple plots. |
Value
A histogram plot that shows the p-value distribution.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- data.frame( peptide = paste0("peptide", 1:1000), pval = runif(n = 1000))# Plot p-valuespval_distribution_plot( data = data, grouping = peptide, pval = pval)Check charge state distribution
Description
Calculates the charge state distribution for each sample (by count or intensity).
Usage
qc_charge_states( data, sample, grouping, charge_states, intensity = NULL, remove_na_intensities = TRUE, method = "count", plot = FALSE, interactive = FALSE)Arguments
data | a data frame that contains at least sample names, peptide or precursor identifiersand missed cleavage counts for each peptide or precursor. |
sample | a character or factor column in the |
grouping | a character column in the |
charge_states | a character or numeric column in the |
intensity | a numeric column in the |
remove_na_intensities | a logical value that specifies if sample/grouping combinations withintensities that are NA (not quantified IDs) should be dropped from the data frame for analysisof missed cleavages. Default is TRUE since we are usually interested in quantifiable peptides.This is only relevant for method = "count". |
method | a character value that indicates the method used for evaluation. "count"calculates the charge state distribution based on counts of the corresponding peptides orprecursors in the charge state group, "intensity" calculates the percentage of precursors orpeptides in each charge state group based on the corresponding intensity values. |
plot | a logical value that indicates whether the result should be plotted. |
interactive | a logical value that specifies whether the plot should be interactive(default is FALSE). |
Value
A data frame that contains the calculated percentage made up by the sum of eitherall counts or intensities of peptides or precursors of the corresponding charge state(depending on which method is chosen).
Examples
# Load librarieslibrary(dplyr)set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random") %>% mutate(intensity_non_log2 = 2^peptide_intensity_missing)# Calculate charge percentagesqc_charge_states( data = data, sample = sample, grouping = peptide, charge_states = charge, intensity = intensity_non_log2, method = "intensity", plot = FALSE)# Plot charge statesqc_charge_states( data = data, sample = sample, grouping = peptide, charge_states = charge, intensity = intensity_non_log2, method = "intensity", plot = TRUE)Percentage of contaminants per sample
Description
Calculates the percentage of contaminating proteins as the share of total intensity.
Usage
qc_contaminants( data, sample, protein, is_contaminant, intensity, n_contaminants = 5, plot = TRUE, interactive = FALSE)Arguments
data | a data frame that contains at least the input variables. |
sample | a character or factor column in the |
protein | a character column in the |
is_contaminant | a logical column that indicates if the protein is a contaminant. |
intensity | a numeric column in the |
n_contaminants | a numeric value that indicates how many contaminants should be displayedindividually. The rest is combined to a group called "other". The default is 5. |
plot | a logical value that indicates if a plot is returned. If FALSE a table is returned. |
interactive | a logical value that indicates if the plot is made interactive using the rpackage |
Value
A bar plot that displays the percentage of contaminating proteins over all samples.Ifplot = FALSE a data frame is returned.
Examples
data <- data.frame( sample = c(rep("sample_1", 10), rep("sample_2", 10)), leading_razor_protein = c(rep(c("P1", "P1", "P1", "P2", "P2", "P2", "P2", "P3", "P3", "P3"), 2)), potential_contaminant = c(rep(c(rep(TRUE, 7), rep(FALSE, 3)), 2)), intensity = c(rep(1, 2), rep(4, 4), rep(6, 4), rep(2, 3), rep(3, 5), rep(4, 2)))qc_contaminants( data, sample = sample, protein = leading_razor_protein, is_contaminant = potential_contaminant, intensity = intensity)Check CV distribution
Description
Calculates and plots the coefficients of variation for the selected grouping.
Usage
qc_cvs( data, grouping, condition, intensity, plot = TRUE, plot_style = "density", max_cv = 200)Arguments
data | a data frame containing at least peptide, precursor or protein identifiers,information on conditions and intensity values for each peptide, precursor or protein. |
grouping | a character column in the |
condition | a character or factor column in the |
intensity | a numeric column in the |
plot | a logical value that indicates whether the result should be plotted. |
plot_style | a character value that indicates the plotting style. |
max_cv | a numeric value that specifies the maximum percentage of CVs that should be includedin the returned plot. The default value is |
Value
Either a data frame with the median CVs in % or a plot showing the distribution of the CVsis returned.
Examples
# Load librarieslibrary(dplyr)set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random") %>% mutate(intensity_non_log2 = 2^peptide_intensity_missing)# Calculate coefficients of variationqc_cvs( data = data, grouping = peptide, condition = condition, intensity = intensity_non_log2, plot = FALSE)# Plot coefficients of variation# Different plot styles are availableqc_cvs( data = data, grouping = peptide, condition = condition, intensity = intensity_non_log2, plot = TRUE, plot_style = "violin")Data completeness
Description
Calculates the percentage of data completeness. That means, what percentage of all detectedprecursors is present in each sample.
Usage
qc_data_completeness( data, sample, grouping, intensity, digestion = NULL, plot = TRUE, interactive = FALSE)Arguments
data | a data frame containing at least the input variables. |
sample | a character or factor column in the |
grouping | a character column in the |
intensity | a numeric column in the |
digestion | optional, a character column in the |
plot | a logical value that indicates whether the result should be plotted. |
interactive | a logical value that specifies whether the plot should be interactive(default is FALSE). |
Value
A bar plot that displays the percentage of data completeness over all samples.Ifplot = FALSE a data frame is returned. Ifinteractive = TRUE, the plot isinteractive.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random")# Determine data completenessqc_data_completeness( data = data, sample = sample, grouping = peptide, intensity = peptide_intensity_missing, plot = FALSE)# Plot data completenessqc_data_completeness( data = data, sample = sample, grouping = peptide, intensity = peptide_intensity_missing, plot = TRUE)Check number of precursor, peptide or protein IDs
Description
Returns a plot or table of the number of IDs for each sample. The default settings removegrouping variables without quantitative information (intensity is NA). These will not becounted as IDs.
Usage
qc_ids( data, sample, grouping, intensity, remove_na_intensities = TRUE, condition = NULL, title = "ID count per sample", plot = TRUE, interactive = FALSE)Arguments
data | a data frame containing at least sample names and precursor/peptide/protein IDs. |
sample | a character or factor column in the |
grouping | a character column in the |
intensity | a character column in the |
remove_na_intensities | a logical value that specifies if sample/grouping combinations withintensities that are NA (not quantified IDs) should be dropped from the data frame. Default isTRUE since we are usually interested in the number of quantifiable IDs. |
condition | optional, a column in the |
title | optional, a character value that specifies the plot title (default is "ID countper sample"). |
plot | a logical value that indicates whether the result should be plotted. |
interactive | a logical value that specifies whether the plot should be interactive(default is FALSE). |
Value
A bar plot with the height corresponding to the number of IDs, each bar represents onesample (ifplot = TRUE). Ifplot = FALSE a table with ID counts is returned.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random")# Calculate number of identificationsqc_ids( data = data, sample = sample, grouping = peptide, intensity = peptide_intensity_missing, condition = condition, plot = FALSE)# Plot number of identificationsqc_ids( data = data, sample = sample, grouping = peptide, intensity = peptide_intensity_missing, condition = condition, plot = TRUE)Check intensity distribution per sample and overall
Description
Plots the overall or sample-wise distribution of all peptide intensities as a boxplot orhistogram.
Usage
qc_intensity_distribution( data, sample = NULL, grouping, intensity_log2, plot_style)Arguments
data | a data frame that contains at least sample names, grouping identifiers (precursor,peptide or protein) and log2 transformed intensities for each grouping identifier. |
sample | an optional character or factor column in the |
grouping | a character column in the |
intensity_log2 | a numeric column in the |
plot_style | a character value that indicates the plot type. This can be either"histogram", "boxplot" or "violin". Plot style "boxplot" and "violin" can only be used if asample column is provided. |
Value
A histogram or boxplot that shows the intensity distribution over all samples or bysample.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random")# Plot intensity distribution# The plot style can be changedqc_intensity_distribution( data = data, sample = sample, grouping = peptide, intensity_log2 = peptide_intensity_missing, plot_style = "boxplot")Median run intensities
Description
Median intensities per run are returned either as a plot or a table.
Usage
qc_median_intensities( data, sample, grouping, intensity, plot = TRUE, interactive = FALSE)Arguments
data | a data frame that contains at least the input variables. |
sample | a character or factor column in the |
grouping | a character column in the |
intensity | a numeric column in the |
plot | a logical value that indicates whether the result should be plotted. |
interactive | a logical value that specifies whether the plot should be interactive(default is FALSE). |
Value
A plot that displays median intensity over all samples. Ifplot = FALSE a dataframe containing median intensities is returned.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random")# Calculate median intensitiesqc_median_intensities( data = data, sample = sample, grouping = peptide, intensity = peptide_intensity_missing, plot = FALSE)# Plot median intensitiesqc_median_intensities( data = data, sample = sample, grouping = peptide, intensity = peptide_intensity_missing, plot = TRUE)Check missed cleavages
Description
Calculates the percentage of missed cleavages for each sample (by count or intensity). Thedefault settings remove grouping variables without quantitative information (intensity is NA).These will not be used for the calculation of missed cleavage percentages.
Usage
qc_missed_cleavages( data, sample, grouping, missed_cleavages, intensity, remove_na_intensities = TRUE, method = "count", plot = FALSE, interactive = FALSE)Arguments
data | a data frame containing at least sample names, peptide or precursor identifiersand missed cleavage counts for each peptide or precursor. |
sample | a character or factor column in the |
grouping | a character column in the |
missed_cleavages | a numeric column in the |
intensity | a numeric column in the |
remove_na_intensities | a logical value that specifies if sample/grouping combinations withintensities that are NA (not quantified IDs) should be dropped from the data frame for analysisof missed cleavages. Default is TRUE since we are usually interested in quantifiable peptides.This is only relevant for method = "count". |
method | a character value that indicates the method used for evaluation. "count"calculates the percentage of missed cleavages based on counts of the corresponding peptide orprecursor, "intensity" calculates the percentage of missed cleavages by intensity of thecorresponding peptide or precursor. |
plot | a logical value that indicates whether the result should be plotted. |
interactive | a logical value that specifies whether the plot should be interactive(default is FALSE). |
Value
A data frame that contains the calculated percentage made up by the sum of all peptidesor precursors containing the corresponding amount of missed cleavages.
Examples
library(dplyr)set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random") %>% mutate(intensity_non_log2 = 2^peptide_intensity_missing)# Calculate missed cleavage percentagesqc_missed_cleavages( data = data, sample = sample, grouping = peptide, missed_cleavages = n_missed_cleavage, intensity = intensity_non_log2, method = "intensity", plot = FALSE)# Plot missed cleavagesqc_missed_cleavages( data = data, sample = sample, grouping = peptide, missed_cleavages = n_missed_cleavage, intensity = intensity_non_log2, method = "intensity", plot = TRUE)Plot principal component analysis
Description
Plots a principal component analysis based on peptide or precursor intensities.
Usage
qc_pca( data, sample, grouping, intensity, condition, components = c("PC1", "PC2"), digestion = NULL, plot_style = "pca")Arguments
data | a data frame that contains sample names, peptide or precursor identifiers,corresponding intensities and a condition column indicating e.g. the treatment. |
sample | a character column in the |
grouping | a character column in the |
intensity | a numeric column in the |
condition | a numeric or character column in the |
components | a character vector indicating the two components that should be displayed inthe plot. By default these are PC1 and PC2. You can provide these using a character vector ofthe form c("PC1", "PC2"). |
digestion | optional, a character column in the |
plot_style | a character value that specifies what plot should be returned. If |
Value
A principal component analysis plot showing PC1 and PC2. Ifplot_style = "scree", ascree plot for all dimensions is returned.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2,)# Plot scree plotqc_pca( data = data, sample = sample, grouping = peptide, intensity = peptide_intensity_missing, condition = condition, plot_style = "scree")# Plot principal componentsqc_pca( data = data, sample = sample, grouping = peptide, intensity = peptide_intensity_missing, condition = condition)Peak width over retention time
Description
Plots one minute binned median precursor elution peak width over retention time for each sample.
Usage
qc_peak_width( data, sample, intensity, retention_time, peak_width = NULL, retention_time_start = NULL, retention_time_end = NULL, remove_na_intensities = TRUE, interactive = FALSE)Arguments
data | a data frame containing at least sample names and protein IDs. |
sample | a character column in the |
intensity | a numeric column in the |
retention_time | a numeric column in the |
peak_width | a numeric column in the |
retention_time_start | a numeric column in the |
retention_time_end | a numeric column in the |
remove_na_intensities | a logical value that specifies if sample/grouping combinationswith intensities that are NA (not quantified IDs) should be dropped from the data frame.Default is TRUE since we are usually interested in the peak width of quantifiable data. |
interactive | a logical value that specifies whether the plot should be interactive(default is FALSE). |
Value
A line plot displaying one minute binned median precursor elution peak width overretention time for each sample.
Examples
data <- data.frame( r_file_name = c(rep("sample_1", 10), rep("sample2", 10)), fg_quantity = c(rep(2000, 20)), eg_mean_apex_rt = c(rep(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 2)), eg_start_rt = c(0.5, 1, 3, 4, 5, 6, 7, 7.5, 8, 9, 1, 2, 2, 3, 4, 5, 5, 8, 9, 9), eg_end_rt = c( 1.5, 2, 3.1, 4.5, 5.8, 6.6, 8, 8, 8.4, 9.1, 3, 2.2, 4, 3.4, 4.5, 5.5, 5.6, 8.3, 10, 12 ))qc_peak_width( data, sample = r_file_name, intensity = fg_quantity, retention_time = eg_mean_apex_rt, retention_time_start = eg_start_rt, retention_time_end = eg_end_rt)Check peptide type percentage share
Description
Calculates the percentage share of each peptide types (fully-tryptic, semi-tryptic,non-tryptic) for each sample.
Usage
qc_peptide_type( data, sample, peptide, pep_type, intensity, remove_na_intensities = TRUE, method = "count", plot = FALSE, interactive = FALSE)Arguments
data | a data frame that contains at least the input columns. |
sample | a character or factor column in the |
peptide | a character column in the |
pep_type | a character column in the |
intensity | a numeric column in the |
remove_na_intensities | a logical value that specifies if sample/peptide combinations withintensities that are NA (not quantified IDs) should be dropped from the data frame for analysisof peptide type distributions. Default is TRUE since we are usually interested in the peptidetype distribution of quantifiable IDs. This is only relevant for method = "count". |
method | a character value that indicates the method used for evaluation. |
plot | a logical value that indicates whether the result should be plotted. |
interactive | a logical value that indicates whether the plot should be interactive. |
Value
A data frame that contains the calculated percentage shares of each peptide type persample. Thecount column contains the number of peptides with a specific type. Thepeptide_type_percent column contains the percentage share of a specific peptide type.
Examples
# Load librarieslibrary(dplyr)set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random") %>% mutate(intensity_non_log2 = 2^peptide_intensity_missing)# Determine peptide type percentagesqc_peptide_type( data = data, sample = sample, peptide = peptide, pep_type = pep_type, intensity = intensity_non_log2, method = "intensity", plot = FALSE)# Plot peptide typeqc_peptide_type( data = data, sample = sample, peptide = peptide, pep_type = pep_type, intensity = intensity_non_log2, method = "intensity", plot = TRUE)Proteome coverage per sample and total
Description
Calculates the proteome coverage for each samples and for all samples combined. In other words the fraction of detected proteins to all proteins in the proteome is calculated.
Usage
qc_proteome_coverage( data, sample, protein_id, organism_id, reviewed = TRUE, plot = TRUE, interactive = FALSE)Arguments
data | a data frame that contains at least sample names and protein ID's. |
sample | a character column in the |
protein_id | a character or numeric column in the |
organism_id | a numeric value that specifies a NCBI taxonomy identifier (TaxId) of theorganism used. Human: 9606, S. cerevisiae: 559292, E. coli: 83333. |
reviewed | a logical value that determines if only reviewed protein entries will be consideredas the full proteome. Default is TRUE. |
plot | a logical value that specifies whether the result should be plotted. |
interactive | a logical value that indicates whether the plot should be interactive(default is FALSE). |
Value
A bar plot showing the percentage of of the proteome detected and undetected in totaland for each sample. Ifplot = FALSE a data frame containing the numbers is returned.
Examples
# Create example dataproteome <- data.frame(id = 1:4518)data <- data.frame( sample = c(rep("A", 101), rep("B", 1000), rep("C", 1000)), protein_id = c(proteome$id[1:100], proteome$id[1:1000], proteome$id[1000:2000]))# Calculate proteome coverageqc_proteome_coverage( data = data, sample = sample, protein_id = protein_id, organism_id = 83333, plot = FALSE)# Plot proteome coverageqc_proteome_coverage( data = data, sample = sample, protein_id = protein_id, organism_id = 83333, plot = TRUE)Check ranked intensities
Description
Calculates and plots ranked intensities for proteins, peptides or precursors.
Usage
qc_ranked_intensities( data, sample, grouping, intensity_log2, facet = FALSE, plot = FALSE, y_axis_transformation = "log10", interactive = FALSE)Arguments
data | a data frame that contains at least sample names, grouping identifiers (precursor,peptide or protein) and log2 transformed intensities for each grouping identifier. |
sample | a character column in the |
grouping | a character column in the |
intensity_log2 | a numeric column in the |
facet | a logical value that specifies whether the calculation should be done group wise bysample and if the resulting plot should be faceted by sample. (default is |
plot | a logical value that specifies whether the result should be plotted (default is |
y_axis_transformation | a character value that determines that y-axis transformation. Thevalue is either "log2" or "log10" (default is "log10"). |
interactive | a logical value that specifies whether the plot should be interactive(default is |
Value
A data frame containing the ranked intensities is returned. Ifplot = TRUE a plotis returned. The intensities are log10 transformed for the plot.
Examples
set.seed(123) # Makes example reproducible# Create synthetic datadata <- create_synthetic_data( n_proteins = 50, frac_change = 0.05, n_replicates = 4, n_conditions = 3, method = "effect_random", additional_metadata = FALSE)# Plot ranked intensities for all samples combinedqc_ranked_intensities( data = data, sample = sample, grouping = peptide, intensity_log2 = peptide_intensity, plot = TRUE,)# Plot ranked intensities for each sample separatelyqc_ranked_intensities( data = data, sample = sample, grouping = peptide, intensity_log2 = peptide_intensity, plot = TRUE, facet = TRUE)Correlation based hirachical clustering of samples
Description
A correlation heatmap is created that uses hirachical clustering to determine sample similarity.
Usage
qc_sample_correlation( data, sample, grouping, intensity_log2, condition, digestion = NULL, run_order = NULL, method = "spearman", interactive = FALSE)Arguments
data | a data frame that contains at least the input variables. |
sample | a character column in the |
grouping | a character column in the |
intensity_log2 | a numeric column in the |
condition | a character or numeric column in the |
digestion | optional, a character column in the |
run_order | optional, a character or numeric column in the |
method | a character value that specifies the method to be used for correlation. |
interactive | a logical value that specifies whether the plot should be interactive.Determines if an interactive or static heatmap should be created using |
Value
A correlation heatmap that compares each sample. The dendrogram is sorted by optimalleaf ordering.
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random")# Create sample correlation heatmapqc_sample_correlation( data = data, sample = sample, grouping = peptide, intensity_log2 = peptide_intensity_missing, condition = condition)Protein coverage distribution
Description
Plots the distribution of protein coverages in a histogram.
Usage
qc_sequence_coverage( data, protein_identifier, coverage, sample = NULL, interactive = FALSE)Arguments
data | a data frame that contains at least the input variables. |
protein_identifier | a character column in the |
coverage | a numeric column in the |
sample | optional, a character or factor column in the |
interactive | a logical value that specifies whether the plot should be interactive(default is FALSE). |
Value
A protein coverage histogram with 5 percent binwidth. The vertical dotted lineindicates the median.
See Also
Examples
set.seed(123) # Makes example reproducible# Create example datadata <- create_synthetic_data( n_proteins = 100, frac_change = 0.05, n_replicates = 3, n_conditions = 2, method = "effect_random")# Plot sequence coverageqc_sequence_coverage( data = data, protein_identifier = protein, coverage = coverage)Randomise samples in MS queue
Description
This function randomises the order of samples in an MS queue. QC and Blank samples are left inplace. It is also possible to randomise only parts of the queue. Before running this make sureto set a specific seed with the
set.seed() function. This ensures that the randomisationof the result is consistent if the function is run again.
Usage
randomise_queue(data = NULL, rows = NULL, export = FALSE)Arguments
data | optional, a data frame that contains a queue. If not provided a queue file can bechosen interactively. |
rows | optional, a numeric vector that specifies a range of rows in for which samplesshould be randomized. |
export | a logical value that determines if a |
Value
Ifexport = TRUE a"randomised_queue.csv" file will be saved in theworking directory. Ifexport = FALSE a data frame that contains the randomised queueis returned.
Examples
queue <- create_queue( date = c("200722"), instrument = c("EX1"), user = c("jquast"), measurement_type = c("DIA"), experiment_name = c("JPQ031"), digestion = c("LiP", "tryptic control"), treatment_type_1 = c("EDTA", "H2O"), treatment_type_2 = c("Zeba", "unfiltered"), treatment_dose_1 = c(10, 30, 60), treatment_unit_1 = c("min"), n_replicates = 4, number_runs = FALSE, organism = c("E. coli"), exclude_combinations = list(list( treatment_type_1 = c("H2O"), treatment_type_2 = c("Zeba", "unfiltered"), treatment_dose_1 = c(10, 30) )), inj_vol = c(2), data_path = "D:\\2007_Data", method_path = "C:\\Xcalibur\\methods\\DIA_120min", position_row = c("A", "B", "C", "D", "E", "F"), position_column = 8, blank_every_n = 4, blank_position = "1-V1", blank_method_path = "C:\\Xcalibur\\methods\\blank")head(queue, n = 20)randomised_queue <- randomise_queue( data = queue, export = FALSE)head(randomised_queue, n = 20)Rapamycin 10 uM example data
Description
Rapamycin example data used for the vignette about binary control/treated data. The data wasobtained fromPiazza 2020and corresponds to experiment 18. FKBP1A the rapamycin binding protein and 49 other randomlysampled proteins were used for this example dataset. Furthermore, only the DMSO control and the10 uM condition were used.
Usage
rapamycin_10uMFormat
A data frame containing peptide level data from a Spectronaut report.
Source
Piazza, I., Beaton, N., Bruderer, R. et al. A machine learning-based chemoproteomicapproach to identify drug targets and binding sites in complex proteomes. Nat Commun 11, 4200(2020).doi:10.1038/s41467-020-18071-x
Rapamycin dose response example data
Description
Rapamycin example data used for the vignette about dose response data. The data was obtainedfromPiazza 2020 and correspondsto experiment 18. FKBP1A the rapamycin binding protein and 39 other randomly sampled proteinswere used for this example dataset. The concentration range includes the following points:0 (DMSO control), 10 pM, 100 pM, 1 nM, 10 nM, 100 nM, 1 uM, 10 uM and 100 uM.
Usage
rapamycin_dose_responseFormat
A data frame containing peptide level data from a Spectronaut report.
Source
Piazza, I., Beaton, N., Bruderer, R. et al. A machine learning-based chemoproteomicapproach to identify drug targets and binding sites in complex proteomes. Nat Commun 11, 4200(2020).doi:10.1038/s41467-020-18071-x
Read, clean and convert
Description
The function uses the very fastfread function form thedata.table package. Thecolumn names of the resulting data table are made more r-friendly usingclean_names fromthejanitor package. It replaces "." and " " with "_" and converts names to lower casewhich is also known as snake_case. In the end the data table is converted to a tibble.
Usage
read_protti(filename, ...)Arguments
filename | a character value that specifies the path to the file. |
... | additional arguments for the fread function. |
Value
A data frame (with class tibble) that contains the content of the specified file.
Examples
## Not run: read_protti("folder\\filename")## End(Not run)Replace identified positions in protein sequence by "x"
Description
Helper function for the calculation of sequence coverage, replaces identified positions with an"x" within the protein sequence.
Usage
replace_identified_by_x(sequence, positions_start, positions_end)Arguments
sequence | a character value that contains the protein sequence. |
positions_start | a numeric vector of start positions of the identified peptides. |
positions_end | a numeric vector of end positions of the identified peptides. |
Value
A character vector that contains the modified protein sequence with each identifiedposition replaced by "x".
Scaling a vector
Description
scale_protti is used to scale a numeric vector either between 0 and 1 or around acentered value using the standard deviation. If a vector containing only one value orrepeatedly the same value is provided, 1 is returned as the scaled value formethod = "01"and 0 is returned formetod = "center".
Usage
scale_protti(x, method)Arguments
x | a numeric vector |
method | a character value that specifies the method to be used for scaling. "01" scalesthe vector between 0 and 1. "center" scales the vector equal to |
Value
A scaled numeric vector.
Examples
scale_protti(c(1, 2, 1, 4, 6, 8), method = "01")Protein sequence coverage
Description
This function was deprecated due to its name changing to
calculate_sequence_coverage().
Usage
sequence_coverage(...)Value
A new column in thedata data frame containing the calculated sequence coveragefor each identified protein
Convert metal names to search pattern
Description
Converts a vector of metal names extracted from theft_metal columnobtained withfetch_uniprot to a pattern that can be used to search for correspondingChEBI IDs. This is used as a helper function for other functions.
Usage
split_metal_name(metal_names)Arguments
metal_names | a character vector containing names of metals and metal containing molecules. |
Value
A character vector with metal name search patterns.
Check treatment enrichment
Description
This function was deprecated due to its name changing to
calculate_treatment_enrichment().
Usage
treatment_enrichment(...)Value
A bar plot displaying the percentage of all detect proteins and all significant proteinsthat bind to the treatment. A Fisher's exact test is performed to calculate the significance ofthe enrichment in significant proteins compared to all proteins. The result is reported as ap-value. Ifplot = FALSE a contingency table in long format is returned.
Query from URL
Description
Downloads data table from URL. If an error occurs during the query (for example due to noconnection) the function waits 3 seconds and tries again. If no result could be obtainedafter the given number of tries a message indicating the problem is returned.
Usage
try_query( url, max_tries = 5, silent = TRUE, type = "text/tab-separated-values", timeout = 60, accept = NULL, ...)Arguments
url | a character value of an URL to the website that contains the table that should bedownloaded. |
max_tries | a numeric value that specifies the number of times the function tries to downloadthe data in case an error occurs. Default is 5. |
silent | a logical value that specifies if individual messages are printed after each trythat failed. |
type | a character value that specifies the type of data at the target URL. Options areall options that can be supplied to httr::content, these include e.g."text/tab-separated-values", "application/json" and "txt/csv". Default is "text/tab-separated-values". |
timeout | a numeric value that specifies the maximum request time. Default is 60 seconds. |
accept | a character value that specifies the type of data that should be sent by the API ifit uses content negotiation. The default is NULL and it should only be set for APIs that usecontent negotiation. |
... | other parameters supplied to the parsing function used by httr::content. |
Value
A data frame that contains the table from the url.
Perform Welch's t-test
Description
Performs a Welch's t-test and calculates p-values between two groups.
Usage
ttest_protti(mean1, mean2, sd1, sd2, n1, n2, log_values = TRUE)Arguments
mean1 | a numeric vector that contains the means of group1. |
mean2 | a numeric vector that contains the means of group2. |
sd1 | a numeric vector that contains the standard deviations of group1. |
sd2 | a numeric vector that contains the standard deviations of group2. |
n1 | a numeric vector that contains the number of replicates used for the calculation ofeach mean and standard deviation of group1. |
n2 | a numeric vector that contains the number of replicates used for the calculation ofeach mean and standard deviation of group2. |
log_values | a logical value that indicates if values are log transformed. This determineshow fold changes are calculated. Default is |
Value
A data frame that contains the calculated differences of means, standard error, tstatistic and p-values.
Examples
ttest_protti( mean1 = 10, mean2 = 15.5, sd1 = 1, sd2 = 0.5, n1 = 3, n2 = 3)Viridis colour scheme
Description
A colour scheme by the viridis colour scheme from the viridis R package.
Usage
viridis_coloursFormat
A vector containing 256 colours
Source
viridis R package, created by Stéfan van der Walt (stefanv) and Nathaniel Smith (njsmith)
Volcano plot
Description
Plots a volcano plot for the given input.
Usage
volcano_plot( data, grouping, log2FC, significance, method, target_column = NULL, target = NULL, facet_by = NULL, facet_scales = "fixed", title = "Volcano plot", x_axis_label = "log2(fold change)", y_axis_label = "-log10(p-value)", legend_label = "Target", colour = NULL, log2FC_cutoff = 1, significance_cutoff = 0.01, interactive = FALSE)Arguments
data | a data frame that contains at least the input variables. |
grouping | a character column in the |
log2FC | a character column in the |
significance | a character column in the |
method | a character value that specifies the method used for the plot. |
target_column | optional, a column required for |
target | optional, a vector required for |
facet_by | optional, a character column that contains information by which the data shouldbe faceted into multiple plots. |
facet_scales | a character value that specifies if the scales should be "free", "fixed","free_x" or "free_y", if a faceted plot is created. These inputs are directly supplied to the |
title | optional, a character value that specifies the title of the volcano plot. Defaultis "Volcano plot". |
x_axis_label | optional, a character value that specifies the x-axis label. Default is"log2(fold change)". |
y_axis_label | optional, a character value that specifies the y-axis label. Default is"-log10(q-value)". |
legend_label | optional, a character value that specifies the legend label. Default is"Target". |
colour | optional, a character vector containing colours that should be used to colourpoints according to the selected method. IMPORTANT: the first value in the vector is thedefault point colour, the additional values specify colouring of target or significant points.E.g. |
log2FC_cutoff | optional, a numeric value that specifies the log2 transformed fold changecutoff used for the vertical lines, which can be used to assess the significance of changes.Default value is 1. |
significance_cutoff | optional, a character vector that specifies the p-value cutoff usedfor the horizontal cutoff line, which can be used to assess the significance of changes. Thevector can consist solely of one element, which is the cutoff value. In that case the cutoffwill be applied directly to the plot. Alternatively, a second element can be provided to thevector that specifies a column in the |
interactive | a logical value that specifies whether the plot should be interactive(default is FALSE). |
Value
Depending on the method used a volcano plot with either highlighted targets(method = "target") or highlighted significant proteins (method = "significant")is returned.
Examples
set.seed(123) # Makes example reproducible# Create synthetic datadata <- create_synthetic_data( n_proteins = 10, frac_change = 0.5, n_replicates = 4, n_conditions = 3, method = "effect_random", additional_metadata = FALSE)# Assign missingness informationdata_missing <- assign_missingness( data, sample = sample, condition = condition, grouping = peptide, intensity = peptide_intensity_missing, ref_condition = "all", retain_columns = c(protein, change_peptide))# Calculate differential abundancesdiff <- calculate_diff_abundance( data = data_missing, sample = sample, condition = condition, grouping = peptide, intensity_log2 = peptide_intensity_missing, missingness = missingness, comparison = comparison, method = "t-test", retain_columns = c(protein, change_peptide))volcano_plot( data = diff, grouping = peptide, log2FC = diff, significance = pval, method = "target", target_column = change_peptide, target = TRUE, facet_by = comparison, significance_cutoff = c(0.05, "adj_pval"))Volcano plot
Description
This function was deprecated due to its name changing to
volcano_plot().
Usage
volcano_protti(...)Value
Depending on the method used a volcano plot with either highlighted targets(method = "target") or highlighted significant proteins (method = "significant")is returned.
Woods' plot
Description
Creates a Woods' plot that plots log2 fold change of peptides or precursors along the proteinsequence. The peptides or precursors are located on the x-axis based on their start and endpositions. The position on the y-axis displays the fold change. The vertical size (y-axis) ofthe box representing the peptides or precursors do not have any meaning.
Usage
woods_plot( data, fold_change, start_position, end_position, protein_length, coverage = NULL, protein_id, targets = "all", facet = TRUE, colouring = NULL, fold_change_cutoff = 1, highlight = NULL, export = FALSE, export_name = "woods_plots")Arguments
data | a data frame that contains differential abundance, start and end peptide orprecursor positions, protein length and optionally a variable based on which peptides orprecursors should be coloured. |
fold_change | a numeric column in the |
start_position | a numeric column in the |
end_position | a numeric column in the |
protein_length | a numeric column in the |
coverage | optional, a numeric column in the |
protein_id | a character column in the |
targets | a character vector that specifies the identifiers of the proteins (depending on |
facet | a logical value that indicates if plots should be summarised into facets of 20plots. This is recommended for many plots. Default is |
colouring | optional, a character or numeric (discrete or continous) column in the dataframe containing information by which peptide or precursors should be coloured. |
fold_change_cutoff | optional, a numeric value that specifies the log2 fold change cutoffused in the plot. The default value is 2. |
highlight | optional, a logical column that specifies whether specific peptides orprecursors should be highlighted with an asterisk. |
export | a logical value that indicates if plots should be exported as PDF. The outputdirectory will be the current working directory. The name of the file can be chosen using the |
export_name | a character vector that provides the name of the exported file if |
Value
A list containing Woods' plots is returned. Plotting peptide or precursor log2 foldchanges along the protein sequence.
Examples
# Create example datadata <- data.frame( fold_change = c(2.3, 0.3, -0.4, -4, 1), pval = c(0.001, 0.7, 0.9, 0.003, 0.03), start = c(20, 30, 45, 90, 140), end = c(33, 40, 64, 100, 145), protein_length = c(rep(150, 5)), protein_id = c(rep("P1", 5)))# Plot Woods' plotwoods_plot( data = data, fold_change = fold_change, start_position = start, end_position = end, protein_length = protein_length, protein_id = protein_id, colouring = pval)