| Type: | Package |
| Title: | Proteomics Data Analysis and Modeling Tools |
| Version: | 0.2.2 |
| Description: | A comprehensive, user-friendly package for label-free proteomics data analysis and machine learning-based modeling. Data generated from 'MaxQuant' can be easily used to conduct differential expression analysis, build predictive models with top protein candidates, and assess model performance. promor includes a suite of tools for quality control, visualization, missing data imputation (Lazar et. al. (2016) <doi:10.1021/acs.jproteome.5b00981>), differential expression analysis (Ritchie et. al. (2015) <doi:10.1093/nar/gkv007>), and machine learning-based modeling (Kuhn (2008) <doi:10.18637/jss.v028.i05>). |
| License: | LGPL-2.1 |LGPL-3 [expanded from: LGPL (≥ 2.1)] |
| Encoding: | UTF-8 |
| Language: | en-US |
| RoxygenNote: | 7.3.3 |
| VignetteBuilder: | knitr |
| Suggests: | covr, knitr, rmarkdown, testthat (≥ 3.0.0) |
| Depends: | R (≥ 3.5.0) |
| URL: | https://github.com/caranathunge/promor,https://caranathunge.github.io/promor/ |
| Imports: | reshape2, ggplot2, ggrepel, gridExtra, limma, statmod,pcaMethods, VIM, missForest, caret, kernlab, xgboost,naivebayes, viridis, pROC |
| LazyData: | true |
| Config/testthat/edition: | 3 |
| BugReports: | https://github.com/caranathunge/promor/issues |
| NeedsCompilation: | no |
| Packaged: | 2025-11-11 16:52:29 UTC; caran |
| Author: | Chathurani Ranathunge |
| Maintainer: | Chathurani Ranathunge <caranathunge86@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2025-11-11 22:20:02 UTC |
Compute average intensity
Description
This function computes average intensities acrosstechnical replicates for each sample.
Usage
aver_techreps(raw_df)Arguments
raw_df | A |
Details
aver_techreps assumes that column names in the data framefollow the "Group_UniqueSampleID_TechnicalReplicate" notation. (Usehead(raw_df) to see the structure of theraw_df object.)
Value
Araw_df object of averaged intensities.
Author(s)
Chathurani Ranathunge
See Also
Examples
## Use a data set containing technical replicates to create a raw_df objectraw_df <- create_df(prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt",exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt",tech_reps = TRUE)# Compute average intensities across technical replicates.rawdf_ave <- aver_techreps(raw_df)Correlation between technical replicates
Description
This function generates scatter plots to visualize thecorrelation between a given pair of technical replicates (Eg: 1 vs 2)for each sample.
Usage
corr_plot( raw_df, rep_1, rep_2, save = FALSE, file_type = "pdf", palette = "viridis", text_size = 5, n_row = 4, n_col = 4, dpi = 80, file_path = NULL)Arguments
raw_df | A |
rep_1 | Numerical. Technical replicate number. |
rep_2 | Numerical. Number of the second technical replicate to compareto |
save | Logical. If |
file_type | File type to save the scatter plots.Default is |
palette | Viridis color palette option for plots. Default is |
text_size | Text size for plot labels, axis labels etc. Default is |
n_row | Numerical. Number of plots to print in a row in a single page.Default is |
n_col | Numerical. Number of plots to print in a column in a singlepage. Default is |
dpi | Plot resolution. Default is |
file_path | A string containing the directory path to save the file. |
Details
Given a data frame of log-transformed intensities(a
raw_dfobject) and a pair of numbers referring to the technicalreplicates,corr_plotproduces a list of scatter plots showingcorrelation between the given pair of technical replicates for all thesamples provided in the data frame.Note:
n_row*n_colshould be equal to the number ofsamples to display in a single page.
Value
A list ofggplot2 plot objects.
Author(s)
Chathurani Ranathunge
See Also
create_df
Examples
## Use a data set containing technical replicates to create a raw_df objectraw_df <- create_df(prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt",exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt",tech_reps = TRUE)## Compare technical replicates 1 vs. 2 for all samplescorr_plot(raw_df, rep_1 = 1, rep_2 = 2)Suvarna et al 2021 LFQ data (fit object)
Description
An object of class "MArrayLM" from running find_dep on covid_norm_df
Usage
data(covid_fit_df)Format
An object of class "MArrayLM"
References
https://www.frontiersin.org/articles/10.3389/fphys.2021.652799/full#h3
Suvarna et al 2021 LFQ data (normalized)
Description
A dataframe containing normalized LFQ protein intensity data for 230proteins in 35 samples (a subset of the original data set)
Usage
data(covid_norm_df)Format
A data frame with 230 rows (proteins) and 35 columns (samples)
References
https://www.frontiersin.org/articles/10.3389/fphys.2021.652799/full#h3
Create a data frame of protein intensities
Description
This function creates a data frame of protein intensities
Usage
create_df( prot_groups, exp_design, input_type = "MaxQuant", data_type = "LFQ", filter_na = TRUE, filter_prot = TRUE, uniq_pep = 2, tech_reps = FALSE, zero_na = TRUE, log_tr = TRUE, base = 2)Arguments
prot_groups | File path to a proteinGroups.txt file produced by MaxQuantor a standard input file containing a quantitative matrixwhere the proteins or protein groups are indicated by rows and thesamples by columns. |
exp_design | File path to a text file containing the experimentaldesign. |
input_type | Type of input file indicated by |
data_type | Type of sample protein intensity data columns to use fromthe proteinGroups.txt file. Some available options are "LFQ", "iBAQ","Intensity". Default is "LFQ." User-defined prefixes in the proteinGroups.txtfile are also allowed. The |
filter_na | Logical. If |
filter_prot | Logical. If |
uniq_pep | Numerical. Proteins that are identified by this number orfewer number of unique peptides are filtered out (default is 2).Only applieswhen |
tech_reps | Logical. Indicate as |
zero_na | Logical. If |
log_tr | Logical. If |
base | Numerical. Logarithm base. Default is 2. |
Details
This function first reads in the proteinGroups.txt fileproduced by MaxQuant or a standard input file containing a quantitativematrix where the proteins or protein groups are indicated by rows and thesamples by columns.
It then reads in the expDesign.txt file provided as
exp_designand extracts relevant information from it to add to thedata frame. an example of the expDesign.txt is provided here:https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt.First, empty rows and columns are removed from the data frame.
Next, if a proteinGroups.txt file is used, it filters out reverseproteins, proteins that were only identified by site, and potentialcontaminants.Then it removes proteins identified with less thanthe number of unique peptides indicated by
uniq_pepfrom thedata frame.Next, it extracts the intensity columns indicated by
data typeand the selected protein rows from the data frame.Converts missing values (zeros) to NAs.
Finally, the function log transforms the intensity values.
Value
Araw_df object which is a data frame containing proteinintensities. Proteins or protein groups are indicated by rows and samplesby columns.
Author(s)
Chathurani Ranathunge
Examples
### Using a proteinGroups.txt file produced by MaxQuant as input.## Generate a raw_df object with default settings. No technical replicates.raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "MaxQuant")## Data containing technical replicatesraw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt", input_type = "MaxQuant", tech_reps = TRUE)## Alter the number of unique peptides needed to retain a proteinraw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "MaxQuant", uniq_pep = 1)## Use "iBAQ" values instead of "LFQ" valuesraw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "MaxQuant", data_type = "iBAQ")### Using a universal standard input file instead of MaxQuant output.raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/st.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt", input_type = "standard")Cox et al 2014 LFQ data (fit object)
Description
An object of class "MArrayLM" from running find_dep on ecoli_norm_df
Usage
data(ecoli_fit_df)Format
An object of class "MArrayLM"
References
https://europepmc.org/article/MED/24942700#id609082
Cox et al 2014 LFQ data (normalized)
Description
A dataframe containing normalized LFQ protein intensity data for 4360proteins in 6 samples
Usage
data(ecoli_norm_df)Format
A data frame with 4360 rows (proteins) and 6 columns (samples)
References
https://europepmc.org/article/MED/24942700#id609082
Visualize feature (protein) variation among conditions
Description
This function visualizes protein intensity differences amongconditions (classes) using box plots or density distribution plots.
Usage
feature_plot( model_df, type = "box", text_size = 10, palette = "viridis", n_row, n_col, save = FALSE, file_path = NULL, file_name = "Feature_plot", file_type = "pdf", dpi = 80, plot_width = 7, plot_height = 7)Arguments
model_df | A |
type | Type of plot to generate. Choices are "box" or "density." Defaultis |
text_size | Text size for plot labels, axis labels etc. Default is |
palette | Viridis color palette option for plots. Default is |
n_row | Number of rows to print the plots. |
n_col | Number of columns to print the plots. |
save | Logical. If |
file_path | A string containing the directory path to save the file. |
file_name | File name to save the plot.Default is |
file_type | File type to save the plot.Default is |
dpi | Plot resolution. Default is |
plot_width | Width of the plot. Default is |
plot_height | Height of the plot. Default is |
Details
This function visualizes condition-wise differences in proteinintensity using boxplots and/or density plots.
Value
Aggplot2 object
Author(s)
Chathurani Ranathunge
See Also
pre_process,rem_feature
Examples
## Create a model_df object with default settings.covid_model_df <- pre_process(covid_fit_df, covid_norm_df)## Feature variation - box plotsfeature_plot(covid_model_df, type = "box", n_row = 4, n_col = 2)## Density plotsfeature_plot(covid_model_df, type = "density")## Change color palettefeature_plot(covid_model_df, type = "density", n_row = 4, n_col = 2, palette = "rocket")Filter proteins by group level missing data
Description
This function filters out proteins based on missing dataat the group level.
Usage
filterbygroup_na(raw_df, set_na = 0.34, filter_condition = "either")Arguments
raw_df | A |
set_na | The proportion of missing data allowed.Default is 0.34 (one third of the samples in the group). |
filter_condition | If set to |
Details
This function firstextracts group or condition information from the
raw_dfobject andassigns samples to their groups.If
filter_condition = "each", it then removes proteins (rows)from the data frame if the proportion of NAs ineach group exceeds thethreshold indicated byset_na(default is 0.34). This option ismore lenient in comparison tofilter_condition = "either", whereproteins that exceeds the missing data threshold ineither group getsremoved from the data frame.
Value
Araw_df object.
Author(s)
Chathurani Ranathunge
See Also
Examples
# Generate a raw_df object with default settings. No technical replicates.raw_df <- create_df(prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt",exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt")## Remove proteins that exceed 34% NAs in either group (default)rawdf_filt1 <- filterbygroup_na(raw_df)## Remove proteins that exceed 34% NAs in each grouprawdf_filt2 <- filterbygroup_na(raw_df, filter_condition = "each")## Proportion of samples with NAs allowed in each group = 0.5rawdf_filt3 <- filterbygroup_na(raw_df, set_na = 0.5, filter_condition = "each")Identify differentially expressed proteins between groups
Description
This function performs differential expression analysison protein intensity data with limma.
Usage
find_dep( df, save_output = FALSE, save_tophits = FALSE, file_path = NULL, adj_method = "BH", cutoff = 0.05, lfc = 1, n_top = 20)Arguments
df | A |
save_output | Logical. If |
save_tophits | Logical. If |
file_path | A string containing the directory path to save the file. |
adj_method | Method used for adjusting the p-values for multipletesting. Default is |
cutoff | Cutoff value for p-values and adjusted p-values. Default is0.05. |
lfc | Minimum absolute log2-fold change to use as threshold fordifferential expression. |
n_top | The number of top differentially expressed proteins to save inthe "TopHits.txt" file. Default is |
Details
It is important that the data is first log-transformed, ideally,imputed, and normalized before performing differential expression analysis.
save_outputsaves the complete results table from thedifferential expression analysis.save_tophitsfirst subsets the results to those with absolutelog fold change of more than 1, performs multiple correction withthe method specified inadj_methodand outputs the topn_topresults based on lowest p-value and adjusted p-value.If the number of hits with absolute log fold change of more than 1 isless than
n_top,find_depprints only those withlog-fold change > 1 to "TopHits.txt".If the
file_pathis not specified, text files will be saved ina temporary directory.
Value
Afit_df object, which is similar to alimmafit object.
Author(s)
Chathurani Ranathunge
References
Ritchie, Matthew E., et al. "limma powers differential expressionanalyses for RNA-sequencing and microarray studies." Nucleic acids research43.7 (2015): e47-e47.
See Also
Examples
## Perform differential expression analysis using default settingsfit_df1 <- find_dep(ecoli_norm_df)## Change p-value and adjusted p-value cutofffit_df2 <- find_dep(ecoli_norm_df, cutoff = 0.1)Heatmap of differentially expressed proteins
Description
This function generates a heatmap to visualize differentiallyexpressed proteins between groups
Usage
heatmap_de( fit_df, df, adj_method = "BH", cutoff = 0.05, lfc = 1, sig = "adjP", n_top = 20, palette = "viridis", text_size = 10, save = FALSE, file_path = NULL, file_name = "HeatmapDE", file_type = "pdf", dpi = 80, plot_height = 7, plot_width = 7)Arguments
fit_df | A |
df | The |
adj_method | Method used for adjusting the p-values for multipletesting. Default is |
cutoff | Cutoff value for p-values and adjusted p-values. Default is0.05. |
lfc | Minimum absolute log2-fold change to use as threshold fordifferential expression. Default is 1. |
sig | Criteria to denote significance. Choices are |
n_top | Number of top hits to include in the heat map. |
palette | Viridis color palette option for plots. Default is |
text_size | Text size for axis text, labels etc. |
save | Logical. If |
file_path | A string containing the directory path to save the file. |
file_name | File name to save the plot. Default is "HeatmapDE." |
file_type | File type to save the plot. Default is |
dpi | Plot resolution. Default is |
plot_height | Height of the plot. Default is 7. |
plot_width | Width of the plot. Default is 7. |
Details
By default the tiles in the heatmap are reordered by intensity valuesalong both axes (x axis = samples, y axis = proteins).
Value
Aggplot2 plot object.
Author(s)
Chathurani Ranathunge
See Also
Examples
## Build a heatmap of differentially expressed proteins using the provided## example fit_df and norm_df data objectsheatmap_de(covid_fit_df, covid_norm_df)## Create a heatmap with P-value of 0.05 and log fold change of 1 as## significance criteria.heatmap_de(covid_fit_df, covid_norm_df, cutoff = 0.05, sig = "P")## Visualize the top 30 differentially expressed proteins in the heatmap and## change the color paletteheatmap_de(covid_fit_df, covid_norm_df, cutoff = 0.05, sig = "P", n_top = 30, palette = "magma")Visualize missing data
Description
This function visualizes the patterns of missing valueoccurrence using a heatmap.
Usage
heatmap_na( raw_df, protein_range, sample_range, reorder_x = FALSE, reorder_y = FALSE, x_fun = mean, y_fun = mean, palette = "viridis", label_proteins = FALSE, text_size = 10, save = FALSE, file_type = "pdf", file_path = NULL, file_name = "Missing_data_heatmap", plot_width = 15, plot_height = 15, dpi = 80)Arguments
raw_df | A |
protein_range | The range or subset of proteins (rows) to plot. If notprovided, all the proteins (rows) in the data frame will be used. |
sample_range | The range of samples to plot. If notprovided, all the samples (columns) in the data frame will be used. |
reorder_x | Logical. If |
reorder_y | Logical. If |
x_fun | Function to reorder samples along the x axis. Possible optionsare |
y_fun | Function to reorder proteins along the y axis. Possible optionsare |
palette | Viridis color palette option for plots. Default is |
label_proteins | If |
text_size | Text size for axis labels. Default is |
save | Logical. If |
file_type | File type to save the heatmap. Default is |
file_path | A string containing the directory path to save the file. |
file_name | File name to save the heatmap. Default is |
plot_width | Width of the plot. Default is |
plot_height | Height of the plot. Default is |
dpi | Plot resolution. Default is |
Details
This function visualizes patterns of missing value occurrence using aheatmap. The user can choose to reorder the axes using the available functions(x_fun,y_fun) to better understand the underlying cause ofmissing data.
Value
Aggplot2 plot object.
Author(s)
Chathurani Ranathunge
See Also
Examples
## Generate a raw_df object with default settings. No technical replicates.raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt")## Missing data heatmap with default settings.heatmap_na(raw_df)## Missing data heatmap with x and y axes reordered by the mean (default) of## protein intensity.heatmap_na(raw_df, reorder_x = TRUE, reorder_y = TRUE)## Missing data heatmap with x and y axes reordered by the sum of## protein intensity.heatmap_na(raw_df, reorder_x = TRUE, reorder_y = TRUE, x_fun = sum, y_fun = sum)## Missing data heatmap for a subset of the proteins with x and y axes## reordered by the mean (default) of protein intensity and the y axis## labeled with protein IDs.heatmap_na(raw_df, protein_range = 1:30, reorder_x = TRUE, reorder_y = TRUE, label_proteins = TRUE)Impute missing values
Description
This function imputes missing values using a user-specifiedimputation method.
Usage
impute_na( df, method = "minProb", tune_sigma = 1, q = 0.01, maxiter = 10, ntree = 20, n_pcs = 2, seed = NULL)Arguments
df | A |
method | Imputation method to use. Default is |
tune_sigma | A scalar used in the |
q | A scalar used in |
maxiter | Maximum number of iterations to be performed when using the |
ntree | Number of trees to grow in each forest when using the |
n_pcs | Number of principal components to calculate when using the |
seed | Numerical. Random number seed. Default is |
Details
Ideally, you should first remove proteins withhigh levels of missing data using the
filterbygroup_nafunctionbefore runningimpute_naon theraw_dfobject or thenorm_dfobject.impute_nafunction imputes missing values using auser-specified imputation method from the available options,minProb,minDet,kNN,RF, andSVD.Note: Some imputation methods may require that the data be normalizedprior to imputation.
Make sure to fix the random number seed with
seedfor reproducibility
.
Value
Animp_df object, which is a data frame of protein intensitieswith no missing values.
Author(s)
Chathurani Ranathunge
References
Lazar, Cosmin, et al. "Accounting for the multiple natures ofmissing values in label-free quantitative proteomics data sets to compareimputation strategies." Journal of proteome research 15.4 (2016): 1116-1125.
See Also
More information on the available imputation methods can be foundin their respective packages.
For
minProbandminDetmethods, seeimputeLCMDpackage.For Random Forest (
RF) method, seemissForest.For
SVDmethod, seepcafrom thepcaMethodspackage.
Examples
## Generate a raw_df object with default settings. No technical replicates.raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt")## Impute missing values in the data frame using the default minProb## method.imp_df1 <- impute_na(raw_df, seed = 3312)## Using the kNN method.imp_df2 <- impute_na(raw_df, method = "kNN", seed = 3312)## Using the SVD method with n_pcs set to 3.imp_df3 <- impute_na(raw_df, method = "SVD", n_pcs = 3, seed = 3312)## Using the minDet method with q set at 0.001.imp_df4 <- impute_na(raw_df, method = "minDet", q = 0.001, seed = 3312)## Impute a normalized data set using the kNN methodimp_df5 <- impute_na(ecoli_norm_df, method = "kNN")Visualize the impact of imputation
Description
This function generates density plots to visualize the impact ofmissing data imputation on the data.
Usage
impute_plot( original, imputed, global = TRUE, text_size = 10, palette = "viridis", n_row, n_col, save = FALSE, file_path = NULL, file_name = "Impute_plot", file_type = "pdf", plot_width = 7, plot_height = 7, dpi = 80)Arguments
original | A |
imputed | An |
global | Logical. If |
text_size | Text size for plot labels, axis labels etc. Default is |
palette | Viridis color palette option for plots. Default is |
n_row | Used if |
n_col | Used if |
save | Logical. If |
file_path | A string containing the directory path to save the file. |
file_name | File name to save the density plot/s.Default is |
file_type | File type to save the density plot/s.Default is |
plot_width | Width of the plot. Default is |
plot_height | Height of the plot. Default is |
dpi | Plot resolution. Default is |
Details
Given two data frames, one with missing valuesand the other, an imputed data frame (
imp_dfobject) of the samedata set,impute_plotgenerates global or sample-wise density plotsto visualize the impact of imputation on the data set.Note, when sample-wise option is selected (
global = FALSE),n_colandn_rowcan be used to specify the number of columnsand rows to print the plots.If you choose to specify
n_rowandn_col, make sure thatn_row*n_colmatches the total number of samples in thedata frame.
Value
Aggplot2 plot object.
Author(s)
Chathurani Ranathunge
Examples
## Generate a raw_df object with default settings. No technical replicates.raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt")## Impute missing values in the data frame using the default minProb## method.imp_df <- impute_na(raw_df)## Visualize the impact of missing data imputation with a global density## plot.impute_plot(original = raw_df, imputed = imp_df)## Make sample-wise density plotsimpute_plot(raw_df, imp_df, global = FALSE)## Print plots in user-specified numbers of rows and columnsimpute_plot(raw_df, imp_df, global = FALSE, n_col = 2, n_row = 3)Visualize the effect of normalization
Description
This function visualizes the impact of normalization onthe data
Usage
norm_plot( original, normalized, type = "box", text_size = 10, palette = "viridis", save = FALSE, file_path = NULL, file_name = "Norm_plot", file_type = "pdf", dpi = 80, plot_width = 10, plot_height = 7)Arguments
original | A |
normalized | A |
type | Type of plot to generate. Choices are "box" or "density." Defaultis |
text_size | Text size for plot labels, axis labels etc. Default is |
palette | Viridis color palette option for plots. Default is |
save | Logical. If |
file_path | A string containing the directory path to save the file. |
file_name | File name to save the plot.Default is |
file_type | File type to save the plot.Default is |
dpi | Plot resolution. Default is |
plot_width | Width of the plot. Default is |
plot_height | Height of the plot. Default is |
Details
Given two data frames, one with data prior to normalization(original), and the other, after normalization (normalized),norm_plot generates side-by-side plots to visualize the effect ofnormalization on the protein intensity data.
Value
Aggplot2 plot object.
Author(s)
Chathurani Ranathunge
See Also
create_dfimpute_na
Examples
## Generate a raw_df object with default settings. No technical replicates.raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt")## Impute missing values in the data frame using the default minProb## method.imp_df <- impute_na(raw_df)## Normalize the imp_df object using the default quantile methodnorm_df <- normalize_data(imp_df)## Visualize normalization using box plotsnorm_plot(original = imp_df, normalized = norm_df)## Visualize normalization using density plotsnorm_plot(imp_df, norm_df, type = "density")Normalize intensity data
Description
This function normalizes data using a user-specifiednormalization method.
Usage
normalize_data(df, method = "quantile")Arguments
df | An |
method | Name of the normalization method to use. Choices are |
Details
normalize_datais a wrapper function aroundthenormalizeBetweenArraysfunction from thelimmapackage.This function normalizesintensity values to achieve consistency among samples.
It assumes that the intensities in thedata frame have been log-transformed, therefore, it is important to make surethat
create_dfwas run withlog_tr = TRUE(default) whencreating theraw_dfobject.
Value
Anorm_df object, which is a data frame ofnormalized protein intensities.
Author(s)
Chathurani Ranathunge
See Also
create_dfimpute_naSee
normalizeBetweenArraysin the R packagelimmafor more information on the different normalization methodsavailable.
Examples
## Generate a raw_df object with default settings. No technical replicates.raw_df <- create_df( prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt", exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt")## Impute missing values in the data frame using the default minProb## method prioir to normalization.imp_df <- impute_na(raw_df)## Normalize the imp_df object using the default quantile methodnorm_df1 <- normalize_data(imp_df)## Use the cyclicloess methodnorm_df2 <- normalize_data(imp_df, method = "cyclicloess")## Normalize data in the raw_df object prior to imputation.norm_df3 <- normalize_data(raw_df)Proteins that are only expressed in a given group
Description
This function outputs a list of proteins that are onlyexpressed (present) in one user-specified group while not expressed(completely absent) in another user-specified group.
Usage
onegroup_only( raw_df, abs_group, pres_group, set_na = 0.34, save = FALSE, file_path = NULL)Arguments
raw_df | A |
abs_group | Name of the group in which proteins are not expressed. |
pres_group | Name of the group in which proteins are expressed. |
set_na | The percentage of missing data allowed in |
save | Logical. If |
file_path | A string containing the directory path to save the file. |
Details
Note:onegroup_only function assumes that column names intheraw_df object provided asdf follow "Group_UniqueSampleID"notation. (Usehead(raw_df) to check the structure of yourraw_df object.)
Given a pair of groups,
onegroup_onlyfunction finds proteins that are only expressed inpres_groupwhilecompletely absent or not expressed inabs_group.A text file containing majority protein IDs will be saved in atemporary directory if
file_pathis not specified.
Value
A list of majority protein IDs.
Author(s)
Chathurani Ranathunge
Examples
# Generate a raw_df object with default settings. No technical replicates.raw_df <- create_df(prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg1.txt",exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed1.txt")## Find the proteins only expressed in group L, but absent in group H.onegroup_only(raw_df, abs_group = "H",pres_group = "L")Model performance plot
Description
This function generates plots to visualize model performance
Usage
performance_plot( model_list, type = "box", text_size = 10, palette = "viridis", save = FALSE, file_path = NULL, file_name = "Performance_plot", file_type = "pdf", plot_width = 7, plot_height = 7, dpi = 80)Arguments
model_list | A |
type | Type of plot to generate. Choices are "box" or "dot."Default is |
text_size | Text size for plot labels, axis labels etc. Default is |
palette | Viridis color palette option for plots. Default is |
save | Logical. If |
file_path | A string containing the directory path to save the file. |
file_name | File name to save the plot.Default is |
file_type | File type to save the plot.Default is |
plot_width | Width of the plot. Default is |
plot_height | Height of the plot. Default is |
dpi | Plot resolution. Default is |
Details
performance_plotuses resampling results frommodels included in themodel_listto generate plots showing modelperformance.The default metrics used for classification based models are "Accuracy"and "Kappa."
These metric types can be changed by providing additional arguments tothe
train_modelsfunction. SeetrainandtrainControlfor more information.
Value
Aggplot2 object.
Author(s)
Chathurani Ranathunge
See Also
train_models
Examples
## Create a model_df objectcovid_model_df <- pre_process(covid_fit_df, covid_norm_df)## Split the data frame into training and test data setscovid_split_df <- split_data(covid_model_df)## Fit models based on the default list of machine learning (ML) algorithmscovid_model_list <- train_models(covid_split_df)## Generate box plots to visualize performance of different ML algorithmsperformance_plot(covid_model_list)## Generate dot plotsperformance_plot(covid_model_list, type = "dot")## Change color paletteperformance_plot(covid_model_list, type = "dot", palette = "inferno")Pre-process protein intensity data for modeling
Description
This function pre-processes protein intensity data fromthe top differentially expressed proteins identified withfind_dep formodeling.
Usage
pre_process( fit_df, norm_df, sig = "adjP", sig_cutoff = 0.05, fc = 1, n_top = 20, find_highcorr = TRUE, corr_cutoff = 0.9, save_corrmatrix = FALSE, file_path = NULL, rem_highcorr = TRUE)Arguments
fit_df | A |
norm_df | The |
sig | Criteria to denote significance in differential expression.Choices are |
sig_cutoff | Cutoff value for p-values and adjusted p-values indifferential expression. Default is |
fc | Minimum absolute log-fold change to use as threshold fordifferential expression. Default is |
n_top | The number of top hits from |
find_highcorr | Logical. If |
corr_cutoff | A numeric value specifying the correlation cutoff.Default is |
save_corrmatrix | Logical. If |
file_path | A string containing the directory path to save the file. |
rem_highcorr | Logical. If |
Details
This function creates a data frame that contains protein intensitiesfor a user-specified number of top differentially expressed proteins.
Using
find_highcorr = TRUE, highly correlatedproteins can be identified, and can be removed withrem_highcorr = TRUE.Note: Most models will benefit from reducing correlation betweenproteins (predictors or features), therefore we recommend removing thoseproteins at this stage to reduce pairwise-correlation.
If no or few proteins meet the significance threshold for differentialexpression, you may adjust
sig,fc, and/orsig_cutoffaccordingly to obtain more proteins for modeling.
Value
Amodel_df object, which is a data frame of proteinintensities with proteins indicated by columns.
Author(s)
Chathurani Ranathunge
See Also
find_dep,normalize_data
Examples
## Create a model_df object with default settings.covid_model_df1 <- pre_process(fit_df = covid_fit_df, norm_df = covid_norm_df)## Change the correlation cutoff.covid_model_df2 <- pre_process(covid_fit_df, covid_norm_df, corr_cutoff = 0.95)## Change the significance criteria to include more proteinscovid_model_df3 <- pre_process(covid_fit_df, covid_norm_df, sig = "P")## Change the number of top differentially expressed proteins to includecovid_model_df4 <- pre_process(covid_fit_df, covid_norm_df, sig = "P", n_top = 24)Remove user-specified proteins (features) from a data frame
Description
This function removes user-specified proteins from amodel_dfobject
Usage
rem_feature(model_df, rem_protein)Arguments
model_df | A |
rem_protein | Name of the protein to remove. |
Details
After visualizing protein intensity variationamong conditions with
feature_plotor after assessing the importanceof each protein in models usingvarimp_plot, you can choose to removespecific proteins (features) from the data frame.For example, you canchoose to remove a protein from the
model_dfobject if the proteindoes not show distinct patterns of variation among conditions. This proteinmay show mostly overlapping distributions in the feature plots.Another incidence would be removing a protein that is very low invariable importance in the models built using
train_models. You canvisualize variable importance usingvarimp_plot.
Value
Amodel_df object.
Author(s)
Chathurani Ranathunge
See Also
Examples
covid_model_df <- pre_process(fit_df = covid_fit_df, norm_df = covid_norm_df)## Remove sp|P22352|GPX3_HUMAN protein from the model_df objectcovid_model_df1 <- rem_feature(covid_model_df, rem_protein = "sp|P22352|GPX3_HUMAN")Remove user-specified samples
Description
This function removes user-specified samples from thedata frame.
Usage
rem_sample(raw_df, rem)Arguments
raw_df | A |
rem | Name of the sample to remove. |
Details
rem_sampleassumes that sample names follow the"Group_UniqueSampleID_TechnicalReplicate" notation (Usehead(raw_df)to see the structure of theraw_dfobject.)If all the technical replicates representing a sample needs to beremoved, provide "Group_UniqueSampleID" as
rem.If a specific technical replicate needs to be removed in case itshows weak correlation with other technical replicates for example, you canremove that particular technical replicate by providing"Group_UniqueSampleID_TechnicalReplicate" as
rem.
Value
Araw_df object.
Author(s)
Chathurani Ranathunge
See Also
Examples
## Use a data set containing technical replicates to create a raw_df objectraw_df <- create_df(prot_groups = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/pg2.txt",exp_design = "https://raw.githubusercontent.com/caranathunge/promor_example_data/main/ed2.txt",tech_reps = TRUE)# Check the first few rows of the raw_df objecthead(raw_df)## Remove all technical replicates of "WT_4"raw_df1 <- rem_sample(raw_df, "WT_4")## Remove only technical replicate number 2 of "WT_4"raw_df2 <- rem_sample(raw_df, "WT_4_2")ROC plot
Description
This function generates Receiver Operating Characteristic (ROC)curves to evaluate models
Usage
roc_plot( probability_list, split_df, ..., multiple_plots = TRUE, text_size = 10, palette = "viridis", save = FALSE, file_path = NULL, file_name = "ROC_plot", file_type = "pdf", plot_width = 7, plot_height = 7, dpi = 80)Arguments
probability_list | A |
split_df | A |
... | Additional arguments to be passed on to |
multiple_plots | Logical. If |
text_size | Text size for plot labels, axis labels etc. Default is |
palette | Viridis color palette option for plots. Default is |
save | Logical. If |
file_path | A string containing the directory path to save the file. |
file_name | File name to save the plot.Default is |
file_type | File type to save the plot.Default is |
plot_width | Width of the plot. Default is |
plot_height | Height of the plot. Default is |
dpi | Plot resolution. Default is |
Details
roc_plotfirst uses probabilities generatedduringtest_modelsto build a ROC object.Next, relevant information is extracted from the ROC object toplot the ROC curves.
Value
Aggplot2 object.
Author(s)
Chathurani Ranathunge
See Also
test_models
Examples
## Create a model_df objectcovid_model_df <- pre_process(covid_fit_df, covid_norm_df)## Split the data frame into training and test data setscovid_split_df <- split_data(covid_model_df)## Fit models using the default list of machine learning (ML) algorithmscovid_model_list <- train_models(covid_split_df)# Test a list of models on a test data set and output class probabilities,covid_prob_list <- test_models(covid_model_list, covid_split_df, type = "prob")## Plot ROC curves separately for each ML algorithmroc_plot(covid_prob_list, covid_split_df)## Plot all ROC curves in one plotroc_plot(covid_prob_list, covid_split_df, multiple_plots = FALSE)## Change color paletteroc_plot(covid_prob_list, covid_split_df, palette = "plasma")Split the data frame to create training and test data
Description
This function can be used to create balanced splits of theprotein intensity data in amodel_df object to create training and test data
Usage
split_data(model_df, train_size = 0.8, seed = NULL)Arguments
model_df | A |
train_size | The size of the training data set as a proportion of thecomplete data set. Default is 0.8. |
seed | Numerical. Random number seed. Default is |
Details
This function splits themodel_df object in to training andtest data sets using random sampling while preserving the originalclass distribution of the data. Make sure to fix the random number seed withseed for reproducibility
Value
A list of data frames.
Author(s)
Chathurani Ranathunge
See Also
pre_process
Examples
## Create a model_df objectcovid_model_df <- pre_process(covid_fit_df, covid_norm_df)## Split the data frame into training and test data sets using default settingscovid_split_df1 <- split_data(covid_model_df, seed = 8314)## Split the data frame into training and test data sets with 70% of the## data in training and 30% in test data setscovid_split_df2 <- split_data(covid_model_df, train_size = 0.7, seed = 8314)## Access training data setcovid_split_df1$training## Access test data setcovid_split_df1$testTest machine learning models on test data
Description
This function can be used to predict test data using modelsgenerated by different machine learning algorithms
Usage
test_models( model_list, split_df, type = "prob", save_confusionmatrix = FALSE, file_path = NULL, ...)Arguments
model_list | A |
split_df | A |
type | Type of output. Set |
save_confusionmatrix | Logical. If |
file_path | A string containing the directory path to save the file. |
... | Additional arguments to be passed on to |
Details
test_modelsfunction usesmodels obtained fromtrain_modelsto predict a given test data set.Setting
type = "raw"is required to obtain confusion matrices.Setting
type = "prob"(default) will output a list ofprobabilities that can be used to generate ROC curves usingroc_plot.
Value
probability_list: Iftype = "prob", a list ofdata frames containing class probabilities for each method in themodel_listwill be returned.prediction_list: Iftype = "raw", a list of factorscontaining class predictions for each method will be returned.
Author(s)
Chathurani Ranathunge
See Also
split_dftrain_models
Examples
## Create a model_df objectcovid_model_df <- pre_process(covid_fit_df, covid_norm_df)## Split the data frame into training and test data setscovid_split_df <- split_data(covid_model_df)## Fit models using the default list of machine learning (ML) algorithmscovid_model_list <- train_models(covid_split_df)# Test a list of models on a test data set and output class probabilities,covid_prob_list <- test_models(model_list = covid_model_list, split_df = covid_split_df)## Not run: # Save confusion matrices in the working directory and output class predictionscovid_pred_list <- test_models( model_list = covid_model_list, split_df = covid_split_df, type = "raw", save_confusionmatrix = TRUE, file_path = ".")## End(Not run)Train machine learning models on training data
Description
This function can be used to train models on protein intensitydata using different machine learning algorithms
Usage
train_models( split_df, resample_method = "repeatedcv", resample_iterations = 10, num_repeats = 3, algorithm_list, seed = NULL, ...)Arguments
split_df | A |
resample_method | The resampling method to use. Default is |
resample_iterations | Number of resampling iterations. Default is |
num_repeats | The number of complete sets of folds to compute (For |
algorithm_list | A list of classification or regression algorithms touse.A full list of machine learning algorithms available throughthe |
seed | Numerical. Random number seed. Default is |
... | Additional arguments to be passed on to |
Details
train_modelsfunction can be used to firstdefine the control parameters to be used in training models, calculateresampling-based performance measures for models based on a given set ofmachine-learning algorithms, and output the best model for each algorithm.In the event that
algorithm_listis not provided, a defaultlist of four classification-based machine-learning algorithms will be usedfor building and training models. Defaultalgorithm_list:"svmRadial", "rf", "glm", "xgbLinear, and "naive_bayes."Note: Models that fail to build are removed from the output.
Make sure to fix the random number seed with
seedfor reproducibility
Value
A list of classtrain for each machine-learning algorithm.Seetrain for more information on accessingdifferent elements of this list.
Author(s)
Chathurani Ranathunge
References
Kuhn, Max. "Building predictive models in R using the caretpackage." Journal of statistical software 28 (2008): 1-26.
See Also
pre_process
Examples
## Create a model_df objectcovid_model_df <- pre_process(covid_fit_df, covid_norm_df)## Split the data frame into training and test data setscovid_split_df <- split_data(covid_model_df, seed = 8314)## Fit models based on the default list of machine learning (ML) algorithmscovid_model_list1 <- train_models(split_df = covid_split_df, seed = 351)## Fit models using a user-specified list of ML algorithms.covid_model_list2 <- train_models( covid_split_df, algorithm_list = c("svmRadial", "glmboost"), seed = 351)## Change resampling method and resampling iterations.covid_model_list3 <- train_models( covid_split_df, resample_method = "cv", resample_iterations = 50, seed = 351)Variable importance plot
Description
This function visualizes variable importance in models
Usage
varimp_plot( model_list, ..., type = "lollipop", text_size = 10, palette = "viridis", n_row, n_col, save = FALSE, file_path = NULL, file_name = "VarImp_plot", file_type = "pdf", dpi = 80, plot_width = 7, plot_height = 7)Arguments
model_list | A |
... | Additional arguments to be passed on to |
type | Type of plot to generate. Choices are "bar" or "lollipop."Default is |
text_size | Text size for plot labels, axis labels etc. Default is |
palette | Viridis color palette option for plots. Default is |
n_row | Number of rows to print the plots. |
n_col | Number of columns to print the plots. |
save | Logical. If |
file_path | A string containing the directory path to save the file. |
file_name | File name to save the plot.Default is |
file_type | File type to save the plot.Default is |
dpi | Plot resolution. Default is |
plot_width | Width of the plot. Default is |
plot_height | Height of the plot. Default is |
Details
varimp_plotproduces a list of plots showingvariable importance measures calculated from models generated with differentmachine-learning algorithms.Note: Variables are ordered by variable importance indescending order, and by default, importance values are scaled to 0 and 100.This can be changed by specifying
scale = FALSE. SeevarImpfor more information.
Value
A list ofggplot2 objects.
Author(s)
Chathurani Ranathunge
See Also
train_models,rem_feature
Examples
## Create a model_df objectcovid_model_df <- pre_process(covid_fit_df, covid_norm_df)## Split the data frame into training and test data setscovid_split_df <- split_data(covid_model_df)## Fit models based on the default list of machine learning (ML) algorithmscovid_model_list <- train_models(covid_split_df)## Variable importance - lollipop plotsvarimp_plot(covid_model_list)## Bar plotsvarimp_plot(covid_model_list, type = "bar")## Do not scale variable importance valuesvarimp_plot(covid_model_list, scale = FALSE)## Change color palettevarimp_plot(covid_model_list, palette = "magma")Volcano plot
Description
This function generates volcano plots to visualizedifferentially expressed proteins between groups.
Usage
volcano_plot( fit_df, adj_method = "BH", sig = "adjP", cutoff = 0.05, lfc = 1, line_fc = TRUE, line_p = TRUE, palette = "viridis", text_size = 10, label_top = FALSE, n_top = 10, save = FALSE, file_path = NULL, file_name = "Volcano_plot", file_type = "pdf", plot_height = 7, plot_width = 7, dpi = 80)Arguments
fit_df | A |
adj_method | Method used for adjusting the p-values for multipletesting. Default is |
sig | Criteria to denote significance. Choices are |
cutoff | Cutoff value for p-values and adjusted p-values. Default is0.05. |
lfc | Minimum absolute log2-fold change to use as threshold fordifferential expression. |
line_fc | Logical. If |
line_p | Logical. If |
palette | Viridis color palette option for plots. Default is |
text_size | Text size for axis text, labels etc. |
label_top | Logical. If |
n_top | The number of top hits to label with protein name when |
save | Logical. If |
file_path | A string containing the directory path to save the file. |
file_name | File name to save the plot. Default is "Volcano_plot." |
file_type | File type to save the plot. Default is |
plot_height | Height of the plot. Default is 7. |
plot_width | Width of the plot. Default is 7. |
dpi | Plot resolution. Default is |
Details
Volcano plots show log-2-fold change on the x-axis,and based on the significance criteria chosen, either -log10(p-value) or-log10(adjusted p-value) on the y-axis.
volcano_plotrequires afit_dfobject from performingdifferential expression analysis withfind_dep.User has the option to choose criteria that denote significance.
Value
Aggplot2 plot object.
Author(s)
Chathurani Ranathunge
See Also
Examples
## Create a volcano plot with default settings.volcano_plot(ecoli_fit_df)## Change significance criteria and cutoffvolcano_plot(ecoli_fit_df, cutoff = 0.1, sig = "P")## Label top 30 differentially expressed proteins and## change the color palette of the plotvolcano_plot(ecoli_fit_df, label_top = TRUE, n_top = 30, palette = "mako")