- Notifications
You must be signed in to change notification settings - Fork1
SlimR: Machine Learning-Assisted, Marker-Based Tool for Single-Cell and Spatial Transcriptomics Annotation
License
Unknown, MIT licenses found
Licenses found
Zhaoqing-wang/SlimR
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
SlimR: Machine Learning-Assisted, Marker-Based Tool for Single-Cell and Spatial Transcriptomics Annotation
SlimR is an R package designed for annotating single-cell and spatial-transcriptomics (ST) datasets. It supports the creation of a unified marker list,Markers_list, using sources including: the package's built-in curated species-specific cell type and marker reference databases (e.g., 'Cellmarker2', 'PanglaoDB', 'scIBD', 'TCellSI','PCTIT'), Seurat objects containing cell label information, or user-provided Excel tables mapping cell types to markers.
SlimR can predict calculation parameters by machine learning algorithms (e.g., 'Random Forest', 'Gradient Boosting', 'Support Vector Machine', 'Ensemble Learning') byParameter_Calculate(), and based onMarkers_list, calculate gene expression of different cell types and predict annotation information and calculate corresponding AUC byCelltype_Calculate(), and annotate it byCelltype_Annotation(), then verify it byCelltype_Verification(). At the same time, it can calculate gene expression corresponding to the cell type to generate a reference map for manual annotation (e.g., 'Heat Map', 'Feature Plots', 'Combined Plots').
- Preparation
- Standardized Markers_list Input
- Automated Annotation Workflow
- Semi-Automated Annotation Workflow
- Other Functions Provided by SlimR
- Conclusion
Install SlimR directly from CRAN using: (Stable version, recommended when the version is equivalent to the GitHub package version)
install.packages("SlimR")Note: Try adjusting the CRAN image toGlobal (CDN) or useBiocManager::install("SlimR") if you encounter a version mismatch during installation.
Install SlimR directly from GitHub using: (Development version, recommended when the version is higher than the CRAN package version)
devtools::install_github("Zhaoqing-wang/SlimR")
Note: If the function doesn't work, please runinstall.packages('devtools') first.
Load the package in your R environment:
library(SlimR)For Seurat objects with multiple layers in the assay, please runSeuratObject::JoinLayers() first.
# For example, if you want to use the 'RNA' layer in the multilayered Seurat object assay.sce@assays$RNA<-SeuratObject::JoinLayers(sce@assays$RNA)
Important: To ensure accuracy of the annotation, make sure that the entered Seurat object has run the standard process and removed batch effects.
Note: It is recommended to use theclustree package to determine the appropriate resolution for the input Seurat object.
SlimR requires R (≥ 3.5) and depends on the following packages:cowplot,dplyr,ggplot2,patchwork,pheatmap,readxl,scales,Seurat,tidyr,tools. If installation fails, please install missing dependencies using:
# Install dependencies if needed:install.packages(c("cowplot","dplyr","ggplot2","patchwork","pheatmap","readxl","scales","Seurat","tidyr","tools"))
SlimR requires a standardized list format for storing marker information, metrics (can be omitted), and corresponding cell types (list names = cell types (essential), first column = markers (essential), subsequent columns = metrics (can be omitted)).
Cellmarkers2: A database of cell types and markers covering different species and tissue types.
Reference:Hu et al. (2023) doi:10.1093/nar/gkac947.
Cellmarker2<-SlimR::Cellmarker2
Cellmarker2_table<-SlimR::Cellmarker2_tableView(Cellmarker2_table)
Markers_list_Cellmarker2<- Markers_filter_Cellmarker2(Cellmarker2,species="Human",tissue_class="Intestine",tissue_type=NULL,cancer_type=NULL,cell_type=NULL)
Important: Select at least thespecies andtissue_class parameters to ensure the accuracy of the annotation.
Link: OutputMarkers_list usable in sections 3.1, 4.1, 4.2, 4.3, and 5.1.Click to section 3 automated annotation workflow.
PanglaoDB: Database of cell types and markers covering different species and tissue types.
Reference:Franzén et al. (2019) doi:10.1093/database/baz046.
PanglaoDB<-SlimR::PanglaoDB
PanglaoDB_table<-SlimR::PanglaoDB_tableView(PanglaoDB_table)
Markers_list_panglaoDB<- Markers_filter_PanglaoDB(PanglaoDB,species_input='Human',organ_input='GI tract')
Important: Select thespecies_input andorgan_input parameters to ensure the accuracy of the annotation.
Link: OutputMarkers_list usable in sections 3.1, 4.1, 4.2, 4.3, and 5.2.Click to section 3 automated annotation workflow.
The standardMarkers_list can be generated by the built-inread_seurat_markers() function after obtaining Markers through theSeurat::FindAllMarkers() function.
seurat_markers<-Seurat::FindAllMarkers(object=sce,group.by="Cell_type",only.pos=TRUE)Markers_list_Seurat<- Read_seurat_markers(seurat_markers,sources="Seurat",sort_by="FSS",gene_filter=20 )
Note: Recommend using the parametersort_by = "FSS" to use the 'Feature Significance Score' (FSS, product value oflog2FC andExpression ratio) or use the parametersort_by = "avg_log2FC" as the ranking basis.
For large data sets, thepresto::wilcoxauc() function can be used to speed up the operation. (Alternative, ~10x faster, sacrifice partial accuracy)
seurat_markers<-dplyr::filter(presto::wilcoxauc(X=sce,group_by="Cell_type",seurat_assay="RNA" ),padj<0.05,logFC>0.5 )Markers_list_Seurat<- Read_seurat_markers(seurat_markers,sources="presto",sort_by="FSS",gene_filter=20 )
Important: This feature depends on thepresto package. Please first run 'devtools::install_github('immunogenomics/presto') 'if prompted to install the package.
Note: Recommend using the parametersort_by = "logFC" or using the parametersort_by = "FSS" to use the 'Feature Significance Score' (FSS, product value oflog2FC andExpression ratio) as the ranking basis.
Link: OutputMarkers_list usable in sections 3.1, 4.1, 4.2, 4.3, and 5.3.Click to section 3 automated annotation workflow.
Format Requirements:
Each sheet name = cell type (essential)
First row = column headers (essential)
First column = markers (essential)
Subsequent columns = metrics (can be omitted)
Markers_list_Excel<- Read_excel_markers("D:/Laboratory/Marker_load.xlsx")
Important: If the "First row" is missing in the input Excel file, please set the parameter 'has_colnames=FALSE' in the function 'Read_excel_markers()'
Link: OutputMarkers_list usable in sections 3.1, 4.1, 4.2, 4.3, and 5.4.Click to section 3 automated annotation workflow.
scIBD: The Human Intestinal Cell Database (Inflammatory Bowel Disease).
Reference:Nie et al. (2023) doi:10.1038/s43588-023-00464-9.
Markers_list_scIBD<-SlimR::Markers_list_scIBD
Important: This is for human intestinal annotation only. The input Seurat object was ensured to be of a human intestinal type to ensure the accuracy of the labeling.
Note: TheMarkers_list_scIBD was generated using section 2.3.2 and the parameterssort_by = "logFC" andgene_filter = 20 were set.
Link: OutputMarkers_list usable in sections 3.1, 4.1, 4.2, 4.3, and 5.3.Click to section 3 automated annotation workflow.
TCellSI: A database of T cell markers of different subtypes.
Reference:Yang et al. (2024) doi:10.1002/imt2.231.
Markers_list_TCellSI<-SlimR::Markers_list_TCellSI
Important: This is only used for annotation of T cell subsets. It was ensured that the input Seurat subjects were T cell subsets to ensure the accuracy of labeling.
Note: TheMarkers_list_TCellSI was generated using section 2.4.
Link: OutputMarkers_list usable in sections 3.1, 4.1, 4.2, 4.3, and 5.4.Click to section 3 automated annotation workflow.
PCTIT: List of T cell subtype markers in the article "Pan-cancer single cell landscape of tumor-infiltrating T cells"
Reference:L. Zheng et al. (2021) doi:10.1126/science.abe6474.
Markers_list_PCTIT<-SlimR::Markers_list_PCTIT
Important: This is only used for annotation of T cell subsets. It was ensured that the input Seurat subjects were T cell subsets to ensure the accuracy of labeling.
Note: TheMarkers_list_PCTIT was generated using section 2.4.
Link: OutputMarkers_list usable in sections 3.1, 4.1, 4.2, 4.3, and 5.4.Click to section 3 automated annotation workflow.
SlimR integrates multiple machine learning algorithms (e.g., Random Forest, Gradient Boosting, Support Vector Machine, and Ensemble Learning) to automatically determine the optimalmin_expression andspecificity_weight parameters in Section 3.2 for calculating the probability of cell types.
# Basic usage uses default genesSlimR_params<- Parameter_Calculate(seurat_obj=sce,features= c("CD3E","CD4","CD8A"),assay="RNA",cluster_col="seurat_clusters",method="ensemble",n_models=3,return_model=FALSE,verbose=TRUE )# Use with custom method: use the genes corresponding to a specific cell type in 'Markers_list' as inputSlimR_params<- Parameter_Calculate(seurat_obj=sce,features= unique(Markers_list_Cellmarker2$`B cell`$marker),assay="RNA",cluster_col="seurat_clusters",method="rf",return_model=FALSE,verbose=TRUE )
Important: This scheme is optional and can be skipped to section 3.2 for cell type probability calculation using default parameters.
Note: Using the parametermethod = "rf" in the functionParameter_Calculate () can modify the machine learning model used.Machine learning method:rf (Random Forest),gbm (Gradient Boosting),svm (Support Vector Machine), orensemble (Ensemble Learning; default)
Usesmarkers_list to calculate probability, prediction results, calculate corresponding AUC (optional), and generate heat map and ROC graphs (optional) for cell annotation.
SlimR_anno_result<- Celltype_Calculate(seurat_obj=sce,gene_list=Markers_list,species="Human",cluster_col="seurat_clusters",assay="RNA",min_expression=0.1,specificity_weight=3,threshold=0.8,compute_AUC=TRUE,plot_AUC=TRUE,AUC_correction=TRUE,colour_low="navy",colour_high="firebrick3" )
You can use themin_expression = SlimR_params$min_expression andspecificity_weight = SlimR_params$specificity_weight parameters in the functionCelltype_Calculate() if you have run theParameter_Calculate () function in section 3.1 above.
Important: The parametercluster_col in the functionCelltype_Calculate() and the functionCelltype_Annotation() must be strictly the same to avoid false matches.
Note: Using the parameterAUC_correction = TRUE takes a little longer to compute (~20% longer than only setting parameterplot_AUC = TRUE; ~40% longer than only setting parametercompute_AUC = TRUE), but it is recommended to correct the predicted cell type this way to obtain more accurate cell type prediction results. The lower the parameterthreshold, the more alternative cell types AUC will check, and the longer the run time will be.
Error handling: If you encounter the error messageError in .rowNamesDF<-: ! duplicate 'row.names' are not allowed when runningCelltype_Calculate(), please runbase::make.unique() first. (Alternative)
# If you encounter the error message `Error in .rowNamesDF<-: ! duplicate 'row.names' are not allowed` when running `Celltype_Calculate()`.rownames(sce)<-base::make.unique(rownames(sce))
Check the annotation probability of the cell type to be annotated in the inputcluster_col column and cell types inMarkers_list with the following code.
print(SlimR_anno_result$Heatmap_plot)
Note: If the heat map is not generated properly, please run the functionlibrary(pheatmap) first.
Cell type information results predicted by SlimR can be viewed with the following code.
View(SlimR_anno_result$Prediction_results)
Furthermore, the ROC curve and AUC value of the correspondingcluster_col and predicted cell types can be viewed by the following code.
print(SlimR_anno_result$AUC_plot)
Important: This feature depends on the parameterplot_AUC = TRUE.
Note: If the heat map is not generated properly, please run the functionlibrary(ggplot2) first.
After viewing the list of predicted cell types and the corresponding AUC values, the predicted cell types can be corrected with the following code.
Example 1:
# For example, cluster '15' in 'cluster_col' corresponds to cell type 'Intestinal stem cell'.SlimR_anno_result$Prediction_results$Predicted_cell_type[SlimR_anno_result$Prediction_results$cluster_col==15]<-"Intestinal stem cell"
Example 2:
# For example, a predicted cell type with an AUC of 0.5 or less should be labeled 'Unknown'.SlimR_anno_result$Prediction_results$Predicted_cell_type[SlimR_anno_result$Prediction_results$AUC<=0.5]<-"Unknown"
After modifying the corresponding predicted cell type, the following code is used to view the updated table of predicted cell types.
View(SlimR_anno_result$Prediction_results)
Important: It is strongly recommended that if you need to correct the cell type, use cell types inSlimR_anno_result$Prediction_results$Alternative_cell_type.
Assigns SlimR predicted cell types information inSlimR_anno_result$Prediction_results$Predicted_cell_type to the Seurat object based on cluster annotations, and stores the results intoseurat_obj@meta.data$annotation_col.
sce<- Celltype_Annotation(seurat_obj=sce,cluster_col="seurat_clusters",SlimR_anno_result=SlimR_anno_result,plot_UMAP=TRUE,annotation_col="Cell_type_SlimR" )
Important: The parametercluster_col in the functionCelltype_Calculate() and the functionCelltype_Annotation() must be strictly the same to avoid false matches. The parameterannotation_col in the functionsCelltype_Annotation() andCelltype_Verification() must be strictly the same to avoid false matches.
Use the cell group identity information inseurat_obj@meta.data$annotation_col and use the 'Feature Significance Score' (FSS, product value oflog2FC andExpression ratio) as the ranking basis.
Celltype_Verification(seurat_obj=sce,SlimR_anno_result=SlimR_anno_result,gene_number=5,assay="RNA",colour_low="white",colour_high="navy",annotation_col="Cell_type_SlimR" )
Important: The parameterannotation_col in the functionCelltype_Annotation() and the functionCelltype_Verification() must be strictly the same to avoid false matches.
Note: Cell types located inSlimR_anno_result$Prediction_results were verified using the markers information fromSlimR_anno_result$Expression_list; cell types that are not in the above list are validated using the markers information from the functionFindMarkers().
Generate a heat map to estimate the likelihood that various cell clusters exhibited similarity to control cell types:
Celltype_Annotation_Heatmap(seurat_obj=sce,gene_list=Markers_list,species="Human",cluster_col="seurat_cluster",min_expression=0.1,specificity_weight=3,colour_low="navy",colour_high="firebrick3")
Note: Now this function has been incorporated intoCelltype_Calculate(), and it is recommended to useCelltype_Calculate() instead.
Generates per-cell-type expression dot plot with metric heat map (when the metric information exists):
Celltype_Annotation_Features(seurat_obj=sce,cluster_col="seurat_clusters",gene_list=Markers_list,gene_list_type="Cellmarker2",species="Human",save_path="./SlimR/Celltype_Annotation_Features/",colour_low="white",colour_high="navy",colour_low_mertic="white",colour_high_mertic="navy" )
Each resulting combined image consists of a dot plot above and a heat map below (if metric information is present). The dot plot illustrates the relationship between the expression level and expression ratio of the cell type and its corresponding markers. Below it, a metric heat map is displayed for the corresponding markers (if metric information is available).
Generates per-cell-type expression combined plots:
Celltype_Annotation_Combined(seurat_obj=sce,gene_list=Markers_list,species="Human",cluster_col="seurat_cluster",assay="RNA",save_path="./SlimR/Celltype_Annotation_Combined/",colour_low="white",colour_high="navy")
Each generated combined plot shows the box plot of the expression levels of the corresponding markers for that cell type, with the colors corresponding to the average expression levels of the markers.
Functions in sections 5.1, 5.2, 5.3, and 5.4 have been incorporated intoCelltype_Annotation_Features(), and it is recommended to useCelltype_Annotation_Features() and set corresponding parameters (for example,gene_list_type = "Cellmarker2") instead. For more information, please refer to section 4.2.
Celltype_annotation_Cellmarker2(seurat_obj=sce,gene_list=Markers_list_Cellmarker2,species="Human",cluster_col="seurat_cluster",assay="RNA",save_path="./SlimR/Celltype_annotation_Cellmarkers2/",colour_low="white",colour_high="navy",colour_low_mertic="white",colour_high_mertic="navy")
Note: To call this function, set the parametergene_list_type = "Cellmarker2" in the functionCelltype_Annotation_Features().
Celltype_annotation_PanglaoDB(seurat_obj=sce,gene_list=Markers_list_panglaoDB,species="Human",cluster_col="seurat_cluster",assay="RNA",save_path="./SlimR/Celltype_annotation_PanglaoDB/",colour_low="white",colour_high="navy",colour_low_mertic="white",colour_high_mertic="navy")
Note: To call this function, set the parametergene_list_type = "PanglaoDB" in the functionCelltype_Annotation_Features().
Celltype_annotation_Seurat(seurat_obj=sce,gene_list=Markers_list_Seurat,species="Human",cluster_col="seurat_cluster",assay="RNA",save_path="./SlimR/Celltype_annotation_Seurat/",colour_low="white",colour_high="navy",colour_low_mertic="white",colour_high_mertic="navy")
Note: To call this function, set the parametergene_list_type = "Seurat" in the functionCelltype_Annotation_Features().
Celltype_annotation_Excel(seurat_obj=sce,gene_list=Markers_list_Excel,species="Human",cluster_col="seurat_cluster",assay="RNA",save_path="./SlimR/Celltype_annotation_Excel/",colour_low="white",colour_high="navy",colour_low_mertic="white",colour_high_mertic="navy")
Note: To call this function, set the parametergene_list_type = "Excel" in the functionCelltype_Annotation_Features. This function also works withMarkers_list that contains either no metric information or metric information generated in other ways.
Thank you for using SlimR. For questions, issues, or suggestions, please submit them in the issue section or discussion section on GitHub (suggested) or send an email (alternative):
Zhaoqing Wang
About
SlimR: Machine Learning-Assisted, Marker-Based Tool for Single-Cell and Spatial Transcriptomics Annotation
Resources
License
Unknown, MIT licenses found
Licenses found
Uh oh!
There was an error while loading.Please reload this page.
