Disclosure of Invention
The application provides a multi-group-science-fusion IDH wild type glioblastoma prognosis evaluation method and system, which are used for constructing a comprehensive prognosis model of an IDH wild type glioblastoma patient by integrating characteristics of pathology, proteomics and genomics, so that the prognosis prediction value of the glioblastoma is increased.
In order to solve the technical problems, in a first aspect, the embodiment of the application provides a multi-group-science-fused IDH wild glioblastoma prognosis evaluation method, which comprises the following steps of firstly, acquiring a data set; the data set comprises an IDH wild glioblastoma patient data set, pathology images, gene data and transcription data are obtained based on the data set, histologic features are extracted from the pathology images, the gene data and the transcription data respectively, the histologic features comprise the histologic features, mutation histologic features, copy number variation histologic features and transcriptomic features, a single-set prognosis model corresponding to the histologic features is built based on the histologic features, the single-set prognosis model comprises a pathology prognosis model, a mutation histologic prognosis model, a copy number variation histologic prognosis model and a transcriptomic prognosis model, then the histologic features of the single-set prognosis model are screened to obtain a plurality of sets of histologic fusion features, a machine learning-based multi-set fusion model is built based on the multi-set fusion features, the patient is subjected to risk stratification based on the multi-set fusion model, the patient is divided into a high risk set and a low risk set, finally, a weighted gene Co-expression network analysis (65-expression Network Analysis, 3279) and a gene set enrichment method (GENE SET ENRICHMENT ANALYSIS, GSEA) is adopted to obtain a cross-over behavior key fusion model, and a biological channel is determined.
In some exemplary embodiments, a pathology image is obtained based on the dataset, and a pathogroup feature is extracted from the pathology image, including tissue slice scanning of a pathology slice of a patient with a pathology slice scanner to obtain a full-field digital pathology image (white SLIDE IMAGE, WSI), preprocessing the full-field digital pathology image and screening the full-field digital pathology image, segmenting and feature extracting the preprocessed screened pathology image, and extracting the pathogroup feature.
In some exemplary embodiments, preprocessing a full-view digital pathology image, screening the full-view digital pathology image, segmenting and extracting features of the preprocessed and screened pathology image, and extracting pathology features, wherein the full-view digital pathology image is preprocessed by an image search engine in a full-automatic mode, segmented into a plurality of 1024×1024 plaques, classified into tissue plaques and non-tissue plaques, wherein the plaques with the tissue content lower than 85% are regarded as the non-tissue plaques, K-means clustering analysis is performed on the tissue plaques according to an imaging mode of the tissue plaques to obtain a plurality of plaque clusters, optimal plaques are selected from each plaque cluster, and features of the selected optimal plaques in each plaque cluster are extracted to obtain pathology features.
In some exemplary embodiments, feature extraction is performed on the selected optimal plaque in each plaque cluster to obtain a pathomic feature, the method comprises the steps of performing unmixing of 1024×1024 pixels on a representative image by utilizing an unmixing color module to identify a foreground and a background of a target tissue, automatically identifying a cell nucleus, a cell body and a cytoplasm by three modules with self-adaptive thresholds, respectively identifying a main module, identifying a secondary module and identifying a tertiary module, performing deep analysis on the image by adopting a functional module to obtain object features in the image, wherein the functional module comprises an object size and shape measuring module, a texture measuring module, a granularity measuring module, a relation measuring module among objects, an image occupation area measuring module, an object intensity measuring module and an object intensity distribution measuring module, the object features comprise the size, shape, texture, density, distribution and intensity features of cell nuclei and cytoplasm, performing statistical analysis on the measured object features to obtain different statistical indexes so as to comprehensively and objectively describe the features and attributes of the objects in the image, and the statistical indexes comprise average value, median standard and difference standard.
In some exemplary embodiments, after extracting the omic features and before constructing the single omic prognosis models corresponding to the omic features based on the omic features, respectively, further comprises pre-processing the omic features and screening the pre-processed omic features.
In some exemplary embodiments, the preprocessing of the histology features and the screening of the preprocessed histology features comprises performing logarithmic conversion on the histology features, performing z-score standardization processing to enable different features to have unified dimensions and scales so as to eliminate the influence of the dimensions and screen out key features of a constructed model, marking the mutation histology features as two kinds of variables, marking the mutation as 1, marking the non-mutation as-1, screening out genes with at least 3 mutations in an in-group patient as key features for subsequent analysis, determining 50% score of the copy number variation histology features and transcriptomics features by using an absolute median difference method to identify and reject abnormal values so as to improve the robustness of the model, and performing standardization processing by adopting a z-score method to screen out the key features of the constructed model.
In some exemplary embodiments, a single-set prognostic model corresponding to the histology features is respectively constructed based on the histology features, and the single-set prognostic model comprises key features obtained by feature screening based on the histology features, the mutation histology features, the copy number variation histology features and the transcriptomics features, wherein the key features are respectively combined into the single-set prognostic model corresponding to the histology features, the key features are weighted by the coefficients of ridge regression to form a linear combination model, and risk scores of the histology are calculated based on the linear combination model, and the risk scores are Path Score, mut Score, CNV Score and FPKM Score.
In some exemplary embodiments, screening the omics features of a single set of prognostic models to obtain a plurality of sets of omics fusion features, constructing a machine learning-based multi-set of omics fusion models based on the multi-set of omics fusion features, including determining cutoff values of risk scores of each set of omics in a training set of a dataset, classifying patients into a low risk set and a high risk set, evaluating relationships between each set of omics features and survival time through a survival analysis curve analysis, evaluating survival differences between the two sets by adopting a log rank test method, applying the cutoff values to an internal verification dataset, calculating cutoff values of the external verification set based on statistical characteristics of the external verification set, drawing a survival analysis curve and a log rank test after dividing into the high risk set and the low risk set, performing single factor Cox analysis on clinical pathological risk factors and risk scores of each set to obtain meaningful factors for prognosis of patients, taking all factors P <0.05 into the multi-factor Cox analysis, obtaining independent risk factors, evaluating scores of each set as independent risk factors, screening patients with four sets of feature data, calculating a weighted risk score after each set is used for constructing a new model of risk set, and calculating a weighted risk score after each set is calculated by using the weighted risk model.
In some exemplary embodiments, the cross pathway is obtained by adopting a weighted gene co-expression network analysis and a gene set enrichment analysis method, and key biological pathways and biological behaviors of the influence prognosis of a plurality of groups of chemical fusion models are determined; the method comprises the steps of calculating the score of gene set variation analysis (Gene Set Variation Analysis, GSVA) corresponding to each pathway by adopting two analysis methods of weighted gene co-expression network analysis and gene set enrichment analysis, carrying out pearson correlation analysis on the score and the risk scores of multiple groups of chemical fusion models, researching the relation between each characteristic value in the risk scores of the multiple groups of chemical fusion models and the cross biological pathway, and comparing the expression difference of high-risk groups and low-risk groups divided by the multiple groups of chemical fusion models on key biological pathways, revealing the biological mechanism behind the multiple groups of chemical fusion characteristics and providing potential biomarkers for clinical treatment.
In a second aspect, the embodiment of the application also provides a multi-group-science-fusion IDH wild glioblastoma prognosis evaluation system, which comprises a data set module, a multi-group-science characteristic module, a single-group-science prognosis model module, a multi-group-science-fusion model module and a data analysis module which are connected in sequence; the data set module is used for acquiring a data set, the data set comprises an IDH wild glioblastoma patient data set, the multiple sets of chemical characteristic modules are used for respectively obtaining pathology images, gene data and transcription data according to the data set, extracting chemical characteristics from the pathology images, the gene data and the transcription data respectively, the chemical characteristics comprise pathology characteristics, mutation chemical characteristics, copy number variation chemical characteristics and transcriptome characteristics, the single set of chemical prognosis model modules are used for respectively constructing single set of chemical prognosis models corresponding to the chemical characteristics according to the chemical characteristics, the single set of chemical prognosis models comprise a pathology set prognosis model, a mutation chemical prognosis model, a copy number variation chemical prognosis model and a transcriptome prognosis model, the multiple sets of chemical fusion model modules are used for screening the chemical characteristics of the single set of chemical prognosis model to obtain multiple sets of chemical fusion characteristics and constructing multiple sets of chemical fusion models based on the multiple sets of chemical fusion characteristics, the multiple sets of chemical fusion models are used for layering the patient and dividing the patient into a high risk set and a low risk set, and a data module is used for carrying out a cross-over risk analysis on the high risk set and the low risk set of chemical analysis and a gene expression profile analysis method, and a cross-over-risk analysis method is adopted for carrying out a cross-over-risk analysis on the multiple sets of the gene expression profile analysis.
The technical scheme provided by the embodiment of the application has at least the following advantages:
The embodiment of the application provides a multi-group-science-fused IDH wild glioblastoma prognosis evaluation method and system, wherein the method comprises the following steps of firstly, acquiring a data set; the data set comprises an IDH wild glioblastoma patient data set, pathology images, gene data and transcription data are obtained based on the data set, histologic features are extracted from the pathology images, the gene data and the transcription data respectively, the histologic features comprise pathology features, mutation histologic features, copy number variation histologic features and transcriptomic features, next, single-histologic prognosis models corresponding to the histologic features are respectively built based on the histologic features, the single-histologic prognosis models comprise a pathology prognosis model, a mutation histologic prognosis model, a copy number variation histologic prognosis model and a transcriptomic prognosis model, then, the histologic features of the single-histologic model are screened to obtain a multi-histologic fusion feature, a multi-histologic fusion model based on machine learning is built based on the multi-histologic fusion feature, the patient is subjected to risk stratification based on the multi-histologic fusion model, the high-risk group and the low-risk group are divided into a high-risk group, finally, a weighted gene co-expression network analysis method and a gene set enrichment analysis method are adopted to obtain a cross passage, and a key biological passage after the biological effect of the cross-fusion model is determined.
According to the prognosis evaluation method for the IDH wild glioblastoma with multiple groups of chemical fusion, on one hand, a multiple groups of chemical models based on machine learning are established, and risk stratification of glioblastoma is verified. On the other hand, multiple sets of biological assays analyze disease-related biological pathways from different perspectives using both WGCNA and GSEA methods, and cross-validation of the assay results. Then, the crossing channels of the two are taken to reveal the biological basis behind the multiple groups of chemical models, so that the reliability and the accuracy of the enrichment channel are enhanced. In addition, the multiple groups of the chemical models provided by the application analyze the pathogroup characteristics, genes and transcription information of glioblastoma patients. By further integrating these information, the present application has studied the biological mechanism of glioblastoma and related signal pathways, which helps the present application to better understand the physiological mechanism of the disease.
Detailed Description
As known from the background art, most of the existing glioma prediction models based on machine learning are based on a single factor, the relation between different groups needs to be further explored, and the prediction performance and stability of the models need to be improved. The data sources also require larger sample volumes of international multicenter, multi-ethnic datasets for training and validating glioblastoma predictive models.
In order to solve the technical problems, the embodiment of the application provides a multi-group-chemical-fusion IDH wild glioblastoma prognosis evaluation method and system, wherein the method comprises the following steps of firstly, acquiring a data set; the data set comprises an IDH wild glioblastoma patient data set, pathology images, gene data and transcription data are obtained based on the data set, histologic features are extracted from the pathology images, the gene data and the transcription data respectively, the histologic features comprise pathology features, mutation histologic features, copy number variation histologic features and transcriptomic features, next, single-histologic prognosis models corresponding to the histologic features are respectively built based on the histologic features, the single-histologic prognosis models comprise a pathology prognosis model, a mutation histologic prognosis model, a copy number variation histologic prognosis model and a transcriptomic prognosis model, then, the histologic features of the single-histologic model are screened to obtain a multi-histologic fusion feature, a multi-histologic fusion model based on machine learning is built based on the multi-histologic fusion feature, the patient is subjected to risk stratification based on the multi-histologic fusion model, the high-risk group and the low-risk group are divided into a high-risk group, finally, a weighted gene co-expression network analysis method and a gene set enrichment analysis method are adopted to obtain a cross passage, and a key biological passage after the biological effect of the cross-fusion model is determined. The application provides a multi-group-science-fusion IDH wild type glioblastoma prognosis evaluation method and system, which are used for constructing a comprehensive prognosis model of an IDH wild type glioblastoma patient by integrating characteristics of pathology, proteomics and genomics, so that the prognosis prediction value of the glioblastoma is increased.
Embodiments of the present application will be described in detail below with reference to the attached drawings. However, it will be understood by those of ordinary skill in the art that in various embodiments of the present application, numerous specific details are set forth in order to provide a thorough understanding of the present application. The claimed application may be practiced without these specific details and with various changes and modifications based on the following embodiments.
Referring to fig. 1, an embodiment of the present application provides a method for prognosis evaluation of multiple sets of genetically fused IDH wild glioblastoma, comprising the steps of:
step S1, acquiring a data set, wherein the data set comprises an IDH wild glioblastoma patient data set.
And S2, respectively obtaining pathology images, gene data and transcription data based on the data set, and respectively extracting histology characteristics from the pathology images, the gene data and the transcription data, wherein the histology characteristics comprise pathology histology characteristics, mutation histology characteristics, copy number variation histology characteristics and transcriptomics characteristics.
And step S3, respectively constructing a single-omic prognosis model corresponding to the omic characteristics based on the omic characteristics, wherein the single-omic prognosis model comprises a pathology-omic prognosis model, a mutation-omic prognosis model, a copy number variation-omic prognosis model and a transcriptomic prognosis model.
And S4, screening the histology characteristics of the single histology prognosis model to obtain a plurality of histology fusion characteristics, constructing a plurality of histology fusion models based on machine learning based on the plurality of histology fusion characteristics, and carrying out risk stratification on patients based on the plurality of histology fusion models to divide the patients into a high risk group and a low risk group.
And S5, performing differential analysis on the high risk group and the low risk group, acquiring a cross passage by adopting a weighted gene co-expression network analysis and a gene set enrichment analysis method, and determining key biological passages and biological behaviors of the influence prognosis of the multiple groups of the chemical fusion models.
The application provides a multi-group-science fusion IDH wild glioblastoma prognosis evaluation method, which comprises the steps of firstly carrying out characteristic pretreatment on pathogenies, genomics and transcriptomics, randomly dividing patients into a training group and a test group according to the proportion of 6:4, and then carrying out single factor COX analysis and Light GBM (LIGHT GRADIENT Boosting Machine) regression analysis on the training group of each group to respectively screen out the optimal pathogenies characteristics, genomics characteristics and transcriptomics characteristics. Subsequently, a prognostic model is established using a Ridge regression model (Ridge fit), and a Risk score (Risk score) is calculated for each group to predict patient survival prognosis. And selecting proper cut-off values according to the risk scores and survival data of each group of students, dividing the patients into a high risk group and a low risk group, and verifying the correlation between the characteristics of each group of students and the survival time of the patients by using Kaplan-Meier curve analysis and multi-factor Cox regression analysis. External verification is performed by using glioblastoma data in a TCGA and CGGA public database to determine the conditions of a survival analysis curve (Kaplan-Meier, K-M) of high and low risk groups divided according to a model. And taking the academic risk scores of the internal verification and the external verification as new features, carrying out multi-group academic feature fusion by using a ridge regression model method, and calculating the risk score of the fusion model (multscore). Patients with multiple sets of complete data are screened for model reliability verification, kaplan-Meier curves and multi-factor COX analysis are performed, and based on independent risk factors, nux Mo Tu is drawn for personalized prediction of patient survival rate. DCA decision curves were developed to verify the clinical utility of the nomogram, and the predictive efficacy of the model was checked using a calibration curve. Finally, performing differential analysis on the high-low risk groups after the multiple groups of the fusion, acquiring a cross passage by using WGCNA and GSEA analysis methods, searching a passage, biological behaviors and the like of the fusion model affecting prognosis, and further performing biological explanation so as to enhance the biological significance of the model.
The application utilizes Cellprofiler software to extract the pathogenicity characteristics, integrates genomics and transcriptomics information, constructs the comprehensive prognosis model of IDH wild glioblastoma patients, and further discusses the advantages of the prognosis model in the aspects of increment, stability, reliability and the like of prediction compared with the traditional prognosis model constructed based on clinical and pathological factors and the like. In addition, the present application relates to the use of paired pathology, whole Exon Sequencing (WES) and high throughput transcriptome sequencing (RNA-seq) data to explore the biological mechanisms behind multiple sets of mathematical fusion models. According to the application, through the integrated analysis of pathological features and gene expression data, compared with single histology, the multi-histology data fusion can better reflect disease characteristics, can obviously improve the prediction performance and stability of a model and better guide personalized clinical decisions, and through the differential analysis of high risk groups and low risk groups divided by the multi-histology fusion model, the cross passage is obtained by adopting a weighted gene co-expression network analysis and a gene set enrichment analysis method, so that the biological mechanism of glioblastoma and the related signal passage thereof can be deeply researched, and the aim is to provide a more accurate reference basis for the treatment strategy formulation and prognosis evaluation of IDH wild glioblastoma patients.
The acquisition of the dataset in step S1 is mainly derived from cases of surgical removal of brain tumors by neurosurgery in hospitals and diagnosis of glioblastoma from CNS 5.
The patient queue inclusion criteria are (1) cases of glioblastoma comprehensively diagnosed by postoperative histopathology and molecular detection, (2) integrity and availability of survival data for postoperative follow-up, 3) integrity and availability of pathological HE staining sections, whole Exon Sequencing (WES) data or RNA sequencing data of surgical excision specimens, and (4) patient age not less than 18 years.
The patient queue exclusion criteria are (1) cases without surgical treatment or with puncture biopsy only when diagnosis is confirmed, (2) cases without surgical treatment in the hospital, (3) HE staining effects of surgical excision tissue specimens are poor, quality problems such as artifacts and more bleeding exist, (4) cases without postoperative visit or death due to glioblastoma, and (5) cases with survival date less than 2 months.
After the above-described rigorous screening procedure, patients meeting the criteria were included in the study cohort. Notably, some patients in the study cohort had paired Whole Exon Sequencing (WES) data or RNA sequencing data. In addition, the application collects detailed clinical and pathological data of each patient through a clinical medical record system, including gender, age, radiotherapy, chemotherapy, tumor grading, preoperative Carlsberg function status score (Karnofsky Performance Status, KPS), tumor resection degree (complete or incomplete). The follow-up data is from telephone follow-up and patient review.
In some embodiments, the genetic data and transcriptional data are obtained by WES and RNA sequencing (RNA-seq), respectively, based on the dataset in step S2. In order to deeply understand the molecular mechanism of glioblastoma and provide data support at a molecular level for the construction of a prognosis evaluation model, the application further detects IDH or 1p/19q gene states on the basis of conventional pathological detection on patients with partial surgical excision of tumor tissues. For patients who have undergone IDH or 1p/19q pathology tests, the results are directly used. For patients who did not perform these tests, the embedded tissue wax blocks were borrowed from the pathologist with informed consent of the patient. The corresponding molecular information was detected using Sanger sequencing and fluorescence in situ hybridization (Fluorescence In Situ Hybridization, FISH) techniques. In addition, partial groups of tumor specimens were selected for WES and RNA-seq. Mutation (Mutation) and copy number variation (Copy Number Variation, CNV) data were obtained from glioblastoma patients by WES. Transcriptomics (FRAGMENTS PER Kilobase of TRANSCRIPT PER Million MAPPED READS, FPKM) data were obtained via RNA-seq to reflect the level of gene expression. Mutation data and CNV data were used to analyze the genetic variation characteristics of tumors. FPKM data was used to evaluate gene expression patterns and further explore the molecular mechanisms associated with glioblastoma development. Integration of WES and RNA-seq data provides molecular-level evidence for the construction of a multiple-set of prognostic evaluation models.
In some embodiments, deriving a pathology image based on the dataset in step S2, extracting a pathogroup feature from the pathology image, comprising:
And step S201, performing tissue slice scanning on pathological sections of a patient by adopting a pathological section scanner to obtain a full-field digital pathological image.
Step S202, preprocessing the full-view digital pathology image, screening the full-view digital pathology image, segmenting and extracting features of the preprocessed and screened pathology image, and extracting pathogroup features.
To ensure the quality and consistency of pathology image data, the present application employs standardized equipment and parameters for full-field digital pathology image (WSI) acquisition. All patients in the study cohort were routinely HE stained after surgery, and pathology sections of the patients were borrowed from the pathology department and scanned at 20-fold magnification using a high-resolution digital pathology section scanner (KF-PRO-120-HI) to obtain full-field digital pathology images. And glioblastoma WSI slice data was downloaded from the TCGA public database for verification of consistency and reliability of pathogroup features in different patient populations.
In some embodiments, in step S202, the full-view digital pathology image is preprocessed and screened, segmentation and feature extraction are performed on the pathology image screened after the preprocessing and extraction of pathology features are performed, the full-view digital pathology image is preprocessed in a full-automatic mode by adopting an image search engine, the pathology image is segmented into a plurality of 1024×1024 plaques, the plaques are classified into tissue plaques and non-tissue plaques, wherein the plaques with the tissue content lower than 85% are regarded as the non-tissue plaques, K-means cluster analysis is performed on the tissue plaques according to an imaging mode of the tissue plaques to obtain a plurality of plaque clusters, optimal plaques are selected from each plaque cluster, and feature extraction is performed on the optimal plaques selected from each plaque cluster to obtain the pathology features.
Specifically, the application adopts an advanced image processing technology, uses an image search engine named 'Yottixel' to carry out full-automatic pretreatment on glioblastoma WSI, and the processing process comprises three main steps of firstly automatically dividing the WSI into a plurality of 1024×1024 plaques (patche), secondly classifying tissue and non-tissue plaques, regarding the plaques with the tissue content lower than 85% as the non-tissue plaques, and finally carrying out K-means clustering analysis on the tissue plaques according to the imaging mode of the tissue plaques. In this way, each patient is given three different plaque clusters (K0, K1, K2), and the plaques in the same plaque cluster are considered to have similar imaging characteristics. The top 5 plaques with the best image quality were then selected from each plaque cluster by an experienced pathologist for further analysis.
In some embodiments, feature extraction is performed on the selected optimal plaque in each plaque cluster to obtain a pathomic feature, and the feature extraction comprises a step of performing 1024×1024-pixel unmixing on a representative image by utilizing an unmixing color (Unmix Colors) module to identify the foreground and the background of a target tissue, a step of automatically identifying the nucleus, the cell body and the cytoplasm by three modules with self-adaptive thresholds, wherein the three modules are respectively a main identification module (IDENTIFY PRIMARY Objects), a secondary identification module (Identify Secondary Objects) and a third identification module (IDENTIFY TERTIARY Objects), a step of performing deep analysis on the image by adopting a function module to obtain object features in the image, wherein the function module comprises an object size and shape measurement module, a texture measurement module, a granularity measurement module, a relation between Objects measurement module, an image occupation area measurement module, an object intensity measurement module and an object intensity distribution measurement module, the object features comprise the size, shape, texture, density, distribution and intensity characteristics of the cell nucleus and cytoplasm, and the obtained object features are analyzed to obtain different objective statistics indexes so as to comprehensively describe the object features, wherein the statistics indexes comprise the average values, the statistics indexes and the statistics indexes in the statistics indexes.
Specifically, when extracting the pathognomonic features, cellprofiler software is used to segment the pre-processed and screened pathological images and extract features, and a large number of quantitative pathognomonic features are extracted. The process first unmixes a representative image using an unmixed color module. Then, the cell nucleus, the cell body and the cytoplasm are automatically identified by adopting an identification main module, an identification secondary module and an identification tertiary module of the self-adaptive Otsu threshold value respectively. Then, the identification object is measured through a plurality of modules such as 'measuring object size and shape', 'measuring texture', 'measuring granularity', 'relation between measuring objects', 'measuring image occupied area', 'measuring object intensity', and 'measuring object intensity distribution', and the measurement results of the modules are expressed as final characteristics through calculating mean, median and standard deviation, 2382 image characteristics are finally obtained, and characteristics such as size, shape, texture, pixel intensity distribution, proximity relation and the like of defined cell nuclei and cytoplasm are covered. And extracting the mean value of each selected feature for five pathological images of the three feature sequences of each sample. And after the useless features are removed, 6456 useful quantitative pathogroup image features are reserved altogether, so that data support is provided for subsequent analysis.
In some embodiments, after the extraction of the omic features in step S2 and before the construction of the single omic prognosis models corresponding to the omic features based on the omic features in step S3, respectively, the method further comprises the steps of preprocessing the omic features and screening the preprocessed omic features.
In preprocessing the multiple sets of chemical features, a suitable data preprocessing method is selected for the extracted pathological features, mutation (Mut) features, copy Number Variation (CNV) features and FPKM features according to the condition of the data.
In some embodiments, preprocessing the histology features and screening the preprocessed histology features includes logarithmically transforming the histology features to reduce the dynamic range of the data and reduce the effects of extremes. The method comprises the steps of carrying out z-score standardization processing to enable different characteristics to have uniform dimensions and scales, eliminating the influence of the dimensions, screening out key characteristics of a constructed model, marking mutation group characteristics as two classification variables, marking mutation as 1, marking no mutation as-1, screening out genes with at least 3 mutations in a patient in a group as key characteristics, carrying out subsequent analysis, determining 50% of the score of the copy number mutation group characteristics and transcriptome characteristics by using an absolute medium bit difference (Median Absolute Deviation, MAD) method to identify and reject abnormal values, improving the robustness of the model, and carrying out standardization processing by adopting a z-score method, so as to screen out key characteristics of the constructed model.
Screening the pretreated histology features, wherein the feature screening strategy comprises (1) carrying out single factor Cox proportional risk model analysis on the pretreated features, evaluating the relation between each feature and the survival time of the patient, and screening out features which are obviously related to prognosis (p < 0.05). (2) And the Light GBM is an efficient machine learning algorithm based on a gradient lifting framework, and the generalization capability of the model is improved and the risk of overfitting is reduced by setting objective functions, evaluation indexes, leaf numbers, maximum depth, learning rate, iteration round numbers, regularization parameters and the like. 30 features that have the greatest influence on the prognosis survival time are screened out. (3) By SHAP (SHAPLEY ADDITIVE exPlanations) analysis to quantify the contribution of each feature to model predictions, the global importance of the feature to model predictions can be assessed by SHAP values, thereby improving the interpretability and reliability of the model.
In some embodiments, in step S3, based on the histology features, single-set prognostic models corresponding to the histology features are respectively constructed, wherein the single-set prognostic models comprise key features obtained by feature screening based on the histology features, the mutation histology features, the copy number variation histology features and the transcriptomics features, the key features are respectively combined into single-set prognostic models corresponding to the histology features, the key features are weighted by coefficients of ridge regression to form a linear combination model, and Risk scores (Risk Score) of each set are calculated based on the linear combination model and are Path Score, mut Score, CNV Score and FPKM Score respectively.
In some embodiments, in step S4, the method comprises screening the group characteristics of a single group of prognosis models to obtain a plurality of groups of fusion characteristics, constructing a machine learning-based multi-group fusion model based on the plurality of groups of fusion characteristics, determining the cut-off value of the risk scores of each group of patients in a training set of the data set by using R package survminer, classifying the patients into a low risk group and a high risk group, evaluating the relationship between the groups of characteristics and the survival time (OS) through a survival analysis curve analysis, evaluating the survival difference between the two groups by adopting a Log rank test method (Log-rank), applying the cut-off value to an internal verification data set, calculating the cut-off value of the external verification set based on the statistical characteristics of the external verification set, dividing the cut-off value into the high risk group and the low risk group, drawing a survival analysis curve and a Log test (shown in FIG. 2B), performing single factor Cox analysis on the risk scores of the clinical pathological risk factors and the risk groups to obtain factors having significance for the patients, evaluating all the factors of P <0.05, using a Log rank test method (Log-rank) to evaluate the characteristics of the independent groups, taking the risk factors as the risk factors of the risk groups after the combination of the independent groups, and obtaining the risk factors after the risk score is used as a weighted risk score model for the risk score is calculated, and the risk score model is calculated after the risk score is calculated.
After the fused linear combination model is obtained, the fused risk score is calculated, the risk score cut-off value of the fused model is determined again by using the R package survminer, the fused model is divided into high and low risk groups, the relation between the characteristics of the fused model and the OS is evaluated through a survival analysis curve and Log-rank test, and the survival difference between the high and low risk groups of the fused model is evaluated (as shown in figure 3). And further carrying out single factor Cox analysis on clinical pathology risk factors and fusion multiple groups of chemical risk scores to obtain factors which are significant for prognosis of patients. Then, all factors of P <0.05 were included in the multi-factor Cox analysis to obtain independent risk factors, and multiple sets of mathematical feature risk scores were evaluated as independent risk factors. Based on multivariate Cox analysis, independent clinical pathology risk factors are included in clinical pathology nociception Mo Tu, and the multiple sets of clinical fusion-clinical nomogram are the multiple sets of clinical risk scores after fusion combined with the independent clinical pathology risk factors. The two nomograms were compared to assess the incremental value of the multiple sets of mathematical fusion models to the patient prognosis assessment. Calibration curves were drawn using R-package rms, the calibration and discrimination performance of the multiple sets of fusion-clinical nociceptive Mo Tu and clinical nociceptive maps were compared, and the calibration performance between observed and predicted survival was compared. C-index was calculated with R package survivinal as an index for evaluating the prognostic model. The merits and demerits of the multiple sets of the mathematical fusion model were evaluated using the net weight classification improvement (Net Reclassification Improvement, NRI) and the risk of overfitting of the model was evaluated using the red pool information criterion (Akaike Information Criterion, AIC). And finally, drawing a decision curve by adopting an R package rmda so as to evaluate the practical application value of the model in clinic.
In some embodiments, the step S5 adopts a weighted gene co-expression network analysis and a gene set enrichment analysis method to obtain a cross passage, and determines key biological passages and biological behaviors of a plurality of groups of biological fusion models affecting prognosis, wherein the method comprises the steps of adopting two analysis methods of weighted gene co-expression network analysis (WGCNA) and Gene Set Enrichment Analysis (GSEA) to obtain respective enrichment passage results; the method comprises the steps of adopting a weighted gene co-expression network analysis and a gene set enrichment analysis to analyze the cross biological paths of two analysis methods, calculating GSVA scores corresponding to each path, carrying out pearson correlation analysis on the scores of multiple groups of chemical fusion models, researching the relation between each characteristic value in the scores of the multiple groups of chemical fusion models and the cross biological path, comparing the expression difference of high-risk groups and low-risk groups divided by the multiple groups of chemical fusion models on the key biological paths, revealing the biological mechanism behind the multiple groups of chemical fusion characteristics, and providing potential biomarkers for clinical treatment.
WGCNA can be used to analyze gene expression data, identify co-expressed genes, and group them into modules or clusters according to their similar expression patterns. Co-expression network analysis of tumor tissue gene expression data can be achieved using R-package WGCNA. Following is a detailed description of the WGCNA flow, (1) data preprocessing, normalizing the data to eliminate any systematic bias and remove low quality samples. (2) The gene co-expression network is constructed by first converting the correlation matrix into an adjacency matrix using the weighted value of the inter-gene correlation. The adjacency matrix is then transformed into a topological overlap matrix (Topological Overlap Matrix, TOM), and the extent to which neighbors are shared between the two genes is measured. TOM is used to construct a network of interconnected gene modules, where each module represents a set of highly co-expressed genes. (3) Module detection using hierarchical clustering to identify gene modules in a co-expression network. The differences between the TOMs of each gene pair are first calculated, and then genes with similar TOMs are grouped into modules using a clustering algorithm. The generated dendrograms may be cut at a particular height to create a specified number of gene modules. (4) The Module characterization is that each gene Module has a representative expression profile, called the Module gene (ME). It is the general feature of the gene expression in the module, and can reflect the gene expression status of the module. ME is the first major component of the in-module gene expression matrix. Then, from GSVA of the samples, a module was determined that was significantly correlated with the multiple sets of chemical fusion scores, with P <0.05 corrected for false discovery rate (False Discovery Rate, FDR) indicating significance. To identify biological pathways and processes enriched within the gene module, functionally enriched pathways were obtained using the R package "clusterProfiler", expressed as FDR <0.05 statistically significant.
GSEA is a method for analyzing gene expression data to identify differentially expressed genomes. Following is a detailed description of the GSEA process (1) data preprocessing, normalizing the data to eliminate any systematic bias and scale the expression values, the resulting expression matrix should contain only high quality samples and genes. (2) Selection of Gene sets from publicly available databases (KEGG, reactome, gene Ottology, hallmark, PID, bioCarta, wikiPathways) a Gene set is selected that represents a particular biological pathway, process or function. (3) Calculation of the enrichment score the expression of genes within the gene set is compared with the expression of genes outside the gene set to determine whether the gene set is significantly enriched under specific biological conditions. The enrichment score (ENRICHMENT SCORE, ES) was calculated to reflect the extent to which genes in the gene set were over-expressed at the top or bottom of the ranked list of genes based on the differential expression between the two conditions. (4) Statistical significance testing the sample tags were randomly arranged and the enrichment score was recalculated for each gene set, repeated multiple times to generate a null distribution of enrichment scores for calculating a normalized enrichment score (Normalized Enrichment Score, NES) for each gene set. FDR <0.05 and |NES| >1 is chosen to represent significance. And calculating GSVA values corresponding to the paths, and evaluating the correlation between the GSVA values and the multiple groups of chemical fusion scores. FDR <0.05 represents statistical significance.
Respective enrichment pathway results were obtained using both WGCNA and GSEA analysis methods. In order to make the result more reliable, accurate and repeatable, and reduce abnormal data and errors, the application adopts the cross path of two analysis methods. Then, GSVA scores corresponding to each pathway were calculated and Pearson correlation analysis was performed with the prognostic multiple-panel fusion scores. The interpretation of multiple sets of fusion features based on related biological pathways reveals the biological mechanisms behind the multiple sets of fusion features. The differences between biological pathways in high-low risk groups divided by multiple groups of biological fusion characteristics are searched for, and potential biomarkers are provided for clinical treatment.
Based on the method, the method for evaluating the prognosis of the IDH wild glioblastoma with multiple groups of chemical fusion mainly comprises four core steps of single-group feature extraction and pretreatment, single-group prognosis model construction, multiple-group chemical fusion model development and bioinformatic analysis of the multiple-group chemical fusion model. In the process of single-group feature extraction and pretreatment, pathology, gene and transcription data of a research queue are collected, and the pathological feature extraction comprises key steps (shown in fig. 2A and 2B) of tissue section scanning, WSI image segmentation, feature extraction and the like, and the features provide a foundation for the construction of a subsequent prognosis model.
When the single-set prognosis model is constructed, the extracted single-set characteristics are preprocessed to eliminate noise and inconsistency in data, then the single-set prognosis model is constructed based on the characteristics, and the model performance is verified through a strict statistical method to ensure the accuracy and reliability of the model (shown in figure 3). In addition, the application also carries out interpretability analysis on key characteristics of the constructed model so as to reveal the contribution of single characteristics to the prognosis model.
When the multi-group fusion model is constructed, the application further integrates the data of the multi-group on the basis of single-group analysis, and constructs a multi-group fusion prognosis model. The fusion model improves the accuracy of prognosis evaluation and the clinical application value by comprehensively considering the information of different data sources, and the fusion model is comprehensively verified (shown in figure 4) so as to ensure the generalization capability of the fusion model in different patient groups.
Finally, bioinformatics analysis is performed on the multiple sets of the mathematical fusion models. The application carries out deep analysis according to the differential genes (RNA sequencing data) of the high-low risk groups defined by the fusion model. Through GSEA and WGCNA analysis, the application identifies biological paths obviously related to the multiple groups of chemical fusion scores, and the step not only reveals key biological mechanisms behind the multiple groups of chemical fusion models, but also provides important information for potential treatment targets of glioblastoma.
Referring to fig. 4, the embodiment of the application also provides a multi-group-chemical-fusion IDH wild glioblastoma prognosis evaluation system, which comprises a data set module 101, a multi-group-chemical-feature module 102, a single-group-chemical prognosis model module 103, a multi-group-chemical-fusion model module 104 and a data analysis module 105 which are sequentially connected; the data set module 101 is used for acquiring a data set, the data set comprises an IDH wild type glioblastoma patient data set, the multiple-group chemical characteristic module 102 is used for respectively obtaining pathology images, gene data and transcription data according to the data set, extracting group chemical characteristics from the pathology images, the gene data and the transcription data respectively, the group chemical characteristics comprise pathology group characteristics, mutation group characteristics, copy number variation group characteristics and transcriptome group characteristics, the single-group chemical prognosis model module 103 is used for respectively constructing a single-group chemical prognosis model corresponding to the group chemical characteristics according to the group chemical characteristics, the single-group chemical prognosis model comprises a pathology group chemical prognosis model, a mutation group chemical prognosis model, a copy number variation group chemical prognosis model and a transcriptome chemical prognosis model, the multiple-group chemical fusion model module 104 is used for screening the group chemical characteristics of the single-group chemical prognosis model to obtain multiple-group chemical fusion characteristics, constructing multiple-group chemical fusion models based on the basis of the multiple-group chemical fusion characteristics, dividing the patient into a high-group risk group and a low-risk group based on the basis of the multiple-group chemical fusion model, and a data analysis module 105 is used for carrying out a weighted analysis of the gene expression co-transformation and the gene expression and the biological pathway and the key-channel and the biological pathway and the method is determined.
The embodiment of the application provides a multi-group-chemical-fusion IDH wild type glioblastoma prognosis evaluation method and system, which are characterized in that firstly, the comprehensive diagnosis proposed by CNS5 is reclassified, the role of a molecular marker in the classification of nervous system tumors is more emphasized, and the molecular diagnosis of the glioblastoma is determined to be IDH wild type glioblastoma. A machine learning based multi-set of mathematical models was then developed and risk stratification of glioblastoma was verified. Second, multiple sets of biological assays analyze disease-related biological pathways from different perspectives using both WGCNA and GSEA methods, and cross-validation of the assay results. Then, the crossing channels of the two are taken to reveal the biological basis behind the multiple groups of chemical models, so that the reliability and the accuracy of the enrichment channel are enhanced. Third, the multiple sets of mathematical models analyze the pathogenicity, gene and transcriptional information of glioblastoma patients. By further integrating these information, the present application has studied the biological mechanism of glioblastoma and related signal pathways, which helps the present application to better understand the physiological mechanism of the disease.
The embodiment of the application also provides a multi-group-science-fused IDH wild type glioblastoma prognosis evaluation method and system, which have the advantages that the prior study reveals the correlation between MYC genes and glioblastoma treatment prognosis through the genome characteristics of the glioblastoma by training a machine learning model. Another study combines proteomics and genomics of gliomas, revealing the laws of glioma development and regulation. The research results show that the multi-group chemical fusion analysis has wide application prospect and important significance in clinical and basic research. In the application, a multi-group chemical fusion model is established, and the characteristics of pathology, proteomics and genomics are combined to deeply study the biological mechanism and prognosis evaluation of glioblastoma. Further analysis shows that fusion of multiple sets of chemical information can increase the prognostic value and improve the accuracy and stability of the model.
In the genomics study of gliomas, it has been shown that in glioblastomas, prognostic radiological characteristics are associated with the hypoxic pathways of glioblastomas. The previous researches are mainly focused on analysis of a single group, and the application applies a multi-group fusion analysis model to increase the predictive value of glioblastoma. WGCNA is a powerful method for analyzing radiogenomics data, identifying gene modules coexpressed with multiple sets of chemical fusion models, and providing a method for identifying potential molecular mechanisms of multiple sets of chemical fusion models. GSEA is a widely used bioinformatics tool for analyzing gene expression data in the context of a predefined gene set or pathway. Because the two methods have different emphasis points and analysis ideas on gene enrichment, the application combines WGCNA and GSEA analysis methods, and improves the repeatability and reliability of biological path identification in the analysis of a multi-group chemical fusion model.
By the technical scheme, the embodiment of the application provides a multi-group-chemical-fusion IDH wild glioblastoma prognosis evaluation method and system, wherein the method comprises the following steps of firstly, acquiring a data set; the data set comprises an IDH wild glioblastoma patient data set, pathology images, gene data and transcription data are obtained based on the data set, histologic features are extracted from the pathology images, the gene data and the transcription data respectively, the histologic features comprise pathology features, mutation histologic features, copy number variation histologic features and transcriptomic features, next, single-histologic prognosis models corresponding to the histologic features are respectively built based on the histologic features, the single-histologic prognosis models comprise a pathology prognosis model, a mutation histologic prognosis model, a copy number variation histologic prognosis model and a transcriptomic prognosis model, then, the histologic features of the single-histologic model are screened to obtain a multi-histologic fusion feature, a multi-histologic fusion model based on machine learning is built based on the multi-histologic fusion feature, the patient is subjected to risk stratification based on the multi-histologic fusion model, the high-risk group and the low-risk group are divided into a high-risk group, finally, a weighted gene co-expression network analysis method and a gene set enrichment analysis method are adopted to obtain a cross passage, and a key biological passage after the biological effect of the cross-fusion model is determined.
According to the prognosis evaluation method for the IDH wild glioblastoma with multiple groups of chemical fusion, on one hand, a multiple groups of chemical models based on machine learning are established, and risk stratification of glioblastoma is verified. On the other hand, multiple sets of biological assays analyze disease-related biological pathways from different perspectives using both WGCNA and GSEA methods, and cross-validation of the assay results. Then, the crossing channels of the two are taken to reveal the biological basis behind the multiple groups of chemical models, so that the reliability and the accuracy of the enrichment channel are enhanced. In addition, the multiple groups of the chemical models provided by the application analyze the pathogroup characteristics, genes and transcription information of glioblastoma patients. By further integrating these information, the present application has studied the biological mechanism of glioblastoma and related signal pathways, which helps the present application to better understand the physiological mechanism of the disease.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the application and that various changes in form and details may be made therein without departing from the spirit and scope of the application. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the application, and the scope of the application is therefore intended to be limited only by the appended claims.