CN117116345A

Movatterモバイル変換

Info

Publication number: CN117116345A
Application number: CN202310909405.6A
Authority: CN
Inventors: 朱云平; 韩明飞; 陈洨清; 陈涛; 徐小放
Original assignee: Academy of Military Medical Sciences AMMS of PLA
Current assignee: Academy of Military Medical Sciences AMMS of PLA
Priority date: 2023-07-24
Filing date: 2023-07-24
Publication date: 2023-11-24
Anticipated expiration: 2043-07-24
Also published as: CN117116345B

Abstract

The invention discloses a method for constructing a patient survival network based on a gene regulation network, which comprises the following steps: 1) Obtaining a gene expression matrix; 2) Constructing a gene regulation network based on the gene expression matrix; 3) Deleting edges with reliability lower than a set threshold in the gene regulation network; 4) Evaluating the co-expression stability of each gene in the gene regulation network in each target cancer patient sample; 5) For each gene, sequencing each patient based on the co-expression stability of the gene in each target cancer patient sample, and taking the survival information of each T% of patients before and after the ranking of the co-expression stability to carry out survival analysis to obtain a logarithmic rank test value P of the gene; then, based on the P value, judging whether the gene has statistical difference on survival time; if there is a statistical difference, the gene is retained; 6) Constructing a survival network of the target cancer according to the genes reserved in the step 5) and the edges and genes of the gene regulation network connecting the reserved genes.

Description

Translated fromChinese

一种基于基因调控网络构建患者生存网络的方法A method to construct patient survival network based on gene regulatory network

技术领域Technical field

本发明属于分子生物学、系统生物学领域，涉及一种基于基因调控网络构建患者生存网络的方法。The invention belongs to the fields of molecular biology and systems biology, and relates to a method for constructing a patient survival network based on a gene regulatory network.

背景技术Background technique

在复杂疾病研究中，生存分析被广泛用于鉴定与患者生存和预后相关的疾病标志物，进而指导疾病筛查、早期诊断和个体化医疗决策。传统生存分析主要分为两步：首先根据特定基因的表达水平对患者排序；然后利用对数秩检验评估排名首尾1/2(或1/4)的患者的生存时间是否存在显著差异。与患者生存显著相关的基因被称为癌症生存基因，它们往往与癌症发展和预后密切相关。然而,传统生存分析存在两个局限：In complex disease research, survival analysis is widely used to identify disease markers related to patient survival and prognosis, thereby guiding disease screening, early diagnosis, and personalized medical decisions. Traditional survival analysis is mainly divided into two steps: first, sort patients according to the expression level of a specific gene; then use the log-rank test to evaluate whether there is a significant difference in the survival time of the top and bottom 1/2 (or 1/4) patients. Genes that are significantly associated with patient survival are called cancer survival genes, and they are often closely related to cancer development and prognosis. However, traditional survival analysis has two limitations:

1)利用基因表达水平难以对患者进行准确且稳定的排序。首先，显著的个体差异性导致基因在不同患者体内的表达水平缺乏可比性；此外，复杂的体内和体外因素导致单基因的表达水平缺乏稳定性。1) It is difficult to accurately and stably rank patients using gene expression levels. First, significant individual differences lead to a lack of comparability of gene expression levels in different patients; in addition, complex in vivo and in vitro factors lead to a lack of stability in the expression levels of single genes.

2)基于表达水平难以发现生存相关的调控子(转录因子和小RNA)。首先，很多调控子(特别是miRNA)在肿瘤组织中的表达水平很低，这导致我们难以对它们准确定量并基于它们的表达水平给病人排序；此外，很多调控子通过表达水平变化以外的方式(例如蛋白质结构和微环境)影响靶基因表达，进而影响癌症进展。2) It is difficult to discover survival-related regulators (transcription factors and small RNAs) based on expression levels. First, the expression levels of many regulators (especially miRNAs) in tumor tissues are very low, which makes it difficult for us to accurately quantify them and rank patients based on their expression levels; in addition, many regulators are expressed through other means than changes in expression levels. (such as protein structure and microenvironment) affect target gene expression, thereby affecting cancer progression.

基因并非独立发挥功能，而是在复杂的基因调控网络(Gene RegulatoryNetwork,GRN)中相互作用、相互协同。GRN的边代表各种各样的相互作用和功能关联，例如物理相互作用(DNA-DNA相互作用、蛋白质-DNA相互作用、蛋白质-蛋白质相互作用)、遗传相互作用(两个或多个基因关联同一性状)、参与同一生物过程或信号通路等。与基因表达水平相比，GRN具备以下优势：Genes do not function independently, but interact and coordinate with each other in a complex gene regulatory network (Gene Regulatory Network, GRN). The edges of GRN represent a variety of interactions and functional associations, such as physical interactions (DNA-DNA interactions, protein-DNA interactions, protein-protein interactions), genetic interactions (two or more genes are associated same trait), participate in the same biological process or signaling pathway, etc. Compared with gene expression levels, GRN has the following advantages:

1)GRN反映了基因在多个患者中稳定的功能关联和调控架构，受个体差异的影响较小；1) GRN reflects the stable functional association and regulatory architecture of genes in multiple patients and is less affected by individual differences;

2)相比单基因表达水平，多基因组成的网络具有更高的数据维度，降低了结果的随机性；2) Compared with the expression level of a single gene, a network composed of multiple genes has a higher data dimension and reduces the randomness of the results;

3)基于GRN我们可以忽略调控子的表达水平，而是借助调控子的靶基因逆向推测它与患者生存的关系。3) Based on GRN, we can ignore the expression level of the regulator, but use the target gene of the regulator to reversely infer its relationship with patient survival.

综上，我们相信基于GRN开展生存分析能有效解决传统生存分析的局限，显著拓展癌症预后标志物的发现。In summary, we believe that survival analysis based on GRN can effectively solve the limitations of traditional survival analysis and significantly expand the discovery of cancer prognostic markers.

发明内容Contents of the invention

针对现有生存分析方法中存在的技术问题，本发明的目的在于提供一种基于基因调控网络构建生存网络的方法。本发明赋予了GRN节点一个新的属性，称为共表达稳定性(co-expression stability)。我们知道，GRN中相互连接的基因往往具有相似的表达模式(表达量在多个样本中同高同低)，这种现象称为共表达。共表达的基因往往功能相关或参与同一生物过程。基于这一特点，某个基因在GRN中的共表达稳定性表示该基因与它的所有邻接基因的表达量差异(基于Z-Score标准化保证不同基因表达量的可比性)。表达差异越小，该基因的共表达稳定性越高，此时它与邻接基因组成的功能模块正常运转；表达差异越大，该基因的共表达稳定性越低，此时它与邻接基因组成的功能模块失调。综上，基因的共表达稳定性与它的功能稳定性密切相关，当一个基因在不同患者体内的共表达稳定性与患者的生存时间显著相关时，该基因被认为在癌症进展中扮演重要角色。In view of the technical problems existing in existing survival analysis methods, the purpose of the present invention is to provide a method for constructing a survival network based on a gene regulatory network. The present invention gives the GRN node a new attribute, called co-expression stability. We know that genes connected to each other in GRN often have similar expression patterns (expression levels are the same high and low in multiple samples). This phenomenon is called co-expression. Co-expressed genes are often functionally related or involved in the same biological process. Based on this feature, the co-expression stability of a gene in GRN represents the difference in expression between the gene and all its adjacent genes (based on Z-Score normalization to ensure the comparability of expression levels of different genes). The smaller the expression difference, the higher the co-expression stability of the gene. At this time, the functional module composed of it and the adjacent gene operates normally; the greater the expression difference, the lower the co-expression stability of the gene. At this time, it forms the functional module with the adjacent gene. The functional module is out of balance. In summary, the co-expression stability of a gene is closely related to its functional stability. When the co-expression stability of a gene in different patients is significantly related to the patient's survival time, the gene is considered to play an important role in cancer progression. .

基于上述原理，我们建立了基于GRN的生存分析策略。该方法以癌症患者的基因表达数据(微阵列数据、RNA测序数据、蛋白质质谱数据)和生存信息(获取生存信息的手段包括医疗档案和追踪调查等，一些大规模癌症研究项目如TCGA也提供了患者的生存信息)作为输入。主要分析步骤包括GRN构建、共表达稳定性评估、患者排序、以及生存差异评估等。Based on the above principles, we established a GRN-based survival analysis strategy. This method uses gene expression data (microarray data, RNA sequencing data, protein mass spectrometry data) and survival information (methods to obtain survival information include medical files and follow-up surveys) of cancer patients. Some large-scale cancer research projects such as TCGA also provide patient survival information) as input. The main analysis steps include GRN construction, co-expression stability assessment, patient sorting, and survival difference assessment.

步骤1)利用实验手段或直接从公共数据库中获取基因表达数据(又称基因表达矩阵，矩阵的行表示所有基因，矩阵的列表示所有患者，矩阵的值表示基因在特定患者中的表达水平，包括转录出的RNA水平或翻译出的蛋白质水平)。实验手段包括基于高通量测序技术检测生物样本中的RNA水平，或基于质谱技术检测生物样本中的蛋白质水平；公共数据库包括Gene Expression Omnibus(GEO)、The Cancer Genome Atlas Program(TCGA)和ArrayExpress等。Step 1) Obtain gene expression data (also called gene expression matrix) using experimental means or directly from public databases. The rows of the matrix represent all genes, the columns of the matrix represent all patients, and the values of the matrix represent the expression levels of genes in specific patients. Including transcribed RNA levels or translated protein levels). Experimental methods include detecting RNA levels in biological samples based on high-throughput sequencing technology, or detecting protein levels in biological samples based on mass spectrometry technology; public databases include Gene Expression Omnibus (GEO), The Cancer Genome Atlas Program (TCGA) and ArrayExpress, etc. .

步骤2)基于基因表达矩阵构建GRN。现有的GRN推断方法主要包括聚类算法(层次聚类、图聚类等)、机器学习算法(贝叶斯算法、随机森林等)和深度学习算法(卷积神经网络、迁移学习等)。Step 2) Construct GRN based on gene expression matrix. Existing GRN inference methods mainly include clustering algorithms (hierarchical clustering, graph clustering, etc.), machine learning algorithms (Bayesian algorithm, random forest, etc.) and deep learning algorithms (convolutional neural network, transfer learning, etc.).

步骤3)利用实验手段或相互作用数据库优化GRN。目的是删除可信度较低的边，只保留经过实验验证或公共数据库收录的相互作用，从而保证后续分析的准确度。可用于优化GRN的实验手段包括：基于免疫共沉淀预测转录因子-靶基因相互作用，基于酵母双杂交、近距离荧光共振、表面等离子体共振、质谱联用等技术预测蛋白质-蛋白质相互作用。可用于优化GRN的相互作用数据库包括：染色质相互作用数据库(4DGenome)，转录因子-靶基因数据库(TRRUST和hTFtarget)，小RNA-靶基因数据库(miRDB和miRTarBase)，蛋白质-蛋白质相互作用数据库(STRING和HuRI)和通路数据库(KEGG和Reactome)。Step 3) Optimize GRN using experimental means or interaction database. The purpose is to delete edges with low credibility and retain only interactions that have been experimentally verified or included in public databases, thereby ensuring the accuracy of subsequent analysis. Experimental methods that can be used to optimize GRN include: predicting transcription factor-target gene interactions based on co-immunoprecipitation, and predicting protein-protein interactions based on yeast two-hybrid, close range fluorescence resonance, surface plasmon resonance, mass spectrometry and other technologies. Interaction databases that can be used to optimize GRN include: chromatin interaction database (4DGenome), transcription factor-target gene database (TRRUST and hTFtarget), small RNA-target gene database (miRDB and miRTarBase), protein-protein interaction database ( STRING and HuRI) and pathway databases (KEGG and Reactome).

步骤4)评估GRN中每个节点(每一节点对应一基因)在不同患者中的共表达稳定性。具体步骤包括先获取每个基因在GRN中的邻接基因；对每个基因在所有患者的表达水平进行Z-Score标准化，目的是保证不同基因的表达水平具有可比性；基于每个基因的邻接基因评估其共表达稳定性(详见“具体实施方式”)。Step 4) Evaluate the co-expression stability of each node in GRN (each node corresponds to a gene) in different patients. Specific steps include first obtaining the adjacent genes of each gene in the GRN; Z-Score normalization of the expression level of each gene in all patients, with the purpose of ensuring that the expression levels of different genes are comparable; based on the adjacent genes of each gene Evaluate its co-expression stability (see "Detailed Implementations" for details).

步骤5)针对所述基因表达矩阵中的每一基因，基于该基因在各患者中的共表达稳定性对患者排序，取共表达稳定性排名前1/4和后1/4的两组患者的生存信息进行Kaplan-Meier生存分析并得到该基因的对数秩检验P值，然后基于该P值评估两组患者的生存时间是否具有统计学差异。P≤0.05时具有统计学差异，表明该基因的共表达稳定性显著影响患者生存时间；P>0.05则表明该基因的共表达稳定性不影响患者生存时间。Step 5) For each gene in the gene expression matrix, sort the patients based on the co-expression stability of the gene in each patient, and select the two groups of patients with the top 1/4 and the bottom 1/4 of the co-expression stability rankings. Kaplan-Meier survival analysis was performed on the survival information and the log-rank test P value of the gene was obtained, and then based on the P value, it was evaluated whether there was a statistical difference in the survival time of the two groups of patients. There is a statistical difference when P≤0.05, indicating that the co-expression stability of the gene significantly affects the patient's survival time; P>0.05 indicates that the co-expression stability of the gene does not affect the patient's survival time.

步骤6)保留对数秩检验P≤0.05的基因以及GRN中连接这些基因的边及该边连接的基因，利用cytoscape工具构建目标癌症的生存网络。该网络中，相连节点的共表达水平在不同患者间发生扰动，而这些扰动会显著影响患者的生存时间，具有重要的研究价值。Step 6) Retain the genes with log-rank test P≤0.05 as well as the edges connecting these genes and the genes connected by the edges in the GRN, and use the cytoscape tool to construct the survival network of the target cancer. In this network, the co-expression levels of connected nodes are perturbed between different patients, and these perturbations can significantly affect the survival time of patients, which has important research value.

基于上述内容，本发明的技术方案为：Based on the above content, the technical solution of the present invention is:

一种基于基因调控网络构建患者生存网络的方法，其步骤包括：A method of constructing a patient survival network based on a gene regulatory network, the steps include:

1)获取基因表达矩阵，所述基因表达矩阵的行为基因，所述基因表达矩阵的列为目标癌症患者样本，所述基因表达矩阵中第m行第n列的元素值表示第m个基因在第n个目标癌症患者中的表达水平；获取每一所述目标癌症患者样本对应的患者生存信息；1) Obtain a gene expression matrix, the behavioral genes of the gene expression matrix, the columns of the gene expression matrix are target cancer patient samples, the element value of the mth row and nth column in the gene expression matrix represents the mth gene in Expression level in the nth target cancer patient; obtain patient survival information corresponding to each target cancer patient sample;

2)基于所述基因表达矩阵构建基因调控网络；2) Construct a gene regulatory network based on the gene expression matrix;

3)对于所述基因调控网络中的每一条边，如果该边的可信度低于设定阈值，则删除该边；3) For each edge in the gene regulatory network, if the edge's credibility is lower than the set threshold, delete the edge;

4)评估步骤3)优化后的基因调控网络中每个基因在每一目标癌症患者样本中的共表达稳定性；4) Evaluate the co-expression stability of each gene in the optimized gene regulatory network in step 3) in each target cancer patient sample;

5)对于所述基因表达矩阵中的每一基因，基于该基因在各目标癌症患者样本中的共表达稳定性对各目标癌症患者样本排序，取共表达稳定性排名前T％的目标癌症患者样本的生存信息作为第一组信息，取共表达稳定性排名后T％的目标癌症患者样本的生存信息作为第二组信息；基于第一、二组信息进行生存分析得到该基因的对数秩检验值P；然后基于该基因的对数秩检验值P判定该基因对该排名前T％的目标癌症患者样本、后T％的目标癌症患者样本中各目标癌症患者的生存时间是否具有统计学差异；如果具有统计学差异，则保留该基因；5) For each gene in the gene expression matrix, rank each target cancer patient sample based on the co-expression stability of the gene in each target cancer patient sample, and select the top T% target cancer patients with co-expression stability. The survival information of the sample is used as the first set of information, and the survival information of the target cancer patient samples with the lowest T% of co-expression stability rankings is taken as the second set of information; survival analysis is performed based on the first and second sets of information to obtain the logarithmic rank of the gene. The test value P; then based on the log-rank test value P of the gene, determine whether the gene has statistical significance for the survival time of each target cancer patient in the top T% target cancer patient samples and the bottom T% target cancer patient samples. Difference; if there is a statistical difference, the gene is retained;

6)根据步骤5)中所保留的基因及所述基因调控网络中连接各所保留基因的边和基因，6) According to the genes retained in step 5) and the edge sum genes connecting each retained gene in the gene regulatory network,

构建目标癌症的生存网络。Constructing survival networks for target cancers.

进一步的，得到每个基因在各目标癌症患者样本中的共表达稳定性的方法为：首先获取基因调控网络中每个基因的邻接基因；然后获取每一所述邻接基因在所述基因表达矩阵中所有患者的表达水平并对其进行Z-Score标准化；然后基于每个基因的Z-Score标准化的邻接基因评估该基因在各目标癌症患者样本中的共表达稳定性。Further, the method to obtain the co-expression stability of each gene in each target cancer patient sample is: first obtain the adjacent genes of each gene in the gene regulatory network; and then obtain the expression matrix of each adjacent gene in the gene expression matrix The expression levels of all patients in each gene were Z-Score normalized; then the co-expression stability of the gene in each target cancer patient sample was evaluated based on the Z-Score-normalized adjacent genes of each gene.

进一步的，得到每个基因在各目标癌症患者样本中的共表达稳定性的方法为：对于所述基因表达矩阵中的每一基因g₀，从基因调控网络中获取该基因g₀的M个邻接基因{g₁,…,g_M}；对基因g₀及其邻接基因进行标准化，其中Z_i＝{v_i,1,…,v_i,n}表示第i个基因g_i经过Z-Score标准化后的表达量，i∈[0,M]∩Z，n表示目标癌症患者样本数量；当g₀与g_i的表达模式正相关，则基因g₀在目标癌症患者样本j中的共表达稳定性为否则/>其中v_i,j表示基因g_i在目标癌症患者样本j中的表达量，j∈{1,…,n}。Further, the method to obtain the co-expression stability of each gene in each target cancer patient sample is: for each gene g₀ in the gene expression matrix, obtain M genes g₀ of the gene from the gene regulation network Adjacent genes {g₁ ,…,g_M }; normalize gene g₀ and its adjacent genes, where Z_i ={v_i,1 ,…,vi_,n } represents the i-th gene g_i after Z- The expression amount after Score normalization, i∈[0,M]∩Z, n represents the number of target cancer patient samples; when g₀ is positively correlated with the expression pattern of g_i , then the total number of gene g₀ in the target cancer patient sample j The expression stability is Otherwise/> where vi_,j represents the expression level of gene g_i in target cancer patient sample j, j∈{1,…,n}.

进一步的，如果基因的对数秩检验值P≤0.05，则判定该基因对该排名前T％的目标癌症患者样本、后T％的目标癌症患者样本中各目标癌症患者的生存时间具有统计学差异。Furthermore, if the log-rank test value of a gene is P≤0.05, it is determined that the gene has statistical significance for the survival time of each target cancer patient in the top T% target cancer patient samples and the bottom T% target cancer patient samples. difference.

进一步的，利用实验手段或相互作用数据库优化所述基因调控网络，删除所述基因调控网络中可信度低于设定阈值的边。Further, experimental methods or interaction databases are used to optimize the gene regulatory network, and edges in the gene regulatory network whose credibility is lower than a set threshold are deleted.

进一步的，所述实验手段包括：基于免疫共沉淀预测转录因子-靶基因相互作用、基于酵母双杂交、近距离荧光共振、表面等离子体共振、质谱联用；所述相互作用数据库包括：染色质相互作用数据库、转录因子-靶基因数据库、小RNA-靶基因数据库、蛋白质-蛋白质相互作用数据库和通路数据库。Further, the experimental methods include: predicting transcription factor-target gene interactions based on co-immunoprecipitation, yeast two-hybrid, close range fluorescence resonance, surface plasmon resonance, and mass spectrometry; the interaction database includes: chromatin Interaction database, transcription factor-target gene database, small RNA-target gene database, protein-protein interaction database and pathway database.

一种服务器，其特征在于，包括存储器和处理器，所述存储器存储计算机程序，所述计算机程序被配置为由所述处理器执行，所述计算机程序包括用于执行上述方法中各步骤的指令。A server, characterized in that it includes a memory and a processor, the memory stores a computer program, the computer program is configured to be executed by the processor, the computer program includes instructions for executing each step in the above method. .

一种计算机可读存储介质，其上存储有计算机程序，其特征在于，所述计算机程序被处理器执行时实现上述方法的步骤。A computer-readable storage medium on which a computer program is stored, characterized in that the steps of the above method are implemented when the computer program is executed by a processor.

本发明具有以下优势：The invention has the following advantages:

1)解决了传统生存分析稳定性不足的问题。相比基因表达水平，基因在GRN中的拓扑特征能更稳定地反应患者的生理状态。首先，GRN反映了基因在多个患者中稳定的功能关联和调控架构，因此受个体差异的影响较小；此外，相比单基因表达水平，多基因组成的网络具有更高的数据维度，降低了结果的随机性。1) Solve the problem of insufficient stability of traditional survival analysis. Compared with gene expression levels, the topological characteristics of genes in GRN can more stably reflect the patient's physiological state. First of all, GRN reflects the stable functional association and regulatory architecture of genes in multiple patients, so it is less affected by individual differences; in addition, compared with the expression level of a single gene, a network composed of multiple genes has a higher data dimension, which reduces the randomness of the results.

2)能基于生存基因逆向推导有驱动癌症进展的调控子(转录因子和小RNA)。我们知道，基于新方法获取的生存基因在不同患者体内的共表达水平不同。而调控子是造成靶基因共表达的主要原因之一。换言之，调控子是否发挥作用会造成靶基因在不同患者中共表达水平不同，进而影响患者生存。因此，生存基因靶向的转录因子或小RNA与癌症进展和患者生存密切相关。调控的生存基因越多，转录因子或小RNA在癌症中的角色越重要，可信度越高。2) Can reversely deduce the regulators (transcription factors and small RNAs) that drive cancer progression based on survival genes. We know that the survival genes obtained based on the new method have different co-expression levels in different patients. The regulator is one of the main reasons for the co-expression of target genes. In other words, whether the regulator works will cause different co-expression levels of target genes in different patients, thereby affecting patient survival. Therefore, transcription factors or small RNAs targeted by survival genes are closely related to cancer progression and patient survival. The more survival genes that are regulated, the more important the role of transcription factors or small RNAs in cancer, and the higher the credibility.

附图说明Description of drawings

图1为本发明的构建基因调控网络流程图。Figure 1 is a flow chart of constructing a gene regulatory network in the present invention.

图2为本发明的构建患者生存网络流程图。Figure 2 is a flow chart of constructing a patient survival network according to the present invention.

图3为本发明的算法示意图。Figure 3 is a schematic diagram of the algorithm of the present invention.

具体实施方式Detailed ways

下面将结合附图和具体实施方式对本发明做进一步的说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

本发明的流程如图1、图2所示，假设某一目标癌症的基因表达数据包含m个不同的基因和n个病人样本，另外获取了这n个病人的生存信息，下面将结合附图和具体实施方式对本发明做进一步的说明。The process of the present invention is shown in Figures 1 and 2. It is assumed that the gene expression data of a certain target cancer contains m different genes and n patient samples, and the survival information of these n patients is also obtained. The following will be combined with the accompanying drawings and specific embodiments to further illustrate the present invention.

步骤1)基于生物信息学手段构建这m个基因的基因调控网络。可用的GRN推断方法包括聚类算法、机器学习算法和深度学习算法。聚类算法包括层次聚类(WGCNA)、图聚类(MCL)等；机器学习算法包括贝叶斯算法(BANJO、CLR)、随机森林(GENIE3、ReNI)等；深度学习算法包括卷积神经网络(DeepInsight、DeepFeature)、迁移学习(Geneformer)等。Step 1) Construct a gene regulatory network of these m genes based on bioinformatics methods. Available GRN inference methods include clustering algorithms, machine learning algorithms, and deep learning algorithms. Clustering algorithms include hierarchical clustering (WGCNA), graph clustering (MCL), etc.; machine learning algorithms include Bayesian algorithms (BANJO, CLR), random forests (GENIE3, ReNI), etc.; deep learning algorithms include convolutional neural networks (DeepInsight, DeepFeature), transfer learning (Geneformer), etc.

步骤2)利用实验手段或相互作用数据库优化GRN，保留可信度较高的相互作用。实验手段包括基于免疫共沉淀预测转录因子-靶基因相互作用，基于酵母双杂交、近距离荧光共振、表面等离子体共振、质谱联用等技术预测蛋白质-蛋白质相互作用。公共数据库包括染色质相互作用数据库4DGenome，转录因子-靶基因数据库TRRUST和hTFtarget，小RNA-靶基因数据库miRDB和miRTarBase，蛋白质-蛋白质相互作用数据库STRING和HuRI，通路数据库KEGG和Reactome。Step 2) Use experimental methods or interaction database to optimize GRN and retain interactions with higher credibility. Experimental methods include predicting transcription factor-target gene interactions based on co-immunoprecipitation, and predicting protein-protein interactions based on yeast two-hybrid, close range fluorescence resonance, surface plasmon resonance, mass spectrometry and other technologies. Public databases include chromatin interaction database 4DGenome, transcription factor-target gene databases TRRUST and hTFtarget, small RNA-target gene databases miRDB and miRTarBase, protein-protein interaction databases STRING and HuRI, and pathway databases KEGG and Reactome.

步骤3)获取GRN中每个基因的邻接基因。基因与其邻接基因可以通过各种相互作用连接，包括物理相互作用(DNA-DNA相互作用、蛋白质-DNA相互作用、蛋白质-蛋白质相互作用)、遗传相互作用(两个或多个基因关联同一性状)、以及共调控(靶向同一转录因子或小RNA，或参与同一生物过程或信号通路)。Step 3) Obtain the adjacent genes of each gene in GRN. Genes and their neighboring genes can be connected through various interactions, including physical interactions (DNA-DNA interactions, protein-DNA interactions, protein-protein interactions), genetic interactions (two or more genes are associated with the same trait) , and co-regulation (targeting the same transcription factor or small RNA, or participating in the same biological process or signaling pathway).

步骤4)对基因在不同患者中的表达水平执行Z-Score标准化。目的是消除不同基因表达量之间的差异，只保留基因在不同患者中的相对变化，从而使不同基因的表达水平具有可比性。Step 4) Perform Z-Score normalization of gene expression levels in different patients. The purpose is to eliminate the differences between the expression levels of different genes and retain only the relative changes of genes in different patients, so that the expression levels of different genes are comparable.

步骤5)评估每个基因与其邻接基因的共表达稳定性。共表达稳定性表示一个基因与其所有邻接基因的相关程度，它侧面反应了该基因的功能稳定性。假设基因g₀有M个邻接基因，它们表示为{g₁,…,g_M}。Z_i＝{v_i,1,…,v_i,n}表示基因g_i(i∈[0,M]∩Z)经过Z-Score标准化后的表达量，其中n表示患者数量。j∈{1,…,n}表示患者编号，基因g₀在患者j中的共表达稳定性计算为其中v_i,j表示基因g_i在患者j中的表达量。当g₀与g_i的表达模式正相关(Pearson相关系数大于0)，括号内为(v_0,j-v_i,j)²，该值越小表示g₀与g_i的正相关越显著。当g₀与g_i的表达模式负相关(Pearson相关系数小于0)，它们的表达水平应该一高一低，经过Z-Score标准化后则表现为一正一负。因此，括号内为(v_0,j+v_i,j)²，该值越小表示g₀与g_i的负相关越显著。综上所述，S_0,j越小，g₀与其邻接基因的正/负相关性越强，g₀的功能稳定性越高。最终得到基因表达矩阵中每个基因在每个患者中的共表达稳定性。Step 5) Assess the co-expression stability of each gene with its neighboring genes. Co-expression stability represents the degree of correlation of a gene with all its adjacent genes, which reflects the functional stability of the gene. Assume that gene g₀ has M adjacent genes, which are expressed as {g₁ ,...,g_M }. Z_i ={v_i,1 ,…,v_i,n } represents the expression level of gene g_i (i∈[0,M]∩Z) after Z-Score normalization, where n represents the number of patients. j∈{1,…,n} represents the patient number, and the co-expression stability of gene g₀ in patient j is calculated as where vi_,j represents the expression level of gene g_i in patient j. When the expression pattern of g₀ and g_i is positively correlated (Pearson correlation coefficient is greater than 0), the brackets are (v_0,j -v_i,j )² . The smaller the value, the more significant the positive correlation between g₀ and g_i . . When the expression patterns of g₀ and g_i are negatively correlated (Pearson correlation coefficient is less than 0), their expression levels should be one high and one low. After Z-Score normalization, they will appear to be one positive and one negative. Therefore, the brackets are (v_0,j +v_i,j )² . The smaller the value, the more significant the negative correlation between g₀ and g_i . To sum up, the smaller S_0,j , the stronger the positive/negative correlation between g₀ and its adjacent genes, and the higher the functional stability of g₀ . Finally, the co-expression stability of each gene in each patient in the gene expression matrix was obtained.

步骤6)基于共表达稳定性的生存分析。对于基因表达矩阵中的每个基因，基于其在每个患者中的共表达稳定性对患者排序，对共表达稳定性排名前1/4和后1/4的两组患者进行Kaplan-Meier生存分析并得到对数秩检验P值，然后基于该P值评估两组患者的生存时间是否具有统计学差异。P≤0.05时具有统计学差异，表明该基因的共表达稳定性显著影响患者生存时间；P>0.05则表明该基因的共表达稳定性不影响患者生存时间。Step 6) Survival analysis based on co-expression stability. For each gene in the gene expression matrix, patients were ranked based on their co-expression stability in each patient, and Kaplan-Meier survival was performed on the two groups of patients ranked in the top 1/4 and bottom 1/4 of the co-expression stability Analyze and obtain the log-rank test P value, and then evaluate whether there is a statistical difference in the survival time of the two groups of patients based on the P value. There is a statistical difference when P≤0.05, indicating that the co-expression stability of the gene significantly affects the patient's survival time; P>0.05 indicates that the co-expression stability of the gene does not affect the patient's survival time.

如图3所示，上方路径表示传统生存分析算法，基于基因g₀的表达水平对患者排序，基于KM生存曲线比较首尾1/4患者的生存差异；下方路径表示新的生存分析算法，基于g₀在基因调控网络中的邻接基因估算其共表达稳定性，基于g₀的共表达稳定性对患者排序，基于KM生存曲线比较首尾1/4患者的生存差异。As shown in Figure 3, the upper path represents the traditional survival analysis algorithm, which sorts patients based on the expression level of gene g₀ and compares the survival differences between the first and last quarter of patients based on the KM survival curve; the lower path represents the new survival analysis algorithm, which is based on g The co-expression stability of₀ adjacent genes in the gene regulatory network is estimated, the patients are ranked based on the co-expression stability of g₀ , and the survival difference between the first and last quarter of the patients is compared based on the KM survival curve.

步骤7)构建癌症生存网络。保留对数秩检验P≤0.05的基因以及GRN中连接这些基因的边，组成目标癌症生存网络。该网络中，相连节点的共表达水平在不同患者间发生扰动，而这些扰动会显著影响患者的生存时间，具有重要的研究价值。Step 7) Construct a cancer survivorship network. Genes with log-rank test P≤0.05 and the edges connecting these genes in GRN were retained to form the target cancer survival network. In this network, the co-expression levels of connected nodes are perturbed between different patients, and these perturbations can significantly affect the survival time of patients, which has important research value.

综上，针对传统生存分析的不足，本发明赋予了GRN节点一个新的属性——共表达稳定性，并建立了共表达稳定性与患者生存的关联。值得一提的是，新方法与传统方法发现的生存基因作用机制不同：传统生存基因通过自身表达水平影响患者生存，而我们的生存基因通过在GRN中的扰动影响患者生存。因此，新方法的意义不在于取代传统生存分析方法，而是从新的维度拓展癌症生存基因的发现，与传统方法形成良性互补。In summary, in view of the shortcomings of traditional survival analysis, the present invention gives GRN nodes a new attribute - co-expression stability, and establishes a correlation between co-expression stability and patient survival. It is worth mentioning that the mechanism of action of survival genes discovered by the new method is different from that discovered by traditional methods: traditional survival genes affect patient survival through their own expression levels, while our survival genes affect patient survival through perturbation in GRN. Therefore, the significance of the new method is not to replace traditional survival analysis methods, but to expand the discovery of cancer survival genes from a new dimension and form a positive complement to traditional methods.

随着癌症精准医学研究的深入，癌症标志物的发现进入平台期。亟需从不同维度研究基因与癌症进展和患者生存的关系。因此，基于GRN的生存分析策略具有广阔的应用前景。With the deepening of cancer precision medicine research, the discovery of cancer markers has entered a plateau. There is an urgent need to study the relationship between genes, cancer progression and patient survival from different dimensions. Therefore, the survival analysis strategy based on GRN has broad application prospects.

尽管为说明目的公开了本发明的具体实施例，其目的在于帮助理解本发明的内容并据以实施，本领域的技术人员可以理解：在不脱离本发明及所附的权利要求的精神和范围内，各种替换、变化和修改都是可能的。因此，本发明不应局限于最佳实施例所公开的内容，本发明要求保护的范围以权利要求书界定的范围为准。Although specific embodiments of the present invention have been disclosed for illustrative purposes, the purpose is to assist in understanding the content of the invention and practicing it therein. Those skilled in the art will understand that the invention can be practiced without departing from the spirit and scope of the invention and the appended claims. Various substitutions, changes and modifications are possible. Therefore, the present invention should not be limited to the contents disclosed in the preferred embodiments, and the scope of protection claimed by the present invention shall be subject to the scope defined by the claims.

Claims

1. A method for constructing a patient survival network based on a gene regulation network comprises the following steps:

1) Obtaining a gene expression matrix, wherein the row of the gene expression matrix is a target cancer patient sample, and the element value of the mth row and the nth row in the gene expression matrix represents the expression level of the mth gene in the nth target cancer patient; acquiring patient survival information corresponding to each target cancer patient sample;

2) Constructing a gene regulation network based on the gene expression matrix;

3) For each edge in the gene regulation network, deleting the edge if the reliability of the edge is lower than a set threshold value;

4) Evaluating the co-expression stability of each gene in the optimized gene regulation network of step 3) in each target cancer patient sample;

5) For each gene in the gene expression matrix, sequencing all target cancer patient samples based on the co-expression stability of the gene in all target cancer patient samples, taking survival information of the target cancer patient samples with the top T% of the co-expression stability as a first group of information, and taking survival information of the target cancer patient samples with the top T% of the co-expression stability as a second group of information; carrying out survival analysis based on the first group information and the second group information to obtain a logarithmic rank test value P of the gene; then, based on the logarithmic rank test value P of the gene, judging whether the survival time of the gene for each target cancer patient in the target cancer patient sample with the top T percent and the target cancer patient sample with the back T percent has statistical difference; if there is a statistical difference, the gene is retained;

6) Constructing a survival network of the target cancer according to the genes reserved in the step 5) and the edges and genes of the gene regulation network connecting the reserved genes.

2. The method of claim 1, wherein the method for obtaining the co-expression stability of each gene in each target cancer patient sample is: firstly, obtaining adjacent genes of each gene in a gene regulation network; then obtaining the expression level of each of the adjacent genes in all patients in the gene expression matrix and performing Z-Score normalization on the obtained expression level; the co-expression stability of each gene in each target cancer patient sample was then assessed based on the Z-Score normalized contiguous genes of that gene.

3. The method of claim 1, wherein the method for obtaining the co-expression stability of each gene in each target cancer patient sample is: for each gene g in the gene expression matrix₀ Obtaining the gene g from a gene regulation network₀ M adjacent genes { g }₁ ,…,g_M -a }; for gene g₀ And adjacent genes thereto, wherein Z_i ＝{v_i,1 ,…,v_i,n [ expression of the ith Gene g ]_i The expression level after Z-Score normalization, i.epsilon.0, M]N Z, n represents the number of samples of the target cancer patient; when g₀ And g is equal to_i Is positively correlated with the expression pattern of gene g₀ Co-expression stability in target cancer patient sample j isOtherwise->Wherein v is_i,j Expression Gene g_i The expression level in the target cancer patient sample j, j e {1, …, n }.

4. The method of claim 1, 2 or 3, wherein if the log rank test value P of a gene is less than or equal to 0.05, determining that the gene has a statistical difference in survival time for each target cancer patient in the top T% target cancer patient sample and the back T% target cancer patient sample.

5. A method according to claim 1, 2 or 3, wherein the gene regulation network is optimized using experimental means or an interaction database, and edges of the gene regulation network having a reliability below a set threshold are deleted.

6. The method of claim 5, wherein the experimental means comprises: predicting transcription factor-target gene interaction based on co-immunoprecipitation, and combining based on yeast two-hybrid, close-range fluorescence resonance, surface plasmon resonance and mass spectrometry; the interaction database comprises: chromatin interaction databases, transcription factor-target gene databases, microRNA-target gene databases, protein-protein interaction databases, and pathway databases.

7. A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of any of claims 1 to 6.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.