CN109585017B

Movatterモバイル変換

Info

Publication number: CN109585017B
Application number: CN201910101067.7A
Authority: CN
Inventors: 王丽君; 高军晖; 袁卫兰; 龚建兵; 刘慧敏; 林灵; 许骋; 张英霞
Original assignee: Shanghai Biotecan Medical Diagnostics Co ltd; Shanghai Biotecan Biology Medicine Technology Co ltd
Current assignee: Shanghai Biotecan Medical Diagnostics Co ltd; Shanghai Biotecan Biology Medicine Technology Co ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2023-12-12
Anticipated expiration: 2039-01-31
Also published as: CN109585017A

Abstract

The application provides a risk prediction algorithm model and a device for Age-related macular degeneration (Age-related macular degeneration, AMD). Specifically, the application provides genotyping of 7 related single nucleotide polymorphisms (Single nucleotide polymorphism, SNPs), converting the genotyping into OR (Odd Ratio) values, combining 7 clinical information, and constructing risk prediction algorithm models and devices by adopting a machine learning method. The application can assist clinic to predict AMD in advance and diagnose early, and has great clinical significance for reducing the incidence of AMD and improving the disease treatment rate.

Description

Translated fromChinese

技术领域Technical field

本发明涉及医学生物检测领域，具体涉及一种年龄相关性黄斑变性(Age-relatedmacular degeneration,AMD)的风险预测算法模型和装置。The invention relates to the field of medical biological detection, and specifically relates to a risk prediction algorithm model and device for age-related macular degeneration (AMD).

背景技术Background technique

年龄相关性黄斑变性(Age-related macular degeneration,AMD)是致老年人失明的主要因素。该病具有与年龄、性别、吸烟、种族及遗传等因素相关的复杂病因，为不可逆性的视觉丧失，目前针对该疾病尚无有效的治疗手段。AMD具有较高的发病率，荟萃分析结果表明全球AMD总的发病率为8.01％，欧洲、非洲及亚洲人群AMD的发病率分别为11.2％、7.1％和6.8％。我国老年人群早期AMD和晚期AMD发病率分别为4.7％-9.2％和0.2％-1.9％。预测至2020年和2040年，全球AMD患者将分别达到1.96亿和2.88亿。随着我国人口老龄化的加快，AMD具有明显的上升趋势。Age-related macular degeneration (AMD) is the main factor causing blindness in the elderly. The disease has complex causes related to age, gender, smoking, race, genetics and other factors. It is an irreversible visual loss. Currently, there is no effective treatment for this disease. AMD has a high incidence rate. Meta-analysis results show that the overall incidence rate of AMD in the world is 8.01%, and the incidence rates of AMD in European, African and Asian populations are 11.2%, 7.1% and 6.8% respectively. The incidence rates of early AMD and late AMD in the elderly population in my country are 4.7%-9.2% and 0.2%-1.9% respectively. It is predicted that by 2020 and 2040, the number of AMD patients worldwide will reach 196 million and 288 million respectively. As my country's population ages, AMD has a clear upward trend.

AMD的发生为环境因素和遗传因素综合作用的结果，其中遗传因素在该疾病的发生风险中占有较高比例，达45-70％。AMD病因复杂，其发病机制与遗传和环境因子均相关，如上所述，遗传因素在该病的发生风险中占有重要比例。显然，若综合考量遗传和环境因素，并结合视力、眼压、眼底检查及眼底血管荧光造影、光学断层扫描等常规及辅助AMD的检查，这必然能够极大提高对AMD的精准诊断与有效风险评估，也将有益于AMD的预防及其早期发现与治疗。The occurrence of AMD is the result of a combination of environmental factors and genetic factors. Genetic factors account for a high proportion of the risk of the disease, reaching 45-70%. The cause of AMD is complex, and its pathogenesis is related to both genetic and environmental factors. As mentioned above, genetic factors account for an important proportion in the risk of the disease. Obviously, if genetic and environmental factors are comprehensively considered, combined with conventional and auxiliary AMD examinations such as visual acuity, intraocular pressure, fundus examination, fundus fluorescein angiography, and optical tomography, this will inevitably greatly improve the accurate diagnosis and effective risk of AMD. Assessment will also be beneficial to the prevention, early detection and treatment of AMD.

因此，本领域迫切需要开发一种可靠的对AMD进行早期预测和诊断的方法。Therefore, there is an urgent need in the field to develop a reliable method for early prediction and diagnosis of AMD.

发明内容Contents of the invention

本发明的目的就是提供一种年龄相关性黄斑变性(Age-related maculardegeneration,AMD)的风险预测算法模型和装置。The purpose of the present invention is to provide a risk prediction algorithm model and device for age-related macular degeneration (AMD).

在本发明的第一方面，提供了一种生物标志物集合，所述的集合包括选自下组两种的生物标志物：rs2338104、rs754203、或其组合。In a first aspect of the present invention, a biomarker set is provided, and the set includes two biomarkers selected from the following group: rs2338104, rs754203, or a combination thereof.

在另一优选例中，所述生物标志物集合为用于诊断黄斑变性(AMD)疾病的生物标志物集合，还包括选自下组的生物标志物：rs2284664、rs2071277、rs1999930、rs10490924、rs5749482、或其组合。In another preferred example, the biomarker set is a biomarker set for diagnosing macular degeneration (AMD) disease, and also includes biomarkers selected from the following group: rs2284664, rs2071277, rs1999930, rs10490924, rs5749482, or combination thereof.

在另一优选例中，所述生物标志物集合为用于诊断黄斑变性(AMD)疾病的生物标志物集合，包括选自表A的生物标志物:In another preferred embodiment, the biomarker set is a biomarker set for diagnosing macular degeneration (AMD) disease, including biomarkers selected from Table A:

表ATable A

编号serial number染色体位置chromosomal location突变碱基mutated basers2338104rs233810412:10945736312:109457363C>GC>Grs754203rs75420314:9969163014:99691630A>GA>Grs2284664rs22846641:1967333951:196733395C>TC>Trs2071277rs20712776:322039066:32203906T>CT>Crs1999930rs19999306:1160659716:116065971C>TC>Trs10490924rs1049092410:12245493210:122454932G>TG>Trs5749482rs574948222:3266367922:32663679C>GC>G

在另一优选例中，所述生物标志物集合用于诊断黄斑变性(AMD)疾病，或用于制备一试剂盒或试剂，所述的试剂盒或试剂用于评估待测对象的黄斑变性(AMD)疾病患病风险(易感性)或诊断(包括早期诊断和/或辅助诊断)待测对象黄斑变性(AMD)疾病。In another preferred embodiment, the biomarker set is used to diagnose macular degeneration (AMD) disease, or is used to prepare a kit or reagent, and the kit or reagent is used to evaluate macular degeneration of a subject to be tested (AMD). AMD) disease risk (susceptibility) or diagnosis (including early diagnosis and/or auxiliary diagnosis) of macular degeneration (AMD) disease in the subject.

在另一优选例中，所述的集合包括选自表B的生物标志物：In another preferred embodiment, the set includes biomarkers selected from Table B:

表BTable B

在另一优选例中，所述的集合包括生物标志物b1～b2。In another preferred embodiment, the set includes biomarkers b1 to b2.

在另一优选例中，所述的集合还包括生物标志物b3～b7。In another preferred example, the set also includes biomarkers b3 to b7.

在另一优选例中，所述的集合还包括生物标志物：rs551397、rs800292、rs10737680、rs3753396、rs1410996、rs2284664、rs1065489、或其组合。In another preferred embodiment, the set also includes biomarkers: rs551397, rs800292, rs10737680, rs3753396, rs1410996, rs2284664, rs1065489, or a combination thereof.

在另一优选例中，所述的生物标志物或生物标志物集合来源于血液、血浆、血清或口腔拭子样品。In another preferred embodiment, the biomarker or biomarker set is derived from blood, plasma, serum or oral swab samples.

在另一优选例中，通过PCR对各个生物标志物进行检测。In another preferred embodiment, each biomarker is detected by PCR.

在另一优选例中，应用荧光定量PCR进行DNA片段的扩增及单碱基的延伸。In another preferred embodiment, fluorescence quantitative PCR is used to amplify DNA fragments and extend single bases.

在另一优选例中，应用MassARRAT Analyzer 4system进行生物标准物的检测。In another preferred embodiment, MassARRAT Analyzer 4system is used to detect biological standards.

在另一优选例中，所述PCR包括QPCR、荧光定量PCR。In another preferred embodiment, the PCR includes QPCR and fluorescence quantitative PCR.

在另一优选例中，所述的集合用于AMD患病风险的评估或诊断。In another preferred embodiment, the collection is used for assessment or diagnosis of AMD risk.

在另一优选例中，所述的评估待测对象的AMD患病风险包括AMD的早期筛查。In another preferred embodiment, the assessment of the AMD risk of the subject includes early screening of AMD.

在本发明的第二方面，提供一种用于AMD患病风险的评估或诊断的试剂组合，所述试剂组合包括用于检测如本发明第一方面所述的集合中各个生物标志物的试剂。In a second aspect of the present invention, a reagent combination for assessment or diagnosis of AMD risk is provided, the reagent combination includes reagents for detecting each biomarker in the set according to the first aspect of the present invention. .

在本发明的第三方面，提供一种试剂盒，所述的试剂盒包括如本发明第一方面所述的集合和/或如本发明第二方面所述的试剂组合。In a third aspect of the present invention, a kit is provided, which kit includes the collection as described in the first aspect of the present invention and/or the reagent combination as described in the second aspect of the present invention.

在另一优选例中，如本发明第一方面所述的集合中各个生物标记物用作标准品。In another preferred embodiment, each biomarker in the set according to the first aspect of the invention is used as a standard.

在另一优选例中，所述的试剂盒还包括一说明书。In another preferred embodiment, the kit further includes an instruction manual.

在本发明的第四方面，提供一种生物标志物集合的用途，用于制备一试剂盒，所述的试剂盒用于AMD患病风险的评估或诊断，其中，所述生物标志物集合包括选自下组的两种生物标志物：rs2338104、rs754203、或其组合。In a fourth aspect of the present invention, there is provided a use of a biomarker set for preparing a kit for the assessment or diagnosis of AMD risk, wherein the biomarker set includes Two biomarkers selected from the group consisting of: rs2338104, rs754203, or a combination thereof.

在另一优选例中，用于AMD患病风险的评估或诊断时，所述生物标志物集合还包括选自下组的生物标志物：rs2284664、rs2071277、rs1999930、rs10490924、rs5749482、或其组合。In another preferred embodiment, when used for the assessment or diagnosis of AMD risk, the biomarker set further includes biomarkers selected from the following group: rs2284664, rs2071277, rs1999930, rs10490924, rs5749482, or a combination thereof.

在另一优选例中，所述的评估包括步骤：In another preferred embodiment, the evaluation includes the steps:

(1)提供一来源于待测对象的样品，对样品中所述集合中各个生物标记物的SNP分型值(即表2的A1或者A2)进行检测；(1) Provide a sample derived from the subject to be tested, and detect the SNP typing value (i.e., A1 or A2 in Table 2) of each biomarker in the set of samples;

(2)将步骤(1)测得的位点信息与一参考数据集进行比较；(2) Compare the site information measured in step (1) with a reference data set;

较佳地，所述的参考数据集包括来源于AMD患者和健康对照者的如所述集合中各个生物标记物的；Preferably, the reference data set includes each biomarker in the set derived from AMD patients and healthy controls;

在另一优选例中，所述的样品选自下组：血液、血浆、血清和口腔拭子。In another preferred embodiment, the sample is selected from the group consisting of blood, plasma, serum and oral swabs.

在另一优选例中，所述的将步骤(1)测得的位点信息与一参考数据集进行比较，还包括建立有监督机器学习的多元统计模型从而输出患病可能性的步骤，较佳地，所述的机器学习模型为Xgboost分析模型。In another preferred embodiment, comparing the site information measured in step (1) with a reference data set also includes the step of establishing a multivariate statistical model of supervised machine learning to output the likelihood of disease, which is more Preferably, the machine learning model is an Xgboost analysis model.

在另一优选例中，如果所述的患病可能性＞0.5，所述的对象被判定为具有AMD疾病患病风险或患有AMD疾病。In another preferred embodiment, if the probability of disease is >0.5, the subject is determined to be at risk of developing AMD disease or suffering from AMD disease.

在另一优选例中，在步骤(1)之前，所述的方法还包括对样品进行处理的步骤。In another preferred embodiment, before step (1), the method further includes the step of processing the sample.

在本发明的第五方面，提供一种用于评估或诊断待测对象的AMD患病风险的的方法，包括步骤：In a fifth aspect of the present invention, a method for assessing or diagnosing the risk of AMD in a subject is provided, including the steps:

(1)提供一来源于待测对象的样品，对样品中所述集合中各个生物标记物的位点信息(如SNP分型值(即表2的A1或者A2))进行检测；(1) Provide a sample derived from the subject to be tested, and detect the site information (such as SNP typing value (i.e., A1 or A2 in Table 2)) of each biomarker in the set in the sample;

(2)将步骤(1)测得的分型与一参考数据集进行比较；(2) Compare the classification measured in step (1) with a reference data set;

较佳地，所述的参考数据集包括来源于AMD患者和健康对照者的如所述集合中各个生物标记物的数据；Preferably, the reference data set includes data derived from AMD patients and healthy controls for each biomarker in the set;

在另一优选例中，所述的将步骤(1)测得分型计算出相应的数据与一参考数据集进行比较，还包括建立有监督集成学习的机器学习模型从而输出患病可能性的步骤，较佳地，所述的机器学习模型为Xgboost分析模型。In another preferred embodiment, the step (1) of comparing the corresponding data calculated by measuring the typing with a reference data set also includes the step of establishing a machine learning model of supervised ensemble learning to output the likelihood of disease. , preferably, the machine learning model is an Xgboost analysis model.

在本发明的第六方面，提供一种筛选用于评估或诊断AMD患病风险候选化合物的方法，包括步骤：In a sixth aspect of the present invention, a method for screening candidate compounds for assessing or diagnosing the risk of AMD is provided, including the steps:

(1)在测试组中，向待测对象施用测试化合物，检测测试组中来源于所述对象的样品中集合中各个生物标记物的水平V1；在对照组中，向待测对象施用空白对照(包括溶媒)，检测对照组中来源于所述对象的样品中所述集合中各个生物标记物的水平V2；(1) In the test group, the test compound is administered to the subject to be tested, and the level V1 of each biomarker in the collection of samples derived from the subject in the test group is detected; in the control group, a blank control is administered to the subject to be tested (including vehicle), detecting the level V2 of each biomarker in said set in a sample derived from said subject in a control group;

(2)对上一步骤检测得到的水平V1和水平V2进行比较，从而确定所述测试化合物是否是治疗AMD的候选化合物，其中所述集合包括两种或多种选自下组的生物标志物：rs2338104、rs1999930、rs10490924。(2) Compare the level V1 and the level V2 detected in the previous step to determine whether the test compound is a candidate compound for treating AMD, wherein the set includes two or more biomarkers selected from the following group : rs2338104, rs1999930, rs10490924.

在另一优选例中，所述的待测对象患有AMD。In another preferred embodiment, the subject suffers from AMD.

在另一优选例中，如果一个或多个选自子集H的生物标志物的水平V1显著低于水平V2，表明测试化合物为治疗AMD的候选化合物。In another preferred embodiment, if the level V1 of one or more biomarkers selected from subset H is significantly lower than the level V2, it indicates that the test compound is a candidate compound for treating AMD.

在另一优选例中，所述“显著低于”指水平V1/水平V2的比值≤0.8，较佳地≤0.6，更佳地，≤0.4。In another preferred embodiment, the "significantly lower" refers to the ratio of level V1/level V2 ≤ 0.8, preferably ≤ 0.6, and more preferably, ≤ 0.4.

在本发明的第七方面，提供一种生物标志物集合的用途，用于筛选评估或诊断AMD患病风险的候选化合物和/或用于评估候选化合物对AMD的治疗效果，其中，所述生物标志物集合选自下组的两种生物标志物：rs2338104、rs754203、或其组合。In a seventh aspect of the present invention, there is provided a use of a biomarker set for screening candidate compounds for evaluating or diagnosing the risk of AMD and/or for evaluating the therapeutic effect of candidate compounds on AMD, wherein the biological The marker set is selected from two biomarkers of the group: rs2338104, rs754203, or a combination thereof.

在另一优选例中，所述生物标志物还包括：rs2284664、rs2071277、rs1999930、rs10490924、rs5749482、或其组合。In another preferred example, the biomarkers further include: rs2284664, rs2071277, rs1999930, rs10490924, rs5749482, or a combination thereof.

在本发明的第八方面，提供一种AMD早期辅助筛查系统，其特征在于，所述系统包括：In an eighth aspect of the present invention, an AMD early auxiliary screening system is provided, characterized in that the system includes:

(a)AMD相关疾病特征输入模块，所述AMD相关疾病特征输入模块用于输入某一对象的AMD相关疾病特征；(a) AMD-related disease feature input module, the AMD-related disease feature input module is used to input AMD-related disease features of a certain subject;

其中所述的AMD相关疾病特征包括选自下组A的位点信息(如SNP分型值(即表2的A1或者A2))的两种或多种：rs2284664、rs2071277、rs1999930、rs10490924、rs2338104、rs754203、rs5749482、或其组合；The AMD-related disease characteristics described therein include two or more selected from the following group A site information (such as SNP typing values (i.e., A1 or A2 in Table 2)): rs2284664, rs2071277, rs1999930, rs10490924, rs2338104 , rs754203, rs5749482, or combinations thereof;

(b)AMD相关疾病判别处理模块，所述处理模块对于输入的AMD相关疾病特征，按预定的判断标准进行评分处理，从而获得风险度评分；并且将所述风险度评分与AMD相关疾病的风险度阈值进行比较，从而得出辅助筛查结果，其中，当所述风险度评分高于所述风险度阈值时，则提示该对象患AMD相关疾病的风险高于正常人群；和(b) AMD-related disease identification and processing module. The processing module performs scoring processing on the input AMD-related disease characteristics according to predetermined judgment standards to obtain a risk score; and compares the risk score with the risk of AMD-related diseases. The degree threshold is compared to obtain an auxiliary screening result, wherein when the risk score is higher than the risk threshold, it is prompted that the risk of the subject suffering from AMD-related diseases is higher than that of the normal population; and

(c)辅助筛查结果输出模块，所述输出模块用于输出所述的辅助筛查结果。(c) Auxiliary screening result output module, the output module is used to output the auxiliary screening result.

在另一优选例中，所述步骤(a)中，还包括以下AMD相关疾病特征：年龄、糖尿病情况、身体质量指数(BMI指数)、肾损伤情况、动脉粥样硬化、饮酒情况、是否经常在户外情况。In another preferred embodiment, step (a) also includes the following AMD-related disease characteristics: age, diabetes, body mass index (BMI), kidney damage, atherosclerosis, drinking, and whether in outdoor situations.

在另一优选例中，所述的对象是人。In another preferred embodiment, the object is a human.

在另一优选例中，所述的对象包括婴幼儿、青少年或成年人。In another preferred embodiment, the subject includes infants, teenagers or adults.

在另一优选例中，在所述处理模块中，如下进行风险度评分处理：In another preferred example, in the processing module, the risk scoring process is performed as follows:

在另一优选例中，所述的特征输入模块包括样本采集仪。In another preferred embodiment, the feature input module includes a sample collection instrument.

在另一优选例中，所述的特征输入模块选自下组：MassARRAT Analyzer 4 system分型输出模块、Askme模块。In another preferred example, the feature input module is selected from the following group: MassARRAT Analyzer 4 system typing output module and Askme module.

在另一优选例中，所述的AMD相关疾病的判别处理模块包括一处理器，以及一储存器，其中所述的储存器中存储有基于AMD相关疾病特征的AMD相关疾病的风险度阈值数据或模型。In another preferred embodiment, the AMD-related disease identification processing module includes a processor and a storage, wherein the storage stores risk threshold data of AMD-related diseases based on AMD-related disease characteristics. or model.

在另一优选例中，所述的输出模块包括报告系统(如Askme的报告系统)。In another preferred embodiment, the output module includes a reporting system (such as Askme's reporting system).

应理解，在本发明范围内中，本发明的上述各技术特征和在下文(如实施例)中具体描述的各技术特征之间都可以互相组合，从而构成新的或优选的技术方案。限于篇幅，在此不再一一累述。It should be understood that within the scope of the present invention, the above-mentioned technical features of the present invention and the technical features specifically described below (such as embodiments) can be combined with each other to form new or preferred technical solutions. Due to space limitations, they will not be described one by one here.

附图说明Description of drawings

图1显示了本发明的技术路线。Figure 1 shows the technical route of the present invention.

图2显示了应用MassARRAT Analyzer 4system进行基因SNP分型实验步骤。Figure 2 shows the experimental steps for gene SNP typing using MassARRAT Analyzer 4system.

图3显示了Logistic回归，随机森林，Adaboost，以及Xgboost分类器的重复1000次随机拆分训练集和测试集，测试集平均结果做ROC曲线，特征变量包含临床信息和位点信息(SNP+CC)。Figure 3 shows the logistic regression, random forest, Adaboost, and Xgboost classifier repeated 1000 times to randomly split the training set and test set. The average results of the test set are made into ROC curves. The feature variables include clinical information and site information (SNP+CC ).

图4显示了Xgboost重复1000次学习和预测，测试集的平均预测结果做ROC曲线，“CC”为特征变量只有临床信息数据、“SNP”为特征变量只有SNP位点，SNP+CC为特征变量包含临床信息和位点信息。Figure 4 shows Xgboost repeating 1000 times of learning and prediction, and the average prediction result of the test set is used as a ROC curve. "CC" is the feature variable with only clinical information data, "SNP" is the feature variable with only SNP sites, and SNP+CC is the feature variable. Contains clinical information and site information.

图5显示了Xgboost输出的前10个特征变量的重要性分数。Figure 5 shows the importance scores of the top 10 feature variables output by Xgboost.

图6显示了变量数目与ROC-AUC分数的关系。过程是根据Xgboost模型得到变量特征的重要性(Feature-importance)分数，根据该分数再次优化筛选模型，逐个按照重要性分数从大到小增加特征变量的数目并输入模型进行训练和测试，得到测试的ROC-AUC最优所需要的变量数目，图中所示最优ROC-AUC对应的变量数目为4，即可将重要性分数的前四个特征变量当成输入变量，此时ROC-AUC得分最高。Figure 6 shows the relationship between the number of variables and the ROC-AUC score. The process is to obtain the feature-importance score of the variable feature based on the Xgboost model, optimize and screen the model again based on this score, increase the number of feature variables one by one according to the importance score from large to small, and input it into the model for training and testing, and get the test The number of variables required for the optimal ROC-AUC. The number of variables corresponding to the optimal ROC-AUC shown in the figure is 4. The first four feature variables of the importance score can be regarded as input variables. At this time, the ROC-AUC score Highest.

图7显示了将Xgboost作为机器学习模型，年龄，rs754203，rs2338104，糖尿病作为变量，将1000次的测试集平均值做ROC曲线。Figure 7 shows Xgboost as a machine learning model, age, rs754203, rs2338104, diabetes as variables, and the average value of the test set of 1000 times as an ROC curve.

具体实施方式Detailed ways

本发明人经过广泛而深入的研究，首次开发了一种年龄相关性黄斑变性(Age-related macular degeneration,AMD)的风险预测算法模型和装置。本发明采用7个相关SNP的风险(Odd ratio)值，结合7个临床信息，并采用机器学习方法构建风险预测算法模型与装置。本发明可辅助临床进行AMD的提前预测，早期诊断，对降低AMD发病率，提高其疾病治疗率均具有重大临床意义。在此基础上完成了本发明。After extensive and in-depth research, the inventor developed for the first time a risk prediction algorithm model and device for age-related macular degeneration (AMD). The present invention uses the risk (Odd ratio) values of 7 related SNPs, combines 7 clinical information, and uses machine learning methods to build a risk prediction algorithm model and device. The invention can assist clinical prediction and early diagnosis of AMD, and has great clinical significance in reducing the incidence of AMD and improving the treatment rate of the disease. On this basis, the present invention was completed.

术语the term

rs2338104：序列rs2338104: sequence

TGAAAAAGTTCTAAAATTAGATAGT[C/G]GTTATGGCCTCACAACTTGTGAATA，染色体位置12:109457363，参与基因KCTD10TGAAAAAGTTCTAAAATTAGATAGT[C/G]GTTATGGCCTCACAACTTGTGAATA, chromosome position 12:109457363, participating gene KCTD10

rs754203：序列rs754203: sequence

GTGCTGTCCTGGGGCCCAGGAGCCC[C/T]GGGGGCAAGGCTCTGCCCTGTTGCT，染色体位置14:99691630，参与基因CYP46A1(GeneView)GTGCTGTCCTGGGGCCCAGGAGCCC[C/T]GGGGGGCAAGGCTCTGCCCTGTTGCT, chromosome position 14:99691630, participating gene CYP46A1 (GeneView)

rs2284664：序列rs2284664: sequence

AGAAAAATACCAGTCTCCATAGATC[A/G/T]TAAAGCAAATAGATGGTCTTAAAAT，染色体位置1:196733395，参与基因CFHAGAAAAATACCAGTCTCCATAGATC[A/G/T]TAAAGCAAATAGATGGTCTTAAAAT, chromosome position 1:196733395, involved in the gene CFH

rs2071277：序列rs2071277: sequence

GGCAGTGACTGATGCAGTGTGTGAC[A/G]TCTAATCTCCCCCATAATTACAGGC，染色体位置6:32203906，参与基因NOTCH4GGCAGTGACTGATGCAGTGTGTGAC[A/G]TCTAATCTCCCCCATAATTACAGGC, chromosome position 6:32203906, participating gene NOTCH4

rs1999930：序列rs1999930: sequence

ATAGGACAGATTCTAGATTTTCCTT[A/C/G/T]TGATACAGAGAAATATAAGACATAA，染色体位置6:116065971，参与基因FRKATAGGACAGATTCTAGATTTTCCTT[A/C/G/T]TGATACAGAGAAATATAAGACATAA, chromosome position 6:116065971, involved in the gene FRK

rs10490924：序列rs10490924: sequence

TTTATCACACTCCATGATCCCAGCT[G/T]CTAAAATCCACACTGAGCTCTGCTT，染色体位置10:122454932，参与基因ARMS2TTTATCACACTCCATGATCCCAGCT[G/T]CTAAAATCCACACTGAGCTCTGCTT, chromosome position 10:122454932, involved in the gene ARMS2

rs5749482：序列rs5749482: sequence

TGGGAACTGACTAATACAGCATGTA[C/G]GAACTATGAAATATGAATTGTGTAA，染色体位置:32663679，参与基因LOC105373002、SYN3TGGGAACTGACTAATACAGCATGTA[C/G]GAACTATGAAATATGAATTGTGTAA, chromosomal location: 32663679, participating genes LOC105373002, SYN3

年龄相关性黄斑变性(Age-related macular degeneration,AMD)Age-related macular degeneration (AMD)

为黄斑区结构的衰老性改变。主要表现为视网膜色素上皮细胞对视细胞外节盘膜吞噬消化能力下降，使未被完全消化的盘膜残余小体潴留于基底部细胞原浆中，并向细胞外排出，沉积于Bruch膜，形成玻璃膜疣，由此继发的种种病理改变后，则导致黄斑部变性发生，或者引起Bruch膜本断裂，脉络膜毛细血管通过破裂的Bruch膜进入RPE下及视网膜神经上皮下，形成脉络膜新生血管。由于新生血管壁的结构异常，导致血管的渗漏和出血，进而引发一系列的继发性病理改变。老年性黄斑变性大多发生于45岁以上，其患病率随年龄增长而增高，是当前老年人致盲的重要疾病。It is an aging change in the structure of the macular area. The main manifestation is that the ability of retinal pigment epithelial cells to phagocytose and digest the disc membrane of the outer segment of the visual cell decreases, so that the residual bodies of the disc membrane that are not completely digested are retained in the basal cell protoplasm, and are excreted out of the cell and deposited in Bruch's membrane. The formation of drusen, and various pathological changes secondary thereto, can lead to macular degeneration, or cause the rupture of Bruch's membrane. Choroidal capillaries pass through the ruptured Bruch's membrane and enter under the RPE and under the retinal neuroepithelium, forming choroidal neovascularization. . Due to structural abnormalities in the walls of new blood vessels, leakage and bleeding of blood vessels lead to a series of secondary pathological changes. Age-related macular degeneration mostly occurs in people over the age of 45, and its prevalence increases with age. It is currently an important disease causing blindness in the elderly.

单核苷酸多态性(Single nucleotide polymorphism，SNP)Single nucleotide polymorphism (SNP)

主要是指在基因组水平上由单个核苷酸的变异所引起的DNA序列多态性。SNP在人类基因组中广泛存在，平均每个碱基对中就有1个，估计其总数可达300万个甚至更多。SNP是一种二态的标记，由单个碱基的转换或颠换所引起，也可由碱基的插入或缺失所致。SNP既可能在基因序列内，也可能在基因以外的非编码序列上。It mainly refers to DNA sequence polymorphisms caused by single nucleotide variations at the genome level. SNPs are widely present in the human genome, with an average of There is 1 in every base pair, and the total number is estimated to be 3 million or more. SNP is a dimorphic marker caused by the conversion or transversion of a single base, or by the insertion or deletion of a base. SNPs may be within the gene sequence or on non-coding sequences outside the gene.

XgboostXgboost

一种boosting的有监督集成学习模型，由多个相关联的CART树联合构成。CART是一种二叉决策树，每次分枝时，是穷举每一个特征列的每一个阈值，根据GINI系数找到使不纯性降低最大的特性列以及其阀值，然后按照特征列<＝阈值，和特征列>阈值分成的两个分枝，每个分支包含符合分支条件的样本；用同样方法继续分枝直到该分支下的所有样本都属于统一类别，或达到预设的终止条件，若最终叶子节点中的类别不唯一，则以多数样本的类别作为该叶子节点的类别。Xgboost可表示为如下公式：A boosting supervised ensemble learning model, which is composed of multiple associated CART trees. CART is a binary decision tree. Each time it branches, it exhaustively enumerates each threshold of each feature column. According to the GINI coefficient, it finds the feature column and its threshold that can reduce the impurity the most, and then according to the feature column < =Threshold, and feature column>Two branches divided by the threshold, each branch contains samples that meet the branch conditions; continue branching in the same way until all samples under the branch belong to the same category, or reach the preset termination conditions , if the category in the final leaf node is not unique, the category of the majority of samples will be used as the category of the leaf node. Xgboost can be expressed as the following formula:

为预测值，F表示所有可能的CART树，f表示一棵具体的CART树。 is the predicted value, F represents all possible CART trees, and f represents a specific CART tree.

模型的目标函数为如下公式：The objective function of the model is as follows:

为损失函数和，∑Ω(f_k)为正则项，Obj(θ)取最小值的点就是这个节点的预测值，最小的/>函数值为最小损失函数。Xgboost采用加法训练法，分步骤优化目标函数，首先优化第一棵树，再优化第二棵树，直至优化完k棵树。 is the sum of loss functions, ∑Ω(f_k ) is the regular term, and the point where Obj(θ) takes the minimum value is the predicted value of this node, and the minimum/> The function value is the minimum loss function. Xgboost uses the additive training method to optimize the objective function step by step, first optimizing the first tree, then optimizing the second tree, until k trees are optimized.

ROC-AUCROC-AUC

一种评价模型准确性的方法，ROC曲线为受试者工作特征曲线(Receiveroperating characteristic curve)，以假阳性概率(False positive rate)为横轴，真阳性(True positive rate)为纵轴所组成的坐标图，是反映敏感性和特异性连续变量的综合指标。AUC为ROC曲线下方面积(Area under the curve)。ROC-AUC值在1.0和0.5之间，越接近于1，说明诊断效果越好，在0.5～0.7时有较低准确性，在0.7～0.9时有一定准确性，AUC在0.9以上时有较高准确性。AUC＝0.5时，说明诊断方法完全不起作用，无诊断价值。AUC<0.5不符合真实情况，在实际中极少出现。A method to evaluate the accuracy of the model. The ROC curve is the receiver operating characteristic curve (Receiver operating characteristic curve), with the false positive rate (False positive rate) as the horizontal axis and the true positive rate (True positive rate) as the vertical axis. Coordinate plot is a comprehensive index reflecting continuous variables of sensitivity and specificity. AUC is the area under the ROC curve (Area under the curve). The ROC-AUC value is between 1.0 and 0.5. The closer it is to 1, the better the diagnostic effect. When it is 0.5 to 0.7, it has lower accuracy. When it is 0.7 to 0.9, it has certain accuracy. When the AUC is above 0.9, it has better accuracy. High accuracy. When AUC=0.5, it means that the diagnostic method is completely ineffective and has no diagnostic value. AUC<0.5 is not consistent with the real situation and rarely occurs in practice.

本发明的主要优点包括：The main advantages of the present invention include:

1)本发明在临床领域首次以位点信息和临床数据预测AMD风险值，适合于高通量样本的检测；1) For the first time in the clinical field, this invention uses site information and clinical data to predict AMD risk values, and is suitable for the detection of high-throughput samples;

2)本发明预测未来年龄患AMD的风险，可提示改变生活习惯等对风险值的作用，对于AMD疾病有预防警示的作用。2) The present invention predicts the risk of AMD in the future age, can prompt the effect of changing living habits on the risk value, and has a preventive and warning effect on AMD disease.

下面结合具体实施例，进一步阐述本发明。应理解，这些实施例仅用于说明本发明而不用于限制本发明的范围。下列实施例中未注明具体条件的实验方法，通常按照常规条件，例如Sambrook等人，分子克隆：实验室手册(New York:Cold Spring HarborLaboratory Press,1989)中所述的条件，或按照制造厂商所建议的条件。除非另外说明，否则百分比和份数是重量百分比和重量份数。The present invention will be further described below in conjunction with specific embodiments. It should be understood that these examples are only used to illustrate the invention and are not intended to limit the scope of the invention. Experimental methods without specifying specific conditions in the following examples usually follow conventional conditions, such as those described in Sambrook et al., Molecular Cloning: Laboratory Manual (New York: Cold Spring Harbor Laboratory Press, 1989), or according to the manufacturer. Suggested conditions. Unless otherwise stated, percentages and parts are by weight.

实施例1.Example 1.

从108个备选SNP位点数据通过统计分析筛选出算法模型和装置需要用的AMD疾病相关的7个位点数据。From the 108 candidate SNP site data, statistical analysis was used to screen out the 7 AMD disease-related site data that the algorithm model and device need to use.

招募实验训练组和对照组进行SNP统计和临床信息学分析，通过大量筛选，找到108个SNP位点，SNP位点见表1。SNP分型数据由以下步骤得到：The experimental training group and control group were recruited for SNP statistical and clinical informatics analysis. Through extensive screening, 108 SNP sites were found. The SNP sites are shown in Table 1. SNP typing data is obtained by the following steps:

1.样本采集：采用下方两种采集方式。1. Sample collection: Use the following two collection methods.

a)血液样本采集方式：全血采集2-4mL于EDTA抗凝管中。a) Blood sample collection method: Collect 2-4mL of whole blood in EDTA anticoagulant tubes.

b)口腔拭子采集方式：尼龙植绒口腔拭子刮取受检人员口腔上颚及口腔两侧黏膜，至口腔拭子尼龙植绒部位全部湿润为止，将采好样的口腔拭子样本放入盛有样本保护液(1-2mL)的试管中保存。b) Oral swab collection method: Use a nylon flocked oral swab to scrape the mucous membranes on the palate and both sides of the oral cavity of the subject until the nylon flocked part of the oral swab is completely moist, then put the collected oral swab sample into Store in a test tube containing sample protection solution (1-2mL).

2.样本运输：在放有样本的泡沫盒中加入冰袋低温运输。2. Sample transportation: Add ice packs to the foam box containing the sample for low-temperature transportation.

3.应用7500荧光定量PCR进行DNA片段的扩增及单碱基的延伸。首先配置染料MIX：1)、配置染料时，应多配几个孔，配置完成后放入-20℃保存；其次染料法和探针法的混合液离心管管壁上应做上标记，避免两种染料混淆；再按顺序加入试剂，即MIXTURE(17μL)，引物1(1μL)样本(2μL)；最后封膜，上机，完成。3. Use 7500 fluorescence quantitative PCR to amplify DNA fragments and extend single bases. First, configure the dye MIX: 1) When configuring the dye, a few more holes should be prepared. After the configuration is completed, store it at -20°C. Secondly, the walls of the centrifuge tubes for the mixture of the dye method and the probe method should be marked to avoid The two dyes are mixed; then add the reagents in order, namely MIXTURE (17μL), primer 1 (1μL) and sample (2μL); finally seal the film, put it on the machine, and complete.

4.应用MassARRAT Analyzer 4system进行基因SNP分型，操作步骤如图2所示。4. Use MassARRAT Analyzer 4system to perform gene SNP typing. The operation steps are shown in Figure 2.

5.通过全基因组范围内SNP关联分析(GWAS)技术得到AMD相关的SNP位点，关联分析包含以下几个假设：5. Obtain AMD-related SNP sites through genome-wide SNP association analysis (GWAS) technology. The association analysis includes the following assumptions:

1)Genotypic Model(基因型模型)，假设A为次等位基因，a为主等位基因，3种不同的基因型有不同的影响。1) Genotypic Model, assuming that A is the minor allele and a is the major allele, three different genotypes have different effects.

2)Dominant Model(显性模型)，即AA/Aa与aa基因型有不同的影响。2) Dominant Model, that is, AA/Aa and aa genotypes have different effects.

3)Recessive Model(隐性模型)，即AA与Aa/aa有不同的影响3) Recessive Model (recessive model), that is, AA and Aa/aa have different impacts

4)Allelic Model(等位模型)，即A和a有不同的影响4)Allelic Model, that is, A and a have different effects

基于上述假设，计算卡方值。O为观测频率，E为预期频率。如(2)的假设，第一步我们计算出AA或者Aa(两者满足一个)基因型在正常人中的观测频率和预期频率之差，除以预期频率得到的值V1，第二步按照正常人的计算方法计算出AA或者Aa在疾病当中的值V2，第三步分别按照上述方法得到aa在正常人中的值V3以及在疾病当中的值V4，计算出卡方值则为V1+V2+V3+V4。通过卡方值得到其相关性的p值，根据p值小于0.05筛选得到14个相关位点。Based on the above assumptions, the chi-square value is calculated. O is the observed frequency and E is the expected frequency. Assuming (2), in the first step we calculate the difference between the observed frequency and the expected frequency of AA or Aa (one of the two) genotypes in normal people, and divide it by the expected frequency to get the value V1. In the second step, follow The calculation method for normal people calculates the value V2 of AA or Aa in disease. The third step is to obtain the value V3 of aa in normal people and the value V4 of aa in disease according to the above method respectively. The calculated chi-square value is V1+ V2+V3+V4. The p value of the correlation was obtained through the chi-square value, and 14 related sites were screened based on the p value being less than 0.05.

这14个相关位点内部有存在共线性的染色体，通过算法排除共线性大的7个位点，具体算法如下：做50个SNP位点的划窗(window)，该划窗每次移动5个SNP，计算其中1个与其他各位点的多重相关指数R²，计算1/(1-R²)的VIF指数，若该指数大于2，则排除这些SNP位点。排除掉rs551397、rs800292、rs10737680、rs3753396、rs1410996、rs2284664、rs1065489位点，最终得到rs2284664,rs2071277,rs1999930,rs10490924,rs2338104,rs754203,rs5749482位点。There are chromosomes with collinearity within these 14 related sites. The 7 sites with large collinearity are eliminated through an algorithm. The specific algorithm is as follows: Make a window of 50 SNP sites, and the window moves 5 times each time. SNPs, calculate the multiple correlation index R² between one of them and other sites, and calculate the VIF index of 1/(1-R² ). If the index is greater than 2, these SNP sites are excluded. After excluding the rs551397, rs800292, rs10737680, rs3753396, rs1410996, rs2284664, and rs1065489 sites, we finally got rs2284664, rs2071277, rs1999930, rs10490924, rs2338104, rs754203 , rs5749482 site.

上述流程后，得到了所需要的位点，其信息见表2。After the above process, the required sites were obtained, and their information is shown in Table 2.

6.SNP位点基因型清洗数据变成相应的数值，该数值定位计算出的AMD7个相关位点的OR值(Odd ratio)。OR值(Odd ratio)是指事物发生的概率与不发生的概率之比。公式如下：6. The SNP site genotype cleaning data becomes the corresponding numerical value, which locates the calculated OR value (Odd ratio) of the seven AMD related sites. OR value (Odd ratio) refers to the ratio of the probability of something happening to the probability of not happening. The formula is as follows:

OR＝(nA/na)/(mA/ma)＝(nA×ma)/(mA×na)OR=(nA/na)/(mA/ma)=(nA×ma)/(mA×na)

假设A为次等位基因，nA为疾病中A的基因个数，na为疾病中不是A的基因个数，mA为对照中A的基因个数，ma为对照中不是A的基因个数。它有以下作用：Suppose A is the minor allele, nA is the number of genes that are A in the disease, na is the number of genes that are not A in the disease, mA is the number of genes that are A in the control, and ma is the number of genes that are not A in the control. It has the following functions:

a)OR＞1时，说明病例组的A的频率大于非病例组的，即A有较高的发病危险性。a) When OR>1, it means that the frequency of A in the case group is greater than that in the non-case group, that is, A has a higher risk of disease.

b)OR＜1时，说明病例组的A的频率低于非病例组的，即A有保护作用。b) When OR<1, it means that the frequency of A in the case group is lower than that in the non-case group, that is, A has a protective effect.

c)疾病与A等位联系愈密切，比值比的数值愈大。c) The closer the disease is to the A allele, the greater the value of the odds ratio.

表1.初始选定的SNP位点编号Table 1. Initial selected SNP site numbers

(美国国立生物技术信息中心(NCBI)数据库的dbSNP的统一编号)(Uniform number of dbSNP in the National Center for Biotechnology Information (NCBI) database)

表2.基因组范围内SNP关联分析(GWAS)技术得到AMD相关的SNP位点信息Table 2. Genome-wide SNP association analysis (GWAS) technology to obtain AMD-related SNP site information

CHRCHRSNPSNPA1A1F_AF_AF_UF_UA2A2CHISQCHISQPPORORSESEL95L95U95U9511rs2284664rs2284664TT0.26870.26870.37620.3762CC4.254.250.039240.039240.60910.60910.24140.24140.37950.37950.97770.977766rs2071277rs2071277CC0.36720.36720.44710.4471TT7.5917.5910.0224700.0224700.71750.71750.23040.23040.45680.45681.1271.12766rs1999930rs1999930TT0.036760.036760.0045870.004587CC5.2045.2040.022530.022538.2828.2821.1011.1010.95710.957171.6771.671010rs10490924rs10490924TT0.53970.53970.42310.4231GG4.2864.2860.038420.038421.5991.5990.22730.22731.0241.0242.4962.4961212rs2338104rs2338104GG0.42060.42060.29050.2905CC5.9515.9510.014710.014711.7731.7730.23590.23591.1171.1172.8162.8161414rs754203rs754203GG0.28680.28680.37730.3773AA7.9257.9250.0190200.0190200.66360.66360.23520.23520.41860.41861.0521.05222twenty twors5749482rs5749482GG0.23530.23530.36360.3636CC6.426.420.011280.011280.53850.53850.2460.2460.33250.33250.8720.872

第一列CHR为位点的染色体信息，第二列为SNP位点的编号，第三列(A1)为次等位基因型，第四列F_A为A1基因型疾病观察到的频率，第五列为F_U为A1等位基因在健康人中观察到的频率，第六列为另一个等位基因型即主等位基因(A2)，第七列CHISQ为卡方值，第八列P为卡方值换算得到的P值，第九列OR则为OR风险值，剩下十、十一、十二则为OR值的标准误及其上95％置信区间的上值和下值。The first column CHR is the chromosome information of the site, the second column is the number of the SNP site, the third column (A1) is the minor allele type, the fourth column F_A is the observed frequency of the A1 genotype disease, and the fifth column The F_U column is the observed frequency of the A1 allele in healthy people, the sixth column is the other allele type, the major allele (A2), the seventh column CHISQ is the chi-square value, and the eighth column P is For the P value obtained by converting the chi-square value, the ninth column OR is the OR risk value, and the remaining ten, eleven, and twelve are the standard error of the OR value and the upper and lower values of the 95% confidence interval.

后续基因型将由次等位基因的OR值进行替换，例如假设A为次等位基因，a为主等位基因，包含一个次等位基因(Aa)替换为OR值，包含两个次等位基因(AA)将替换为OR值得平方，如没有该次等位基因(aa)则替换为1.The subsequent genotype will be replaced by the OR value of the minor allele. For example, assuming that A is the minor allele and a is the major allele, including one minor allele (Aa), it is replaced by the OR value, including two minor alleles. The gene (AA) will be replaced by the square of the OR value, or 1 if there is no minor allele (aa).

实施例2.Example 2.

根据受试者在问卷情况整理获得年龄、身高体重指数(BMI)、高血压情况、高血脂情况、糖尿病情况、肾损伤情况、是否经常在户外、是否素食、从来没有吸过烟、从来没有饮过酒、动脉粥样硬化情况、眼睛手术情况、性别情况等13个临床调查数据。According to the questionnaire, the subjects’ age, height and body mass index (BMI), hypertension, hyperlipidemia, diabetes, kidney damage, whether they often spend time outdoors, whether they are vegetarians, never smoked cigarettes, and never drank alcohol were collected. 13 clinical survey data including alcohol consumption, atherosclerosis status, eye surgery status, gender status, etc.

实施例3.Example 3.

机器学习算法可分为三类：监督学习，非监督学习和半监督学习。监督学习为通过一部分输入数据和输出数据之间的相应关系，生成函数，将输入映射到合适的输出，比如分类。本发明的样本数据都已在临床确诊，带有已分类好的标签，因此将在有监督的机器学习分类模型中进行探索选择。分别将所有样本只有SNP位点信息的数据(SNP)，所有样本只有临床信息的数据(CC)，以及结合SNP位点和临床信息的综合数据(SNP+CC)作为输入数据，样本的诊断结果作为输出分类标签。Machine learning algorithms can be divided into three categories: supervised learning, unsupervised learning and semi-supervised learning. Supervised learning generates a function through the corresponding relationship between a part of input data and output data to map the input to an appropriate output, such as classification. The sample data of the present invention have been clinically diagnosed and have classified labels, so they will be explored and selected in a supervised machine learning classification model. The data of all samples with only SNP site information (SNP), the data of all samples with only clinical information (CC), and the comprehensive data combining SNP sites and clinical information (SNP+CC) are used as input data respectively, and the diagnostic results of the samples are as output classification labels.

根据以下步骤进行算法构建：Build the algorithm according to the following steps:

a)将所有数据随机分成75％的训练集和25％的测试集。a) Randomly divide all data into 75% training set and 25% testing set.

b)构建机器学习分类器。用SNP+CC作为输入数据，先后尝试Logistic回归，随机森林，Adaboost，以及Xgboost。b) Build a machine learning classifier. Using SNP+CC as input data, Logistic regression, random forest, Adaboost, and Xgboost were tried successively.

c)交叉验证调参，选取得分最好的参数。c) Cross-validate parameters and select the parameters with the best scores.

d)用测试集进行结果验证。d) Use the test set to verify the results.

e)模型评价。上述步骤重复1000次，计算测试集的平均受试者曲线的曲线下方面积(ROC-AUC)。选取最高ROC-AUC得分的Xgboost为最佳模型(见图3)。e) Model evaluation. The above steps are repeated 1000 times, and the area under the curve (ROC-AUC) of the average subject curve of the test set is calculated. Select Xgboost with the highest ROC-AUC score as the best model (see Figure 3).

f)特征变量筛选。分别将临床信息(CC)，位点信息(SNP)，结合临床信息与位点信息(SNP+CC)作为输入数据，通过Xgboost进行分类，重复1000次，测试集平均受试者曲线见图4，可以看出SNP+CC的ROC-AUC最高。f) Feature variable screening. Clinical information (CC), site information (SNP), and combined clinical information and site information (SNP+CC) are used as input data respectively, and classified through Xgboost, repeated 1000 times. The average subject curve of the test set is shown in Figure 4 , it can be seen that SNP+CC has the highest ROC-AUC.

g)进一步优化特征筛选。Xgboost模型得到变量特征的重要性(Feature-importance)分数(例如前10个的重要性见图5)，根据该分数再次优化筛选模型，将改分数从大到小，逐个增加变量数目去训练和测试模型，从而得到变量数目与ROC-AUC分数的关系图(见图6)。结果显示，输入4个最重要的变量(年龄，rs754203，rs2338104，糖尿病)的数据训练并测试模型，模型测试得到的ROC-AUC分数最高。g) Further optimize feature screening. The Xgboost model obtains the Feature-importance score of the variable features (for example, the importance of the top 10 is shown in Figure 5). According to this score, the model is optimized and screened again. The score is changed from large to small, and the number of variables is increased one by one to train and Test the model to obtain a plot of the number of variables versus the ROC-AUC score (see Figure 6). The results show that by inputting the data of the four most important variables (age, rs754203, rs2338104, diabetes) to train and test the model, the model test obtained the highest ROC-AUC score.

h)将Xgboost作为机器学习模型，年龄，rs754203，rs2338104，糖尿病作为输入变量，得到1000次的平均ROC-AUC为(0.800±0.06)。h) Using Xgboost as a machine learning model, age, rs754203, rs2338104, and diabetes as input variables, the average ROC-AUC of 1000 times is (0.800±0.06).

i)存储模型，用于后续测量数据的AMD风险预测。i) Store the model for AMD risk prediction of subsequent measurement data.

j)风险值输出：即学习训练完的算法模型预测输入的测试数据在0(对照)和1(患AMD疾病)之间的概率，最终将1(患疾病)概率值确认为风险值，将风险值超过0.5的判定为患AMD疾病。j) Risk value output: That is, the trained algorithm model predicts the probability that the input test data is between 0 (control) and 1 (suffering from AMD disease), and finally the probability value of 1 (suffering from AMD disease) is confirmed as the risk value, and A risk value exceeding 0.5 is considered to be associated with AMD.

在本发明提及的所有文献都在本申请中引用作为参考，就如同每一篇文献被单独引用作为参考那样。此外应理解，在阅读了本发明的上述讲授内容之后，本领域技术人员可以对本发明作各种改动或修改，这些等价形式同样落于本申请所附权利要求书所限定的范围。All documents mentioned in this application are incorporated by reference in this application to the same extent as if each individual document was individually incorporated by reference. In addition, it should be understood that after reading the above teaching content of the present invention, those skilled in the art can make various changes or modifications to the present invention, and these equivalent forms also fall within the scope defined by the appended claims of this application.

序列表sequence list

<110> 上海宝藤生物医药科技股份有限公司<110> Shanghai Baoteng Biomedical Technology Co., Ltd.

上海宝藤医学检验所有限公司Shanghai Baoteng Medical Laboratory Co., Ltd.

<120> 一种年龄相关性黄斑变性的风险预测算法模型和装置<120> A risk prediction algorithm model and device for age-related macular degeneration

<130> P2018-2112<130> P2018-2112

<160> 7<160> 7

<170> SIPOSequenceListing 1.0<170> SIPOSequenceListing 1.0

<210> 1<210> 1

<211> 52<211> 52

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 1<400> 1

tgaaaaagtt ctaaaattag atagtcggtt atggcctcac aacttgtgaa ta 52tgaaaaagtt ctaaaattag atagtcggtt atggcctcac aacttgtgaa ta 52

<210> 2<210> 2

<211> 52<211> 52

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 2<400> 2

gtgctgtcct ggggcccagg agcccctggg ggcaaggctc tgccctgttg ct 52gtgctgtcct ggggcccagg agcccctggg ggcaaggctc tgccctgttg ct 52

<210> 3<210> 3

<211> 53<211> 53

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 3<400> 3

agaaaaatac cagtctccat agatcagtta aagcaaatag atggtcttaa aat 53agaaaaatac cagtctccat agatcagtta aagcaaatag atggtcttaa aat 53

<210> 4<210> 4

<211> 52<211> 52

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 4<400> 4

ggcagtgact gatgcagtgt gtgacagtct aatctccccc ataattacag gc 52ggcagtgact gatgcagtgt gtgacagtct aatctccccc ataattacag gc 52

<210> 5<210> 5

<211> 54<211> 54

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 5<400> 5

ataggacaga ttctagattt tccttacgtt gatacagaga aatataagac ataa 54ataggacaga ttctagattt tccttacgtt gatacagaga aatataagac ataa 54

<210> 6<210> 6

<211> 52<211> 52

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 6<400> 6

tttatcacac tccatgatcc cagctgtcta aaatccacac tgagctctgc tt 52tttatcacac tccatgatcc cagctgtcta aaatccacac tgagctctgc tt 52

<210> 7<210> 7

<211> 52<211> 52

<212> DNA<212> DNA

<213> 人工序列(Artificial Sequence)<213> Artificial Sequence

<400> 7<400> 7

tgggaactga ctaatacagc atgtacggaa ctatgaaata tgaattgtgt aa 52tgggaactga ctaatacagc atgtacggaa ctatgaaata tgaattgtgt aa 52

Claims

Translated fromChinese

1.一种生物标志物集合，其特征在于，所述的集合包括选自下组的5种生物标志物：rs2338104、rs754203、rs5749482、rs2284664和rs10490924，其中，1. A biomarker collection, characterized in that the collection includes 5 biomarkers selected from the following group: rs2338104, rs754203, rs5749482, rs2284664 and rs10490924, wherein,

rs2338104，染色体位置12:109457363，该位置的C突变为G；rs2338104, chromosome position 12:109457363, C at this position is mutated to G;

rs754203，染色体位置14:99691630，该位置的A突变为G；rs754203, chromosome position 14:99691630, A at this position mutates to G;

rs5749482，染色体位置22:32663679，该位置的C突变为G；rs5749482, chromosome position 22:32663679, C at this position is mutated to G;

rs2284664，染色体位置1:196733395，该位置的C突变为T；rs2284664, chromosome position 1:196733395, C at this position is mutated to T;

rs10490924，染色体位置10:122454932，该位置的G突变为T；rs10490924, chromosome position 10:122454932, the G at this position mutates to T;

所述集合还包括选自下组的2种生物标志物：rs2071277和rs1999930；The set also includes 2 biomarkers selected from the group consisting of: rs2071277 and rs1999930;

其中，rs2071277，染色体位置6:32203906，该位置的T突变为C；Among them, rs2071277, chromosome position 6:32203906, T at this position mutates to C;

rs1999930，染色体位置6:116065971，该位置的C突变为T。rs1999930, chromosome position 6:116065971, C at this position is mutated to T.

3.一种试剂盒，其特征在于，所述的试剂盒包括如权利要求1所述的集合和/或如权利要求2所述的试剂组合。3. A kit, characterized in that the kit includes the collection according to claim 1 and/or the reagent combination according to claim 2.

4.一种生物标志物集合的用途，其特征在于，用于制备一试剂盒，所述的试剂盒用于年龄相关性黄斑变性（AMD）患病风险的评估或诊断，其中，所述生物标志物集合包括选自下组的5种生物标志物：rs2338104、rs754203、rs5749482、rs2284664和rs10490924；4. The use of a biomarker set, characterized in that it is used to prepare a kit, and the kit is used for the assessment or diagnosis of age-related macular degeneration (AMD) risk, wherein the biomarker set is used to prepare a kit. The marker set includes 5 biomarkers selected from the following group: rs2338104, rs754203, rs5749482, rs2284664, and rs10490924;

其中，in,

所述生物标志物集合还包括选自下组的2种生物标志物：rs2071277和rs1999930；The biomarker set also includes 2 biomarkers selected from the following group: rs2071277 and rs1999930;

(c) 辅助筛查结果输出模块，所述输出模块用于输出所述的辅助筛查结果；(c) An auxiliary screening result output module, the output module is used to output the auxiliary screening results;

其中，in,