CN115376610A

Movatterモバイル変換

Info

Publication number: CN115376610A
Application number: CN202211065685.9A
Authority: CN
Inventors: 黄卫东; 赵琼珍; 赵杰; 梁齐; 李彦奇; 张雪萍; 刘盼盼; 郭婕; 李继洋
Original assignee: Xinjiang Carbon Wisdom Stem Cell Bank Co ltd
Current assignee: Xinjiang Carbon Wisdom Stem Cell Bank Co ltd
Priority date: 2022-09-01
Filing date: 2022-09-01
Publication date: 2022-11-22

Abstract

Translated fromChinese

本发明公开了一种SNP变异率的批量组间对比分析方法，包括如下步骤：病例分组：按照纳排标准将病历分为病例组和对照组；组内SNP变异数据整合：VCF文件储存病例组和对照组整个测序数据通过对比参考基因后得到的变异结果，并将病例组和对照组的SNP变异数据分别进行整合；计算整合数据中每个SNP变异数：计算病例组和对照组的各SNP变异在各组中所占变异数。本发明通过基因组SNP变异率筛查来寻找遗传疾病/性状易感基因，通过SNP变异率的批量组间对比分析有效地提高了疾病风险基因的筛选效率，实现了低成本、更高效地找到遗传标记与疾病间的关联，为复杂疾病的发病机制提供了更多的线索。

The invention discloses a batch comparison analysis method of SNP variation rate between groups, comprising the following steps: case grouping: dividing medical records into case group and control group according to the inclusion and row standard; SNP variation data integration within the group: VCF file storage case group The variation results obtained by comparing the entire sequencing data with the control group by comparing the reference genes, and integrating the SNP variation data of the case group and the control group respectively; calculating the number of each SNP variation in the integrated data: calculating the SNPs of the case group and the control group The variance accounted for by each group. The present invention searches for genetic disease/trait susceptibility genes through genome SNP mutation rate screening, and effectively improves the screening efficiency of disease risk genes through batch comparison analysis of SNP mutation rates, and realizes low-cost and more efficient finding of genetic diseases. The association between markers and diseases provides more clues to the pathogenesis of complex diseases.

Description

Translated fromChinese

一种SNP变异率的批量组间对比分析方法A method for comparative analysis between batch groups of SNP mutation rate

技术领域technical field

本发明涉及检测疾病技术领域，具体来说，涉及一种SNP变异率的批量组间对比分析方法。The invention relates to the technical field of disease detection, in particular to a method for comparing and analyzing SNP mutation rates between groups in batches.

背景技术Background technique

新一代技术的迅猛发展在数据通量和成本上都显示出巨大的优势。尤其是全外显子组捕获测序技术WES针对外显子功能区域进行深度测序，可以更全面地检测编码区域的变异，美国医学遗传学与基因组学学会(ACMA)制定了序列变异指南，利用新一代测序技术，临床实验检测遗传性疾病的产品种类不断增加，包括基因分型、单基因、基因包、外显子组、基因组、转录组和表观遗传学检测。The rapid development of a new generation of technology has shown great advantages in both data throughput and cost. In particular, WES, a whole-exome capture sequencing technology, performs deep sequencing of exon functional regions, which can more comprehensively detect variations in coding regions. Next-generation sequencing technology, clinical laboratory testing of genetic diseases has an increasing variety of products, including genotyping, single gene, gene package, exome, genome, transcriptome and epigenetic testing.

在过去的十年中，随着新一代高通量测序的出现，测序技术有了快速发展，但随着技术的复杂性日益增加，基因检测在序列解读方面不断面临着新的挑战，虽然ACMG工作组制定并不断修订了序列变异解读的标准和指南，但仍然存在大量临床意义不明确的变异，给临床医生的解读带来了困难。In the past decade, with the emergence of next-generation high-throughput sequencing, sequencing technology has developed rapidly, but with the increasing complexity of technology, genetic testing is constantly facing new challenges in sequence interpretation, although ACMG The working group has formulated and continuously revised the standards and guidelines for the interpretation of sequence variants, but there are still a large number of variants with unclear clinical significance, which brings difficulties to the interpretation of clinicians.

针对上述问题，目前还没有有效的解决办法。At present, there is no effective solution to the above problems.

发明内容Contents of the invention

针对相关技术中的上述技术问题，本发明提出一种SNP变异率的批量组间对比分析方法，利用基因组SNP变异率筛查来寻找遗传疾病/性状易感基因，能够克服现有技术的上述不足。Aiming at the above-mentioned technical problems in the related art, the present invention proposes a batch comparison analysis method of SNP mutation rate between groups, and uses genome SNP mutation rate screening to find genetic disease/character susceptibility genes, which can overcome the above-mentioned deficiencies in the prior art .

为实现上述技术目的，本发明的技术方案是这样实现的：For realizing above-mentioned technical purpose, technical scheme of the present invention is realized like this:

一种SNP变异率的批量组间对比分析方法，包括如下步骤：A method for comparative analysis between batches of SNP mutation rates, comprising the steps of:

S1病例分组：按照纳排标准将病历分为病例组和对照组；S1 case grouping: Divide medical records into case group and control group according to inclusion and exclusion criteria;

S2组内SNP变异数据整合：VCF文件储存病例组和对照组整个测序数据通过对比参考基因后得到的变异结果，并将病例组和对照组的SNP变异数据分别进行整合；S2 group SNP variation data integration: the VCF file stores the variation results obtained by comparing the entire sequencing data of the case group and the control group by comparing the reference genes, and integrates the SNP variation data of the case group and the control group respectively;

S3计算整合数据中每个SNP变异数：计算病例组和对照组的各SNP变异在各组中所占变异数；S3 Calculate the variation of each SNP in the integrated data: calculate the variation of each SNP variation in the case group and the control group in each group;

S4计算整合数据中每个SNP变异频率：计算病例组和对照组的各SNP变异在各组中所占变异频率；S4 Calculate the variation frequency of each SNP in the integrated data: calculate the variation frequency of each SNP variation in the case group and the control group in each group;

S5组间SNP变异频率差异分析：批量用卡方检验计算病例组和对照组的各SNP变异频率的显著性差异。S5 Difference analysis of SNP variation frequency between groups: Chi-square test was used to calculate the significant difference of each SNP variation frequency between the case group and the control group.

进一步地，步骤S1中所述纳排标准包括纳入标准和排除标准。Further, the inclusion and exclusion criteria in step S1 include inclusion criteria and exclusion criteria.

进一步地，步骤S1中所述纳排标准为无规则的自由文本形式。Further, the inclusion and arrangement criteria in step S1 are in the form of irregular free text.

进一步地，步骤S5中所述卡方检验是统计样本的实际观测值与理论推断值之间的偏离程度，实际观测值与理论推断值之间的偏离程度决定卡方值的大小。Further, the chi-square test in step S5 is the degree of deviation between the actual observed value of the statistical sample and the theoretically inferred value, and the degree of deviation between the actual observed value and the theoretically inferred value determines the size of the chi-square value.

本发明的有益效果：本发明通过基因组SNP变异率筛查来寻找遗传疾病/性状易感基因，通过SNP变异率的批量组间对比分析有效地提高了疾病风险基因的筛选效率，实现了低成本、更高效地找到遗传标记与疾病间的关联，为复杂疾病的发病机制提供了更多的线索。Beneficial effects of the present invention: the present invention searches for genetic disease/trait susceptibility genes through genome SNP mutation rate screening, and effectively improves the screening efficiency of disease risk genes through batch comparison analysis of SNP mutation rates, and realizes low-cost , Find the association between genetic markers and diseases more efficiently, and provide more clues for the pathogenesis of complex diseases.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the accompanying drawings required in the embodiments. Obviously, the accompanying drawings in the following description are only some of the present invention. Embodiments, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1是根据本发明实施例所述的SNP变异率的批量组间对比分析方法的操作流程示意图。FIG. 1 is a schematic diagram of the operation flow of the method for batch comparison analysis of SNP mutation rate between groups according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments of the present invention belong to the protection scope of the present invention.

如图1所示，根据本发明实施例所述的一种SNP变异率的批量组间对比分析方法，包括如下步骤：As shown in Figure 1, a method for comparative analysis between batches of SNP mutation rates according to an embodiment of the present invention comprises the following steps:

实施例，步骤S1中所述纳排标准包括纳入标准和排除标准。步骤S1中所述纳排标准为无规则的自由文本形式。In an embodiment, the inclusion and exclusion criteria in step S1 include inclusion criteria and exclusion criteria. The inclusion and exclusion criteria described in step S1 are in the form of irregular free text.

实施例，步骤S5中所述卡方检验是统计样本的实际观测值与理论推断值之间的偏离程度，实际观测值与理论推断值之间的偏离程度决定卡方值的大小。In an embodiment, the chi-square test in step S5 is the degree of deviation between the actual observed value of the statistical sample and the theoretically inferred value, and the degree of deviation between the actual observed value and the theoretically inferred value determines the size of the chi-square value.

为了方便理解本发明的上述技术方案，以下通过具体使用方式上对本发明的上述技术方案进行详细说明。In order to facilitate the understanding of the above-mentioned technical solution of the present invention, the above-mentioned technical solution of the present invention will be described in detail below through a specific usage mode.

在具体使用时，根据本发明所述的一种SNP变异率的批量组间对比分析方法，包括如下步骤：In specific use, according to a kind of SNP mutation rate of the present invention, the method for comparative analysis between batch groups comprises the following steps:

1)病例分组1) Case grouping

按照纳排标准进行病例组和对照组的分组Case group and control group were grouped according to inclusion and exclusion criteria

2)组内SNP变异数据整合2) SNP variation data integration within the group

VCF文件储存了整个测序数据通过对比参考基因后得到的变异结果，将病例组和对照组的SNP变异数据分别进行整合。The VCF file stores the variation results obtained by comparing the entire sequencing data with the reference gene, and integrates the SNP variation data of the case group and the control group respectively.

3)计算整合数据中每一个SNP变异数3) Calculate the number of variations for each SNP in the integrated data

计算病例组和对照组的各SNP变异在各组中所占变异数。Calculate the variation of each SNP variation in the case group and the control group in each group.

4)计算整合数据中每一个SNP变异频率4) Calculate the variation frequency of each SNP in the integrated data

计算病例组和对照组的各SNP变异在各组中所占变异频率。Calculate the variation frequency of each SNP variation in the case group and the control group in each group.

5)组间SNP变异频率差异分析5) Difference analysis of SNP variation frequency between groups

批量用卡方检验计算病例组和对照组的各SNP变异频率的显著性差异。The chi-square test was used in batches to calculate the significant difference of each SNP variation frequency between the case group and the control group.

具体实施时，SNP变异率的批量组间对比分析步骤如下：During specific implementation, the batch comparison analysis steps of SNP mutation rate between groups are as follows:

通过配置实验和对照组样本,使用本程序进行变异位点对比分析。By configuring the experimental and control samples, use this program for comparative analysis of variant sites.

Setting：设置case和control组的样本ID和工作文件夹。Setting: Set the sample ID and working folder of the case and control groups.

1.根据样本ID获取储存的样本变异文件1. Obtain the stored sample variation file according to the sample ID

通过遍历对比储存库中的文件,获得文件名称进行拼接,用于VCF合并判断样本文件是否存在。By traversing the files in the comparison repository, the file names are obtained for splicing, which is used for VCF merge to determine whether the sample file exists.

2.使用BCFtools合并两组VCF变异文件2. Use BCFtools to merge two sets of VCF mutation files

使用两个线程对control和case组的vcf文件进行合并。Use two threads to merge the vcf files of the control and case groups.

3.提取合并的变异文件中每个位点的数量3. Extract the number of each locus in the merged variant file

4.合并两个位点突变计数文件,并进行计算每个位点的突变差值,根据差值大小进行排序4. Merge two site mutation count files, and calculate the mutation difference of each site, and sort according to the size of the difference

5.计算突变位点的卡方检验结果,并筛选P小于0.5的位点5. Calculate the chi-square test result of the mutation site, and screen the sites with P less than 0.5

6.计算完成6. Calculation completed

在工作目录中保存了以下结果文件Saved the following resulting files in the working directory

├──caseCounts.csv├──caseCounts.csv

├──caseMerge.vcf.gz├──caseMerge.vcf.gz

├──controlCounts.csv├──controlCounts.csv

├──controlMerge.vcf.gz├──controlMerge.vcf.gz

├──merge.csv├──merge.csv

├──merge_sort.csv├──merge_sort.csv

├─merge_sort_pv.csv├─merge_sort_pv.csv

综上所述，借助于本发明的上述技术方案，通过基因组SNP变异率筛查来寻找遗传疾病/性状易感基因，通过SNP变异率的批量组间对比分析有效地提高了疾病风险基因的筛选效率，实现了低成本、更高效地找到遗传标记与疾病间的关联，为复杂疾病的发病机制提供了更多的线索。In summary, with the help of the above-mentioned technical solution of the present invention, genetic disease/trait susceptibility genes can be found through genome SNP mutation rate screening, and the screening of disease risk genes can be effectively improved through batch comparison analysis of SNP mutation rates Efficiency, realizing the low-cost and more efficient way to find the association between genetic markers and diseases, and providing more clues for the pathogenesis of complex diseases.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the scope of the present invention. within the scope of protection.

Claims

1. A batch group-by-group comparative analysis method for SNP variation rate is characterized by comprising the following steps:

s1 case grouping: dividing the medical records into a case group and a control group according to a nano-arranging standard;

and (3) integrating SNP variation data in the S2 group: the VCF file stores variation results obtained by comparing the whole sequencing data of the case group and the control group with the reference gene, and integrates the SNP variation data of the case group and the control group respectively;

s3, calculating the variation number of each SNP in the integration data: calculating the variation number of each SNP variation of the case group and the control group in each group;

s4, calculating the variation frequency of each SNP in the integrated data: calculating the variation frequency of each SNP variation of the case group and the control group in each group;

and (3) analyzing SNP variation frequency difference among S5 groups: the chi-square test is used in batches to calculate the significant difference of the variation frequency of each SNP of a case group and a control group.

2. The method for batch comparative analysis between groups of SNP variation rates according to claim 1, wherein the inclusion criteria and the exclusion criteria are included in step S1.

3. The method for batch comparative analysis between groups of SNP variation rates according to claim 1, wherein the nano-ranking criteria in step S1 is in the form of irregular free text.

4. The method of batch group-by-group contrastive analysis of SNP variation rates according to claim 1, wherein the chi-square test in step S5 is a degree of deviation between an actual observed value and a theoretical inferred value of a statistical sample, and the degree of deviation between the actual observed value and the theoretical inferred value determines the magnitude of the chi-square value.