CN106326689A

Movatterモバイル変換

Info

Publication number: CN106326689A
Application number: CN201510358145.3A
Authority: CN
Inventors: 陈伟芬; 余胜; 王莹; 王崇志; 何伟明
Original assignee: BGI Technology Solutions Co Ltd
Current assignee: BGI Technology Solutions Co Ltd
Priority date: 2015-06-25
Filing date: 2015-06-25
Publication date: 2017-01-11

Abstract

The invention discloses a method and device for determining a site subject to selection in a colony. The method includes the following steps: acquiring nucleic acid sequencing data of colony samples, wherein the colony samples are from a plurality of individuals of a specie, and the colony samples can be divided into 2n first class sub colonies according to n pairs of preset indexes, wherein the n is a natural number; performing detection according to the nucleic acid sequencing data so as to acquire colony SNP data, wherein the colony SNP data includes a plurality of first class colony SNP data; and comparing differences of polymorphism of different first class sub colonies on the basis of the colony SNP data so as to determine an SNP subject to selection, wherein the SNP subject to selection is a site subject to selection. The invention further provides a device and system for determining a site subject to selection in colony. The method, device and/or system can accurately determine the site subject to selection.

Description

Translated fromChinese

确定群体中受到选择作用的位点的方法和装置Method and apparatus for determining loci in a population subject to selection

技术领域technical field

本发明涉及生物学领域，特别地，涉及群体遗传学领域，更特别地，本发明涉及一种确定群体中受到选择作用的位点的方法和一种确定群体中受到选择作用的位点的装置。The present invention relates to the field of biology, in particular, to the field of population genetics, and more particularly, the present invention relates to a method for determining a site under selection in a population and a device for determining a site under selection in a population .

背景技术Background technique

随着二代测序(next generation sequencing，NGS)技术的成熟和成本的逐步降低，各项以此为基础、用于不同目的的研究技术层出不穷。RNA-Seq是一种基于NGS，通过对样品的转录组(transcriptome)进行测序，主要用于揭示样本中基因表达规律的技术，现已被广泛运用。同时，RNA-Seq的测序数据也可用于检测整个基因组转录区域的多态性位点，包括SNP位点。With the maturity of next generation sequencing (NGS) technology and the gradual reduction of cost, various research technologies based on it and used for different purposes emerge in an endless stream. RNA-Seq is a technology based on NGS, which is mainly used to reveal the gene expression rules in samples by sequencing the transcriptome of samples, and has been widely used. At the same time, RNA-Seq sequencing data can also be used to detect polymorphic sites in transcribed regions of the entire genome, including SNP sites.

发明内容Contents of the invention

依据本发明的一方面，本发明提供一种确定群体中受到选择作用的位点的方法，所述选择作用包括人工选择作用和自然选择作用的至少一种，该方法包括以下步骤：(1)获得群体样本的核酸测序数据，所述群体样本来自一个物种的多个个体，任选的，所述群体样本来自一个物种多个个体的相同组织或者一个物种的多个个体的相同部位，所述群体样本能够依据n对预定指标划分成2n个一级亚群体，n为自然数；(2)基于(1)中的核酸测序数据，检测以获得群体SNP数据，所述群体SNP数据包括多个一级亚群体SNP数据；(3)基于(2)中的群体SNP数据，比较不同一级亚群体的多态性的差异，以确定受到选择作用的SNP，所述受到选择作用的SNP为所述受到选择作用的位点。在本发明的一个实施例中，所述核酸测序数据是利用RNA-Seq技术得来的，为转录本测序数据。所称的预定指标可以是任意的两个个体样本的不一样的特征，在本发明的一个实施例中，预定指标是地理的和/或生物性状相关的，例如可以以不同地域来源、具有某个(些)不同性状等来作为初步划分群体的指标。在本发明的一个实施例中，在进行该方法的步骤(3)之前或者步骤(3)之后，进行群体结构分析，包括：基于(2)中的群体SNP数据，对所述群体样本进行群体结构分析，获得群体结构分析结果；任选的，进行所述群体结构分析包括构建系统发育树、主成分分析和STRUCTURE分析中的至少之一。而且，在本发明的另一个实施例中，进一步的，基于所述群体结构分析结果，对所述群体样本进行重新划分，以获得的划分结果即对群体的分类结果替代原先所述一级亚群体，进而进行(3)来确定群体中受到选择作用的位点。According to one aspect of the present invention, the present invention provides a kind of method for determining the site that is subject to selection in the population, and described selection comprises at least one of artificial selection and natural selection, and the method comprises the following steps: (1) Obtaining nucleic acid sequencing data of a population sample, the population sample comes from multiple individuals of one species, optionally, the population sample comes from the same tissue of multiple individuals of one species or the same part of multiple individuals of one species, the The population sample can be divided into 2n first-level subgroups according to n pairs of predetermined indicators, and n is a natural number; (2) based on the nucleic acid sequencing data in (1), detect and obtain population SNP data, and the population SNP data includes multiple ones (3) Based on the population SNP data in (2), compare the polymorphic differences of different first-level subgroups to determine the SNP subject to selection, the SNP subjected to selection is the sites subject to selection. In one embodiment of the present invention, the nucleic acid sequencing data is obtained using RNA-Seq technology, and is transcript sequencing data. The so-called predetermined index can be different characteristics of any two individual samples. In one embodiment of the present invention, the predetermined index is geographically and/or biologically related, for example, it can be derived from different regions, have certain One (some) different traits are used as indicators for preliminary group division. In one embodiment of the present invention, before performing step (3) of the method or after step (3), performing population structure analysis includes: performing population analysis on the population sample based on the population SNP data in (2). Structural analysis, obtaining population structure analysis results; optionally, performing the population structure analysis includes building at least one of phylogenetic tree, principal component analysis and STRUCTURE analysis. Moreover, in another embodiment of the present invention, further, based on the analysis results of the population structure, the population samples are re-divided, so that the obtained division results, that is, the classification results of the populations, replace the original first-level subgroups. group, and then proceed to (3) to determine the selected sites in the group.

依据本发明的另一方面，本发明提供一种基于群体转录本数据分析群体结构的方法，该方法包括：获得群体样本的核酸测序数据，所述群体样本来自一个物种的多个个体，任选的，所述群体样本来自一个物种多个个体的相同组织或者一个物种的多个个体的相同部位，所述群体样本能够依据n对预定指标分成2n个一级亚群体，n为自然数；基于所述核酸测序数据，检测以获得群体SNP数据，所述群体SNP数据包括多个一级亚群体SNP数据；基于所述群体SNP数据，比较不同一级亚群体的多态性的差异，确定受到选择作用的SNP，和/或，基于所述群体SNP数据，对所述群体进行群体结构分析。According to another aspect of the present invention, the present invention provides a method for analyzing population structure based on population transcript data, the method comprising: obtaining nucleic acid sequencing data of a population sample, the population sample being from multiple individuals of a species, optionally Yes, the population sample comes from the same tissue of multiple individuals of a species or the same part of multiple individuals of a species, and the population sample can be divided into 2n first-level subgroups according to n pairs of predetermined indicators, where n is a natural number; based on the The nucleic acid sequencing data is detected to obtain population SNP data, and the population SNP data includes a plurality of first-level subgroup SNP data; based on the population SNP data, the polymorphism differences of different first-level subgroups are compared, and it is determined that the selected SNPs that act, and/or, based on the population SNP data, perform population structure analysis on the population.

依据本发明的再一方面，本发明提供一种确定群体中受到选择作用的位点的装置，该装置用以实施上述本发明一方面的确定群体中受到选择作用的位点的方法，装置包括：数据输入单元，用于输入数据；数据输出单元，用于输出数据；处理器，用于执行机器可执行程序，执行所述机器可执行程序包括完成本发明一方面的或者任一实施例中的方法；存储单元，与所述数据输入单元、数据输出单元和处理器相连，用于存储数据，其中包括所述机器可执行程序。本领域技术人员能够理解，所说的机器可执行程序可以保存在存储介质中，所称存储介质可以包括：只读存储器、随机存储器、磁盘或光盘等。According to yet another aspect of the present invention, the present invention provides a device for determining the site subjected to selection in a population, and the device is used to implement the above-mentioned method for determining a site subjected to selection in a population in one aspect of the present invention, and the device includes : a data input unit, used to input data; a data output unit, used to output data; a processor, used to execute a machine-executable program, and executing the machine-executable program includes completing one aspect of the present invention or in any embodiment The method; a storage unit, connected to the data input unit, the data output unit and the processor, for storing data, including the machine-executable program. Those skilled in the art can understand that the machine-executable program can be stored in a storage medium, and the storage medium can include: read-only memory, random access memory, magnetic disk or optical disk, and the like.

依据本发明的又一方面，本发明提供一种确定群体中受到选择作用的位点的系统，该系统能够用以实施上述本发明一方面的或者任一实施例中的方法的全部或部分步骤，该系统包括：测序数据获取装置，用以获取群体样本的核酸测序数据，所述群体样本来自一个物种的多个个体，任选的，所述群体样本来自一个物种多个个体的相同组织或者一个物种的多个个体的相同部位，所述群体样本能够依据n对预定指标分成2n个一级亚群体，n为自然数；SNP检测装置，与所述测序数据获取装置连接，用于基于所述核酸测序数据，检测以获得群体SNP数据，所述群体SNP数据包括多个一级亚群体SNP数据；目的位点确定装置，与所述SNP检测装置连接，用以基于所述群体SNP数据，比较不同一级亚群体的多态性的差异，以确定受到选择作用的SNP，所述受到选择作用的SNP为所述受到选择作用的位点。According to yet another aspect of the present invention, the present invention provides a system for determining the site of selection in a population, which can be used to implement all or part of the steps of the above-mentioned one aspect of the present invention or the method in any embodiment , the system includes: a sequencing data acquisition device, used to acquire nucleic acid sequencing data of a population sample, the population sample comes from a plurality of individuals of a species, optionally, the population sample comes from the same tissue of a plurality of individuals of a species or For the same part of multiple individuals of a species, the population samples can be divided into 2n first-level subpopulations according to n pairs of predetermined indicators, where n is a natural number; the SNP detection device is connected to the sequencing data acquisition device for Nucleic acid sequencing data is detected to obtain population SNP data, and the population SNP data includes multiple first-level subpopulation SNP data; the target site determination device is connected to the SNP detection device for comparing based on the population SNP data The difference of polymorphisms of different primary subpopulations is used to determine the SNP subject to selection, and the SNP subject to selection is the site subject to selection.

利用上述本发明的方法、装置和/或系统能够准确的判定出群体中受到选择作用的位点。本发明的方法和/或装置，集中于更具普遍重要性的基因组转录区域，能够基于获得的群体转录本数据，获得基因表达数据，揭示样本的基因表达规律，这将有利于揭示遗传背景差异条件下的基因表达规律，是对RAD、GBS等群体研究范围的进一步拓展。而且，又能够获得群体SNP数据，揭示群体结构和群体遗传进化规律。本发明方法、装置和/或系统能够用以规范群体转录组重测序分析流程，降低分析风险，能够高效率、高质量和高标准完成对群体项目的分析。Using the above method, device and/or system of the present invention can accurately determine the site under selection in the population. The method and/or device of the present invention focuses on the more generally important genome transcriptional regions, can obtain gene expression data based on the obtained population transcript data, and reveal the gene expression rules of the sample, which will help reveal differences in genetic backgrounds The law of gene expression under conditions is a further expansion of the research scope of RAD, GBS and other groups. Moreover, population SNP data can be obtained to reveal population structure and population genetic evolution law. The method, device and/or system of the present invention can be used to standardize the analysis process of group transcriptome resequencing, reduce the risk of analysis, and can complete the analysis of group projects with high efficiency, high quality and high standard.

附图说明Description of drawings

本发明的上述和/或附加的方面和优点从结合下面附图对实施方式的描述中将变得明显和容易理解，其中：The above and/or additional aspects and advantages of the present invention will become apparent and comprehensible from the description of the embodiments in conjunction with the following drawings, wherein:

图1是本发明的一个实施例中的确定群体中受到选择作用的位点的方法的步骤流程图。Fig. 1 is a flow chart of the steps of the method for determining the sites under selection in a population in one embodiment of the present invention.

图2是本发明的一个实施例中的确定群体中受到选择作用的位点的方法的步骤流程图。Fig. 2 is a flow chart of the steps of the method for determining the loci subject to selection in a population in one embodiment of the present invention.

图3是本发明的一个实施例中的确定群体中受到选择作用的位点的方法的步骤流程图。Fig. 3 is a flow chart of the steps of the method for determining the loci subject to selection in a population in one embodiment of the present invention.

图4是本发明的一个实施例中的确定群体中受到选择作用的位点的装置示意图。Fig. 4 is a schematic diagram of an apparatus for determining a site under selection in a population in one embodiment of the present invention.

图5是本发明的一个实施例中的确定群体中受到选择作用的位点的系统示意图。Fig. 5 is a schematic diagram of the system for determining the sites under selection in a population in an embodiment of the present invention.

图6是本发明的一个实施例中的Frappe基于群体SNP推测的群体遗传结构的示意图。Fig. 6 is a schematic diagram of the population genetic structure estimated by Frappe based on the population SNP in one embodiment of the present invention.

图7是本发明的一个实施例中的基于群体SNPs采用邻接法推断的系统发生树的示意图。Fig. 7 is a schematic diagram of a phylogenetic tree inferred by the neighbor-joining method based on population SNPs in an embodiment of the present invention.

图8是本发明的一个实施例中的基于群体SNP的PCA分析结果示意图。Fig. 8 is a schematic diagram of the PCA analysis results based on population SNPs in one embodiment of the present invention.

图9是本发明的一个实施例中的Arlequin程序基于群体SNP检测受选择作用位点的结果示意图。Fig. 9 is a schematic diagram of the results of detecting selected action sites based on population SNPs by the Arlequin program in an embodiment of the present invention.

图10是本发明的一个实施例中的Global FST test程序基于群体SNP检测受选择作用位点的结果示意图。Fig. 10 is a schematic diagram of the results of detecting selected action sites based on population SNPs by the Global FST test program in an embodiment of the present invention.

图11是本发明的一个实施例中的BayeScan程序基于群体SNP检测受选择作用位点的结果示意图。Fig. 11 is a schematic diagram of the results of BayeScan program detecting selected action sites based on population SNPs in an embodiment of the present invention.

具体实施方式detailed description

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中，自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。需要说明的，本文中所使用的术语“一级”、“二级”等仅为方便描述，不能理解为指示或暗示相对重要性，也不能理解为之间有先后顺序关系。在本发明的描述中，除非另有说明，“多个”的含义是两个或两个以上。在本文中，除非另有明确的规定和限定，术语“相连”、“连接”等术语应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。Embodiments of the present invention are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals designate the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention. It should be noted that the terms "first level" and "secondary level" used in this article are only for convenience of description, and cannot be understood as indicating or implying relative importance, nor can they be understood as having a sequence relationship between them. In the description of the present invention, unless otherwise specified, "plurality" means two or more. In this article, unless otherwise clearly specified and limited, terms such as "connected" and "connected" should be interpreted in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical A connection can also be an electrical connection; it can be a direct connection or an indirect connection through an intermediary, and it can be the internal communication of two components.

根据本发明的一个实施例，如图1所示，本发明提供一种确定群体中受到选择作用的位点的方法，所述选择作用包括人工选择作用和自然选择作用的至少一种，该方法包括以下步骤：S10获得群体样本的核酸测序数据，所述群体样本来自一个物种的多个个体，任选的，所述群体样本来自一个物种多个个体的相同组织或者一个物种的多个个体的相同部位，所述群体样本能够依据n对预定指标划分成2n个一级亚群体，n为自然数；S20基于S10中的核酸测序数据，检测以获得群体SNP数据，所述群体SNP数据包括多个一级亚群体SNP数据；S30基于S20中的群体SNP数据，比较不同一级亚群体的多态性的差异，以确定受到选择作用的SNP，所述受到选择作用的SNP为所述受到选择作用的位点。According to one embodiment of the present invention, as shown in FIG. 1, the present invention provides a method for determining a site under selection in a population, and the selection includes at least one of artificial selection and natural selection, the method The method comprises the following steps: S10 obtaining nucleic acid sequencing data of a population sample, the population sample comes from multiple individuals of one species, optionally, the population sample comes from the same tissue of multiple individuals of one species or multiple individuals of one species In the same position, the group samples can be divided into 2n first-level subgroups according to n pairs of predetermined indicators, where n is a natural number; S20 detects and obtains group SNP data based on the nucleic acid sequencing data in S10, and the group SNP data includes multiple First-level subgroup SNP data; S30 is based on the population SNP data in S20, and compares the polymorphism differences of different first-level subgroups to determine the SNP subjected to selection, and the SNP subjected to selection is the selected SNP site.

根据本发明的一个实施例，所述核酸测序数据是利用RNA-Seq技术得来的，为转录本测序数据。以同一物种、多个不同遗传背景的个体为研究对象，通过对转录组(transcriptome)样品进行高通量测序，一次性获得该特定物种群体水平的基因组转录区域多态性数据，包括群体SNP数据和全基因/转录本表达信息，可以用于揭示研究个体之间的进化关系和遗传组成差异、在特定选择作用下共同进化的基因簇、亚群体中受人工/自然选择作用的位点以及个体或亚群体之间的在表达上具有显著差异的功能模块和代谢通路等生物学问题。而且，相对于常规的少量样品的转录组重测序，相比于RAD、GBS等群体研究技术，本发明的研究区域相对集中于基因组转录区域，可以对基因表达进行定量，这将有利于揭示遗传背景差异条件下的基因表达规律，是对RAD、GBS等群体研究范围的进一步拓展。According to an embodiment of the present invention, the nucleic acid sequencing data is obtained using RNA-Seq technology, and is transcript sequencing data. Taking individuals of the same species with multiple different genetic backgrounds as the research object, through high-throughput sequencing of transcriptome samples, the polymorphism data of the genome transcriptional region at the population level of the specific species can be obtained at one time, including population SNP data and whole gene/transcript expression information, which can be used to reveal the evolutionary relationship and genetic composition differences among individuals, gene clusters co-evolved under specific selection, sites affected by artificial/natural selection in subpopulations, and individual Or biological issues such as functional modules and metabolic pathways with significant differences in expression between subgroups. Moreover, compared with conventional transcriptome resequencing of a small number of samples, compared with population research techniques such as RAD and GBS, the research area of the present invention is relatively concentrated on the transcriptional region of the genome, and gene expression can be quantified, which will help reveal genetic The gene expression law under different background conditions is a further expansion of the research scope of RAD, GBS and other groups.

所称的预定指标可以是任意的两个个体样本的不一样的特征，根据本发明的一个实施例，预定指标是地理的和/或生物性状相关的，例如可以以不同地域来源、具有某个(些)不同性状等来作为初步划分群体的指标。The so-called predetermined index can be different characteristics of any two individual samples. According to an embodiment of the present invention, the predetermined index is geographically and/or biologically related, for example, it can be derived from different regions, have a certain (Some) different traits, etc. are used as indicators for preliminary group division.

根据本发明的一个实施例，如图2所示，在进行该方法的步骤S30之前，还包括进行S23群体结构分析，S23群体结构分析包括：基于S20中的群体SNP数据，对所述群体样本进行群体结构分析，获得群体结构分析结果；任选的，进行所述群体结构分析包括构建系统发育树、主成分分析(PCA)和Group Structure分析中的至少之一。According to an embodiment of the present invention, as shown in FIG. 2, before step S30 of the method, S23 population structure analysis is also included, and S23 population structure analysis includes: based on the population SNP data in S20, the population sample Performing population structure analysis to obtain a population structure analysis result; optionally, performing the population structure analysis includes at least one of building a phylogenetic tree, principal component analysis (PCA) and Group Structure analysis.

可以利用邻接法构建系统发育树，也可以利用MEGA软件构建关系，利用MEGA软件(http://www.megasoftware.net)，将每个样本所有SNP位点的基因型文件组成序列，一个个体样本对应一个序列，作为MEGA的输入文件，MEGA根据各个体样本序列上的差异，该软件有三种方法(Maximum likelihood、Least Squares和Maximum parsimony)来构建关系树。You can use the neighbor-joining method to construct a phylogenetic tree, or you can use MEGA software to construct a relationship. Use MEGA software (http://www.megasoftware.net) to form a sequence of genotype files of all SNP sites in each sample, and an individual sample Corresponding to a sequence, as the input file of MEGA, MEGA has three methods (Maximum likelihood, Least Squares and Maximum parsimony) to construct the relationship tree according to the differences in the sequence of each individual sample.

在统计学中，主成分分析(Principal Components Analysis，PCA)是一种简化数据集的技术，是一个线性变换。这个变换把数据变换到一个新的坐标系统中，使得任何数据投影的第一大变量数在第一个坐标(称为第一主成分)上、第二大变量数在第二坐标(第二主成分)上，依次类推。主成分分析经常用于减少数据集的维数，同时保留对数据集贡献最大的特征变量。通过保留低阶主成分，忽略高阶主成分来实现的。这是由于低阶成分往往能够保留数据集中最重要的方面。根据参考文献A tutorial on Principal Components Analysis.Lindsay ISmith,2002-02和实施例中真实的SNP数据特点，首先将SNP数据转换成数字矩阵，例如设定与参考序列一致的为0、相反的为2、简并碱基为1，并作均一化。然后通过上述介绍的方法构建线性向量方程。其中i从1到k表示第i个样本。应用R语言软件包强大的解方程能力，解得矩阵a，根据各样本的数据特点提取前四个主成分向量，以向量作为坐标轴展示各个体聚类情况。In statistics, Principal Components Analysis (PCA) is a technique for simplifying data sets and is a linear transformation. This transformation transforms the data into a new coordinate system such that the first largest number of variables of any data projection is on the first coordinate (called the first principal component), and the second largest number of variables is on the second coordinate (second principal component). principal components), and so on. Principal component analysis is often used to reduce the dimensionality of a dataset while retaining the feature variables that contribute most to the dataset. This is achieved by retaining low-order principal components and ignoring high-order principal components. This is due to the fact that low-level components tend to preserve the most important aspects in the dataset. According to the reference A tutorial on Principal Components Analysis.Lindsay ISmith, 2002-02 and the real SNP data characteristics in the examples, first convert the SNP data into a digital matrix, for example, set the one consistent with the reference sequence to 0, and the opposite to 2 , The degenerate base is 1, and it is normalized. Then construct the linear vector equation by the method introduced above. where i from 1 to k represents the i-th sample. The powerful equation-solving ability of the R language software package is used to solve the matrix a, and the first four principal component vectors are extracted according to the data characteristics of each sample, and the clustering of each individual is displayed with the vector as the coordinate axis.

Group Structure分析可以利用Structure软件(http://pritch.bsd.uchicago.edu/software/structure2_1.html)进行，该软件基于SNP位点的基因分型数据，推断是否存在不同群体并判断每个个体所归属的群体。根据软件说明，将群体SNP的基因型文件转换格式，作为Structure输入文件并在混合模型中采用高达5万次模拟，假设多个群体存在时，计算每个个体归属各类(亚)群体的概率。经过以上，能够实现对个体的分类。在本发明的一个实施例中，在分类的基础上，还可以进一步筛选个体，例如根据上述群体结构分析结果，实现对个体的分类，提取每个个体样本信息，剔除存在异议的个体，比如分类不明确或明显离群样本。Group Structure analysis can be performed using Structure software (http://pritch.bsd.uchicago.edu/software/structure2_1.html). This software is based on the genotyping data of SNP loci to infer whether there are different groups and judge each individual the group to which they belong. According to the software instructions, convert the genotype file of the group SNP into a structure input file and use up to 50,000 simulations in the mixed model. Assuming that multiple groups exist, calculate the probability that each individual belongs to each type (sub) group . Through the above, classification of individuals can be realized. In one embodiment of the present invention, on the basis of classification, individuals can be further screened, for example, according to the above-mentioned group structure analysis results, the classification of individuals can be realized, the sample information of each individual can be extracted, and individuals with objections can be eliminated, such as classification Ambiguous or obviously outlier samples.

根据本发明的一个实施例，进一步的，基于所述群体结构分析结果，对所述群体样本进行重新划分，以获得的划分结果即获得的新的亚群体替代原先的一级亚群体，进而基于新的亚群体及其SNP数据进行步骤S30来确定群体中受到选择作用的位点，这样，以群体结构分析结果对群体/亚群体进行再分类或者重新分类，有利于准确判定受到选择作用的位点。According to an embodiment of the present invention, further, based on the analysis result of the population structure, the population sample is re-divided, and the division result obtained is that the obtained new sub-population replaces the original first-level sub-population, and then based on The new subpopulation and its SNP data are carried out in step S30 to determine the selected site in the population. In this way, the population/subpopulation is reclassified or reclassified based on the population structure analysis results, which is conducive to accurately determining the selected site. point.

根据本发明的一个实施例，如图3所示，在进行该方法的步骤S30之后，还包括进行S23群体结构分析，S23群体结构分析包括：基于S20中的群体SNP数据，对所述群体样本进行群体结构分析，获得群体结构分析结果；任选的，进行所述群体结构分析包括构建系统发育树、主成分分析(PCA)、Group Structure分析和种群遗传结构Frappe检测中的至少之一。According to one embodiment of the present invention, as shown in FIG. 3, after performing step S30 of the method, it also includes performing S23 population structure analysis, and S23 population structure analysis includes: based on the population SNP data in S20, analyzing the population sample Performing population structure analysis to obtain population structure analysis results; optionally, performing the population structure analysis includes at least one of building a phylogenetic tree, principal component analysis (PCA), Group Structure analysis, and population genetic structure Frappe detection.

根据本发明的一个实施例，所述群体样本的核酸测序数据由组成群体样本的每个个体样本的核酸测序数据组成，要求每个个体样本的核酸测序数据不少于4G，以利于准确检测出SNP，进而有利于基于准确的群体SNP数据准确确定受到选择作用位点。According to an embodiment of the present invention, the nucleic acid sequencing data of the group sample is composed of the nucleic acid sequencing data of each individual sample constituting the group sample, and the nucleic acid sequencing data of each individual sample is required to be no less than 4G, so as to facilitate accurate detection of SNP, in turn, is conducive to accurately determining the site under selection based on accurate population SNP data.

根据本发明的一个实施例，群体样本来自同一物种的、具有不同遗传背景的个体。对于群体样本分析，建议群体样本中包含的个体样本数量不小于30个，而且，涉及的所有个体至少能够根据某种指标而被划分为两个及两个以上的亚群体，即所称的一级亚群体，以便于后续差异分析。根据本发明的一个实施例，较佳的，每个一级亚群体包括至少10个个体样本，以利于差异分析。根据本发明的一个实施例，将所有个体样本在相同的条件下进行培养，然后在相同的组织或者部位取样，来获得群体样本，这样使得基于该群体样本数据进行群体分析包括进行基因差异表达分析有意义，原因在于，个体样本的遗传差异即变量已经存在，在相同条件下取样，能够使得到的差异表达基因能从遗传差异的角度去作解释，否则，多个变量的存在，会导致差异表达的原因模棱两可。例如，研究群体可以被分为抗盐碱和不抗盐碱两类，可以使用相同计量的盐水对生长在相同环境下的所有个体进行处理，然后对处理后特定时间(例如1小时)的根尖进行取样，这样，后续群体分析鉴定出来的差异表达基因可能能用于揭示此物种抗盐碱的机制，并且，能确定该差异表达是由于遗传背景的差异导致的。According to one embodiment of the present invention, the population samples are from individuals of the same species with different genetic backgrounds. For group sample analysis, it is recommended that the number of individual samples contained in the group sample is not less than 30, and all the individuals involved can be divided into two or more subgroups at least according to some indicators, that is, the so-called one subgroup. subgroups for subsequent differential analysis. According to an embodiment of the present invention, preferably, each primary subpopulation includes at least 10 individual samples, so as to facilitate difference analysis. According to an embodiment of the present invention, all individual samples are cultured under the same conditions, and then samples are taken from the same tissues or parts to obtain group samples, so that group analysis based on the group sample data includes gene differential expression analysis The reason is that the genetic differences of individual samples, that is, variables already exist, and sampling under the same conditions can make the differentially expressed genes obtained can be explained from the perspective of genetic differences, otherwise, the existence of multiple variables will lead to differences The reason for the expression is ambiguous. For example, the research population can be divided into two categories: saline-alkali-resistant and non-saline-alkali-resistant, and all individuals grown in the same environment can be treated with the same amount of saline, and then the roots at a specific time (for example, 1 hour) after treatment can be treated. In this way, the differentially expressed genes identified by subsequent population analysis may be used to reveal the mechanism of salt-alkali resistance in this species, and it can be determined that the differential expression is caused by differences in genetic background.

根据本发明的一个实施例，所述一级亚群体包括至少一个二级亚群体；任选的，一个所述二级亚群体包括至少10个个体。二级亚群体可以通过利用不同于划分群体的另一个(些)指标划分一级亚群体来获得。利用本发明的任一实施例中的方法能够对多次划分后的多级亚群体中的受到选择作用的位点进行准确判定。According to an embodiment of the present invention, the primary subpopulation includes at least one secondary subpopulation; optionally, one secondary subpopulation includes at least 10 individuals. Secondary subpopulations can be obtained by classifying primary subpopulations using another indicator(s) than that used to classify primary subpopulations. Using the method in any embodiment of the present invention can accurately determine the sites subjected to selection in the multi-level subpopulation after multiple divisions.

根据本发明的一个实施例，所述基于群体SNP数据，比较不同一级亚群体多态性的差异，以确定受到选择作用的SNP，包括：基于群体SNP数据，利用至少两种检验方法比较所述不同一级亚群体中的相同SNP位点的杂合度的差异，将得到至少两种检验方法支持的SNP位点确定为受到选择作用的SNP；任选的，所述检验方法包括F统计量、分子变异分析和多层贝叶斯方法。在本发明的一些实施例中，利用Arlequin程序、Global FST test程序和BayeScan程序中的两个或者全部三个，或者包括利用Arlequin，BayesScan和Datacal三种方法中的至少两个或者全部三种方法来判断比较位点的杂合度差异程度，当某SNP位点得到以上三种检验方法中的至少两种或者全部三种的支持，即至少其中的两种的检验结果都认定该SNP在不同亚群体中的杂合度的差异是显著的，则判定该SNP为受到选择作用的位点。这样，有利于准确判定。According to an embodiment of the present invention, the comparison of differences in polymorphisms of different first-level subgroups based on population SNP data to determine the SNP subject to selection includes: using at least two testing methods to compare the selected SNPs based on population SNP data. The difference in the heterozygosity of the same SNP site in the different primary subpopulations, the SNP site supported by at least two test methods is determined as the SNP subject to selection; optionally, the test method includes the F statistic , Molecular Variation Analysis, and Multilevel Bayesian Methods. In some embodiments of the present invention, two or all three of the Arlequin program, the Global FST test program and the BayeScan program are used, or at least two or all of the three methods of Arlequin, BayesScan and Datacal are used To judge the degree of heterozygosity difference of comparison sites, when a SNP site is supported by at least two or all three of the above three test methods, that is, at least two of the test results all identify the SNP in different subgroups. If the difference in heterozygosity in the population is significant, it is determined that the SNP is a site subjected to selection. In this way, it is conducive to accurate judgment.

根据本发明的一个实施例，所述利用至少两种检验方法来比较所述不同一级亚群体中的相同SNP位点的杂合度的差异，将得到至少两种检验方法支持的SNP位点确定为受到选择作用的SNP，包括：计算所述SNP位点在不同一级亚群体中的杂合度差异值，将杂合度差异值不小于阈值的SNP位点确定为受到选择作用的位点。在本发明的一个实施例中，所称的杂合度差异值以F_ST(Fixation index)表示。F_ST可以用来评价群体间的基因组距离和种群的差异，是度量种群间分化程度的一个指标，由Sewall Wright在1922年应用F-检验的一种特殊情况发展而来。F_ST的零假设是在群体没有分化时，多态性位点在(亚)群内和(亚)群间的次等位碱基的频率差别不具显著性。计算F_ST的方法很多，虽然具体计算方法不同，但基本理论是一致的，即由Hudson(1992)给出的定义：其中，Π_Between在这里表示从两个亚群体(Between)中分别抽取一个样本，组成一对，计算这对样本SNP基因型的差异，如此可以计算所有成对样本SNP基因型的差异，最后求平均值即为Π_Between。Π_Within表示从一个亚群体(Within)中分别抽取2个样本，组成一对，计算这对样本SNP基因型的差异，如此可以计算所有成对样本SNP基因型的差异，最后求平均值即为Π_Within。如果有两个亚群体，可以两个亚群体分别先计算Π_Within，然后累加。在该实施例中，结合已有亚群体SNP数据的结构，基于上述原理，推导公式如下：According to an embodiment of the present invention, the use of at least two testing methods to compare the difference in heterozygosity of the same SNP site in the different first-level subpopulations will determine the SNP site supported by at least two testing methods The SNP subject to selection includes: calculating the heterozygosity difference value of the SNP site in different primary subpopulations, and determining the SNP site whose heterozygosity difference value is not less than a threshold value as the site subject to selection effect. In one embodiment of the present invention, the so-called difference in heterozygosity is represented by F_ST (Fixation index)._FST can be used to evaluate the genomic distance and population differences between groups, and it is an index to measure the degree of differentiation between populations. It was developed from a special case of F-test applied by Sewall Wright in 1922. The null hypothesis of_FST is that when the population is not differentiated, there is no significant difference in the frequency of the minor alleles of the polymorphic site within (sub)groups and between (sub)groups. There are many methods to calculate_FST . Although the specific calculation methods are different, the basic theory is the same, that is, the definition given by Hudson (1992): Among them, Π_Between means to draw a sample from two subpopulations (Between) respectively to form a pair, and calculate the difference of the SNP genotype of the pair of samples, so that the difference of the SNP genotype of all paired samples can be calculated, and finally find The average value is Π_Between . Π_Within means that two samples are drawn from a subpopulation (Within) to form a pair, and the difference between the SNP genotypes of the pair of samples is calculated, so that the difference of the SNP genotypes of all paired samples can be calculated, and the final average value is Π_Within . If there are two subgroups,_ΠWithin can be calculated for the two subgroups first, and then accumulated. In this embodiment, combined with the structure of the existing subpopulation SNP data, based on the above principles, the derivation formula is as follows:

$F_{S T} = \frac{Π_{B e t w e e n} - Π_{W i t h i n}}{Π_{B e t w e e n}} = 1 - \frac{Π_{W i t h i n}}{Π_{B e t w e e n}} = 1 - \frac{[\underset{j}{Σ} (_{2}^{n_{j}}) \underset{j}{Σ} 2 \frac{n_{i j}}{n_{i j} - 1} (1 - x_{i j})] / \underset{j}{Σ} (_{2}^{n_{j}})}{\underset{j}{Σ} 2 \frac{n_{i}}{n_{i} - 1} x_{i} (1 - x_{i})},$ 其中，xⁱ_j是SNP位点i在亚群体j中的次等位碱基(第二碱基)的频率，而nⁱ_j是SNP位点i在亚群体j中染色体上的物理位置，n_j则是亚群体j中用于比较分析的SNP位点个数的总和。在本发明的一个实施例中，利用Arlequin，BayesScan和Datacal三种方法来比较检验SNP位点的次等位碱基频率在亚群体间的差异，各自设置的差异具有显著性的阈值分别为0.05，0.1和0.01。 $f_{S T} = \frac{Π_{B e t w e e no} - Π_{W i t h i no}}{Π_{B e t w e e no}} = 1 - \frac{Π_{W i t h i no}}{Π_{B e t w e e no}} = 1 - \frac{[\underset{j}{Σ} (_{2}^{{no}_{j}}) \underset{j}{Σ} 2 \frac{{no}_{i j}}{{no}_{i j} - 1} (1 - x_{i j})] / \underset{j}{Σ} (_{2}^{{no}_{j}})}{\underset{j}{Σ} 2 \frac{{no}_{i}}{{no}_{i} - 1} x_{i} (1 - x_{i})},$ Among them, xⁱ_j is the frequency of the secondary allele (second base) of SNP site i in subgroup j, and nⁱ_j is the physical position of SNP site i on chromosome in subgroup j, n_j is the sum of the number of SNP sites used for comparative analysis in subgroup j. In one embodiment of the present invention, three methods of Arlequin, BayesScan and Datacal are used to compare and test the difference of the sub-allelic base frequency of the SNP site between the subgroups, and the thresholds for the significance of the differences respectively set are 0.05 , 0.1 and 0.01.

根据本发明的一个实施例，本发明提供一种基于群体转录本数据分析群体结构的方法，该方法包括：获得群体样本的核酸测序数据，所述群体样本来自一个物种的多个个体，任选的，所述群体样本来自一个物种多个个体的相同组织或者一个物种的多个个体的相同部位，所述群体样本能够依据n对预定指标分成2n个一级亚群体，n为自然数；基于所述核酸测序数据，检测以获得群体SNP数据，所述群体SNP数据包括多个一级亚群体SNP数据；基于所述群体SNP数据，比较不同一级亚群体的多态性的差异，确定受到选择作用的SNP，和/或，基于所述群体SNP数据，对所述群体进行群体结构分析。According to one embodiment of the present invention, the present invention provides a method for analyzing population structure based on population transcript data, the method comprising: obtaining nucleic acid sequencing data of population samples, the population samples are from multiple individuals of a species, optionally Yes, the population sample comes from the same tissue of multiple individuals of a species or the same part of multiple individuals of a species, and the population sample can be divided into 2n first-level subgroups according to n pairs of predetermined indicators, where n is a natural number; based on the The nucleic acid sequencing data is detected to obtain population SNP data, and the population SNP data includes a plurality of first-level subgroup SNP data; based on the population SNP data, the polymorphism differences of different first-level subgroups are compared, and it is determined that the selected SNPs that act, and/or, based on the population SNP data, perform population structure analysis on the population.

根据本发明的一个实施例，如图4所示，本发明提供一种确定群体中受到选择作用的位点的装置100，该装置100用以实施上述本发明一方面的确定群体中受到选择作用的位点的方法，装置100包括：数据输入单元110，用于输入数据；数据输出单元120，用于输出数据；处理器130，用于执行机器可执行程序，执行所述机器可执行程序包括完成本发明一方面的或者任一实施例中的方法；存储单元140，与所述数据输入单元110、数据输出单元120和处理器130相连，用于存储数据，其中包括所述机器可执行程序。本领域技术人员能够理解，所说的机器可执行程序可以保存在存储介质中，所称存储介质可以包括：只读存储器、随机存储器、磁盘或光盘等。According to one embodiment of the present invention, as shown in FIG. 4 , the present invention provides a device 100 for determining the sites subjected to selection in a population. In the method of the site, the device 100 includes: a data input unit 110, for inputting data; a data output unit 120, for outputting data; a processor 130, for executing a machine-executable program, and executing the machine-executable program includes To complete the method in one aspect of the present invention or in any embodiment; the storage unit 140 is connected to the data input unit 110, the data output unit 120 and the processor 130 for storing data, including the machine executable program . Those skilled in the art can understand that the machine-executable program can be stored in a storage medium, and the storage medium can include: read-only memory, random access memory, magnetic disk or optical disk, and the like.

根据本发明的一个实施例，如图5所示，本发明提供一种确定群体中受到选择作用的位点的系统1000，该系统能够用以实施上述本发明一方面的或者任一实施例中的方法的全部或部分步骤，该系统1000包括：测序数据获取装置1100，用以获取群体样本的核酸测序数据，所述群体样本来自一个物种的多个个体，任选的，所述群体样本来自一个物种多个个体的相同组织或者一个物种的多个个体的相同部位，所述群体样本能够依据n对预定指标分成2n个一级亚群体，n为自然数；SNP检测装置1200，与所述测序数据获取装置1100连接，用于基于所述核酸测序数据，检测以获得群体SNP数据，所述群体SNP数据包括多个一级亚群体SNP数据；目的位点确定装置1300，与所述SNP检测装置1200连接，用以基于所述群体SNP数据，比较不同一级亚群体的多态性的差异，以确定受到选择作用的SNP，所述受到选择作用的SNP为所述受到选择作用的位点。According to an embodiment of the present invention, as shown in FIG. 5 , the present invention provides a system 1000 for determining the site under selection in a population, and the system can be used to implement the above-mentioned aspect of the present invention or any of the embodiments. For all or part of the steps of the method, the system 1000 includes: a sequencing data acquisition device 1100, configured to acquire nucleic acid sequencing data of a population sample, the population sample comes from multiple individuals of a species, and optionally, the population sample comes from For the same tissues of multiple individuals of a species or the same parts of multiple individuals of a species, the population samples can be divided into 2n first-level subpopulations according to n pairs of predetermined indicators, where n is a natural number; the SNP detection device 1200 is compatible with the sequencing The data acquisition device 1100 is connected to detect and obtain population SNP data based on the nucleic acid sequencing data, and the population SNP data includes a plurality of first-level subpopulation SNP data; the target site determination device 1300 is connected with the SNP detection device 1200 connections, used to compare polymorphism differences of different primary subpopulations based on the population SNP data, so as to determine the SNP subject to selection, and the SNP subject to selection is the site subject to selection.

利用上述本发明任一实施例中的方法、装置和/或系统能够准确的判定出群体中受到选择作用的位点。本发明的方法和/或装置，主要集中于更具普遍重要性的基因组转录区域，能够基于获得的群体转录本数据，获得基因表达数据，揭示样本的基因表达规律，这将有利于揭示遗传背景差异条件下的基因表达规律，是对RAD、GBS等群体研究范围的进一步拓展。而且，又能够获得群体SNP数据，揭示群体结构和群体遗传进化规律。本发明方法、装置和/或系统能够用以规范群体转录组重测序分析流程，降低分析风险，能够高效率、高质量和高标准完成对群体项目的分析。Using the method, device and/or system in any of the above embodiments of the present invention can accurately determine the site under selection in the population. The method and/or device of the present invention mainly focuses on the more generally important genome transcriptional regions, can obtain gene expression data based on the obtained population transcript data, and reveal the gene expression rules of samples, which will help reveal the genetic background The law of gene expression under different conditions is a further expansion of the research scope of RAD, GBS and other groups. Moreover, population SNP data can be obtained to reveal population structure and population genetic evolution law. The method, device and/or system of the present invention can be used to standardize the analysis process of group transcriptome resequencing, reduce the risk of analysis, and can complete the analysis of group projects with high efficiency, high quality and high standard.

以下结合附图和具体样本数据实施例对本发明的确定受到选择作用的位点的方法、群体项目分析装置和/或系统进行详细的说明。通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。除另有交待，以下实施例中涉及的未特别交待的试剂、序列(接头、标签和引物)、软件及仪器，都是常规市售产品或者开源的，例如购买Illumina的转录组文库构建试剂盒。The method for determining sites subject to selection, the group item analysis device and/or system of the present invention will be described in detail below with reference to the accompanying drawings and specific sample data examples. The embodiments described by referring to the figures are exemplary only for explaining the present invention and should not be construed as limiting the present invention. Unless otherwise stated, the unspecified reagents, sequences (adapters, tags, and primers), software, and instruments involved in the following examples are all conventional commercially available products or open source, such as purchasing Illumina’s transcriptome library construction kit .

实施例一Embodiment one

参考序列、测序策略、样品要求及其他注意事项：Reference sequences, sequencing strategies, sample requirements, and other considerations:

i)参考序列：要求有用较高质量的基因组参考序列。i) Reference sequence: A high-quality genome reference sequence is required.

ii)测序策略：采用PE91(双末端测序，获得多对paired-end reads，每条reads的长度都为91bp)的测序策略，单个样品达到过滤后数据量4G的标准。ii) Sequencing strategy: PE91 (paired-end sequencing, obtaining multiple pairs of paired-end reads, the length of each reads is 91bp) sequencing strategy is adopted, and a single sample reaches the standard of 4G after filtering.

iii)样品应该来自同一物种、具有不同遗传背景的个体。iii) Samples should be from individuals of the same species with different genetic backgrounds.

iv)对于总的研究群体，推荐30个个体及以上的体量。同时，涉及的所有个体能够根据某种指标而被划分为两个及两个以上的亚群体(便于差异分析)，且每个亚群体最好多于10个个体。iv) For the total study population, a size of 30 individuals and above is recommended. At the same time, all the individuals involved can be divided into two or more subgroups according to certain indicators (to facilitate difference analysis), and each subgroup preferably has more than 10 individuals.

v)将所有样本在相同的条件下进行培养，然后在相同的组织、部位取样。原因在于，样品的遗传差异(变量)已经存在，只有在相同条件下取样，得到的差异表达基因才可能从遗传差异的角度去作解释。否则，多个变量的存在，会导致差异表达的原因模棱两可。例如研究群体可以被分为抗盐碱和不抗盐碱两类。可以使用相同计量的盐水对生长在相同环境下的所有个体进行处理，然后对处理后特定时间(例如1小时)的根尖进行取样。那么后续鉴定出来的差异表达基因则可能揭示出此物种抗盐碱的机制，因为，差异表达是由于遗传背景的差异导致的。v) All samples were cultured under the same conditions, and then samples were taken from the same tissue and site. The reason is that the genetic differences (variables) of the samples already exist, and only when samples are taken under the same conditions can the differentially expressed genes obtained be explained from the perspective of genetic differences. Otherwise, the presence of multiple variables can lead to ambiguity about the cause of differential expression. For example, the research population can be divided into two categories: salt-alkali resistant and non-saline-alkali resistant. All individuals grown in the same environment can be treated with the same dose of saline, and then the root tips can be sampled at a specific time (for example, 1 hour) after the treatment. Then the differentially expressed genes identified later may reveal the mechanism of this species' salt-alkali resistance, because the differential expression is caused by differences in genetic background.

为规范群体转录组重测序项目的分析流程，降低分析风险，以达到高效率、高质量、高标准完成项目的目的，在此提出一种群体转录组重测序分析方法，主要包括：In order to standardize the analysis process of the population transcriptome resequencing project, reduce the risk of analysis, and achieve the goal of completing the project with high efficiency, high quality, and high standards, a population transcriptome resequencing analysis method is proposed here, mainly including:

一、实验流程1. Experimental process

提取样本总RNA并使用DNase I消化DNA后，用带有Oligo(dT)的磁珠富集真核生物mRNA(若为原核生物，则用试剂盒去除rRNA后进入下一步骤)；加入打断试剂在Thermomixer中适温将mRNA打断成短片段，以打断后的mRNA为模板合成一链cDNA，然后配制二链合成反应体系合成二链cDNA，并使用试剂盒纯化回收粘性末端修复、cDNA的3'末端加上碱基"A"并连接接头，然后进行片段大小选择，最后进行PCR扩增；构建好的文库用Agilent 2100Bioanalyzer和ABI StepOnePlus Real-Time PCRSystem质检合格后，使用Illumina HiSeq^TM 2000或其他测序仪进行测序。After extracting the total RNA of the sample and digesting the DNA with DNase I, use magnetic beads with Oligo(dT) to enrich the eukaryotic mRNA (for prokaryotic organisms, use the kit to remove the rRNA and proceed to the next step); add interrupt The reagent breaks the mRNA into short fragments at an appropriate temperature in the Thermomixer, uses the broken mRNA as a template to synthesize a first-strand cDNA, then prepares a second-strand synthesis reaction system to synthesize a second-strand cDNA, and uses the kit to purify and recover sticky end repair and cDNA Add base "A" to the 3' end of the DNA and connect adapters, then perform fragment size selection, and finally perform PCR amplification; after the constructed library is qualified with Agilent 2100Bioanalyzer and ABI StepOnePlus Real-Time PCRSystem, use Illumina HiSeq^TM 2000 or other sequencer for sequencing.

二、信息分析内容2. Information analysis content

1)标准RNA-Seq分析1) Standard RNA-Seq analysis

包括数据过滤，基因表达定量，组间差异基因鉴定及其GO、KEGG Pathway富集分析，SNP calling及注释等。Including data filtering, quantification of gene expression, identification of differential genes between groups and their GO, KEGG Pathway enrichment analysis, SNP calling and annotation, etc.

2)基于群体SNP数据的分析2) Analysis based on population SNP data

基于标准RNA-Seq分析中对单个样品的一致性序列(consensus序列)的预测，即SNP识别(SNP calling)的中间步骤，整理得到群体水平的SNP数据，用于下述多个方面的分析：Based on the prediction of the consensus sequence (consensus sequence) of a single sample in standard RNA-Seq analysis, that is, the intermediate step of SNP identification (SNP calling), the SNP data at the population level is sorted out for the analysis of the following aspects:

a、群体结构分析：包括构建系统发育树、主成分(PCA)分析和STRUCTURE分析，三者都能够反映出群体的结构，但每个分析侧重点又有所不同。构建系统发育树侧重于揭示群体中个体之间的进化关系；主成分(PCA)分析侧重于揭示群体中个体之间遗传背景差异的主要因素；STRUCTURE分析侧重于对每个个体的遗传组成进行比较、量化，并以图示的方式揭示个体之间遗传组成的异同。a. Population structure analysis: including the construction of phylogenetic tree, principal component (PCA) analysis and STRUCTURE analysis, all of which can reflect the structure of the population, but the focus of each analysis is different. The construction of a phylogenetic tree focuses on revealing the evolutionary relationship between individuals in a population; the principal component (PCA) analysis focuses on revealing the main factors of genetic background differences among individuals in a population; the STRUCTURE analysis focuses on comparing the genetic composition of each individual , quantify, and graphically reveal similarities and differences in genetic composition among individuals.

b、检测受到选择作用的位点：选择作用(来自于人工or自然)通常在种群的分化(亚群的形成)过程中起着非常重要的作用。从亚群的SNP数据出发，可以统计出所有位点在不同亚群之间多态性的差异(Fst)，并检验出Fst显著差异的位点。这些位点作为潜在的受到选择作用的位点，能够辅助研究者进一步认识针对于某些亚群的选择作用的过程。b. Detection of sites subject to selection: selection (from artificial or natural) usually plays a very important role in the differentiation of populations (formation of subpopulations). Starting from the SNP data of subgroups, the polymorphic differences (Fst) of all sites among different subgroups can be counted, and the sites with significant differences in Fst can be detected. These loci, as potential loci subject to selection, can assist researchers to further understand the process of selection for certain subpopulations.

Fst(Fixation index)主要用来评价群体间的基因组距离和种群的差异，是度量种群间分化程度的一个指标，由Sewall Wright在1922年应用F-检验的一种特殊情况发展而来。Fst (Fixation index) is mainly used to evaluate the genomic distance and population differences between groups. It is an index to measure the degree of differentiation between populations. It was developed from a special case of F-test applied by Sewall Wright in 1922.

F_ST的零假设是在群体没有分化时，多态性位点在群内和群间的次等位碱基的频率差别是不显著的。计算F_ST的方法很多，虽然具体计算方法不同，但基本理论是一致的，即由Hudson(1992)给出的定义： $F_{S T} = \frac{Π_{B e t w e e n} - Π_{W i t h i n}}{Π_{B e t w e e n}},$ The null hypothesis of_FST is that when the populations are not differentiated, there is no significant difference in the frequencies of minor alleles of polymorphic sites within and between groups. There are many methods to calculate_FST . Although the specific calculation methods are different, the basic theory is the same, that is, the definition given by Hudson (1992): $f_{S T} = \frac{Π_{B e t w e e no} - Π_{W i t h i no}}{Π_{B e t w e e no}},$

其中Π_Between表示从两个群体(Between)中分别抽取一个样本，组成一对，计算这对样本SNP基因型的差异，如此可以计算所有成对样本SNP基因型的差异，最后求平均值即为Π_Between。Among them, Π_Between means that a sample is drawn from two groups (Between) respectively to form a pair, and the difference between the SNP genotypes of the pair of samples is calculated, so that the difference between the SNP genotypes of all paired samples can be calculated, and finally the average value is Π_Between .

Π_Within表示从一个群体(Within)中分别抽取2个样本，组成一对，计算这对样本SNP基因型的差异，如此可以计算所有成对样本SNP基因型的差异，最后求平均值即为Π_Within。Π_Within means that two samples are drawn from a group (Within) to form a pair, and the difference between the SNP genotypes of the pair of samples is calculated, so that the difference between the SNP genotypes of all paired samples can be calculated, and the final average is Π_Within .

如果有两个群体，可以两个群体分别先计算Π_Within，然后累加。If there are two groups, you can first calculate_ΠWithin for the two groups, and then add up.

3)基于基因表达数据的额外分析3) Additional analysis based on gene expression data

a、聚类分析、PCA分析：基于基因表达数据，可以对群体中的个体进行聚类、PCA分析，呈现个体与个体之间在基因表达层次上的差异。这一结果可与SNP数据构建出来的系统发育树和PCA分析结果相互印证、比较。a. Cluster analysis and PCA analysis: Based on gene expression data, clustering and PCA analysis can be performed on individuals in the group to present differences in gene expression levels between individuals. This result can be mutually confirmed and compared with the phylogenetic tree and PCA analysis results constructed from SNP data.

b、共表达基因网络构建和组间比较：在各种生命活动中，多个基因(co-expression genes)通常在很多条件下协同地表达，以实现某些特定的功能。从多个不同个体的基因表达数据出发，可以构建出许多共表达基因的模块。以此为基础，研究者可以分析：i)在特定条件下，哪些共表达基因模块在发挥着作用(较高水平地表达)，这有利于认识这些特定条件背后的基因表达规律；ii)哪些共表达基因模块在哪个(哪些)特定的个体中发挥作用，这有利于解析部分共表达基因模块的生物学功能；iii)以上构建出的共表达基因模块还可以亚群体之间进行比较。从共表达基因模块这一更高的水平上去比较个体之间的差异，可以揭示出从常规的基因差异表达数据(假定基因与基因之间相互独立，不考虑它们之间的相互作用)中无法体现出来的新内容。b. Co-expression gene network construction and group comparison: In various life activities, multiple genes (co-expression genes) are usually expressed cooperatively under many conditions to achieve certain specific functions. Starting from the gene expression data of multiple different individuals, many modules of co-expressed genes can be constructed. Based on this, researchers can analyze: i) under specific conditions, which co-expressed gene modules are playing a role (expressed at a higher level), which is conducive to understanding the gene expression rules behind these specific conditions; ii) which Which (which) specific individuals the co-expressed gene modules play a role in is conducive to analyzing the biological functions of some co-expressed gene modules; iii) the co-expressed gene modules constructed above can also be compared between subgroups. Comparing the differences between individuals at a higher level of co-expressed gene modules can reveal that the conventional gene differential expression data (assuming that the genes are independent of each other, regardless of the interaction between them) cannot new content reflected.

以上，以同一物种、多个不同遗传背景的个体为研究对象，通过对转录组(transcriptome)样品进行高通量测序，一次性获得该特定物种群体水平的基因组转录区域多态性数据(群体SNP)和全基因/转录本表达信息，进而可以揭示(i)研究个体之间的进化关系和遗传组成差异，(ii)在特定选择作用下共同进化的基因簇，(iii)亚群体中受人工/自然选择作用的位点，以及(iv)个体或亚群体之间的在表达上具有显著差异的功能模块和代谢通路等生物学问题。相对于常规的少量样品的转录组重测序，该方法还将给出群体SNP数据，该数据可用于揭示群体结构、群体进化历史、群体中每个个体的进化关系，以及潜在的受选择作用的位点等生物学问题。相比于RAD、GBS等群体研究技术，该方法的研究区域集中于更具普遍重要性的基因组转录区域。同时，本发明可以对基因表达进行定量，这将有利于揭示遗传背景差异条件下的基因表达规律，是对RAD、GBS等群体研究范围的进一步拓展。As mentioned above, taking individuals of the same species and multiple different genetic backgrounds as the research object, through high-throughput sequencing of transcriptome samples, the polymorphism data of the genome transcription region (group SNP ) and global gene/transcript expression information, which in turn can reveal (i) evolutionary relationships and differences in genetic composition among individuals under study, (ii) gene clusters that have co-evolved under specific selection, (iii) subpopulations affected by artificial / Sites of natural selection, and (iv) biological issues such as functional modules and metabolic pathways with significant differences in expression between individuals or subpopulations. Compared with conventional transcriptome resequencing of a small number of samples, this method will also give population SNP data, which can be used to reveal population structure, population evolution history, evolutionary relationship of each individual in the population, and potential effects of selection. Biological issues such as sites. Compared with population research techniques such as RAD and GBS, the research area of this method is concentrated on the more generally important transcriptional regions of the genome. At the same time, the present invention can quantify gene expression, which will help to reveal the gene expression rule under the condition of genetic background difference, and is a further expansion of the research scope of RAD, GBS and other populations.

实施例二Embodiment two

下面详细示例介绍分步骤操作过程：The following detailed example walks through the step-by-step process:

一、常规转录组重测序流程1. Conventional transcriptome resequencing process

不同地域包括秦岭、岷山、梁山、邛崃和相岭的大熊猫，获取的大熊猫血液或组织样本数目总共34个，其中，来自梁山为2个——样本编号为GP37和GP52(均为血液样本)，来自岷山的有7个——样本编号为GP14-19和GP51(均为血液样本)，来自秦岭的有8个——样本编号分别为GP3-8(血液样本)、GP10(组织样本)和GP12(血液样本)，来自邛崃的有15个——样本编号分别为GP2、GP13、GP22-31、GP33和GP35-36(均为血液样本)，来自相岭的有2个——样本编号分别为GP38-39(均为血液样本)。样本转录组核酸提取、文库构建以及测序参照前面实施例进行，获得各样本测序数据。根据地域的不同，将34个样本分为5个一级亚群体。Giant pandas in different areas include Qinling, Minshan, Liangshan, Qionglai and Xiangling. A total of 34 giant panda blood or tissue samples were obtained, of which 2 were from Liangshan—the sample numbers were GP37 and GP52 (both blood samples), 7 from Minshan—sample numbers GP14-19 and GP51 (both blood samples), 8 from Qinling—sample numbers are GP3-8 (blood sample), GP10 (tissue samples) and GP12 (blood samples), 15 from Qionglai—the sample numbers are GP2, GP13, GP22-31, GP33, and GP35-36 (both blood samples), and 2 from Xiangling— The sample numbers are GP38-39 (both are blood samples). Sample transcriptome nucleic acid extraction, library construction and sequencing were carried out referring to the previous examples, and the sequencing data of each sample were obtained. According to different regions, 34 samples were divided into 5 primary subgroups.

完成数据过滤、质控，将干净测序数据(clean data)比对到基因组参考序列，比如利用SOAP或者BWA、按照其默认设置进行比对，对每个样品进行SNP识别(call snp)，将cleandata比对到基因集参考序列上，计算每个基因的表达量并进行组间差异表达基因鉴定和GO、KEGG pathway富集分析。再次将clean data比对到基因组参考序列，例如利用TopHat或者STAR进行比对，预测可变剪切及新的转录本，以及完成各种统计工作，包括原始、过滤后数据量统计、reads mapping信息统计、基因组覆盖度统计、生成文库随机性评估图等。Complete data filtering and quality control, compare clean sequencing data (clean data) to genome reference sequences, such as using SOAP or BWA, compare according to their default settings, perform SNP identification (call snp) for each sample, and cleandata Compared to the reference sequence of the gene set, the expression level of each gene was calculated, and differentially expressed gene identification between groups and GO and KEGG pathway enrichment analysis were performed. Compare the clean data to the genome reference sequence again, such as using TopHat or STAR for comparison, predict variable splicing and new transcripts, and complete various statistical tasks, including original and filtered data volume statistics, reads mapping information Statistics, genome coverage statistics, generation of library randomness evaluation maps, etc.

二、识别(Call)群体SNP、以及基于群体SNP的群体进化分析2. Identification (Call) group SNP, and group evolution analysis based on group SNP

从上一步获得的每个个体相对于基因组参考序列的consensus信息(即SOAPsnp输出的cns文件)出发，整合形成群体SNP数据，此为所有个体水平，即为取所有个体样本SNP的并集为群体SNP数据。以此群体SNP为基础，进行群体进化分析，群体进化分析包括进化树的构建、主成分分析、个体遗传组成分析等。此流程需要准备一些简单的配置文件，说明如下：Starting from the consensus information of each individual obtained in the previous step relative to the genome reference sequence (that is, the cns file output by SOAPsnp), the group SNP data is integrated to form a group. This is at the level of all individuals, that is, the union of the SNPs of all individual samples is taken as a group SNP data. Based on this population SNP, the population evolution analysis is carried out. The population evolution analysis includes the construction of evolutionary tree, principal component analysis, individual genetic composition analysis, etc. This process requires the preparation of some simple configuration files, as follows:

individual.txt：样品(个体样本)信息文件，每一行是一个样品的信息，每行6列，如表1所示。individual.txt: sample (individual sample) information file, each line is the information of a sample, and each line has 6 columns, as shown in Table 1.

表1Table 1

snp.lst：群体SNP(genotype)文件列表，群体SNP文件格式如表2所示。snp.lst: group SNP (genotype) file list, the group SNP file format is shown in Table 2.

表2Table 2

第一列first row染色体编号chromosome number第二列The second column等位基因位置allele position第三列third column对应参考序列位点的核苷酸Nucleotides corresponding to reference sequence positions第四列fourth column测序样本基因型，以空格隔开，顺序需与individual文件对应Sequencing sample genotypes, separated by spaces, the order must correspond to the individual file

population.txt：进行位点选择分析的两个群体信息，第一列是亚群名称，可以与individual文件不同，第二列是样品缩写ID，需存在于individual文件第四列中。population.txt: Two population information for locus selection analysis. The first column is the name of the subgroup, which can be different from the individual file. The second column is the sample abbreviated ID, which needs to exist in the fourth column of the individual file.

*.gff：基因组gff文件，进行位点选择分析时确定受选择位点所在基因，可以不提供。*.gff: Genome gff file, which determines the gene of the selected site when performing site selection analysis, which can not be provided.

1)Call群体SNP1) Call group SNP

利用SOAPsnp检测每个样本的SNP，整合所有单个样品的SNP数据获得群体SNP数据。具体包括：Use SOAPsnp to detect the SNP of each sample, and integrate the SNP data of all individual samples to obtain the population SNP data. Specifically include:

我们首先充分考虑并利用已公开的熊猫基因组信息(Zhao S,et al.Whole-genomesequencing of giant pandas provides insights into demographic history and local adaptation.NatGenet.45(1):67-71(2013))，从NCBI网站下载熊猫基因组对应的dbsnp，作为SOAPsnp的先验概率，并依据目前确定的研究结果，设置杂合位点SNP的先验概率为0.0010，纯合位点SNP的先验概率为0.0005。在设置以上参数后，利用SOAPsnp软件将过滤后数据与熊猫参考基因组比对，得到比对结果为CNS文件。由于每个样本基因组存在一些低测序深度的区域，在此综合所有样本基因型的可能性的文件，利用最大似然法整合所有样本的数据，产生包含所有样本每个位点的伪基因组(Pseudo-genome)。选择概率最大的基因型作为每个样本的一致基因型，通过基因型和测序深度等信息检测出高质量的SNPs。得到各个样本的一致性序列后，结果保存为群体SNPs格式，获得群体SNP数据。We first fully considered and utilized the published panda genome information (Zhao S, et al. Whole-genome sequencing of giant pandas provides insights into demographic history and local adaptation. NatGenet.45(1):67-71(2013)), from The dbsnp corresponding to the panda genome was downloaded from the NCBI website as the prior probability of the SOAPsnp, and based on the current research results, the prior probability of the SNP at the heterozygous site was set to 0.0010, and the prior probability of the SNP at the homozygous site was set to 0.0005. After setting the above parameters, use SOAPsnp software to compare the filtered data with the panda reference genome, and obtain the comparison result as a CNS file. Since there are some regions of low sequencing depth in each sample genome, the possibility files of all sample genotypes are integrated here, and the data of all samples are integrated using the maximum likelihood method to generate a pseudogenome (Pseudo -genome). The genotype with the highest probability is selected as the consistent genotype of each sample, and high-quality SNPs are detected through information such as genotype and sequencing depth. After obtaining the consensus sequence of each sample, the result is saved in the group SNPs format, and the group SNP data is obtained.

2)群体进化分析2) Population evolution analysis

输入群体SNP结果，并以群体SNP为基础，整合调用多个软件进行群体进化分析，包括Tree，PCA，Structure和Frappe分析，具体如下。Input the group SNP results, and based on the group SNP, integrate and call multiple software for group evolution analysis, including Tree, PCA, Structure and Frappe analysis, as follows.

将软件命名PopuStruct.pl，相关参数说明如表3，需注意的是群体SNP文件必须与individual文件对应。Structure软件运行时间较长，如果时间紧急，建议首先用Frappe进行群体结构分析，得到初步分析结果。Name the software PopuStruct.pl, and the relevant parameters are described in Table 3. It should be noted that the population SNP file must correspond to the individual file. The Structure software takes a long time to run. If the time is urgent, it is recommended to use Frappe to analyze the group structure first to obtain the preliminary analysis results.

表3table 3

参数parameter说明illustrate-indi<s>-indi<s>群体中每个个体信息，个体顺序与群体SNP文件一致，必须设置。The information of each individual in the group, the order of the individual is consistent with the SNP file of the group, and must be set.-list<s>-list<s>群体SNPs genotype文件列表，必须设置。Group SNPs genotype file list, must be set.-OutDir<s>-OutDir<s>输出路径，默认当前路径。Output path, the default current path.-prefix<s>-prefix<s>输出脚本前缀信息，默认“Pop”。Output script prefix information, default "Pop".-Struct<y/n>-Struct<y/n>是否用Structure软件进行群体结构分析，默认“y”Whether to use Structure software for group structure analysis, default "y"-Tree<y/n>-Tree<y/n>是否构建系统发生树，默认“y”Whether to build a phylogenetic tree, default "y"-Frappe<y/n>-Frappe<y/n>是否用Frappe软件进行群体结构分析，默认“y”Whether to use Frappe software for population structure analysis, default "y"-PCA<y/n>-PCA<y/n>是否进行主成分分析，默认“y”Whether to perform principal component analysis, default "y"-queue<s>-queue<s>投递任务队列，默认bc.qDelivery task queue, default bc.q-project<s>-project<s>投递投任务-P参数值，默认rdtestDelivery task -P parameter value, default rdtest-help-help帮助信息help information

输出文件(结果)output file (result)

i)Frappe结果文件和Structure结果文件，可结合excel进行调整和作图。结果如图6所示，图6是Frappe基于群体SNP推测的群体遗传结构示意图，图中，分隔的每块代表一个群体，横坐标代表一个样本，不同分隔块代表K个不同或差异较大的祖先，分析每一个品系的遗传成分中，所具有的每一个假想祖先成分的比例。如果一个样品对应两个不同的分割块，则表示该样品可能是两个亚群之间的中间品种。当K值取得越大时，样品之间的差异性越被放大，分得越细，可根据实际结果来决定K值取到哪就可以完全体现出所有样品的结构关系。图中，K分别取2、3、4和5，可以看出K＝3即将群体分成3个亚群体基本可以完整体现出所有样本的结构关系。i) Frappe result files and Structure result files can be adjusted and drawn in combination with excel. The results are shown in Figure 6. Figure 6 is a schematic diagram of the population genetic structure estimated by Frappe based on the population SNP. In the figure, each separated block represents a population, the abscissa represents a sample, and different separated blocks represent K different or large differences. Ancestry, analyzing the proportion of each imaginary ancestral component in the genetic component of each line. If a sample corresponds to two different divisions, it indicates that the sample may be an intermediate species between the two subpopulations. When the K value is larger, the difference between the samples is more enlarged and the division is finer. It can be determined according to the actual results that the K value can fully reflect the structural relationship of all samples. In the figure, K is 2, 3, 4 and 5 respectively. It can be seen that K=3 means that the group is divided into 3 subgroups, which can basically completely reflect the structural relationship of all samples.

ii)tree结果文件利用mega软件进行调整，结果如图7所示。图7是基于群体SNP采用邻接法推断的系统发生树的示意图，图中，分支距离越近，说明两分支间进化关系越近。对于同一亚群内的样本，应当显示能很好的分在一起或离得不远，通过该图可以说明品种之间的进化关系远近。从图7可看出，该群体可以分成3个亚群体。ii) The tree result file is adjusted using mega software, and the result is shown in Figure 7. Fig. 7 is a schematic diagram of a phylogenetic tree inferred by the neighbor-joining method based on population SNPs. In the figure, the closer the branch distance is, the closer the evolutionary relationship between the two branches is. For the samples in the same subgroup, it should be shown that they can be well grouped together or not far away. This figure can illustrate the evolutionary relationship between varieties. As can be seen from Figure 7, this population can be divided into 3 subpopulations.

iii)PCA分析结果，需用excel进行作图，结果如图8所示。图8是基于群体SNP的PCA分析结果的示意图，图中不同形状的标记代表不同亚群的样本，一个标记点代表一个样品，点的横纵坐标分别是该样品对应的第一和第二特征向量中同一顺序元素的值，相应的特征值大小代表该主成分在整个关系中所占的比例，通过该图可以跟样品的实际分组进行对比，看出样品分组好坏。进而可以看要不要重新分类以获得新亚群。iii) The results of PCA analysis need to be plotted with excel, and the results are shown in Figure 8. Figure 8 is a schematic diagram of the results of PCA analysis based on population SNPs. Markers of different shapes in the figure represent samples of different subgroups, and a marker point represents a sample, and the horizontal and vertical coordinates of the point are respectively the first and second characteristics corresponding to the sample The values of the elements in the same order in the vector, and the corresponding eigenvalues represent the proportion of the principal component in the entire relationship. Through this figure, you can compare it with the actual grouping of samples to see whether the sample grouping is good or bad. In turn, it can be seen whether to reclassify to obtain new subgroups.

三、受选择作用位点的检测3. Detection of Selected Action Sites

结合实施例一以及上述获得的群体SNP数据的结构，推导公式如下：In combination with the structure of the population SNP data obtained in Example 1 and the above, the derivation formula is as follows:

$\begin{matrix} {F f}_{S S T T} = = \frac{{Π Π}_{B B e e t t w w e e e e n no} - - {Π Π}_{W W i i t t h h i i n no}}{{Π Π}_{B B e e t t w w e e e e n no}} \\ = = 11 - - \frac{{Π Π}_{W W i i t t h h i i n no}}{{Π Π}_{B B e e t t w w e e e e n no}} = = 11 - - \frac{[[\underset{j j}{Σ Σ} {((}_{22}^{{n no}_{j j}})) \underset{i i}{Σ Σ} 22 \frac{{n no}_{i i j j}}{{n no}_{i i j j} - - 11} {x x}_{i i j j} ((11 - - {x x}_{i i j j}))]] / / \underset{j j}{Σ Σ} {((}_{22}^{{n no}_{j j}}))}{\underset{i i}{Σ Σ} 22 \frac{{n no}_{i i}}{{n no}_{i i} - - 11} {x x}_{i i} ((11 - - {x x}_{i i}))} \end{matrix}$

上式中xⁱ_j是SNP位点i在亚群体j中的次等位碱基(第二碱基)的频率；而nⁱ_j是SNP位点i在亚群体j中染色体上的物理位置；n_j则是亚群体j用于比较分析的SNP位点个数的总和。其中变量j依据上述群体结构分析结果，新取为3，变量i以最后判定的SNP位置代入。In the above formula, xⁱ_j is the frequency of the secondary allele (second base) of SNP site i in subgroup j; and nⁱ_j is the physical position of SNP site i on chromosome in subgroup j ; n_j is the sum of the number of SNP sites used for comparative analysis in subgroup j. Among them, the variable j is newly taken as 3 according to the above-mentioned population structure analysis results, and the variable i is substituted with the final determined SNP position.

上述计算分析过程以群体SNP为基础，调用多个软件检测亚群体间可能存在的受到选择作用的位点，命名为SnpSelect.pl，使用的软件方法包括：Arlequin，BayesScan和Datacal三种，各软件对应参数说明，包括阈值的设置，详见表4。The above calculation and analysis process is based on the population SNP, using multiple software to detect possible selection sites among subgroups, named SnpSelect.pl, the software methods used include: Arlequin, BayesScan and Datacal, each software For the description of the corresponding parameters, including the setting of the threshold, see Table 4 for details.

perl SnpSelect.pl<snp.list><individual><2population.txt>[options]；其中2population文件指的是参与位点选择分析的两个亚群信息，具体格式见说明。perl SnpSelect.pl<snp.list><individual><2population.txt>[options]; the 2population file refers to the information of the two subpopulations involved in the site selection analysis, see the instructions for the specific format.

表4Table 4

输出文件output file

i)Arlequin分析结果，如图9所示。图9显示Arlequin程序基于群体SNP检测受选择作用位点的分析结果。横轴表示给定位点在群体水平的杂合度，纵轴表示亚群之间在给定位点上的杂合度差异值(Fst)。上部分圈起中的点表示受定向选择的位点(q<0.01或者q<0.05)，下部分圈起中的点表示受平衡选择的位点(q<0.01或者q<0.05)。i) Arlequin analysis results, as shown in FIG. 9 . Figure 9 shows the analysis results of the Arlequin program based on population SNP detection of selected action sites. The horizontal axis represents the heterozygosity of a given locus at the population level, and the vertical axis represents the heterozygosity difference (Fst) between subgroups at a given locus. The dots circled in the upper part represent the loci subjected to directional selection (q<0.01 or q<0.05), and the dots circled in the lower part represent the loci subject to balanced selection (q<0.01 or q<0.05).

ii)Global FST test分析结果，如图10所示。图10显示Global FST test程序基于群体SNP检测受选择作用位点的结果。横轴表示给定位点在群体水平的杂合度，纵轴表示亚群之间在给定位点上的杂合度差异值(Fst)。前1％Fst值所对应位点被认为是候选位点，即横线以上的点为检测出的受到选择作用的位点。ii) Global FST test analysis results, as shown in Figure 10. Figure 10 shows the results of the Global FST test program based on population SNP detection of selected action sites. The horizontal axis represents the heterozygosity of a given locus at the population level, and the vertical axis represents the heterozygosity difference (Fst) between subgroups at a given locus. The sites corresponding to the top 1% Fst values are considered as candidate sites, that is, the points above the horizontal line are the detected sites subjected to selection.

iii)BayeScan分析结果，如图11所示。图11显示BayeScan程序基于群体SNP检测受选择作用位点的结果。横轴表示给定位点在群体水平的杂合度，纵轴表示将给定位点的检验q值(q value)取对数的值(以10为底数)。q value<0.1的位点被认为是候选受选择作用位点，即位于图上竖线右边的点为候选受选择作用位点。iii) BayeScan analysis results, as shown in Figure 11. Figure 11 shows the results of the BayeScan program detecting selected action sites based on population SNPs. The horizontal axis represents the heterozygosity of a given site at the population level, and the vertical axis represents the logarithmic value (base 10) of the test q value of a given site. Sites with q value<0.1 are considered as candidate selection sites, that is, the points on the right side of the vertical line on the graph are candidate selection sites.

结合图9-图11，在位点选择分析时，获得有以上至少两种方法支持的判为最终的受选择作用位点。Combining with Fig. 9-Fig. 11, during site selection analysis, the final selected action site judged to be supported by at least two methods above is obtained.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, descriptions referring to the terms "one embodiment", "some embodiments", "example", "specific examples", or "some examples" mean that specific features described in connection with the embodiment or example , structure, material or characteristic is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管已经示出和描述了本发明的实施例，本领域的普通技术人员可以理解：在不脱离本发明的原理和宗旨的情况下可以对这些实施例进行多种变化、修改、替换和变型，本发明的范围由权利要求及其等同物限定。Although the embodiments of the present invention have been shown and described, those skilled in the art can understand that various changes, modifications, substitutions and modifications can be made to these embodiments without departing from the principle and spirit of the present invention. The scope of the invention is defined by the claims and their equivalents.