KR101879329B1

Movatterモバイル変換

Info

Publication number: KR101879329B1
Application number: KR1020160073379A
Authority: KR
Inventors: 류근호; 박영준; 류광선; 박현우
Original assignee: 충북대학교 산학협력단
Priority date: 2016-06-13
Filing date: 2016-06-13
Publication date: 2018-07-17
Anticipated expiration: 2036-06-13
Also published as: KR20170140708A

Abstract

Translated fromKorean

본 발명은 유전자 차별 발현 분석을 위한 RNA-seq 발현량 데이터 시뮬레이션 방법에 관한 것으로서, 본 발명의 컴퓨터 장치에서 실행되는 RNA-seq 발현량 데이터 시뮬레이션 방법에서, 유전자 샘플을 입력받는 단계 및 입력된 유전자가 차별 발현되지 않은 유전자이면, 고정된 포아송 분포(Poisson distribution)에서 상기 유전자의 발현량을 생성하는 단계를 포함한다.
본 발명의 RNA-seq 유전자 발현 데이터 시뮬레이션 방법은 차별 발현 분석기법의 정확성 및 신뢰성 벤치마킹을 가능하게 한다는 효과가 있다.A method for simulating RNA-seq expression level data for gene expression analysis, the method comprising: inputting a gene sample; And generating an expression amount of the gene in a fixed Poisson distribution if the gene is not differentially expressed.
The RNA-seq gene expression data simulation method of the present invention has the effect of enabling accurate and reliable benchmarking of the differential expression analysis technique.

Description

Translated fromKorean

유전자 차별 발현 분석을 위한 RNA-seq 발현량 데이터 시뮬레이션 방법 및 이를 기록한 기록매체 {RNA-seq expression data simulation method for differential gene expression analysis, and recording medium thereof}TECHNICAL FIELD The present invention relates to a method for simulating RNA-seq expression data for gene expression analysis, and a recording medium on which the RNA-seq expression data is simulated.

본 발명은 생물정보학 데이터 시뮬레이션 기법들 중 RNA-seq 가상 발현 자료 생성에 관한 것으로, RNA-seq 차별발현분석 기법의 정확성 및 신뢰성을 벤치마킹하기 위한 시뮬레이션 자료 생성 방법에 관한 것이다.The present invention relates to generation of RNA-seq virtual expression data among bioinformatics data simulation techniques, and a method for generating simulation data for benchmarking the accuracy and reliability of RNA-seq differential expression analysis techniques.

차세대 시퀀싱 기술의 급속한 발전과 비용의 감소가 전사체학(transcriptomics) 연구 분야의 복잡성과 한계성에 대한 시각을 바꾸어 놓고 있다.The rapid development and cost reduction of next-generation sequencing technology has changed the view on the complexity and limitations of transcriptomics research.

특히, Deep sequencing 기술을 사용하여 전사체의 발현정보를 획득하는 RNA-seq(RNA sequencing) 기술은, 지난 10년간 널리 사용되었던 hybridization 기반의 microarray 기술을 대체할 것으로 보인다.In particular, RNA-seq (RNA sequencing) technology, which uses deep sequencing technology to obtain transcript expression information, will replace hybridization-based microarray technology, which has been widely used for the last decade.

RNA-seq은 유전자 발현 정보 유추, alternative splicing 검출, 새로운 전사체 발견 등 다양한 분야에 적용 할 수 있지만, 주된 목적은 서로 다른 상태에서의(예를 들어, 질병 상태와 정상 상태 등) 차별발현 유전자를 검출하는 것이다.RNA-seq can be applied to various fields such as gene expression information analogy, alternative splicing detection, and new transcript detection, but its main objective is to identify genes that are differentially expressed (eg, diseased and normal) .

차별발현 유전자 검출을 위한 일반적인 RNA-seq 분석 파이프라인은 다음과 같다.The general RNA-seq analysis pipeline for differential expression gene detection is as follows.

우선, 하나의 RNA 샘플이 cDNA 또는 양쪽 끝에 adapter가 붙은 RNA 조각들로 전환 되어, high-throughput 시퀀싱 플랫폼에 의하여 염기서열이 읽혀진다.First, one RNA sample is converted into cDNA or RNA fragments with adapters at both ends, and the sequence is read by a high-throughput sequencing platform.

다음, 시퀀싱에 의해서 생성된 수십만 개의 RNA 조각 염기서열들이 참조 유전체 또는 참조 전사체에 사상(mapping) 되고, 사상된 조각들의 수에 의하여 유전자 발현량 정보가 유추되고, 필요하면 정규화가 수행된다.Next, hundreds of thousands of RNA fragment nucleotides generated by sequencing are mapped to a reference genome or reference transcript, the gene expression amount information is deduced by the number of scrambled pieces, and normalization is performed if necessary.

마지막으로 통계적 기법 또는 데이터 마이닝 기법을 이용하여 차별 발현된 유전자를 검출한다.Finally, a discriminant gene is detected using a statistical technique or a data mining technique.

최근, RNA-seq 차별발현 분석을 위하여 새로운 통계 및 데이터 마이닝 기법들이 개발되고 있다. 그러나 가장 중요한 문제점 중의 하나가 "gold standard" 데이터의 부족으로 인하여 개발하려고 하는 기법의 정확성 및 신뢰성 평가가 어렵다는 문제점이 있다. 즉, 실제 차별 발현된 유전자와 계산학적 기법에 의하여 검출된 유전자와의 비교분석을 통하여 정확성 및 신뢰성 평가를 수행하여야 하지만, 실제 차별 발현된 유전자 데이터가 존재하지 않으므로, 정확성 및 신뢰성 평가가 어려운 것이다.Recently, new statistical and data mining techniques have been developed for RNA-seq differential expression analysis. However, one of the most important problems is that it is difficult to evaluate the accuracy and reliability of the technique to be developed due to the lack of "gold standard" data. In other words, accuracy and reliability evaluation should be performed by comparing and analyzing genetically differentiated genes with the genes detected by a computational technique, but it is difficult to evaluate the accuracy and reliability because there is no genetic data showing differentiation.

대한민국 공개특허 제10-2006-0029597호Korean Patent Publication No. 10-2006-0029597

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 실제 차별 발현된 유전자 정보를 포함하고 있는 데이터 부족으로 인하여 발생하는 RNA-seq 유전자 차별발현 분석기법 비교분석의 어려움을 극복하고, 차별 발현 분석 기법의 정확도 및 신뢰성에 대한 용이한 평가를 위하여, 통계적 기법을 이용한 RNA-seq 발현량 데이터 시뮬레이션 방법을 제공하는데 그 목적이 있다.Disclosure of Invention Technical Problem [8] Accordingly, the present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to overcome the difficulty of comparative analysis of RNA- seq gene differentiation expression analysis technique caused by lack of data including genetic information, It is an object of the present invention to provide a method for simulating RNA-seq expression data using statistical techniques for easy evaluation of the accuracy and reliability of the technique.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

이와 같은 목적을 달성하기 위한 본 발명의 컴퓨터 장치에서 실행되는 RNA-seq 발현량 데이터 시뮬레이션 방법에서, 유전자 샘플을 입력받는 단계 및 입력된 유전자가 차별 발현되지 않은 유전자이면, 고정된 포아송 분포(Poisson distribution)에서 상기 유전자의 발현량을 생성하는 단계를 포함한다.In order to accomplish the above object, there is provided a method for simulating the expression level of RNA-seq in a computer apparatus of the present invention, which comprises: inputting a gene sample; and, if the input gene is not differentially expressed, ) To generate an expression amount of the gene.

입력된 유전자가 차별 발현된 유전자이면, 샘플별 또는 레플리카(replica)별로 서로 다른 포아송 분포에서 상기 유전자의 발현량을 생성하는 단계를 더 포함한다.If the input gene is a differentially expressed gene, generating an expression amount of the gene in different Poisson distributions for each sample or for each replica.

입력된 유전자가 공발현(co-expressed) 유전자이면, 상관 계수 행렬을 만족하는 다변량 정규분포를 이용하여 공발현 유전자 개수만큼 확률 벡터를 생성하는 단계 및 누적 분포 함수를 이용하여, 생성된 확률 벡터를 포아송 분포를 따르는 벡터들로 변환하는 단계를 더 포함한다.If the input gene is a co-expressed gene, a step of generating a probability vector by the number of coexpression genes using a multivariate normal distribution satisfying the correlation coefficient matrix, and a step of generating a probability vector by using a cumulative distribution function Into vectors that follow the Poisson distribution.

발현되지 않은 유전자 중에서 임의의 유전자를 선택하여 발현량에 0값을 부여하는 0-발현 유전자 발현량 생성 단계를 더 포함한다.And generating an expression amount of 0-expression gene by selecting an arbitrary gene out of the unexpressed genes and giving 0 value to the expression amount.

차별 발현이나 공발현이 되지 않은 유전자를 임의로 선택하여, 발현량 값에 입력된 상수를 곱하여 발현량을 재생성하는 과발현(highly-expressed) 유전자 발현량 생성 단계를 더 포함한다.The method further includes a highly-expressed gene expression amount generation step of arbitrarily selecting a gene that is not differentially expressed or unexpressed, and regenerating the expression amount by multiplying the expression amount value by a constant inputted thereto.

상기 확률 벡터를 생성하는 단계에서, 상관 계수 행렬을 만족하고, μ=0, σ=1인 다변량 정규분포를 이용하여 p 차원 확률벡터인

을 생성하고, 상기 포아송 분포를 따르는 벡터들로 변환하는 단계에서, 상기 확률벡터 값

에 대하여 누적 정규 분포 함수

를 구하고, 각각의

에 대해 평균값 λ_i를 이용하여 포아송 역누적 확률 함수 LP_i를 구하는 방식으로, 상기 확률벡터

을 포아송 분포를 따르는

벡터로 변환할 수 있다.In the step of generating the probability vector, a multivariate normal distribution satisfying the correlation coefficient matrix and having 占 = 0 and 占 = 1 is used to calculate the p-dimensional probability vector

And transforming the vectors into vectors following the Poisson distribution, the probability vector values

The cumulative normal distribution function

Respectively,

And the Poisson inverse cumulative probability function LP_i is obtained using the mean value?_I for the probability vector

Lt; RTI ID = 0.0 > Poisson &

Can be converted into a vector.

상기 LP_i를,

의 수학식으로 나타낼 수 있다.Lt; /_RTI >

. &Lt; / RTI >

본 발명의 RNA-seq 유전자 발현 데이터 시뮬레이션 방법은 차별 발현 분석기법의 정확성 및 신뢰성 벤치마킹을 가능하게 한다는 효과가 있다.The RNA-seq gene expression data simulation method of the present invention has the effect of enabling accurate and reliable benchmarking of the differential expression analysis technique.

또한, 본 발명에 의하면, 생성된 시뮬레이션 자료를 기반으로 기존에 개발된 차별 발현 분석 기법의 비교분석을 통해 실 데이터에 적합한 차별 발현 분석 기법 선택의 어려움을 극복할 수 있으므로, 데이터 맞춤형 분석이 가능하다는 장점이 있다.In addition, according to the present invention, it is possible to overcome the difficulty of selecting a differentiation expression analysis technique suitable for real data through comparative analysis of previously developed differential expression analysis techniques based on the generated simulation data, There are advantages.

또한, 본 발명에 의하면, 시뮬레이션 자료를 이용한 성능 평가를 수행하여 새로운 분석기법의 신뢰성을 향상시킬 수 있는 효과가 있다.In addition, according to the present invention, performance evaluation using simulation data is performed to improve reliability of a new analysis technique.

도 1은 본 발명의 일 실시예에 따른 유전자 차별 발현 분석을 위한 RNA-seq 발현량 데이터 시뮬레이션 방법을 보여주는 흐름도이다.
도 2는 본 발명의 일 실시예에 따른 RNA-seq 발현량 데이터 시뮬레이션 알고리즘 의사코드이다.FIG. 1 is a flowchart illustrating a method of simulating RNA-seq expression amount data for analysis of gene differentiation expression according to an embodiment of the present invention.
2 is a pseudo-code of an RNA-seq expression amount data simulation algorithm according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 갖는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted in an ideal or overly formal sense unless expressly defined in the present application Do not.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In the following description of the present invention with reference to the accompanying drawings, the same components are denoted by the same reference numerals regardless of the reference numerals, and redundant explanations thereof will be omitted. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

본 발명은 컴퓨터 장치에서 실행되는 RNA-seq 발현량 데이터 시뮬레이션 방법에 대한 것이다.The present invention relates to a method for simulating RNA-seq expression amount data executed in a computer apparatus.

본 발명에서 RNA-seq 발현량 데이터 시뮬레이션 방법을 수행하는 주체는 RNA-seq 발현량 데이터 시뮬레이션 방법을 수행하는 제반 컴퓨터 장치라고 할 수 있다. 즉, 본 발명에서 RNA-seq 발현량 데이터 시뮬레이션 방법을 수행하는 컴퓨터, 컴퓨터의 제어부 또는 프로세서(processor)가 그 수행 주체가 될 수 있다.In the present invention, the subject performing the RNA-seq expression amount data simulation method may be referred to as a computer device for performing the RNA-seq expression amount data simulation method. That is, a computer, a control unit of a computer, or a processor that performs a method of simulating the amount of RNA-seq expression data in the present invention may be the subject of execution.

도 1은 본 발명의 일 실시예에 따른 유전자 차별 발현 분석을 위한 RNA-seq 발현량 데이터 시뮬레이션 방법을 보여주는 흐름도이다.FIG. 1 is a flowchart illustrating a method of simulating RNA-seq expression amount data for analysis of gene differentiation expression according to an embodiment of the present invention.

도 1을 참조하면, 먼저, 컴퓨터 장치에서 유전자 샘플을 입력받고, 입력된 유전자가 차별 발현되지 않은 유전자이면(S10), 고정된 포아송 분포(Poisson distribution)에서 유전자의 발현량을 생성한다(S20). 예를 들어, k는 입력 파라미터라고 할 때, 차별 발현되지 않은 k개의 유전자의 발현량은 고정된 포아송 분포에서 생성되고, 상태(condition)에 관계없이 모든 샘플에 대하여 동일한 포아송 분포가 이용된다.Referring to FIG. 1, if a gene sample is received from a computer device and the input gene is not differentially expressed (S10), a gene expression amount is generated in a fixed Poisson distribution (S20) . For example, when k is an input parameter, the expression amounts of k genes that are not differentiated are generated in a fixed Poisson distribution, and the same Poisson distribution is used for all samples regardless of the condition.

또는, 입력된 유전자가 차별 발현된 유전자이면(S20), 샘플별 또는 레플리카(replica)별로 서로 다른 포아송 분포에서 유전자의 발현량을 생성한다(S22). 예를 들어, 상태 1에 속하는 샘플에서의 차별 발현 유전자 발현량은 평균값이 λ₁인 포아송 분포에서 생성되고, 상태 2에 속하는 샘플에서의 차별 발현 유전자 발현량은 평균값이 λ₂인 포아송 분포에서 생성된다.Alternatively, if the input gene is a differentially expressed gene (S20), the expression amount of the gene is generated in different Poisson distributions for each sample or replica (S22). For example, generated by the differentially expressed gene expression level is the mean value is generated from the Poisson distribution λ_1, samples differentially expressed gene expression level is a Poisson distribution of the average value λ₂ of the belonging to thestate 2 of the sample at that belong to state one do.

또는, 입력된 유전자가 공발현(co-expressed) 유전자이면(S30), 상관 계수 행렬을 만족하는 다변량 정규분포를 이용하여 공발현 유전자 개수만큼 확률 벡터를 생성한다(S32).Alternatively, if the input gene is a co-expressed gene (S30), a probability vector is generated by the number of co-expression genes using a multivariate normal distribution satisfying the correlation coefficient matrix (S32).

그리고, 누적 분포 함수를 이용하여, 생성된 확률 벡터를 포아송 분포를 따르는 벡터들로 변환한다(S34).Then, the cumulative distribution function is used to convert the generated probability vector into vectors following the Poisson distribution (S34).

확률 벡터를 생성하는 단계(S32)에서, 상관 계수 행렬을 만족하고, μ=0, σ=1인 다변량 정규분포를 이용하여 p 차원 확률벡터인

을 생성한다. 이때, 상관관계 행렬 R_N이 정규분포에 의하여 자동으로 생성된다.In a step S32 of generating a probability vector, a multivariate normal distribution satisfying the correlation coefficient matrix and having 占 = 0 and 占 = 1 is used to calculate the p-dimensional probability vector

. At this time, the correlation matrix R_N is automatically generated by the normal distribution.

그리고, 포아송 분포를 따르는 벡터들로 변환하는 단계(S34)에서, 확률벡터 값

에 대하여 누적 정규 분포 함수

를 구하고, 각각의

에 대해 평균값 λ_i를 이용하여 포아송 역누적 확률 함수 LP_i를 구하는 방식으로, 확률벡터

을 포아송 분포를 따르는

벡터로 변환한다.Then, in the step S34 of transforming into the vectors following the Poisson distribution, the probability vector value

The cumulative normal distribution function

Respectively,

Lt; RTI ID = 0.0 > Poisson &

Vector.

본 발명에서 LP_i를 다음 수학식으로 나타낼 수 있다.In the present invention, LP_i can be expressed by the following equation.

실제 RNA-seq 유전자 발현 데이터에는 발현량이 0인 유전자가 포함되어 있다. 본 발명에서는 이와 같은 현상을 시뮬레이션하기 위하여 k개의 발현 되지 않은 유전자 중 임의의 유전자를 선택하여 발현량에 0값을 부여한다.Actual RNA-seq gene expression data includes genes with an expression level of zero. In the present invention, in order to simulate such a phenomenon, an arbitrary gene among k non-expressed genes is selected and a value of 0 is given to the expression amount.

즉, 발현되지 않은 유전자 중에서 임의의 유전자를 선택하여 발현량에 0값을 부여하는 0-발현 유전자 발현량 생성 단계(S40, S42)를 포함한다.That is, the step of generating an expression amount of 0-expression gene (S40, S42) which selects an arbitrary gene among the unexpressed genes and adds 0 value to the expression level is included.

또한, 실제 RNA-seq 유전자 발현 데이터에는 타 유전자 발현량에 비해 현저하게 높은 발현량을 보이는 유전자가 포함되어 있다.In addition, the actual RNA-seq gene expression data contains a gene that shows a significantly higher expression level than other gene expression amounts.

본 발명에서는 이와 같은 현상을 시뮬레이션 하기 위하여 차별 발현이나 공발현이 되지 않은 유전자를 임의로 선택하여, 발현량 값에 입력된 상수를 곱하여 발현량을 재생성하는 과발현(highly-expressed) 유전자 발현량 생성 단계(S50, S52)를 포함한다.In the present invention, in order to simulate such a phenomenon, a step of generating a highly-expressed gene expression amount (hereinafter referred to as a " step ") for arbitrarily selecting a gene that is not differentially expressed or coexpressed and multiplying the expression value by a constant S50, and S52.

즉, 발현량 exp(h) 값에 상수 n을 곱하여 발현량을 재생성하며, 이를 다음 수학식과 같이 나타낼 수 있다. 여기서 n은 입력 파라미터이다.That is, the expression amount exp (h) is multiplied by a constant n to regenerate the expression amount, which can be expressed by the following equation. Where n is an input parameter.

도 2는 본 발명의 일 실시예에 따른 RNA-seq 발현량 데이터 시뮬레이션 알고리즘 의사코드이다.2 is a pseudo-code of an RNA-seq expression amount data simulation algorithm according to an embodiment of the present invention.

한편, 본 발명의 실시예에 따른 RNA-seq 발현량 데이터 시뮬레이션 방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다.Meanwhile, the RNA-seq expression amount data simulation method according to an embodiment of the present invention can be implemented as a computer-readable code on a computer-readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored.

예컨대, 컴퓨터가 읽을 수 있는 기록매체로는 롬(ROM), 램(RAM), 시디-롬(CD-ROM), 자기 테이프, 하드디스크, 플로피디스크, 이동식 저장장치, 비휘발성 메모리(Flash Memory), 광 데이터 저장장치 등이 있다.For example, the computer-readable recording medium includes a ROM, a RAM, a CD-ROM, a magnetic tape, a hard disk, a floppy disk, a removable storage device, a nonvolatile memory, , And optical data storage devices.

또한, 컴퓨터로 읽을 수 있는 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.In addition, the computer readable recording medium may be distributed and executed in a computer system connected to a computer communication network, and may be stored and executed as a code readable in a distributed manner.

이상 본 발명을 몇 가지 바람직한 실시예를 사용하여 설명하였으나, 이들 실시예는 예시적인 것이며 한정적인 것이 아니다. 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 다양한 변화와 수정을 가할 수 있음을 이해할 것이다.While the present invention has been described with reference to several preferred embodiments, these embodiments are illustrative and not restrictive. It will be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit of the invention and the scope of the appended claims.

Claims

Translated fromKorean

컴퓨터 장치에서 실행되는 RNA-seq 발현량 데이터 시뮬레이션 방법에서,
유전자 샘플을 입력받는 단계;
입력된 유전자가 차별 발현되지 않은 유전자이면, 고정된 포아송 분포(Poisson distribution)에서 상기 유전자의 발현량을 생성하는 단계; 및
발현되지 않은 유전자 중에서 임의의 유전자를 선택하여 발현량에 0값을 부여하는 0-발현 유전자 발현량 생성 단계를 포함하는 RNA-seq 발현량 데이터 시뮬레이션 방법.
In an RNA-seq expression amount data simulation method executed in a computer device,
Receiving a gene sample;
Generating an expression amount of the gene in a fixed Poisson distribution if the input gene is a gene that is not differentially expressed; And
A step of generating an expression amount of 0-expression gene by selecting an arbitrary gene among the unexpressed genes and giving a value of 0 to the expression level.

청구항 1에 있어서,
입력된 유전자가 차별 발현된 유전자이면, 샘플별 또는 레플리카(replica)별로 서로 다른 포아송 분포에서 상기 유전자의 발현량을 생성하는 단계를 더 포함하는 것을 특징으로 하는 RNA-seq 발현량 데이터 시뮬레이션 방법.
The method according to claim 1,
And generating an expression amount of the gene in different Poisson distributions on a sample-by-sample or replica basis if the input gene is a differentially expressed gene.

청구항 1에 있어서,
입력된 유전자가 공발현(co-expressed) 유전자이면, 상관 계수 행렬을 만족하는 다변량 정규분포를 이용하여 공발현 유전자 개수만큼 확률 벡터를 생성하는 단계; 및
누적 분포 함수를 이용하여, 생성된 확률 벡터를 포아송 분포를 따르는 벡터들로 변환하는 단계를 더 포함하는 것을 특징으로 하는 RNA-seq 발현량 데이터 시뮬레이션 방법.
The method according to claim 1,
Generating a probability vector by the number of co-expression genes using a multivariate normal distribution satisfying a correlation coefficient matrix if the input gene is a co-expressed gene; And
Further comprising the step of converting the generated probability vector into vectors following the Poisson distribution using a cumulative distribution function.

삭제delete

청구항 1에 있어서,
차별 발현이나 공발현이 되지 않은 유전자를 임의로 선택하여, 발현량 값에 입력된 상수를 곱하여 발현량을 재생성하는 과발현(highly-expressed) 유전자 발현량 생성 단계를 더 포함하는 것을 특징으로 하는 RNA-seq 발현량 데이터 시뮬레이션 방법.
The method according to claim 1,
Characterized in that the method further comprises a step of generating a highly expressed gene expression amount by arbitrarily selecting a gene which is not differentially expressed or unexpressed and regenerating the expression amount by multiplying the expression amount value by a constant inputted thereto. Expression amount data simulation method.

청구항 3에 있어서,
상기 확률 벡터를 생성하는 단계에서, 상관 계수 행렬을 만족하고, μ=0, σ=1인 다변량 정규분포를 이용하여 p 차원 확률벡터인

에 대하여 누적 정규 분포 함수

를 구하고, 각각의

을 포아송 분포를 따르는

벡터로 변환하는 것을 특징으로 하는 RNA-seq 발현량 데이터 시뮬레이션 방법.
The method of claim 3,
In the step of generating the probability vector, a multivariate normal distribution satisfying the correlation coefficient matrix and having 占 = 0 and 占 = 1 is used to calculate the p-dimensional probability vector

Lt; / RTI >
In the step of transforming into the vectors following the Poisson distribution, the probability vector value

The cumulative normal distribution function

Respectively,

Lt; RTI ID = 0.0 > Poisson &

Wherein the RNA-seq expression level data is simulated.

청구항 6에 있어서,
상기 LP_i를,

의 수학식으로 나타낼 수 있는 것을 특징으로 하는 RNA-seq 발현량 데이터 시뮬레이션 방법.
The method of claim 6,
Lt; /_RTI >

Wherein the expression level of the RNA-seq expression vector is expressed by the following equation.

청구항 1 내지 청구항 3, 청구항 5 내지 청구항 7 중 어느 한 청구항의 방법을 컴퓨터로 실행시킬 수 있는 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체.A computer-readable recording medium having recorded thereon a program capable of executing a method according to any one of claims 1 to 3 and claim 5 to a computer.