Movatterモバイル変換


[0]ホーム

URL:


CN119832977A - Model for identifying ASm6A modification and application thereof - Google Patents

Model for identifying ASm6A modification and application thereof
Download PDF

Info

Publication number
CN119832977A
CN119832977ACN202411735634.1ACN202411735634ACN119832977ACN 119832977 ACN119832977 ACN 119832977ACN 202411735634 ACN202411735634 ACN 202411735634ACN 119832977 ACN119832977 ACN 119832977A
Authority
CN
China
Prior art keywords
sample
asm6a
peak
snp
modified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411735634.1A
Other languages
Chinese (zh)
Inventor
罗晓彤
谢宇斌
左志向
任间
张印
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sixth Affiliated Hospital of Sun Yat Sen University
Original Assignee
Sixth Affiliated Hospital of Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sixth Affiliated Hospital of Sun Yat Sen UniversityfiledCriticalSixth Affiliated Hospital of Sun Yat Sen University
Priority to CN202411735634.1ApriorityCriticalpatent/CN119832977A/en
Publication of CN119832977ApublicationCriticalpatent/CN119832977A/en
Pendinglegal-statusCriticalCurrent

Links

Classifications

Landscapes

Abstract

The invention discloses a model for identifying ASm6A modification, which can accurately identify ASm6A modification peaks in a sample to be detected based on MeRIP-seq sequencing data of the sample to be detected. When the model provided by the invention is used for identifying ASm6A modified peaks in a sample to be detected, no matter what MeRIP-seq sequencing data of the sample to be detected is used, the error rate of identification is always lower than 10%, which indicates that the model provided by the invention has excellent stability when the ASm6A modified peaks of the sample are identified, and is suitable for identifying MeRIP-seq sequencing data of the sample under different conditions. Compared with the existing ASm6A modification identification method, the AUC value of the ASm6A modification in the sample identification model is obviously improved, and the identification effect is obviously better.

Description

Model for identifying ASm6A modification and application thereof
Technical Field
The invention relates to the technical field of biomedicine, in particular to a model for identifying ASm6A modification and application thereof.
Background
N6-methyladenine (N6-methyladenosine, m 6A) is the most common dynamic reversible RNA modification in eukaryotic cells, and m6A modification forms a precise dynamic regulation and control cycle in the organism through the combined action of a series of methylases, demethylases and methylation recognition proteins, and is widely involved in vital activities such as embryo development, apoptosis, sperm development, immunity, pressure response and the like. Abnormal m6A modification can cause serious pathological changes of organisms, as found in the prior art, hypoxia conditions can induce breast cancer stem cell phenotype through m6A demethylation, and in addition, m6A can be used as a target for inhibiting cancer under specific conditions. Therefore, identifying the m6A precise site and researching the internal regulation mechanism thereof are key bases for elucidating the pathogenesis of related diseases and developing new therapeutic targets.
Currently, it is still very difficult to resolve the precise regulatory mechanism of the m6A modification at the histology level due to the resolution limitations of the experimental methods. The co-immunoprecipitation combined with a new generation sequencing technology (MeRIP-seq) can simultaneously identify tens of thousands of RNA methylation modification signal peaks with the length of 100-200 nt on the transcriptome level, which plays an important role in promoting the physiological function of revealing RNA methylation modification.
With the proposal of the ultraviolet light crosslinking m6A antibody combined immunoprecipitation (m 6Aindividual-nucleotideresolution cross-LINKING AND immunoprecipitation, miCLIP) method, the m6A modification site of single base progress is successfully identified from the whole transcriptome on the experimental level, the high-throughput identification of the m6A modification site of mammals on the single base precision is already carried out in the prior art, however, the identification result obtained by the technology is still not very stable as the technology is still immature, the technology is not widely applied in the m6A research field at present, and MeRIP-seq is still the most main high-throughput research method at present. Thus, improving the quality of the results identifying m6A modified regions from MeRIP-seq data would be the key basis of current m6A studies.
Based on the accurate identification of the m6A modification region of the transcriptome, another important study was aimed at resolving allele-specific modification times. In non-haploid organisms, the expression of certain modification sites or certain genes is more prone to occur on a particular chromosome, which is referred to as allele-specific. Some important vital event control events, such as gene transcription, epigenetic modification, etc., have significant allele specificity, and these allele-specific controls affect the vital activities of cells in many ways, including imprinting control of embryo development, chromosome inactivation, and gene expression control in a specific spatiotemporal context, etc. Whereas, the disorder allele-specific regulatory event can often cause serious organism pathological changes, such as biological dysplasia, cardiovascular and cerebrovascular functional pathological changes, even cancers and the like, so that the identification and research of the allele-specific regulatory event can help researchers to know the occurrence mechanism of various diseases and to mine potential therapeutic targets.
However, due to technical limitations, allele-specific regulatory event studies have been limited to limited genes or chromosome segments for a considerable period of time, and with the development of high throughput sequencing technology, allele-specific regulatory event studies have been developed at the genomic level, sequencing technology has been able to identify single nucleotide polymorphic sites (Single Nucleotide Polymorphisms, SNPs) present on DNA or RNA, and, in non-haploid organisms, chromosomes from maternal and paternal sides generally carry diverse SNP sites, so sequencing reads can be deduced from heterozygous SNP sites on the sequences, thereby identifying regulatory signals from diverse alleles.
Allele-specific modification is considered to be one of the important reasons for the generation of allele-specific expression, and the prior art mainly focuses on DNA level, as the prior art studies on mouse brain tissue find that the methylation level of DNA of more than 1300 alleles has a significant difference on two homologous chromosomes, and furthermore, the prior art also reveals that more than 10% of human genes are regulated by allele-specific DNA methylation. These methylation modifications with allele specificity have a high degree of correlation with SNP sites on the sequence, and there are relatively few studies on allele specificity on RNA modifications.
In contrast to the above allele-specific regulatory times, studies of their allele specificity for m6A modification on the transcriptome have not been reported. In addition, m6A modifies the correct functions of a plurality of metabolism and growth differentiation related genes of a controller in an organism, the prior art finds that the mouse embryonic stem cells can be prevented from differentiating after being knocked out (knockout) METTL, but in drosophila, the homologous protein Ime4 of METTL3 also has a sublethal effect after being knocked out, which can cause the damage of NOTCH signal channels so as to influence the fertility of drosophila, and in addition, the Ime4 gene has an important effect on the meiosis of yeast. This indicates that m6A modification is critical for early sexual organ development and early embryo development, and thus studying whether m6A modification has specificity between alleles is an important entry point for decrypting embryo development mechanisms.
Analyzing m6A modified signals from MeRIP-seq sequencing data mainly depends on a computational biology method, researchers firstly use Fisher's Exact Test method to construct a first peak identification algorithm for MeRIP-seq, the method can achieve higher identification specificity, but neglects the upstream and downstream correlation of m6A signals on transcripts, and lacks accurate modeling of read steps on the whole genome step by step, accuracy and sensitivity are insufficient, while Exome Peak, HEPeak and MetPeak algorithms mainly use Beta binomial distribution to model methylation level of each chromosome window, and introduce Hidden Markov models (Hidden MarkovModel, HMM) to calculate the upstream and downstream correlation of methylation events, but the algorithms must assume MeRIP-seq reads to conform to Beta binomial distribution, so that obvious errors are introduced for samples with low sequencing quality, parameter optimization of HMM models can obtain accurate results based on a certain number of repeated samples, the number of repeated samples can be used for detecting the specific analysis of the sequencing data of the MACseq, and the analysis of the analysis principle is also limited by the fact that the number of the MACseq is very small in 35-35, and the analysis of the sequence has the fact that the number of the sequence of the MACseq is very small.
Therefore, there is an urgent need for a method capable of precisely identifying m6A peak signals based on MeRIP-seq sequencing, thereby achieving accurate identification of allele-specific m 6A.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provide a model for identifying ASm6A modification and application thereof.
It is a first object of the present invention to provide a model for identifying ASm6A modifications.
It is a second object of the present invention to provide the use of the above model for the preparation of a product for the identification of ASm 6A.
A third object of the invention is to provide an ASm6A modification system.
It is a fourth object of the invention to provide a computer device.
A fifth object of the present invention is to provide a computer-readable storage medium.
It is a sixth object of the invention to provide a computer program product.
A seventh object of the present invention is to provide a model for identifying a differential ASm6A of paired samples.
In order to achieve the above object, the present invention is realized by the following means:
The model for identifying ASm6A modification comprises a data acquisition module, a data processing module, a modification identification module and a result output module;
the data acquisition module is used for acquiring MeRIP-seq sequencing data of a sample to be tested;
The data processing module acquires m6A peaks of a sample to be detected based on MeRIP-seq sequencing data of the sample to be detected, acquires SNP locus information covered by each m6A peak, calculates a dominance ratio ρj(m) according to a formula 10 for SNPj(m) covered by each m6A peak, and obtains a logarithmic dominance ratio yj(m) by taking the logarithm;
Equation 10:
Wherein nj(m) is the read count of SNPj(m) in the sample to be tested, x(m)ma,j is the number of reads of the major haplotype on SNPj(m);
the acquisition method of muASE comprises the steps of acquiring SNP loci of a sample to be detected based on RNA-seq sequencing data of the sample to be detected, aiming at SNPj in the sample to be detected, calculating the dominance ratio rhoj of a main haplotype to a secondary haplotype on SNPj according to a formula I, and taking logarithm to obtain a logarithmic dominance ratio yj;
equation 1:
Wherein nj is the read count of SNPj in the sample to be tested, xma,j is the number of reads of the major haplotype on SNPj, and x0,j is 0.5 x nj;
processing the logarithmic dominance ratio yj by combining the unidirectional normal random distribution to obtain processed yj, sampling the processed yj and calculating the average value muASE of posterior distribution;
Processing the logarithmic dominance ratio yj(m) of SNP loci covered by each m6A peak by combining unidirectional normal random distribution to obtain processed yj(m), sampling the processed yj(m), calculating the average value mu(m) of posterior distribution, taking the inverse logarithm of mu(m) to obtain rhoj(m)', and calculating the frequency of a main allele of each m6A peak;
The difference significance analysis module is used for carrying out difference significance analysis on the frequency of the main allele of each m6A peak obtained by the data processing module to obtain a difference significance analysis result of each m6A peak;
the modification recognition module recognizes an m6A peak with difference significance as an ASm6A modified peak and recognizes an m6A peak without difference significance as a non-ASm 6A modified peak based on the difference significance analysis result of each m6A score;
the result output module is used for outputting the result obtained by the modification identification module.
Wherein ASm6A is modified by RNA N6-methyladenosine (Allle-specofoc m A, ASm 6A) with specific Allele, and the method is concretely described in the prior art, namely Genome Res.2023Aug, and 33 (8), namely 1369-1380.doi:10.1101/gr.277704.123.Epub 2023ep 15.
Preferably, in the data processing module, for the processed yj(m), a Metropolis-Hastings sampling method modified based on a Markov chain Monte Carlo algorithm is used for sampling.
Preferably, in the data processing module, the major haplotype at SNPj is the base of the highest base reading at SNPj and the minor haplotype at SNPj is the base of the last high base reading at SNPj.
Preferably, in the data processing module, the frequency of the main allele of each m6A peak is calculated by the method that the frequency of the main allele is = { ρj(m)/(1+ρj(m) }.
The invention also claims the use of a model as described in any of the above for the preparation of a product for the identification of ASm 6A.
The invention also claims an ASm6A modification and identification system, which comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring MeRIP-seq sequencing data of a sample to be tested;
The storage unit stores program instructions executable by the processing unit;
the processing unit comprises a model as described in any one of the above;
When the program instruction is executed by the processing unit, meRIP-seq sequencing data of the sample to be detected obtained by the obtaining unit is input into any one of the models to obtain ASm6A modification recognition results.
The invention also claims a computer device comprising a memory and a processor, said memory having stored thereon a computer program executable on said processor;
The computer program, when executed by a processor, implements any of the models described above and/or the ASm6A modifier identification system described above.
The invention also claims a computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements any of the above-mentioned models and/or the above-mentioned ASm6A modifier identification system.
The invention also claims a computer program product comprising the computer readable storage medium described above.
The invention also claims a model for identifying the difference ASm6A of the paired samples, which is characterized by comprising a data acquisition module, an identification module and a result output module;
the paired samples are from different groups and are derived from sample a and sample B of the same individual;
The data acquisition module is used for acquiring MeRIP-seq sequencing data of the paired samples, and identifying according to any one of the models to obtain ASm6A modified peaks of the sample A and the sample B respectively;
the identification module marks ASm6A modified peaks which overlap by more than 50% between genome coordinates of the sample A and the sample B as m6A modified peaks which exist in the sample A and the sample B simultaneously based on MeRIP-seq sequencing data of the sample A and the sample B obtained by the data acquisition module;
ASm6A modified peaks present in sample a but not in sample B are noted as specific ASm6A modified peaks in sample a;
for ASm6A modified peaks not present in sample a, but present in sample B, noted as non-specific ASm6A modified peaks for sample a;
ASm6A modified peaks for ASm6A that are present in both sample a and sample B, but exhibit different major haplotypes, noted as specific ASm6A modified peaks for sample a;
For ASm6A modified peaks which exist in the sample A and the sample B but show the same main haplotype, judging the difference of the ASm6A modified peaks in the sample A and the sample B by using a hierarchical Bayesian model, wherein the ASm6A modified peak with obvious difference is marked as a specific ASm6A modified peak of the sample A, and the ASm6A modified peak with insignificant difference is marked as a non-specific ASm6A modified peak of the sample A;
The result output module is used for outputting the result obtained by the identification module.
Preferably, the variability is significant with q <0.05, and the variability is not significant with q > 0.05.
Compared with the prior art, the invention has the following beneficial effects:
The invention provides a model for identifying ASm6A modification, which can accurately identify ASm6A modification peaks in a sample to be detected based on MeRIP-seq sequencing data of the sample to be detected. When the model provided by the invention is used for identifying ASm6A modified peaks in a sample to be detected, no matter what MeRIP-seq sequencing data of the sample to be detected is used, the error rate of identification is always lower than 10%, which indicates that the model provided by the invention has excellent stability when the ASm6A modified peaks of the sample are identified, and is suitable for identifying MeRIP-seq sequencing data of the sample under different conditions. Compared with the existing ASm6A modification identification method, the AUC value of the ASm6A modification in the sample identification model is obviously improved, and the identification effect is obviously better.
Drawings
FIG. 1 is a schematic diagram of a model for ASm6A modification recognition as shown in example 2;
FIG. 2 is a graph of average error rate results for data sets of each classification category identified according to the method of example 3, A is the average error rate results for data sets classified for sequencing reads, B is the average error rate results for data sets classified for library sizes, C is the average error rate results for data sets classified for FPKM values for gene expression, D is the average error rate results for data sets classified for the number of SNPs in the m6A peak, E is the average error rate results for data sets classified for the number of biological replicates;
FIG. 3 is a graph of the average AUC results for experimental, control, and control groups 1 and 2 of example 4;
FIG. 4 is a plot of area under ROC curves for experimental, control, 1, and control groups 2 of example 4, A is a plot of AUC results for data sets classifying the number of SNPs in the m6A peak, B is a plot of AUC results for data sets classifying the FPKM values of gene expression, C is a plot of AUC results for data sets classifying the number of biological repeats, D is a plot of AUC results for data sets classifying the sequencing read length, E is a plot of AUC results for data sets classifying the library size;
FIG. 5 is a graph of the results of the IGV visualization in example 4;
FIG. 6 is a graph showing the identification result of the difference ASm6A in example 5.
Detailed Description
The invention will be further described in detail with reference to the drawings and specific examples, which are given solely for the purpose of illustration and are not intended to limit the scope of the invention. The test methods used in the examples described below are conventional methods unless otherwise specified, and the materials, reagents, etc., used are commercially available reagents and materials unless otherwise specified.
Example 1A functional model for identifying genes with significant allele-specific expression
The functional model for identifying genes with significant allele-specific expression includes a data acquisition module, a data processing module, a difference significance analysis module, a prediction module, and a result output module.
The data acquisition module is used for acquiring RNA-seq sequencing data of the sample to be tested.
The data processing module performs the following processing based on RNA-seq sequencing data of a sample to be tested, and specifically comprises the following steps:
Mapping the RNA-seq sequencing data of the sample to be detected to corresponding genomes of the sample to be detected based on the RNA-seq sequencing data of the sample to be detected to obtain SNP locus information of each gene and the reading number of each SNP in the sample to be detected, and marking the base of the highest base reading on one SNP locus as a main haplotype of the locus and the base of the next highest base reading as a secondary haplotype of the locus;
Aiming at the SNPj, calculating the dominance ratio rhoj of the main haplotype to the secondary haplotype on the SNPj according to a formula 1, and taking the dominance ratio according to a formula 2 to obtain a logarithmic dominance ratio yj, wherein the SNPj is the SNP locus in a sample to be detected;
equation 1:
equation 2:
Wherein nj is the read count of SNPj in the sample to be tested, xma,j is the number of reads of the major haplotype on SNPj, x0,j is 0.5 x nj;
processing the logarithmic dominance ratio yj obtained in the formula 2 by combining a single normal Random Effect Model (REM) according to the formulas 3-6 to obtain processed yj;
Equation 3:
Equation 4: θj~N(μ,τ2);
Equation 5: μ to form (- ≡, ++ infinity a) is provided;
Equation 6:
Wherein θj is the mean value of the log-dominance ratio of the sample to be measured, σj2 is the variance of the log-dominance ratio of the sample to be measured;s2=10。
For the processed yj, a Metropolis-Hastings (M-H) sampling method modified based on Markov Chain Monte Carlo (MCMC) algorithm is adopted to sample according to the formula 7, and edge probability is calculated according to the formula 8, then posterior distribution of μ is calculated according to the formula 9, and mean value (μASE) of the posterior distribution of μ is calculated.
Equation 7:p (θ1,…,θj,μ,τ|y1,y2,…,yj);
Equation 8:p (μ, τ|y1,y2,…,yj);
equation 9:p (τ|y1,y2,…,yj).
The difference significance analysis module performs difference significance analysis based on the mean value (muASE) of mu posterior distribution obtained by the data processing module, and specifically comprises the following steps:
(1) Construction of tail distribution data for Major Allele Frequencies (MAFs) under null hypothesis
To distinguish between significant allele-specific events, a Major Allele Frequency (MAF) profile needs to be obtained under null hypothesis conditions, simulated sequencing data is needed to obtain read counts of major and minor alleles of SNPs without significant ASE or ASm6A events due to lack of true MeRIP-seq data meeting the criteria, and the read count profile of a single SNP captured by sequencing is fitted by using a negative binomial profile (NBD).
Assuming that the number of reads of a single allele covering each SNP locus (SNPi) is shown in formula 10;
Equation 10:
Wherein μ is the theoretical number of reads of SNPi in the absence of allele-specific events, equal to 0.5Ni,Ni is the ready count of SNPi, and k is the dispersion parameter.
The acquisition method of the dispersion parameter k comprises the steps of acquiring Whole Genome Sequencing (WGS) data from a 1000-person genome database (1000 genome, https:// www.internationalgenome.org /), counting the number of covered reads Ni of each SNP locus and the number of reads x of alleles in each sample, and integrating Ni and x through a maximum likelihood method to obtain the dispersion parameter k.
The FPKM values for all genes were then collected from the cancer genome map program (TCGA) and fitted using the fitter package in Python, the genes were classified into six classes of genes according to the distribution type of FPKM (as described in prior art de Torrente L,Zimmerman S,Suzuki M,Christopeit M,Greally JM,Mar JC.The shape of gene expression distributions matter:how incorporating distribution shape improves the interpretation of cancer transcriptomic data.BMC Bioinformatics.2020;21:562.), the overall distribution of FPKM for each class of genes was re-fitted, the simulated FPKM for each gene was obtained by sampling from the distribution, and the total number of reads Ni' for each SNP on each gene was calculated based on the simulated FPKM for each gene and the length of the gene.
Based on the dispersion parameter k and the Ni ' obtained by simulation, the negative binomial distribution of the allele reads number of SNPi of each meta-unit (gene) is obtained by combining formula 10, the number of reads of the major and minor alleles is simulated by sampling, the dominance ratio ρi ' is calculated according to formula 1, and MAF on each gene is calculated, wherein MAF= { ρi'/(1+ρi ')}.
Introducing Generalized Pareto Distribution (GPD), estimating tail distribution of MAF' of different types of genes, calculating a statistical significance threshold, taking 94% quantiles as an initial threshold t, and estimating the distribution of MAF larger than the initial threshold t by combining the scale parameter and the shape parameter of the GPD in the formula 11 to obtain tail distribution of MAF values in zero hypothesis data.
Equation 11:
The parameter ψ is estimated by a mixing method and a gradient descent algorithm proposed in the prior art (Wang C,Chen G.A new hybrid estimation method for the generalized pareto distribution.Communications in Statistics-Theory and Methods.2016;45:4285-4294.).
(2) Analysis of Difference significance
And taking the inverse logarithm of the mean value (muASE) of the mu posterior distribution obtained by the data processing module to obtain rhoASE, and calculating to obtain a MAF value (R) corresponding to muASE, wherein MAF= { rhoASE/(1+ρASE) }.
Carrying out difference significance analysis on MAF value (R) corresponding to muASE in combination with formula 12 to obtain a difference significance analysis result;
Equation 12:
where t is 94%, Rsimj represents the MAF value of ASm6A/ASE in the null hypothesis data, and Nsim represents the total number of Rsimj in the simulation data.
The q value was obtained by correcting the value of P by BH, and q value less than 0.05 was noted as having difference significance.
The prediction module obtains a difference significance analysis result based on the difference significance analysis module, and judges whether the gene is an ASE gene, specifically, judges the gene corresponding to the MAF value corresponding to muASE with difference significance (namely q value is less than 0.05) as the ASE gene, and judges the gene corresponding to the MAF value corresponding to muASE without difference significance (namely q value is more than or equal to 0.05) as the non-ASE gene.
The result output module is used for outputting the result obtained by the judging module.
Example 2A model for ASm6A modification recognition
A schematic diagram of a model for ASm6A modification recognition is shown in fig. 1.
The model for ASm6A modification recognition comprises a data acquisition module, a data processing module, a difference significance analysis module, a modification recognition module and a result output module;
The data acquisition module is used for acquiring MeRIP-seq sequencing data of the sample to be tested.
The data processing module performs the following processing based on MeRIP-seq sequencing data of a sample to be tested, and specifically comprises the following steps:
Obtaining m6A peaks in the sample to be detected based on MeRIP-seq sequencing data of the sample to be detected, and obtaining SNP locus information covered by each m6A peak;
for SNPj(m) covered on the m6A peak, calculating the dominance ratio rhoj(m) according to formula 10, and taking the dominance ratio logarithm according to formula 11 to obtain a logarithmic dominance ratio yj(m);
Equation 10:
Equation 11:
Wherein nj(m) is the number of reads count of SNPj(m) in the sample to be tested, x(m)ma,j is the number of reads of the major haplotype on SNPj(m), and μASE is obtained by processing SNPi(m) by the functional model identified as having a significant allele-specific expressed gene as shown in example 1;
Processing the logarithmic dominance ratio yj(m) obtained by the formula 11 in combination with a single normal random distribution (REM) according to the formula 15 of the formula 12 to obtain a processed yj(m);
Equation 12:
Equation 13:
Equation 14: mu(m) to uniformity (- ≡, ++ infinity a) is provided;
equation 15:
wherein θj is the mean of the log-odds ratios and σj2 is the variance of the log-odds ratios;s2=10。
For the processed yj(m), sampling by adopting a Metropolis-Hastings (M-H) sampling method based on the improved Markov Chain Monte Carlo (MCMC) algorithm, calculating the average value mu(m) of posterior distribution, taking the mu(m) as the inverse logarithm to obtain rhoj(m)' (namely the expected value of rhoj(m)), and calculating the frequency (MAF) of a main allele of an M6A peak;
where maf= { ρj(m)/(1+ρj(m)) }.
The difference saliency analysis module differs from the difference saliency analysis module described in example 1 in that the difference saliency analysis was performed according to the (2) difference saliency analysis in the difference saliency analysis module described in example 1 with respect to the frequency (MAF) of the major allele on the m6A peak, resulting in a difference saliency analysis result (q value) of the m6A peak.
The modification identification module identifies an ASm6A modified peak based on the difference significance analysis result of the m6A peak obtained by the difference significance analysis module, identifies the m6A peak with difference significance (namely q value is less than 0.05) as the ASm6A modified peak, and identifies the m6A peak without difference significance (namely q value is more than or equal to 0.05) as the non-ASm 6A modified peak;
the result output module is used for outputting the result obtained by the modification identification module.
Example 3A method for identifying differential ASm6A in paired samples
The paired samples are samples from different groups and from the same individual, such as tumor samples and normal samples for the same patient, and the specific method is as follows:
Considering that the m6A peak overlapped by more than 50% in genome coordinates between the tumor sample and the normal sample is from the same m6A peak (i.e., the m6A peak existing in both the tumor sample and the normal sample), treating the tumor sample and the normal sample in the paired samples by using the model shown in example 2, respectively, obtaining ASm6A modified peaks of the tumor sample and the normal sample, and classifying the tumor sample and the normal sample:
(1) ASm6A peaks present in tumor samples, but not in normal samples, were noted as specific ASm6A modified peaks in tumor samples.
(2) ASm6A peaks present in normal samples, but not in tumor samples, were noted as non-specific ASm6A modified peaks of tumor samples.
(3) ASm6A peaks for different major haplotypes, which were present in both tumor and normal samples, were noted as specific ASm6A modified peaks for tumor samples.
(4) For ASm6A peaks which exist in both tumor samples and normal samples and show the same main haplotype, evaluating the variability by using a hierarchical Bayesian model, regarding the consistent heterozygous SNP sites in the overlapping area of the ASm6A peaks as available sites for downstream analysis, ensuring consistent haplotype construction between the two samples, and calculating the dominance ratio ρjs according to the formula 16 for each SNP site;
equation 16:
Wherein the method comprises the steps ofYtumor,j is the read count of the major allele at SNPj locus in the tumor IP sample, ntumor,j is the ready count at SNPj locus in the tumor IP sample, μtumor,b is the expected value of the major allele log odds ratio of allele-specific expression calculated for the Input sample of the tumor (i.e., general RNA-seq), where ρtumor,j=ρnormal,j under zero assumption;
And calculating corresponding MAF by using rhojs, and carrying out difference significance analysis according to the method shown by the difference significance analysis module in the embodiment 1 to obtain a difference significance analysis result q value, wherein MAF= { rhojs/(1+ρjs) }.
Q value is less than 0.05, and is marked as specific ASm6A modified peak of tumor sample, q value is greater than or equal to 0.05, and is marked as non-specific ASm6A modified peak of tumor sample.
Example 4 evaluation of Performance of model for ASm6A modification recognition
1. Stability assessment of model for ASm6A modification recognition
1. Experimental method
The m6A peak and site of the real MeRIP-seq sequencing data (GEO: GSM 11828594) were used to generate the simulated data and classified based on sequencing read length (< 75bp, > 75bp and <100bp, > 100bp and <150bp, > 150bp and <300bp, > 300bp and > 300 bp), library size (< 10, > 10 and <20, > 20 and <30, > 30 and <40, > 40 and <50, > 50 and <60, > 60 and <70, > 70 and <80, > 80 and <90 and > 90million reads), FPKM values of gene expression (0-5, 5-10 and 10-15, all log2 (FPKM+1)), number of SNPs in m6A peak (1, 2,3, 4 and > 5), and number of biological repeats (1, 2,3, 4 and 6).
In each classification category, 50% of the m6A peaks were randomly assigned to be allele-specific, i.e. true positives for ASm6A (MAF > 0.6), the remaining m6A peaks were labeled as true negatives for ASm6A (maf=0.5), resulting in a dataset for each classification category.
The evaluation was repeated 50 times for each of the classified data sets using the model shown in example 2, and the average error rate in evaluating each classified data set was calculated.
2. Experimental results
The average error rate results when the model shown in example 2 was used to identify the data sets of each classification category are shown in fig. 2, a being the average error rate results for the data sets classified for sequencing reads, B being the average error rate results for the data sets classified for library sizes, C being the average error rate results for the data sets classified for FPKM values for gene expression, D being the average error rate results for the data sets classified for the number of SNPs in the m6A peak, E being the average error rate results for the data sets classified for the number of biological replicates.
The results showed that the average error rate for the data sets of different sequencing read lengths, as identified using the model shown in example 2, did not change significantly, while the average error rate decreased significantly with increasing library size, FPKM values for gene expression, number of SNPs in the m6A peak, and number of biological repeats, especially for library size and FPKM values for gene expression.
However, the error rate of less than 10% was consistently maintained for each dataset, indicating that the model shown in example 2 has excellent stability in identifying ASm6A of the sample, and is suitable for sequencing data under different conditions.
2. Evaluation of models for ASm6A modification recognition
1. Experimental method
Experimental group (M6 rule) the data set of each classification category shown in step one was used to identify using the model shown in example 2 and the area under ROC curve (AUC) of the model shown in example 2 was calculated based on the true positive and true negative results of ASm6A in the data set and the average AUC was calculated.
Control group 1 (ASPRIN) was identified using a default parameter configuration using the dataset of each of the classification categories shown in step one in combination with algorithm ASPRIN developed by ASPRIN et al (prior art::: https:// doi.org/10.1016/j.ajhg.2019.01.018), for which identification any SNP within one m6A peak was identified as having an ASm6A modification by ASPRIN, that m6A peak was classified as an ASm6A modification, otherwise was considered as non-ASm 6A, and the area under ROC curve (AUC) was calculated based on the true positive and true negative results of ASm6A in the dataset, and the average AUC was calculated.
Control group 2 (Cao S) was identified using a default parameter configuration using the data set for each of the classification categories shown in step one, in combination with the algorithm developed by Cao S et al (prior art: genome Res.2023Aug;33 (8): 1369-1380.doi:10.1101/gr.277704.123.Epub 2023 ep 15), for which identification any SNP within one m6A peak was identified as having ASm6A modification by Cao S, that m6A peak was classified as ASm6A modification, otherwise deemed non-ASm 6A, and the area under the ROC curve (AUC) was calculated based on the true positive and true negative results of ASm6A in the data set, and the average AUC was calculated.
2. Experimental results
The average AUC result graphs of the experimental group, the control group 1 and the control group 2 are shown in fig. 3, the area under ROC curve result graphs of the experimental group, the control group 1 and the control group 2 are shown in fig. 4, a is an AUC result graph when the data set classified for the number of SNPs in the m6A peak is identified, B is an AUC result graph when the data set classified for the FPKM value of gene expression is identified, C is an AUC result graph when the data set classified for the number of biological repetitions is identified, D is an AUC result graph when the data set classified for the sequencing read length is identified, and E is an AUC result graph when the data set classified for the library size is identified.
The results show that the average AUC value when the model shown in example 2 is used for authentication is always significantly higher than that of the algorithm developed by ASPRIN et al and that of the algorithm developed by Cao S et al, while the algorithm developed by ASPRIN et al and that developed by Cao S et al show significant changes in AUC values with changes in sequencing conditions during the authentication process, and significantly increase with increases in the number of SNPs in m6A minutes, indicating that errors due to sequencing data noise are difficult to avoid when the data are authenticated for different sequencing conditions.
3. Application of model for ASm6A modification recognition
1. Experimental method
The model shown in example 2 was used to identify and randomly select m6A peaks with significant ASm6A and m6A peaks without significant ASm6A in the identification were visualized using an IGV tool with the MeRIP-seq dataset in the GEO database (GEO: GSE164151, IP sample: GSM4998285, input sample: GSM 4998284) as the dataset to be analyzed.
2. Experimental results
The IGV visualization results are shown in FIG. 5, which shows that ASm6A peaks in the sample can be accurately identified using the model shown in example 2, and that peaks other than ASm6A are not erroneously identified as ASm6A.
Example 5 application of the method for identifying differential ASm6A of paired samples
1. Experimental method
Two pieces of simulated data with the same genotyping were generated using the m6A peak and locus of the authentic MeRIP-seq sequencing data (GEO: GSM 11828594), noted Sample1 and Sample2 respectively, each piece of simulated data randomly assigned 50% of the m6A peak as allele specific, i.e. true positive for ASm6A (MAF > 0.6), the remaining m6A peak labeled true negative for ASm6A (maf=0.5).
ASm6A events in Sample1 and Sample2, respectively, were identified using example 2, and the difference ASm6A between the two data (in Sample1 and Sample 2) was identified as shown in example 3.
2. Experimental results
The identification result diagram of the difference ASm6A is shown in fig. 6, the left result in fig. 6 is the predicted position of ASm6A, including the positions which are not in Sample1 and Sample2, are in Sample1 and Sample2, and the right result in fig. 5 is the position of actual ASm6A, and the result shows that most ASm6A in Sample1 and Sample2 can be accurately identified, and only a small part of ASm6A is identified by error identification.
Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the scope of the present invention, and that other various changes and modifications can be made by one skilled in the art based on the above description and the idea, and it is not necessary or exhaustive of all the embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims (10)

CN202411735634.1A2024-11-292024-11-29Model for identifying ASm6A modification and application thereofPendingCN119832977A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411735634.1ACN119832977A (en)2024-11-292024-11-29Model for identifying ASm6A modification and application thereof

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411735634.1ACN119832977A (en)2024-11-292024-11-29Model for identifying ASm6A modification and application thereof

Publications (1)

Publication NumberPublication Date
CN119832977Atrue CN119832977A (en)2025-04-15

Family

ID=95303370

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411735634.1APendingCN119832977A (en)2024-11-292024-11-29Model for identifying ASm6A modification and application thereof

Country Status (1)

CountryLink
CN (1)CN119832977A (en)

Similar Documents

PublicationPublication DateTitle
AU2017228558B2 (en)Noninvasive prenatal molecular karyotyping from maternal plasma
CN103201744B (en)For estimating the method that full-length genome copies number variation
KR102665592B1 (en)Methods and processes for non-invasive assessment of genetic variations
AU2016355983B2 (en)Methods for detecting copy-number variations in next-generation sequencing
US7937225B2 (en)Systems, methods and software arrangements for detection of genome copy number variation
IL249095B2 (en) Detection of subchromosomal aneuploidy in the fetus and variations in the number of copies
JP2022537442A (en) Systems, computer program products and methods using density of single nucleotide mutations to verify copy number variation in human embryos
EP3977459A1 (en)Limit of detection based quality control metric
US20240221954A1 (en)Disease prediction methods and devices, electronic devices, and computer readable storage media
US20180247019A1 (en)Method for determining whether cells or cell groups are derived from same person, or unrelated persons, or parent and child, or persons in blood relationship
US20180300451A1 (en)Techniques for fractional component fragment-size weighted correction of count and bias for massively parallel DNA sequencing
EP3465500B1 (en)Genotyping polyploid loci
WO2024192121A1 (en)White blood cell contamination detection
WO2024140881A1 (en)Method and device for determining fetal dna concentration
CN119832977A (en)Model for identifying ASm6A modification and application thereof
Mayrink et al.Bayesian factor models for the detection of coherent patterns in gene expression data
CN119673271B (en)Method and device for identifying parent source pollution and detecting copy number abnormality by using peak value
Nordlund et al.Computational and statistical analysis of array-based DNA methylation data
Al-Abri et al.Estimating the size of long tandem repeat expansions from short reads with ScatTR
EP4511838A1 (en)Method and system for detecting tumour presence from mapping metrics of free circulating dna fragments
Mansouri et al.PRINCE: accurate approximation of the copy number of tandem repeats
SK882023A3 (en) Methods and system for detecting microsatellite instability from sequenced free circulating DNA
HK1186784A (en)Methods for estimating genome-wide copy number variations
HK1186784B (en)Methods for estimating genome-wide copy number variations

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp