CN119832977A

Movatterモバイル変換

Info

Publication number: CN119832977A
Application number: CN202411735634.1A
Authority: CN
Inventors: 罗晓彤; 谢宇斌; 左志向; 任间; 张印
Original assignee: Sixth Affiliated Hospital of Sun Yat Sen University
Current assignee: Sixth Affiliated Hospital of Sun Yat Sen University
Priority date: 2024-11-29
Filing date: 2024-11-29
Publication date: 2025-04-15

Abstract

The invention discloses a model for identifying ASm6A modification, which can accurately identify ASm6A modification peaks in a sample to be detected based on MeRIP-seq sequencing data of the sample to be detected. When the model provided by the invention is used for identifying ASm6A modified peaks in a sample to be detected, no matter what MeRIP-seq sequencing data of the sample to be detected is used, the error rate of identification is always lower than 10%, which indicates that the model provided by the invention has excellent stability when the ASm6A modified peaks of the sample are identified, and is suitable for identifying MeRIP-seq sequencing data of the sample under different conditions. Compared with the existing ASm6A modification identification method, the AUC value of the ASm6A modification in the sample identification model is obviously improved, and the identification effect is obviously better.

Description

Model for identifying ASm6A modification and application thereof

Technical Field

The invention relates to the technical field of biomedicine, in particular to a model for identifying ASm6A modification and application thereof.

Background

N6-methyladenine (N6-methyladenosine, m 6A) is the most common dynamic reversible RNA modification in eukaryotic cells, and m6A modification forms a precise dynamic regulation and control cycle in the organism through the combined action of a series of methylases, demethylases and methylation recognition proteins, and is widely involved in vital activities such as embryo development, apoptosis, sperm development, immunity, pressure response and the like. Abnormal m6A modification can cause serious pathological changes of organisms, as found in the prior art, hypoxia conditions can induce breast cancer stem cell phenotype through m6A demethylation, and in addition, m6A can be used as a target for inhibiting cancer under specific conditions. Therefore, identifying the m6A precise site and researching the internal regulation mechanism thereof are key bases for elucidating the pathogenesis of related diseases and developing new therapeutic targets.

Currently, it is still very difficult to resolve the precise regulatory mechanism of the m6A modification at the histology level due to the resolution limitations of the experimental methods. The co-immunoprecipitation combined with a new generation sequencing technology (MeRIP-seq) can simultaneously identify tens of thousands of RNA methylation modification signal peaks with the length of 100-200 nt on the transcriptome level, which plays an important role in promoting the physiological function of revealing RNA methylation modification.

With the proposal of the ultraviolet light crosslinking m6A antibody combined immunoprecipitation (m 6Aindividual-nucleotideresolution cross-LINKING AND immunoprecipitation, miCLIP) method, the m6A modification site of single base progress is successfully identified from the whole transcriptome on the experimental level, the high-throughput identification of the m6A modification site of mammals on the single base precision is already carried out in the prior art, however, the identification result obtained by the technology is still not very stable as the technology is still immature, the technology is not widely applied in the m6A research field at present, and MeRIP-seq is still the most main high-throughput research method at present. Thus, improving the quality of the results identifying m6A modified regions from MeRIP-seq data would be the key basis of current m6A studies.

Based on the accurate identification of the m6A modification region of the transcriptome, another important study was aimed at resolving allele-specific modification times. In non-haploid organisms, the expression of certain modification sites or certain genes is more prone to occur on a particular chromosome, which is referred to as allele-specific. Some important vital event control events, such as gene transcription, epigenetic modification, etc., have significant allele specificity, and these allele-specific controls affect the vital activities of cells in many ways, including imprinting control of embryo development, chromosome inactivation, and gene expression control in a specific spatiotemporal context, etc. Whereas, the disorder allele-specific regulatory event can often cause serious organism pathological changes, such as biological dysplasia, cardiovascular and cerebrovascular functional pathological changes, even cancers and the like, so that the identification and research of the allele-specific regulatory event can help researchers to know the occurrence mechanism of various diseases and to mine potential therapeutic targets.

However, due to technical limitations, allele-specific regulatory event studies have been limited to limited genes or chromosome segments for a considerable period of time, and with the development of high throughput sequencing technology, allele-specific regulatory event studies have been developed at the genomic level, sequencing technology has been able to identify single nucleotide polymorphic sites (Single Nucleotide Polymorphisms, SNPs) present on DNA or RNA, and, in non-haploid organisms, chromosomes from maternal and paternal sides generally carry diverse SNP sites, so sequencing reads can be deduced from heterozygous SNP sites on the sequences, thereby identifying regulatory signals from diverse alleles.

Allele-specific modification is considered to be one of the important reasons for the generation of allele-specific expression, and the prior art mainly focuses on DNA level, as the prior art studies on mouse brain tissue find that the methylation level of DNA of more than 1300 alleles has a significant difference on two homologous chromosomes, and furthermore, the prior art also reveals that more than 10% of human genes are regulated by allele-specific DNA methylation. These methylation modifications with allele specificity have a high degree of correlation with SNP sites on the sequence, and there are relatively few studies on allele specificity on RNA modifications.

In contrast to the above allele-specific regulatory times, studies of their allele specificity for m6A modification on the transcriptome have not been reported. In addition, m6A modifies the correct functions of a plurality of metabolism and growth differentiation related genes of a controller in an organism, the prior art finds that the mouse embryonic stem cells can be prevented from differentiating after being knocked out (knockout) METTL, but in drosophila, the homologous protein Ime4 of METTL3 also has a sublethal effect after being knocked out, which can cause the damage of NOTCH signal channels so as to influence the fertility of drosophila, and in addition, the Ime4 gene has an important effect on the meiosis of yeast. This indicates that m6A modification is critical for early sexual organ development and early embryo development, and thus studying whether m6A modification has specificity between alleles is an important entry point for decrypting embryo development mechanisms.

Analyzing m6A modified signals from MeRIP-seq sequencing data mainly depends on a computational biology method, researchers firstly use Fisher's Exact Test method to construct a first peak identification algorithm for MeRIP-seq, the method can achieve higher identification specificity, but neglects the upstream and downstream correlation of m6A signals on transcripts, and lacks accurate modeling of read steps on the whole genome step by step, accuracy and sensitivity are insufficient, while Exome Peak, HEPeak and MetPeak algorithms mainly use Beta binomial distribution to model methylation level of each chromosome window, and introduce Hidden Markov models (Hidden MarkovModel, HMM) to calculate the upstream and downstream correlation of methylation events, but the algorithms must assume MeRIP-seq reads to conform to Beta binomial distribution, so that obvious errors are introduced for samples with low sequencing quality, parameter optimization of HMM models can obtain accurate results based on a certain number of repeated samples, the number of repeated samples can be used for detecting the specific analysis of the sequencing data of the MACseq, and the analysis of the analysis principle is also limited by the fact that the number of the MACseq is very small in 35-35, and the analysis of the sequence has the fact that the number of the sequence of the MACseq is very small.

Therefore, there is an urgent need for a method capable of precisely identifying m6A peak signals based on MeRIP-seq sequencing, thereby achieving accurate identification of allele-specific m 6A.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a model for identifying ASm6A modification and application thereof.

It is a first object of the present invention to provide a model for identifying ASm6A modifications.

It is a second object of the present invention to provide the use of the above model for the preparation of a product for the identification of ASm 6A.

A third object of the invention is to provide an ASm6A modification system.

It is a fourth object of the invention to provide a computer device.

A fifth object of the present invention is to provide a computer-readable storage medium.

It is a sixth object of the invention to provide a computer program product.

A seventh object of the present invention is to provide a model for identifying a differential ASm6A of paired samples.

In order to achieve the above object, the present invention is realized by the following means:

The model for identifying ASm6A modification comprises a data acquisition module, a data processing module, a modification identification module and a result output module;

the data acquisition module is used for acquiring MeRIP-seq sequencing data of a sample to be tested;

The data processing module acquires m6A peaks of a sample to be detected based on MeRIP-seq sequencing data of the sample to be detected, acquires SNP locus information covered by each m6A peak, calculates a dominance ratio ρ_j^(m) according to a formula 10 for SNP_j^(m) covered by each m6A peak, and obtains a logarithmic dominance ratio y_j^(m) by taking the logarithm;

Equation 10:

Wherein n_j^(m) is the read count of SNP_j^(m) in the sample to be tested, x^(m)_ma,j is the number of reads of the major haplotype on SNP_j^(m);

the acquisition method of mu_ASE comprises the steps of acquiring SNP loci of a sample to be detected based on RNA-seq sequencing data of the sample to be detected, aiming at SNP_j in the sample to be detected, calculating the dominance ratio rho_j of a main haplotype to a secondary haplotype on SNP_j according to a formula I, and taking logarithm to obtain a logarithmic dominance ratio y_j;

equation 1:

Wherein n_j is the read count of SNP_j in the sample to be tested, x_ma,j is the number of reads of the major haplotype on SNP_j, and x_0,j is 0.5 x n_j;

processing the logarithmic dominance ratio y_j by combining the unidirectional normal random distribution to obtain processed y_j, sampling the processed y_j and calculating the average value mu_ASE of posterior distribution;

Processing the logarithmic dominance ratio y_j^(m) of SNP loci covered by each m6A peak by combining unidirectional normal random distribution to obtain processed y_j^(m), sampling the processed y_j^(m), calculating the average value mu^(m) of posterior distribution, taking the inverse logarithm of mu^(m) to obtain rho_j^(m)', and calculating the frequency of a main allele of each m6A peak;

The difference significance analysis module is used for carrying out difference significance analysis on the frequency of the main allele of each m6A peak obtained by the data processing module to obtain a difference significance analysis result of each m6A peak;

the modification recognition module recognizes an m6A peak with difference significance as an ASm6A modified peak and recognizes an m6A peak without difference significance as a non-ASm 6A modified peak based on the difference significance analysis result of each m6A score;

the result output module is used for outputting the result obtained by the modification identification module.

Wherein ASm6A is modified by RNA N6-methyladenosine (Allle-specofoc m A, ASm 6A) with specific Allele, and the method is concretely described in the prior art, namely Genome Res.2023Aug, and 33 (8), namely 1369-1380.doi:10.1101/gr.277704.123.Epub 2023ep 15.

Preferably, in the data processing module, for the processed y_j^(m), a Metropolis-Hastings sampling method modified based on a Markov chain Monte Carlo algorithm is used for sampling.

Preferably, in the data processing module, the major haplotype at SNP_j is the base of the highest base reading at SNP_j and the minor haplotype at SNP_j is the base of the last high base reading at SNP_j.

Preferably, in the data processing module, the frequency of the main allele of each m6A peak is calculated by the method that the frequency of the main allele is = { ρ_j^(m)/(1+ρ_j^(m) }.

The invention also claims the use of a model as described in any of the above for the preparation of a product for the identification of ASm 6A.

The invention also claims an ASm6A modification and identification system, which comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring MeRIP-seq sequencing data of a sample to be tested;

The storage unit stores program instructions executable by the processing unit;

the processing unit comprises a model as described in any one of the above;

When the program instruction is executed by the processing unit, meRIP-seq sequencing data of the sample to be detected obtained by the obtaining unit is input into any one of the models to obtain ASm6A modification recognition results.

The invention also claims a computer device comprising a memory and a processor, said memory having stored thereon a computer program executable on said processor;

The computer program, when executed by a processor, implements any of the models described above and/or the ASm6A modifier identification system described above.

The invention also claims a computer readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements any of the above-mentioned models and/or the above-mentioned ASm6A modifier identification system.

The invention also claims a computer program product comprising the computer readable storage medium described above.

The invention also claims a model for identifying the difference ASm6A of the paired samples, which is characterized by comprising a data acquisition module, an identification module and a result output module;

the paired samples are from different groups and are derived from sample a and sample B of the same individual;

The data acquisition module is used for acquiring MeRIP-seq sequencing data of the paired samples, and identifying according to any one of the models to obtain ASm6A modified peaks of the sample A and the sample B respectively;

the identification module marks ASm6A modified peaks which overlap by more than 50% between genome coordinates of the sample A and the sample B as m6A modified peaks which exist in the sample A and the sample B simultaneously based on MeRIP-seq sequencing data of the sample A and the sample B obtained by the data acquisition module;

ASm6A modified peaks present in sample a but not in sample B are noted as specific ASm6A modified peaks in sample a;

for ASm6A modified peaks not present in sample a, but present in sample B, noted as non-specific ASm6A modified peaks for sample a;

ASm6A modified peaks for ASm6A that are present in both sample a and sample B, but exhibit different major haplotypes, noted as specific ASm6A modified peaks for sample a;

For ASm6A modified peaks which exist in the sample A and the sample B but show the same main haplotype, judging the difference of the ASm6A modified peaks in the sample A and the sample B by using a hierarchical Bayesian model, wherein the ASm6A modified peak with obvious difference is marked as a specific ASm6A modified peak of the sample A, and the ASm6A modified peak with insignificant difference is marked as a non-specific ASm6A modified peak of the sample A;

The result output module is used for outputting the result obtained by the identification module.

Preferably, the variability is significant with q <0.05, and the variability is not significant with q > 0.05.

Compared with the prior art, the invention has the following beneficial effects:

The invention provides a model for identifying ASm6A modification, which can accurately identify ASm6A modification peaks in a sample to be detected based on MeRIP-seq sequencing data of the sample to be detected. When the model provided by the invention is used for identifying ASm6A modified peaks in a sample to be detected, no matter what MeRIP-seq sequencing data of the sample to be detected is used, the error rate of identification is always lower than 10%, which indicates that the model provided by the invention has excellent stability when the ASm6A modified peaks of the sample are identified, and is suitable for identifying MeRIP-seq sequencing data of the sample under different conditions. Compared with the existing ASm6A modification identification method, the AUC value of the ASm6A modification in the sample identification model is obviously improved, and the identification effect is obviously better.

Drawings

FIG. 1 is a schematic diagram of a model for ASm6A modification recognition as shown in example 2;

FIG. 2 is a graph of average error rate results for data sets of each classification category identified according to the method of example 3, A is the average error rate results for data sets classified for sequencing reads, B is the average error rate results for data sets classified for library sizes, C is the average error rate results for data sets classified for FPKM values for gene expression, D is the average error rate results for data sets classified for the number of SNPs in the m6A peak, E is the average error rate results for data sets classified for the number of biological replicates;

FIG. 3 is a graph of the average AUC results for experimental, control, and control groups 1 and 2 of example 4;

FIG. 4 is a plot of area under ROC curves for experimental, control, 1, and control groups 2 of example 4, A is a plot of AUC results for data sets classifying the number of SNPs in the m6A peak, B is a plot of AUC results for data sets classifying the FPKM values of gene expression, C is a plot of AUC results for data sets classifying the number of biological repeats, D is a plot of AUC results for data sets classifying the sequencing read length, E is a plot of AUC results for data sets classifying the library size;

FIG. 5 is a graph of the results of the IGV visualization in example 4;

FIG. 6 is a graph showing the identification result of the difference ASm6A in example 5.

Detailed Description

The invention will be further described in detail with reference to the drawings and specific examples, which are given solely for the purpose of illustration and are not intended to limit the scope of the invention. The test methods used in the examples described below are conventional methods unless otherwise specified, and the materials, reagents, etc., used are commercially available reagents and materials unless otherwise specified.

Example 1A functional model for identifying genes with significant allele-specific expression

The functional model for identifying genes with significant allele-specific expression includes a data acquisition module, a data processing module, a difference significance analysis module, a prediction module, and a result output module.

The data acquisition module is used for acquiring RNA-seq sequencing data of the sample to be tested.

The data processing module performs the following processing based on RNA-seq sequencing data of a sample to be tested, and specifically comprises the following steps:

Mapping the RNA-seq sequencing data of the sample to be detected to corresponding genomes of the sample to be detected based on the RNA-seq sequencing data of the sample to be detected to obtain SNP locus information of each gene and the reading number of each SNP in the sample to be detected, and marking the base of the highest base reading on one SNP locus as a main haplotype of the locus and the base of the next highest base reading as a secondary haplotype of the locus;

Aiming at the SNP_j, calculating the dominance ratio rho_j of the main haplotype to the secondary haplotype on the SNP_j according to a formula 1, and taking the dominance ratio according to a formula 2 to obtain a logarithmic dominance ratio y_j, wherein the SNP_j is the SNP locus in a sample to be detected;

equation 1:

equation 2:

Wherein n_j is the read count of SNP_j in the sample to be tested, x_ma,j is the number of reads of the major haplotype on SNP_j, x_0,j is 0.5 x n_j;

processing the logarithmic dominance ratio y_j obtained in the formula 2 by combining a single normal Random Effect Model (REM) according to the formulas 3-6 to obtain processed y_j;

Equation 3:

Equation 4: θ_j～N(μ,τ²);

Equation 5: μ to form (- ≡, ++ infinity a) is provided;

Equation 6:

Wherein θ_j is the mean value of the log-dominance ratio of the sample to be measured, σ_j² is the variance of the log-dominance ratio of the sample to be measured;s²=10。

For the processed y_j, a Metropolis-Hastings (M-H) sampling method modified based on Markov Chain Monte Carlo (MCMC) algorithm is adopted to sample according to the formula 7, and edge probability is calculated according to the formula 8, then posterior distribution of μ is calculated according to the formula 9, and mean value (μ_ASE) of the posterior distribution of μ is calculated.

Equation 7:p (θ₁,…,θ_j,μ,τ|y₁,y₂,…,y_j);

Equation 8:p (μ, τ|y₁,y₂,…,y_j);

equation 9:p (τ|y₁,y₂,…,y_j).

The difference significance analysis module performs difference significance analysis based on the mean value (mu_ASE) of mu posterior distribution obtained by the data processing module, and specifically comprises the following steps:

(1) Construction of tail distribution data for Major Allele Frequencies (MAFs) under null hypothesis

To distinguish between significant allele-specific events, a Major Allele Frequency (MAF) profile needs to be obtained under null hypothesis conditions, simulated sequencing data is needed to obtain read counts of major and minor alleles of SNPs without significant ASE or ASm6A events due to lack of true MeRIP-seq data meeting the criteria, and the read count profile of a single SNP captured by sequencing is fitted by using a negative binomial profile (NBD).

Assuming that the number of reads of a single allele covering each SNP locus (SNP_i) is shown in formula 10;

Equation 10:

Wherein μ is the theoretical number of reads of SNP_i in the absence of allele-specific events, equal to 0.5N_i,N_i is the ready count of SNP_i, and k is the dispersion parameter.

The acquisition method of the dispersion parameter k comprises the steps of acquiring Whole Genome Sequencing (WGS) data from a 1000-person genome database (1000 genome, https:// www.internationalgenome.org /), counting the number of covered reads N_i of each SNP locus and the number of reads x of alleles in each sample, and integrating N_i and x through a maximum likelihood method to obtain the dispersion parameter k.

The FPKM values for all genes were then collected from the cancer genome map program (TCGA) and fitted using the fitter package in Python, the genes were classified into six classes of genes according to the distribution type of FPKM (as described in prior art de Torrente L,Zimmerman S,Suzuki M,Christopeit M,Greally JM,Mar JC.The shape of gene expression distributions matter:how incorporating distribution shape improves the interpretation of cancer transcriptomic data.BMC Bioinformatics.2020;21:562.), the overall distribution of FPKM for each class of genes was re-fitted, the simulated FPKM for each gene was obtained by sampling from the distribution, and the total number of reads N_i' for each SNP on each gene was calculated based on the simulated FPKM for each gene and the length of the gene.

Based on the dispersion parameter k and the N_i ' obtained by simulation, the negative binomial distribution of the allele reads number of SNP_i of each meta-unit (gene) is obtained by combining formula 10, the number of reads of the major and minor alleles is simulated by sampling, the dominance ratio ρ_i ' is calculated according to formula 1, and MAF on each gene is calculated, wherein MAF= { ρ_i'/(1+ρ_i ')}.

Introducing Generalized Pareto Distribution (GPD), estimating tail distribution of MAF' of different types of genes, calculating a statistical significance threshold, taking 94% quantiles as an initial threshold t, and estimating the distribution of MAF larger than the initial threshold t by combining the scale parameter and the shape parameter of the GPD in the formula 11 to obtain tail distribution of MAF values in zero hypothesis data.

Equation 11:

The parameter ψ is estimated by a mixing method and a gradient descent algorithm proposed in the prior art (Wang C,Chen G.A new hybrid estimation method for the generalized pareto distribution.Communications in Statistics-Theory and Methods.2016;45:4285-4294.).

(2) Analysis of Difference significance

And taking the inverse logarithm of the mean value (mu_ASE) of the mu posterior distribution obtained by the data processing module to obtain rho_ASE, and calculating to obtain a MAF value (R) corresponding to mu_ASE, wherein MAF= { rho_ASE/(1+ρ_ASE) }.

Carrying out difference significance analysis on MAF value (R) corresponding to mu_ASE in combination with formula 12 to obtain a difference significance analysis result;

Equation 12:

where t is 94%, R_simj represents the MAF value of ASm6A/ASE in the null hypothesis data, and N_sim represents the total number of R_simj in the simulation data.

The q value was obtained by correcting the value of P by BH, and q value less than 0.05 was noted as having difference significance.

The prediction module obtains a difference significance analysis result based on the difference significance analysis module, and judges whether the gene is an ASE gene, specifically, judges the gene corresponding to the MAF value corresponding to mu_ASE with difference significance (namely q value is less than 0.05) as the ASE gene, and judges the gene corresponding to the MAF value corresponding to mu_ASE without difference significance (namely q value is more than or equal to 0.05) as the non-ASE gene.

The result output module is used for outputting the result obtained by the judging module.

Example 2A model for ASm6A modification recognition

A schematic diagram of a model for ASm6A modification recognition is shown in fig. 1.

The model for ASm6A modification recognition comprises a data acquisition module, a data processing module, a difference significance analysis module, a modification recognition module and a result output module;

The data acquisition module is used for acquiring MeRIP-seq sequencing data of the sample to be tested.

The data processing module performs the following processing based on MeRIP-seq sequencing data of a sample to be tested, and specifically comprises the following steps:

Obtaining m6A peaks in the sample to be detected based on MeRIP-seq sequencing data of the sample to be detected, and obtaining SNP locus information covered by each m6A peak;

for SNP_j^(m) covered on the m6A peak, calculating the dominance ratio rho_j^(m) according to formula 10, and taking the dominance ratio logarithm according to formula 11 to obtain a logarithmic dominance ratio y_j^(m);

Equation 10:

Equation 11:

Wherein n_j^(m) is the number of reads count of SNP_j^(m) in the sample to be tested, x^(m)_ma,j is the number of reads of the major haplotype on SNP_j^(m), and μ_ASE is obtained by processing SNP_i^(m) by the functional model identified as having a significant allele-specific expressed gene as shown in example 1;

Processing the logarithmic dominance ratio y_j^(m) obtained by the formula 11 in combination with a single normal random distribution (REM) according to the formula 15 of the formula 12 to obtain a processed y_j^(m);

Equation 12:

Equation 13:

Equation 14: mu^(m) to uniformity (- ≡, ++ infinity a) is provided;

equation 15:

wherein θ_j is the mean of the log-odds ratios and σ_j² is the variance of the log-odds ratios;s²=10。

For the processed y_j^(m), sampling by adopting a Metropolis-Hastings (M-H) sampling method based on the improved Markov Chain Monte Carlo (MCMC) algorithm, calculating the average value mu^(m) of posterior distribution, taking the mu^(m) as the inverse logarithm to obtain rho_j^(m)' (namely the expected value of rho_j^(m)), and calculating the frequency (MAF) of a main allele of an M6A peak;

where maf= { ρ_j^(m)/(1+ρ_j^(m)) }.

The difference saliency analysis module differs from the difference saliency analysis module described in example 1 in that the difference saliency analysis was performed according to the (2) difference saliency analysis in the difference saliency analysis module described in example 1 with respect to the frequency (MAF) of the major allele on the m6A peak, resulting in a difference saliency analysis result (q value) of the m6A peak.

The modification identification module identifies an ASm6A modified peak based on the difference significance analysis result of the m6A peak obtained by the difference significance analysis module, identifies the m6A peak with difference significance (namely q value is less than 0.05) as the ASm6A modified peak, and identifies the m6A peak without difference significance (namely q value is more than or equal to 0.05) as the non-ASm 6A modified peak;

Example 3A method for identifying differential ASm6A in paired samples

The paired samples are samples from different groups and from the same individual, such as tumor samples and normal samples for the same patient, and the specific method is as follows:

Considering that the m6A peak overlapped by more than 50% in genome coordinates between the tumor sample and the normal sample is from the same m6A peak (i.e., the m6A peak existing in both the tumor sample and the normal sample), treating the tumor sample and the normal sample in the paired samples by using the model shown in example 2, respectively, obtaining ASm6A modified peaks of the tumor sample and the normal sample, and classifying the tumor sample and the normal sample:

(1) ASm6A peaks present in tumor samples, but not in normal samples, were noted as specific ASm6A modified peaks in tumor samples.

(2) ASm6A peaks present in normal samples, but not in tumor samples, were noted as non-specific ASm6A modified peaks of tumor samples.

(3) ASm6A peaks for different major haplotypes, which were present in both tumor and normal samples, were noted as specific ASm6A modified peaks for tumor samples.

(4) For ASm6A peaks which exist in both tumor samples and normal samples and show the same main haplotype, evaluating the variability by using a hierarchical Bayesian model, regarding the consistent heterozygous SNP sites in the overlapping area of the ASm6A peaks as available sites for downstream analysis, ensuring consistent haplotype construction between the two samples, and calculating the dominance ratio ρ_j^s according to the formula 16 for each SNP site;

equation 16:

Wherein the method comprises the steps ofY_tumor,j is the read count of the major allele at SNP_j locus in the tumor IP sample, n_tumor,j is the ready count at SNP_j locus in the tumor IP sample, μ_tumor,b is the expected value of the major allele log odds ratio of allele-specific expression calculated for the Input sample of the tumor (i.e., general RNA-seq), where ρ_tumor,j＝ρ_normal,j under zero assumption;

And calculating corresponding MAF by using rho_j^s, and carrying out difference significance analysis according to the method shown by the difference significance analysis module in the embodiment 1 to obtain a difference significance analysis result q value, wherein MAF= { rho_j^s/(1+ρ_j^s) }.

Q value is less than 0.05, and is marked as specific ASm6A modified peak of tumor sample, q value is greater than or equal to 0.05, and is marked as non-specific ASm6A modified peak of tumor sample.

Example 4 evaluation of Performance of model for ASm6A modification recognition

1. Stability assessment of model for ASm6A modification recognition

1. Experimental method

The m6A peak and site of the real MeRIP-seq sequencing data (GEO: GSM 11828594) were used to generate the simulated data and classified based on sequencing read length (< 75bp, > 75bp and <100bp, > 100bp and <150bp, > 150bp and <300bp, > 300bp and > 300 bp), library size (< 10, > 10 and <20, > 20 and <30, > 30 and <40, > 40 and <50, > 50 and <60, > 60 and <70, > 70 and <80, > 80 and <90 and > 90million reads), FPKM values of gene expression (0-5, 5-10 and 10-15, all log2 (FPKM+1)), number of SNPs in m6A peak (1, 2,3, 4 and > 5), and number of biological repeats (1, 2,3, 4 and 6).

In each classification category, 50% of the m6A peaks were randomly assigned to be allele-specific, i.e. true positives for ASm6A (MAF > 0.6), the remaining m6A peaks were labeled as true negatives for ASm6A (maf=0.5), resulting in a dataset for each classification category.

The evaluation was repeated 50 times for each of the classified data sets using the model shown in example 2, and the average error rate in evaluating each classified data set was calculated.

2. Experimental results

The average error rate results when the model shown in example 2 was used to identify the data sets of each classification category are shown in fig. 2, a being the average error rate results for the data sets classified for sequencing reads, B being the average error rate results for the data sets classified for library sizes, C being the average error rate results for the data sets classified for FPKM values for gene expression, D being the average error rate results for the data sets classified for the number of SNPs in the m6A peak, E being the average error rate results for the data sets classified for the number of biological replicates.

The results showed that the average error rate for the data sets of different sequencing read lengths, as identified using the model shown in example 2, did not change significantly, while the average error rate decreased significantly with increasing library size, FPKM values for gene expression, number of SNPs in the m6A peak, and number of biological repeats, especially for library size and FPKM values for gene expression.

However, the error rate of less than 10% was consistently maintained for each dataset, indicating that the model shown in example 2 has excellent stability in identifying ASm6A of the sample, and is suitable for sequencing data under different conditions.

2. Evaluation of models for ASm6A modification recognition

1. Experimental method

Experimental group (M6 rule) the data set of each classification category shown in step one was used to identify using the model shown in example 2 and the area under ROC curve (AUC) of the model shown in example 2 was calculated based on the true positive and true negative results of ASm6A in the data set and the average AUC was calculated.

Control group 1 (ASPRIN) was identified using a default parameter configuration using the dataset of each of the classification categories shown in step one in combination with algorithm ASPRIN developed by ASPRIN et al (prior art::: https:// doi.org/10.1016/j.ajhg.2019.01.018), for which identification any SNP within one m6A peak was identified as having an ASm6A modification by ASPRIN, that m6A peak was classified as an ASm6A modification, otherwise was considered as non-ASm 6A, and the area under ROC curve (AUC) was calculated based on the true positive and true negative results of ASm6A in the dataset, and the average AUC was calculated.

Control group 2 (Cao S) was identified using a default parameter configuration using the data set for each of the classification categories shown in step one, in combination with the algorithm developed by Cao S et al (prior art: genome Res.2023Aug;33 (8): 1369-1380.doi:10.1101/gr.277704.123.Epub 2023 ep 15), for which identification any SNP within one m6A peak was identified as having ASm6A modification by Cao S, that m6A peak was classified as ASm6A modification, otherwise deemed non-ASm 6A, and the area under the ROC curve (AUC) was calculated based on the true positive and true negative results of ASm6A in the data set, and the average AUC was calculated.

2. Experimental results

The average AUC result graphs of the experimental group, the control group 1 and the control group 2 are shown in fig. 3, the area under ROC curve result graphs of the experimental group, the control group 1 and the control group 2 are shown in fig. 4, a is an AUC result graph when the data set classified for the number of SNPs in the m6A peak is identified, B is an AUC result graph when the data set classified for the FPKM value of gene expression is identified, C is an AUC result graph when the data set classified for the number of biological repetitions is identified, D is an AUC result graph when the data set classified for the sequencing read length is identified, and E is an AUC result graph when the data set classified for the library size is identified.

The results show that the average AUC value when the model shown in example 2 is used for authentication is always significantly higher than that of the algorithm developed by ASPRIN et al and that of the algorithm developed by Cao S et al, while the algorithm developed by ASPRIN et al and that developed by Cao S et al show significant changes in AUC values with changes in sequencing conditions during the authentication process, and significantly increase with increases in the number of SNPs in m6A minutes, indicating that errors due to sequencing data noise are difficult to avoid when the data are authenticated for different sequencing conditions.

3. Application of model for ASm6A modification recognition

1. Experimental method

The model shown in example 2 was used to identify and randomly select m6A peaks with significant ASm6A and m6A peaks without significant ASm6A in the identification were visualized using an IGV tool with the MeRIP-seq dataset in the GEO database (GEO: GSE164151, IP sample: GSM4998285, input sample: GSM 4998284) as the dataset to be analyzed.

2. Experimental results

The IGV visualization results are shown in FIG. 5, which shows that ASm6A peaks in the sample can be accurately identified using the model shown in example 2, and that peaks other than ASm6A are not erroneously identified as ASm6A.

Example 5 application of the method for identifying differential ASm6A of paired samples

1. Experimental method

Two pieces of simulated data with the same genotyping were generated using the m6A peak and locus of the authentic MeRIP-seq sequencing data (GEO: GSM 11828594), noted Sample1 and Sample2 respectively, each piece of simulated data randomly assigned 50% of the m6A peak as allele specific, i.e. true positive for ASm6A (MAF > 0.6), the remaining m6A peak labeled true negative for ASm6A (maf=0.5).

ASm6A events in Sample1 and Sample2, respectively, were identified using example 2, and the difference ASm6A between the two data (in Sample1 and Sample 2) was identified as shown in example 3.

2. Experimental results

The identification result diagram of the difference ASm6A is shown in fig. 6, the left result in fig. 6 is the predicted position of ASm6A, including the positions which are not in Sample1 and Sample2, are in Sample1 and Sample2, and the right result in fig. 5 is the position of actual ASm6A, and the result shows that most ASm6A in Sample1 and Sample2 can be accurately identified, and only a small part of ASm6A is identified by error identification.

Finally, it should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the scope of the present invention, and that other various changes and modifications can be made by one skilled in the art based on the above description and the idea, and it is not necessary or exhaustive of all the embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The model for identifying ASm6A modification is characterized by comprising a data acquisition module, a data processing module, a difference significance analysis module, a modification module and a result output module;

equation 1:

2. The model of claim 1, wherein the data processing module performs sampling for the processed y_j^(m) by a Metropolis-Hastings sampling method modified based on a markov chain monte carlo algorithm.

3. The model of claim 1, wherein in the data processing module, the major haplotype at SNP_j is the base of the highest base reading at SNP_j and the minor haplotype at SNP_j is the base of the last high base reading at SNP_j.

4. Use of a model according to any one of claims 1 to 3 for the preparation of a product for the identification of ASm 6A.

5. An ASm6A modification and identification system comprises an acquisition unit, a storage unit and a processing unit, and is characterized in that the acquisition unit is used for acquiring MeRIP-seq sequencing data of a sample to be tested;

The storage unit stores program instructions executable by the processing unit;

the processing unit comprises the model of any one of claims 1-3;

When the program instruction is executed by the processing unit, meRIP-seq sequencing data of the sample to be detected, which is obtained by the obtaining unit, is input into the model of any one of claims 1-3, and an ASm6A modification recognition result is obtained.

6. A computer device comprising a memory and a processor, wherein the memory has stored thereon a computer program executable on the processor;

The computer program, when executed by a processor, implements the model of any one of claims 1-3 and/or the ASm6A modification recognition system of claim 5.

7. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the model of any one of claims 1-3 and/or the ASm6A modification recognition system of claim 5.

8. A computer program product, characterized in that the computer program product comprises the computer readable storage medium of claim 7.

9. A model for identifying a difference ASm6A of a paired sample, which is characterized by comprising a data acquisition module, an identification module and a result output module;

the data acquisition module is used for acquiring MeRIP-seq sequencing data of a paired sample, and identifying according to the model of any one of claims 1-3 to respectively obtain ASm6A modified peaks of a sample A and a sample B;

The identification module marks an m6A modified peak with genome coordinates overlapped by more than 50% between the sample A and the sample B as an m6A modified peak which exists in the sample A and the sample B simultaneously based on MeRIP-seq sequencing data of the sample A and the sample B obtained by the data acquisition module;

10. The model of claim 9, wherein the variability is significant with q <0.05 and the variability is not significant with q > 0.05.