Detailed Description
The present invention is further described in the following description of the embodiments with reference to the drawings, which are not intended to limit the invention, and those skilled in the art may make various modifications or improvements based on the basic idea of the invention, but within the scope of the invention, unless departing from the basic idea of the invention.
The siRNA drug has the advantages that other traditional drugs cannot have in the aspect of dealing with outbreak of new sudden virus epidemic. After the sequence of the burst virus is obtained preliminarily, the design, preliminary screening and effectiveness verification of the siRNA medicament with the virus inhibition effect can be completed in a relatively short time. However, the siRNA thus obtained generally has off-target effects, which may cause cytotoxicity. Under emergency conditions, a method which can shorten the screening time, improve the screening efficiency and facilitate emergency drug formation is urgently needed to predict the cytotoxicity of siRNA, thereby effectively assisting the design of siRNA.
Herein, the term "burst virus" or "burst virus" includes: respiratory viruses, ebola viruses, zika viruses, and the like.
In this context, the term "respiratory virus" is known in the art and refers to a large group of viruses that invade the respiratory tract, causing local lesions of the respiratory tract, or only invade the respiratory tract, mainly causing lesions of the external tissues and organs of the respiratory tract. Respiratory viruses include influenza viruses in the Orthomyxoviridae family (Orthomyxoviridae); parainfluenza virus, respiratory syncytial virus, measles virus, mumps virus in the Paramyxoviridae family (Paramyxoviridae), and some viruses in other virus families, such as adenovirus, rubella virus, rhinovirus, coronavirus, and reovirus, among others. Statistically, more than 90% of acute respiratory infections are caused by viruses.
Herein, the term "influenza virus" (influenza virus for short) is known in the art, which has the three types of influenza A (A), B (B) and C (C), and causes influenza (influenza for short) in humans and animals (pigs, horses, marine mammals, birds, etc.). Influenza a viruses are the most important in causing human influenza epidemics, the most frequent recurrent and the most important pathogen causing true global epidemics. Taxonomically, influenza virus belongs to the family orthomyxoviridae, causes acute upper respiratory infections and is rapidly transmitted through the air, with periodic pandemics around the world. Influenza virus causes more serious symptoms such as pneumonia or cardiopulmonary failure in the elderly or children with weak immunity and some patients with immune disorder.
Respiratory viruses also include coronaviruses, and a previously unknown coronaviruse has contributed to a global SARS disaster. SARS is a global infectious disease tide that was first developed in the Guangdong province in China in 2002, spreads to southeast Asia and even the world, and is not gradually eliminated until the mid-term epidemic in 2003. Research reports have shown that SARS Coronavirus (SARS Coronavir, SARS-CoV) is the causative agent of Severe Acute Respiratory Syndrome (SARS).
Herein, the term "ebola virus" (Ebolavirus, EBOV) is known in the art, belonging to the family Filoviridae (Filoviridae). The virosome is in the form of long thread or rod, with a diameter of about 100nm and a length of 300-1500 nm. The virus particles have a helical nucleocapsid and an envelope on the outside. Its genome is single negative strand RNA, about 19kb in total length, and encodes 7 proteins in total. Ebola viruses can currently be divided into five subtypes: zaire types (ZE-BOV), Cote's Dewar types (CE-BOV), Sudan types (Sudan Ebolavirus, SEBOV), Reston types (REBOV) and Bendbury types (BEBOV). Ebola Hemorrhagic Fever (EHF) is an acute hemorrhagic infectious disease caused by Ebola virus, first occurred in zaire (the current congo republic) of Ebola river basin in 1976, and was named Ebola hemorrhagic fever because it causes systemic hemorrhage symptoms of infected patients. Since 1976, after the outbreak of zaire (the current congo republic of democratic republic) and sudan, the outbreak has developed in the middle of africa, and the outbreaks mainly include countries such as uda, congo, galbanum, sudan, kotedawa, libira and south africa, and have extremely high infectivity and the fatality rate is as high as 50-88%. Humans are primarily infected by contact with the body fluids, excretions, secretions, and the like of the patient or infected animal. The clinical manifestations mainly include fever, hemorrhage and multiple organ damage.
Herein, the term "off-target effect" is known in the art, and means that there is non-specificity in the siRNA action process, and it may act on other genes than the non-target gene to non-specifically block the gene expression, resulting in unexpected utility. Off-target effects associated with siRNA fall into three major categories: microrna (mirna) -like off-target effects, immune stimulation, RNAi element saturation.
It is an object of the present invention to provide a method for establishing a machine learning model for predicting the toxicity of siRNA against a certain class of cells. Another object of the present invention is to provide the use of said method for predicting the toxicity of siRNA against such cells. It is yet another object of the present invention to provide a computer readable medium. It is still another object of the present invention to provide a device for predicting the toxicity of siRNA against a certain type of cell. It is still another object of the present invention to provide a method for predicting the toxicity of siRNA against a certain type of cell.
Method for establishing machine learning model for predicting toxicity of siRNA to certain cell
In a first aspect, the present invention provides a method for establishing a machine learning model for predicting the toxicity of siRNA against a cell of a type, comprising the steps of:
A) providing n siRNAs, wherein n is more than or equal to 2, and the length of the siRNA is 19-29 bp;
B) obtaining an input value and an output value for establishing a machine learning model independently from each siRNA;
wherein the input value of any one siRNA in the n siRNAs is obtained by:
i) respectively comparing the sequence of the siRNA with the sequence of the genome mRNA, and selecting one or more off-target genes which are positioned in the genome mRNA, wherein the off-target genes are genes which can be complementary with the siRNA in the genome mRNA and have the number of mismatched bases less than or equal to 7;
ii) obtaining the off-target weight of each selected off-target gene mRNA independently from the complementary region of the siRNA sequence according to the characteristics of mismatched bases and the secondary structure characteristics of the off-target gene mRNA sequence;
iii) performing omic annotation of each selected off-target gene independently and non-sequentially from ii) using a bioinformatics database to obtain an omic weight for the off-target gene, the omic weight comprising at least one of a protein interaction weight, a signal pathway weight, and a core gene weight for the off-target gene;
iv) calculating a characteristic value of each omic based on the weight of each omic and the off-target weight of all the selected off-target genes, and taking the characteristic value as the input value;
and, the output value of the strip of siRNA is obtained by:
performing an experiment on the cell by using the siRNA to obtain a cell survival index in the presence of the siRNA, and taking the cell survival index as an output value;
C) and calculating the input values and the output values of the n siRNAs by a machine learning algorithm, thereby establishing a machine learning model.
The method for establishing the machine learning model is obtained by combining bioinformatics with biological experimental data and calculating through a machine learning algorithm.
Herein, the term "Bioinformatics (Bioinformatics)" is known in the art, which refers to the science of storing, retrieving and analyzing biological information using a computer as a tool in the research of life sciences. In general, bioinformatics combines molecular biology with information technology (especially internet technology). The research materials and results of bioinformatics include a wide variety of biological data, the research tools include computers, and the research methods include searching (collecting and screening), processing (editing, organizing, managing and displaying) and utilizing (calculating, simulating) the biological data.
In this context, the term "Machine Learning" is known in the art and is a multi-domain cross discipline involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, computational complexity theory, and the like. Machine learning theory is mainly to design and analyze some algorithms that allow computers to "learn" automatically. The machine learning algorithm belongs to an artificial intelligence algorithm, and is an algorithm for automatically analyzing and obtaining rules from data and predicting unknown data by using the rules. Because a large number of statistical theories are involved in the learning algorithm, machine learning is particularly closely related to inference statistics, which are also called statistical learning theories. Machine learning can be divided into the following categories: supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, and the like. Supervised learning learns a function from a given set of training data from which results can be predicted when new data arrives. The training set requirement for supervised learning is to include inputs and outputs, also referred to as features and targets. The targets in the training set are labeled by humans. Common supervised learning algorithms include regression analysis and statistical classification. Compared with supervised learning, the unsupervised learning has no artificially labeled result in the training set. A common unsupervised learning algorithm is clustering. Semi-supervised learning is between supervised learning and unsupervised learning. Reinforcement learning learns what to do by observation. Each action has an impact on the environment and the learning object makes a decision based on the observed feedback of the surrounding environment.
In one embodiment of the invention, the machine learning algorithm is preferably a supervised learning algorithm.
The machine learning model of the invention is used for predicting the toxicity of siRNA to certain cells.
The cells in the term "a class of cells" as used herein in reference to predicting cytotoxicity may be human cells or other mammalian cells. When the cell is a human cell, the genomic mRNA is a human genomic mRNA. When the cell is other mammalian cell, the genomic mRNA is that of the mammal. In addition, the term "cell type" refers to one or more types of cells that are functionally identical or related, for example, a "cell type" may be a cell that can be contacted or infected by a virus, such as a respiratory tract epithelial cell, a gastrointestinal tract epithelial cell, a skin cell, a liver cell, a nerve cell, a lymphocyte, an eye cell, a urinary tract cell, a reproductive tract cell, and the like. When the "certain cell type" refers to multiple cell types, a machine learning model for predicting the toxicity of siRNA to the cell type can be established for each cell type.
In this context, the term "siRNA (small interfering nucleic acid, abbreviated to small nucleic acid)" is known in the art and refers to a double-stranded short nucleic acid with a specific gene code, which can be 19-29bp (base pairs) in length (see the literature "McIntyre GJ, Yu YH, Lomas M, Fanning GC. the effects of stem length and core place on shRNA activity. BMC Mol biol.2011Aug 8; 12: 34."). The siRNA duplex is characterized by the same strand as the targeting sequence of messenger rna (mrna), referred to as the sense strand, and the complementary strand, the antisense strand. The siRNA includes a5 ' -phosphate terminus, a19 nt double-stranded region, a 3 ' -hydroxyl terminus, and 2 unpaired 3 ' -terminal nucleotide projections that direct cleavage of the mRNA. Generally, a gene usually contains thousands of bp, and siRNA is a specific sequence with the length of 21-23 bp. The siRNA may be cloned into an siRNA expression vector that functions to bind to and degrade the messenger ribonucleic acid (mRNA) of a particular target gene in a mammalian cell, thereby "silencing" the expression of the target gene, i.e., "turning off" the function of the gene. The mechanism by which such sirnas degrade mRNA and thereby block the synthesis of a particular protein is nucleic acid interference (RNAi).
In this context, the term "nucleic acid interference (or RNA interference) (RNAi)" is known in the art and refers to a phenomenon of highly conserved evolutionary process, double-stranded RNA (dsRNA) -induced, highly efficient and specific degradation of homologous mrnas. Once discovered, RNAi rapidly becomes one of the most active hotspots in the field of biological research, Science is listed as one of ten scientific achievements in 2001, and is listed as the first of ten technologies in 2002; nature also assesses siRNA as one of the most important scientific findings in 2002; two american scientists of farl and merolo found the RNAi mechanism in 2006 acquired the nobel medical prize. RNAi technology can specifically eliminate or close the expression of specific genes, is a rapid, effective and specific tool for inhibiting gene expression, and is widely used in the field of gene therapy for exploring gene functions, viral diseases (mainly AIDS and hepatitis) and malignant tumors. On one hand, RNAi is a test diamond for gene function test, and the time for understanding and knowing the human gene function and action can be greatly shortened by utilizing RNAi technology; on the other hand, a novel gene drug for inactivating a pathogenic gene, i.e., an siRNA drug, can be obtained by using RNAi technology.
By way of example, FIG. 8 shows a flow chart of one embodiment of the method of the present invention. In the method of the present invention, n siRNAs are first provided, each siRNA comprising a pair of a sense strand sequence and an antisense strand sequence. The value of n is 2 or more, for example, 10 or more, 15 or more, 20 or more, 100 or more. Those skilled in the art can use the appropriate value of n based on the present invention depending on the circumstances (e.g., balancing between the need for model accuracy or other requirements and the need for time and economic cost control or other requirements).
The n siRNAs can be specifically designed to implement the method of the present invention to create a machine learning model for predicting the toxicity of siRNAs against a certain type of cells, such as those shown in tables 1 and 2 in Experimental example 1 of this specification. The n siRNAs can also be candidate siRNA drugs with antiviral effect designed aiming at certain viruses, and the viruses can be paroxysmal viruses. For example, the n sirnas may be designed against a specific virus in the respiratory virus or against a specific virus in the ebola virus.
In the method of the present invention, an input value for establishing a machine learning model is further obtained for each siRNA using bioinformatics, and an output value for establishing a machine learning model is obtained using a biological experiment independently and not sequentially from the above-described step of obtaining an input value.
In the method, when the input value for establishing the machine learning model is obtained for each siRNA, in order to preliminarily determine the off-target gene of each siRNA, the sequences of the siRNA and the sequence of the genome mRNA are comprehensively compared, and the number of mismatched bases of the siRNA and the sequence of the genome mRNA is set to be less than or equal to 7, so that a series of off-target genes are comprehensively selected.
The genomic mRNA may be a human genomic mRNA or other mammalian genomic mRNA. Other mammals include, but are not limited to (for example): chimpanzee, gorilla, bonobo, guinea pig, rat rabbit, squirrel, dog, cat, mouse, rat, and the like.
As used herein, the term "human genome" is art-recognized and refers to the human (Homo sapiens) genome, consisting of 23 pairs of chromosomes, containing about 31.6 hundred million DNA base pairs. A part of which consists of approximately 20000 to 25000 genes. All human genome sequencing work has been done in 2006, and human genome sequences are publicly available.
Since different degrees of complementarity of the siRNA and the mRNA and the secondary structure of the mRNA in the complementary region can cause different off-target effects, the invention determines the off-target weight of each off-target gene according to the characteristics of mismatched bases and the secondary structure of the mRNA sequence of the off-target gene in the complementary region of the mRNA of each selected off-target gene and the sequence of the siRNA.
Furthermore, from genetic impact to cytotoxicity, it is a complex biological problem, like a black box. Off-target effects of siRNA are mainly manifested in degradation of mRNA or inhibition of further translation of mRNA into protein, and therefore the effect of off-target effects at the protein level is most direct. Proteins are not isolated and upstream proteins often have a modulating (including activating or inhibiting) effect on the activity of downstream proteins in various signaling pathways within the cell, primarily by changing the conformation of downstream proteins by adding or removing phosphate groups. In addition, among all genes in the human genome, some genes are essential for human, called core genes, and more than 1500 core genes are known at present. In order to more scientifically and accurately predict the toxicity of siRNA to certain cells, when a machine learning model is established, the method integrates the information of large data such as proteome, signal path group and/or core genome, etc., performs omic annotation on each selected off-target gene to obtain the omics weight of the off-target gene, and comprehensively calculates the characteristic value of each omic based on each omic weight and off-target weight of all selected off-target genes.
In the method, when the output value for establishing the machine learning model is obtained for each siRNA, the siRNA is used for carrying out experiments on the cells to obtain the cell survival index in the presence of the siRNA, and the cell survival index is used as the output value. As used herein, the term "cell viability index" refers to the viability of a cell and can be expressed as the ratio of the OD450 value of a cell in the presence of a given siRNA to the OD450 value of that cell in a normal state.
Through the design and conception, the method for establishing the machine learning model for predicting the toxicity of the siRNA to certain cells is more scientific, rigorous and accurate.
In the method, the length of the siRNA is further preferably 19-25bp, more preferably 19-21bp, and still more preferably 21 bp.
The alignment may be performed using alignment software selected from the group consisting of: BLAST, BLAT, or Wise2 DBA. When using these software, default parameters can be used and some of them adjusted as necessary to obtain a comprehensive alignment result. Taking BLAST (for a description of this software see references "Camacho C, coulauris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden tl. BLAST +: architecture and applications, bmc bioinformatics.2008,10: 421", the entire contents of which are incorporated herein by reference) as an example, in one embodiment of the present invention, default parameters may be used and the expectation (evalue) set at 1000 so that the software will retain all sequences with expectations less than or equal to 1000.
The BLAT (i.e., "the BLAST-like alignment tool") software is described in the literature "Kent, W James (2002)", BLAT-the BLAST-like alignment tool ". Genome research.12(4): 656-664", the entire contents of which are incorporated herein by reference. In one embodiment of the invention, default parameters may be used when using software BLATs.
A description of the Wise2DBA software is found in the literature "Jareborg N, Birney E, Durbin R. comparative analysis of non-functional regions of 77 orthogonal drugs and human genes pair. genome Research 9:815-824, 1999", the entire contents of which are incorporated herein by reference. In one embodiment of the invention, when using the software Wise2DBA, default parameters may be used.
Preferably, for each siRNA, its sense strand (or sense strand) and antisense strand are aligned separately to the sequence of the genomic mRNA.
Preferably, the characteristic of the mismatched base comprises the number of mismatched bases, optionally including the position of the mismatched base.
Preferably, the secondary structure of the off-target gene mRNA sequence is characterized by the probability that the mRNA itself does not form secondary structure with respect to the complementary region. The secondary structure of the mRNA of a complementary region can affect the probability of binding of the mRNA to a complementary siRNA in that region.
Preferably, for each selected off-target gene, calculating the interference rate of the siRNA to the expression quantity of the off-target gene mRNA according to the characteristics of the mismatched bases, and calculating the product of the interference rate and the probability of not forming secondary structure, thereby obtaining the off-target weight of the off-target gene.
If a particular off-target gene mRNA has multiple complementary regions to the same siRNA sequence, the off-target weight calculated for each complementary region is taken as the maximum value.
Different degrees of sequence matching of siRNA and mRNA can lead to different interference rates, for example, the interference rate decreases with increasing number of mismatched bases, and generally, if the number of mismatched bases reaches more than 7, the interference rate of siRNA on mRNA expression amount is negligible. The interference rate of siRNA to mRNA expression can be determined theoretically or by biological experiments.
For example, the following methods can be used to determine the respective interference rates of siRNAs with different numbers of mismatched bases from a given mRNA on the expression level of the mRNA: detecting the expression level of a given mRNA in a suitable cell (hereinafter referred to as natural expression level) by using a qRT-PCR method; transfecting siRNA having different mismatch base numbers with the given mRNA into the cells respectively, and detecting the expression level (hereinafter referred to as interference expression level) of the mRNA under each mismatch condition by using a qRT-PCR method; the ratio of each interference expression quantity to the natural expression quantity is respectively calculated, and the ratio is subtracted by 1 to obtain the interference rate of the siRNA with different mismatched base numbers.
In addition, the invention carries out curve fitting treatment on the interference rate of the siRNA with different mismatched base numbers, finds that a nonlinear fitting formula can be obtained, and can calculate the interference rate of the siRNA with different mismatched base numbers to the mRNA expression quantity by utilizing the fitting formula. The interference rate calculated by the fitting formula is close to the actual interference rate in height, and the accuracy is good.
In one embodiment of the present invention, the non-linear fit formula is: 1) for mismatched bases at the 3' end: y is3’=-0.01316x3’2-0.03245x3’+ 1.0238; wherein x3’Number of mismatched bases at 3' end, y3’Is the interference rate of the 3' end; 2) for mismatched bases at the 5' end: y is5’=-0.01313x5’2+0.03223x5’+0.95513. Wherein x5’Number of mismatched bases at 5' end, y5’Is the interference rate at the 5' end. The method for obtaining the nonlinear fitting equation of the present invention may be, for example, as described in test example 1 of the present specification. Although the non-linear formula in test example 1 was obtained using the human MGMT gene (O-6-Methylguanine-DNA Methyltransferase ) as the off-target gene, the non-linear fitting formula of the present invention is not limited thereto and can be applied to other off-target genes.
Further, the non-linear fitting formula of the present invention can be further optimized according to, for example, the method described in experimental example 1 of the present specification, to improve the accuracy of the coefficients of the non-linear formula.
The expression "calculating the interference rate of the expression level of the off-target gene mRNA by the siRNA based on the characteristics of the mismatched bases" in the present invention refers to the interference rate of the siRNA to the whole off-target gene, i.e., y-y3’×y5’. For example, if a particular off-target gene has 2 mismatches at the 3 'end of the sense strand and 3 mismatches at the 5' end of the sense strand in the region where it matches the siRNA, then the overall interference rate of the siRNA to that off-target gene is the product of the interference rates at both ends, i.e., 0.9060 multiplied by 0.9337 equals 0.8459.
In the method of the present invention, the probability of not forming a secondary structure of each off-target gene mRNA may be predicted using software selected from the group consisting of: RNAPLLFOLD, mfold or RNAstructure. When using these software, parameters can be set as needed. A description of the RNAPLFLOLD software may be found in the references "Lewis BP, Burge CB, Bartel DP. conserved seed pairing, soft open bent by antibiotics, indictes that are not toxic to human genes are microRNA targets. cell.2005,120(1): 15-20.", the entire contents of which are incorporated herein by reference. In one embodiment of the present invention, RNAPLFOLD software can be used to predict the secondary structure of human whole genome mRNA and integrate the output results to form a localized database, so as to achieve high-speed reading and increase the computation speed. The parameter design of rnaplflold may include: l-40, W-80, u-25. Thereby obtaining the probability that the off-target gene does not form a secondary structure.
Combining the interference rate of the siRNA to the whole off-target gene, the off-target weight of the off-target gene is the product of the probability of not forming a secondary structure and the interference rate of the whole.
In step iii), the omics weights may be selected from one, both or all of the protein interaction weights, signal pathway weights and core gene weights of off-target genes.
Protein interaction weights can be obtained by omics annotation of each selected off-target gene using, for example, the protein interaction network database "STRING". "STRING" is one of the most authoritative databases of protein interaction networks in the world today, which covers the interaction data of known and predicted proteins (see references "Szklarczyk D, France schini A, Kuhn M, Simonovic M, Roth A, Minguez P, Dorks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C. the STRING Database in 2011: functional interaction networks of proteins, global integrated and programmed nucleic Acids Res. 39 (Databaissue): D561-8.", the entire contents of which are incorporated herein by reference). These interactions include physical direct effects and functionally indirect effects. These data are derived from genomics information, high throughput biological experiments, conserved coexpression profiles and the combing of literature knowledge. STRING is an organic quantitative integration of the above basic data, and in a particular species, each pair of interacting proteins is weighted (in the range of 0-1000) to show how close the relationship is. If a protein participates in multiple pairs of interactions, its protein interaction weight is the sum of the weights of the interactions it participates in.
Signal pathway weights can be obtained by annotating each of the selected off-target genes with, for example, the human pathway group Database "ConsenusPathDB-human" (see references "Kamburov A, Pentchev K, Galicka H, Wierling C, Lehrach H, Herwig R. ConsenusPathDB: aware a more complete picture of cell biology. nucleic Acids Res.2011, 39(Database issue): D712-7.", the entire contents of which are incorporated herein by reference). The database relates to various aspects such as gene regulation, protein action, signal conduction, metabolism, drug targeting, biochemical reaction and the like, and is the most complete public access group database so far. For any selected off-target gene, the number of pathways in which it participates can be extracted from the database as the signal pathway weight.
Regarding core gene weights, the research team of the university of Toronto, molecular genetics, is known to turn off 18000 Genes (90% of the human genome) one by one using the latest gene editing technology CRISPR (see references "Hart T, Chandrashekhar M, Aregger M, Steinhart Z, Brown KR, MacLeod G, Mis M, Zimmermann M, Fradet-Turcotte A, Sun S, Mero P, Dirks P, Sidhu S, Roth FP, Rissland OS, Durocher D, Angers S, Moffat J. high-Resolution CRISPR Screens derived Genes and Genotype-Specific Cancer promoters 2015. cell. (163-6), the contents of which are incorporated herein by reference in their entirety). Genes essential to humans are referred to herein as "core genes". If the off-target gene of choice is the core gene, the toxic effect of siRNA on cells may be greater. For any selected off-target gene, its core gene weight may be set to 1 if it is a core gene, and 0 if it is not a core gene.
In the step iv), the characteristic values of the omics may be selected from one, two or all of proteome characteristic values, signal path set characteristic values and core genome characteristic values, and the proteome characteristic values, the signal path set characteristic values and the core genome characteristic values may each be calculated according to the following a) to c), respectively:
a) calculating the product a 'of the off-target weight of each selected off-target gene and the protein interaction weight thereof, and then calculating the sum of the products a' obtained by all the selected off-target genes respectively to obtain a proteome characteristic value;
b) calculating the product b 'of the off-target weight of each selected off-target gene and the signal path weight thereof, and then calculating the sum of the products b' obtained by all the selected off-target genes respectively to obtain the characteristic value of the signal path group;
c) and calculating the product c 'of the off-target weight of each selected off-target gene and the weight of the core gene, and then calculating the sum of the products c' obtained by all the selected off-target genes respectively to obtain the characteristic value of the core genome.
Preferably, the input values are normalized prior to establishing the machine learning model. The normalization process is to avoid that a certain kind of data affects the establishment of the model because the absolute value is too large. The (value-minimum)/(maximum-minimum) method is usually adopted to map data into the 0-1 interval one by one, which is one of the commonly used classical methods.
The output values may also be binarized prior to establishing the machine learning model, but this is not required. A certain cell viability index may be used as a boundary value, and values higher than or equal to the cell viability index are defined as 1, and the remaining values are defined as 0. The cell survival index as a boundary value may be 0.75 or more. For example, with a cell survival index of 0.9 as a boundary value, a value higher than or equal to 0.9 is defined as 1, and the remainder is defined as 0.
Preferably, the machine learning algorithm comprises a support vector machine, an artificial neural network, a decision tree and a regression model. These machine learning algorithms may be based on C language, perl language, python language, R language, and parameters may be set as needed. For example, when the machine learning model is built by using the support vector machine algorithm, the key (function mapping mode for determining the data space) parameter may be set to linear, polymodal, radial, sigmoid, preferably linear, using the library function svm of R. In building a machine learning model using an artificial neural network algorithm, the main parameter hidden (i.e. the number/layer of hidden neurons) can be tuned using the library function neuralenet of R, which is preferably set to 1.
The established machine learning model can be evaluated using known evaluation methods. The most common is cross validation. For example, 8-fold cross validation, 9-fold cross validation, 10-fold cross validation, etc. may be used.
Preferably, based on the principle of action of siRNA, the off-target gene selected excludes genes in which the mRNA of the off-target gene is located only in the 5' untranslated region (UTR) of the complementary region of the siRNA sequence.
The interfering effect of siRNA is embodied in the silencing effect on a gene, and if a specific gene is not expressed in a certain cell type under the natural state, the interfering effect of siRNA can be ignored aiming at the gene. Thus, preferably, the off-target genes selected do not include genes that are not expressed in nature (or normal) in the cell of the certain type, based on the database of expression profiles of known cell lines. THE expression profile database of such cell lines is, for example, THE "THE HUMAN PROTEIN ATLAS" database (see THE documents "Uhlen M, Oksvold P, Fagerberg L, Lundberg E, Jonasson K, Forsberg M, Zwahlen M, Kampf C, Wester K, Hober S, Wernerus H,

l, Ponten f.towards a knowledgeable Human Protein atlas. nat biotechnol.2010 (28) (12):1248-50. ", the entire contents of which are incorporated herein by reference), contains expression data of Protein-encoding genes of common cell lines, which are double validated at the RNA and Protein levels, respectively.
In the methods of the present invention, sirnas used in experiments on cells can be prepared using methods conventional in the art, including (for example): chemical synthesis, in vitro transcription, siRNA expression vectors, siRNA frameworks and the like.
(II) application of the method of the invention in predicting the toxicity of siRNA to the cells
In another aspect, the invention provides the use of the methods of the invention for predicting the toxicity of an siRNA against a cell of a certain type.
(III) computer readable Medium
Another aspect of the invention also provides a computer readable medium, which can be used to build the machine learning model according to the method of the invention, and which comprises the following modules:
a sequence alignment module for performing step i) of the method of the present invention;
an off-target weight calculation module for performing step ii) of the method of the present invention;
omics annotation module for performing step iii) of the method of the invention;
an omics characteristic value calculation module for performing step iv) of the method of the present invention;
a machine learning algorithm calculation module for executing step C) of the method of the present invention.
The computer readable medium may comprise an external data input module for inputting the n siRNA sequences and the corresponding cell viability index, respectively.
By way of example, FIG. 9 illustrates a schematic diagram of one embodiment of a computer-readable medium of the present invention.
(IV) device for predicting the toxicity of siRNA to certain cell types
Another aspect of the present invention also provides an apparatus for predicting the toxicity of siRNA against a cell of a type, comprising:
1) the input unit is used for inputting the sequence of the siRNA to be detected;
2) the storage unit is used for storing the machine learning model established by the method aiming at the cells;
3) an execution unit to execute the machine learning model on the sequence of the siRNA;
4) and the output unit is used for displaying the prediction result of the toxicity of the siRNA to the cells.
The apparatus may be an apparatus specially constructed for the purposes of the present invention, or may be a computer.
Such as, but not limited to, a keyboard, a mouse, a scanner, a touch screen, as is known in the art.
In one aspect of the invention, the storage unit may be any type of memory for storing data and/or software, including electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), virtual storage locations on a network, memory devices, computer-readable media, computer disks, and information-transportable storage devices, or any other type of media suitable for storing the machine learning model.
The output unit includes, but is not limited to, any type of display, a printer.
(V) method for predicting toxicity of siRNA to certain cell
In another aspect, the present invention provides a method for predicting the toxicity of an siRNA against a cell of a type, comprising:
providing a sequence of the siRNA to be detected;
inputting the sequence of the siRNA into the device, and executing a machine learning model established by the method aiming at the cell by the device so as to obtain a prediction result of the toxicity of the siRNA to the cell.
The siRNA to be detected can be a drug candidate for resisting virus (including respiratory viruses, Ebola viruses and the like) infection. Generally such siRNA sequences can be obtained in a manner commonly used in the art. For example, the siRNA sequences to be tested are designed according to siRNA design principles well known in the art using known public or commercial siRNA design tools (e.g., Invitrogen, GenScript, Dharmacon, and/or siDirect, etc.).
An example of the siRNA design principle is: starting 50-100 bases after the gene promoter of a conserved region of the whole gene sequence of, for example, human respiratory virus, a nucleotide sequence of 19-21bp (e.g., 19bp) in the gene sequence is searched for which the following conditions are satisfied: (1) starting with G or C and ending with A or T; (2) at least 5 of the last 7 bases at the end are A or T; (3) avoiding 4 consecutive bases, such as AAAA or CCCC, thereby increasing the complexity of the bases; and/or (4) a GC content of between 30 and 52%.
The complete gene sequence of the human respiratory virus includes the complete gene sequence of the known human respiratory virus or the complete gene sequence of the new human respiratory virus. The complete gene sequences of known human respiratory viruses are directly available from the public database Genebank, and the complete gene sequences of new human respiratory viruses can be obtained by isolating, extracting (for example) RNA, sequencing and further genotyping the new viral strains by known methods.
Preferably, in the method of the present invention, the respiratory virus comprises influenza virus, parainfluenza virus, respiratory syncytial virus, measles virus, mumps virus, adenovirus, rubella virus, rhinovirus, coronavirus and/or reovirus; more preferably influenza virus; further preferably influenza a virus; further preferred is influenza a virus H1, H3, H5, H7 or H9; further preferred are influenza a viruses H1N1, H3N2, H5N1, H7N7, and H7N 9.
The present invention will be further explained or illustrated below by way of examples, which should not be construed as limiting the scope of the invention.
Examples of the present invention
Taking the establishment of a machine learning model for predicting the toxicity of siRNA against human airway cells as an example, the embodiments of the invention are described as follows:
[ materials used in the test ]
1) Cell culture material
The conventional culture medium was DMEM medium (Gibco, USA) supplemented with 10% (v/v) fetal bovine serum (Hyclone, USA); DMSO is purchased from Sigma-Aldrich, USA.
2) qRT-PCR detection related reagent
The cell total RNA extraction kit, the reverse transcription kit and the fluorescent quantitative PCR kit are purchased from Promega corporation in the United states;
transfection reagent Lipipo 2000 was purchased from Invitrogen, USA; the siRNA sequences were all synthesized by Invitrogen, USA.
3) Cell survival index-related reagents
The CCK-8 kit (containing CCK-8 solution) was purchased from the institute of Dojindo chemistry, Japan.
4) Experiment consumable
The disposable test consumables used in the tests were purchased from Corning, usa.
Unless otherwise indicated, the following biological experiments are performed using methods, materials, conditions and equipment conventional in the art.
Test example 1 interference rate of siRNA on expression level of mRNA of off-target Gene
Different degrees of sequence matching between siRNA and mRNA can cause different interference effects, and the specific weight setting is supported by biological experiment data. A549 non-small cell lung cancer cell line was selected, and a human gene MGMT (O-6-Methylguanine-DNA Methyltransferase ), which is known to be a weakly expressed gene in A549 cell line, was selected. The selection of a weakly expressed gene is made because if it is a strongly expressed gene, a large dose of siRNA may be required to detect the interfering effect, and a large dose of exogenous siRNA acting on the cell may cause other immune stimulating and element saturating effects.
Four siRNA sequences (each siRNA comprises a pair of sense and antisense strand sequences) were designed for MGMT, as detailed in table 1. A549 cells were transfected with siRNA at 50nM transfection concentration, respectively, and untransfected blanks were used as controls. In complete medium (10% FBS + 90% DMEM: F12(1:1)), 5% CO at 37 ℃2After 48 hours of incubation in the incubator, mRNA expression level of MGMT was measured by qRT-PCR method, and the results are shown in FIG. 1. As shown in FIG. 1, the mRNA expression level of MGMT in the blank group was set to 1, and the mRNA expression levels in the other transfection groups were set to relative percentages, and the mRNA expression level of MGMT in the siRNA4 group was found<10%, i.e., interfering effect of siRNA4>And 90% is determined as an effective interference sequence. Thereafter, the optimal siRNA transfection concentrations were searched using the effective interference sequence, and FIG. 2 shows the respective transfection concentrations tested. As shown in FIG. 2, the transfection concentration substantially saturated at 25nM, and 25nM was selected for the subsequent experiments.
Table 1: siRNA designed aiming at MGMT gene
Based on the screened valid interfering sequences, 15 mismatched sequences were synthesized, as shown in Table 2, wherein the underlined parts are the mismatched bases.
Table 2: sequence design of mismatch siRNA against MGMT gene
A549 cells were transfected with these siRNAs, respectively, and a blank group (untransfected), a negative control group (random sequence-transfected siRNA (Invitrogen-synthesized), i.e., siRNA not directed against MGMT gene), and a positive control group (high-efficiency MGMT knock-out siRNA (siRNA 4)) were set, and the effect of each mismatch sequence siRNA on the mRNA level of MGMT was examined by qRT-PCR method after culturing for 48 hours under the above-mentioned culture conditions. The results are shown in FIG. 3, and the expression levels of all mRNAs were compared with those of the blank control group. It can be seen that as the number of mismatched bases increases, the expression level of mRNA also increases, i.e., the interfering effect of siRNA decreases. Mismatched bases are identical in regularity, either at the 5 'or 3' end, and differ only in weight coefficient (interference ratio). Based on the expression data of mRNA, the interference rate of siRNA is obtained, namely: the ratio of the expression level of each experimental group to the expression level of the blank control group is calculated, and the ratio is subtracted from 1 to obtain the interference rate of each siRNA. The interference rates of these sirnas were subjected to a curve fitting treatment. Since the expression level of mRNA in the negative control group was about 0.6, and the mRNA in the siRNA10 and siRNA11 groups was also close to 0.6, no curve fitting treatment was included, and the fitted curves are shown in FIGS. 4 and 5. The non-linear fit equations for mismatches at the 3 ' end (FIG. 4) and 5 ' end (FIG. 5), respectively, are 1) for mismatched bases at the 3 ' end: y is3’=-0.01316x3’2-0.03245x3’+ 1.0238; wherein x3’Number of mismatched bases at 3' end, y3’Is the interference rate of the 3' end; 2) for mismatched bases at the 5' end: y is5’=-0.01313x5’2+0.03223x5’+0.95513. Wherein x5’Is 5'Number of mismatched bases at the end, y5’Is the interference rate at the 5' end.
The interference rate of siRNA to off-target gene is y ═ y3’×y5’。
Example 1: process for establishing machine learning model for predicting toxicity of siRNA to human respiratory system cells
A. Providing siRNA for establishing machine learning model
The above 16 siRNAs (siRNA 4 in Table 1 and 15 mismatch sequence siRNA5-siRNA19 in Table 2) were used to establish a machine learning model.
B. Separately obtaining input values and output values for building a machine learning model
Wherein the input value of any one siRNA in the 16 siRNAs is obtained by:
i) comparing the siRNA sequence with the human genome mRNA sequence, and further screening off-target genes based on functional annotation and expression profile database
To initially identify off-target genes for a given siRNA, a database of mRNA sequences from the human genome was created, localized (i.e., mRNA sequences downloaded to hard disk, subsequent work can be web independent), and the sequences of the siRNA and the mRNA sequences from the human genome were aligned comprehensively by BLAST (version 2.2.31) software (see references "Camacho C, Couloris G, Avagayan V, Ma N, Papadopoulos J, Beller K, Madden TL. BLAST +: architecture and applications. BMC Bioinformatics.2008,10: 421."). To obtain a full alignment, rather than just a highly similar alignment, we chose the blastn model in the BLAST software. Default parameters are commonly used for setting parameters of BLAST software, specifically, evalue is 1000, word size is 7, gapopen is 5, gapxtend is 2, dependency is 3, and reward is 2. In the alignment, the sense and antisense strands of the siRNA are aligned separately.
Through comparison, a complete preliminary off-target gene list is obtained, and then functional annotation is carried out on the regions matched with the siRNA and the mRNA of each off-target gene, namely, whether the action region of the siRNA is distributed in the 5 'UTR, the 3' UTR or the coding region of the mRNA is marked. Based on the principle of action of siRNA, in the subsequent analysis, only off-target genes whose siRNA match site is located in the 3' UTR and/or coding region of its mRNA are of interest.
Off-target genes that are not expressed in human respiratory cells (e.g., a549 non-small cell lung cancer cell line) per se are deleted from the off-target gene list using an expression profile database of known cell lines. Expression profiling data for cell lines was derived from THE "THE HUMAN PROTEIN ATLAS" database.
Thereby selecting a series of off-target genes. Hundreds of off-target genes were obtained for each of the 16 sirnas, and the specific statistical results for the number of off-target genes are shown in table 3.
Table 3: statistics of off-target Gene number of siRNA
| Name of siRNA | Number of off-target genes |
| siRNA4 | 138 |
| siRNA5 | 131 |
| siRNA6 | 140 |
| siRNA7 | 124 |
| siRNA8 | 131 |
| siRNA9 | 120 |
| siRNA10 | 134 |
| siRNA11 | 101 |
| siRNA12 | 132 |
| siRNA13 | 136 |
| siRNA14 | 121 |
| siRNA15 | 127 |
| siRNA16 | 129 |
| siRNA17 | 121 |
| siRNA18 | 151 |
| siRNA19 | 151 |
ii) determining off-target weights for selected off-target genes
The off-target gene was weighted based on the interference rate of the curve fitting obtained in test example 1.
For example, if a specific off-target gene (human ERCC6,precision Repair Cross-Complementation 6, Excision Repair Cross-complement 6) is matched with a specific siRNA (e.g., siRNA4 (sense strand sequence CCAGACAGGUGUUAUGGAATT (SEQ No.7))) in a region where 1 mismatch is present at the 3 'end of the sense strand and 5 mismatches are present at the 5' end of the sense strand, the interference ratio of the siRNA to the whole off-target gene is the product of the interference ratios at both ends, i.e., 0.9782 multiplied by 0.7880 equals 0.7708.
For the complementary region, the probability that the off-target gene mRNA itself does not form secondary structures is determined using the software RNAPLFLOLD (version number 2.2.4) (see "Lewis BP, Burge CB, Bartel DP. connected fed pairing, Soft swept by adenosines, indicative that is that which is that of human genes are micro RNA targets. cell.2005,120(1): 15-20."). Specifically, the software is used for predicting the secondary structure of human whole genome mRNA, and extracting relevant text and numerical information from the output result to form a local database, so that high-speed reading is realized, and the calculation speed is improved. The parameter design of RNAPLFLOLD specifically comprises the following steps: l-40, W-80, u-25. For example, in the case of the region where the mRNA and siRNA sequences of the above off-target gene are complementary, the probability that the off-target gene cannot form a secondary structure is 0.5425, the overall interference rate obtained based on the interference rates at both ends is 0.7708, and the off-target weight of the off-target gene is 0.5425 multiplied by 0.7708 and is 0.4182.
iii) -vi) performing omics annotation on the selected off-target genes to obtain omics weights; calculating the characteristic value of the omics according to the weight of the omics and the off-target weight
(1) Calculating proteomic feature values based on the protein interaction weights and off-target weights of all selected off-target genes
The links tables of human beings in the STRING database are processed locally (i.e., downloaded to the hard disk of the local computer) and the names of the proteins are converted into common gene names for computational operations. By acting specific siRNA on cells, the possible off-target genes and their weights were determined by the methods described above. FIG. 6 illustrates a simplified example if a siRNA has seven off-target genes (indicated by circles), each with a schematic off-target weight (numbers in the circles). And determining which of the STRING database has interaction relations and protein interaction weights according to the information of the STRING database. FIG. 6 schematically shows an example in which three genes are connected by a line if there is an interaction between them, and the numbers on the line show the interaction weights. Thus, the proteomic characteristic value was calculated to be 0.9 × (280+160) +0.8 × 280+0.6 × 160. That is, for each off-target gene, if it is involved in an interaction, its off-target weight is multiplied by the protein interaction weight; multiplying the sum of the interaction weights of each protein by its off-target weight if it is involved in multiple interactions; if the off-target gene is isolated, its effect is ignored. The results of the proteome characteristic value calculation of off-target genes for each siRNA are shown in Table 4.
Table 4: calculation results of proteome characteristic values of off-target genes of siRNA
| Name of siRNA | Proteomic eigenvalues |
| siRNA4 | 150.0129 |
| siRNA5 | 135.2095 |
| siRNA6 | 182.5355 |
| siRNA7 | 97.8546 |
| siRNA8 | 102.6913 |
| siRNA9 | 88.8456 |
| siRNA10 | 106.3368 |
| siRNA11 | 65.3091 |
| siRNA12 | 141.7128 |
| siRNA13 | 101.5539 |
| siRNA14 | 82.2402 |
| siRNA15 | 107.9213 |
| siRNA16 | 107.9795 |
| siRNA17 | 122.7014 |
| siRNA18 | 182.3832 |
| siRNA19 | 134.8214 |
(2) Calculating a signal path set eigenvalue based on the signal path weights and off-target weights of all selected off-target genes
The human path set database consensu spathdb-human (version number 31) is localized. The determined off-target weight of each off-target gene is multiplied by the number of the paths involved in each off-target gene, and the sum is taken as the characteristic value of the signal path group, and if the off-target gene is isolated, the influence of the off-target gene is ignored. For example, three off-target genes A, B, C were identified, and from this database it was found that a participated in 3 known pathways, B participated in 2 known pathways, and C were isolated, then their signature pathway set eigenvalues were (off-target weight of a multiplied by 3) plus (off-target weight of B multiplied by 2). The results of calculating the characteristic values of the signaling pathway groups of off-target genes for each siRNA are shown in Table 5.
Table 5: calculation results of Signal pathway group characteristic values of off-target genes of siRNA
| Name of siRNA | Signal path group characteristic value |
| siRNA4 | 653.2424 |
| siRNA5 | 585.7767 |
| siRNA6 | 742.5516 |
| siRNA7 | 372.6335 |
| siRNA8 | 404.0694 |
| siRNA9 | 416.7108 |
| siRNA10 | 419.1286 |
| siRNA11 | 318.9158 |
| siRNA12 | 717.8643 |
| siRNA13 | 476.5563 |
| siRNA14 | 362.0600 |
| siRNA15 | 368.6291 |
| siRNA16 | 440.0923 |
| siRNA17 | 551.1228 |
| siRNA18 | 837.8167 |
| siRNA19 | 258.3346 |
(3) Calculating core genome feature values based on core gene weights and off-target weights of all selected off-target genes
It is now known that more than 1500 core genes have been found. For example, if 4 off-target genes A ', B', C ', D' are identified, where B 'and C' are core genes, based on the found core genes, then their core genome feature values are the off-target weight of B 'plus the off-target weight of C'. The results of calculation of the core genome feature values of off-target genes for each siRNA are shown in Table 6.
Table 6: calculation results of core genome characteristic values of off-target genes of siRNA
| Name of siRNA | Core genome eigenvalues |
| siRNA4 | 7.6147 |
| siRNA5 | 6.6085 |
| siRNA6 | 8.0534 |
| siRNA7 | 5.7126 |
| siRNA8 | 8.5514 |
| siRNA9 | 5.3999 |
| siRNA10 | 5.8094 |
| siRNA11 | 4.6920 |
| siRNA12 | 7.3661 |
| siRNA13 | 5.4217 |
| siRNA14 | 5.7174 |
| siRNA15 | 4.9374 |
| siRNA16 | 6.1381 |
| siRNA17 | 4.1732 |
| siRNA18 | 7.2726 |
| siRNA19 | 4.0797 |
Wherein the output value of any one siRNA in the 16 siRNAs is obtained by:
a549 cells were transfected with the above 16 siRNAs (siRNA 4 in Table 1 and 15 mismatched sequence siRNA5-siRNA19 in Table 2), respectively, and a blank group (untransfected) and a negative control group (random sequence-transfected siRNA (Invitrogen-synthesized), i.e., siRNA not directed against MGMT gene) were set, and after culturing for 48 hours under the above culture conditions, the cells were treated with CCK-8 solution, 10. mu.L of CCK-8 solution was added to each well, and the culture plates were incubated in an incubator for 0.5 to 1 hour. The absorbance at 450nm was measured with a microplate reader, OD450 data was collected, and the ratio of OD450 value of each experimental group to OD450 value of blank group was calculated, respectively, to obtain the cell survival index of each group, and the results are shown in fig. 7.
Comparing fig. 7 and fig. 3, it can be seen that the survival index of the cells transfected by each siRNA has no significant correlation with the mRNA expression level of MGMT, and there is no certain rule with the correlation between the mismatch sites and numbers of siRNA sequences, which indicates that the difference in cell survival index is caused by off-target effect of siRNA. In addition to off-target genes, siRNA has some effect on other genes because each off-target gene has complex network interaction effects at each level of RNA group, proteome, pathway group, etc.
C. Establishing machine learning model through machine learning algorithm
(1) Establishing machine learning model through machine learning algorithm ANN
As described above, for a specific siRNA, proteome characteristic values, signal path set characteristic values and core genome characteristic values are obtained, and these data are normalized before being used as input values of a machine learning algorithm. And mapping the data into 0-1 intervals one by adopting a (value-minimum)/(maximum-minimum) method. The results of the normalized proteome, signal pathway set and core genome eigenvalues are shown in table 7.
Table 7: normalized proteome, signal pathway set and core genome feature results
The output value data of the machine learning algorithm, that is, the survival index of the cell in the presence of siRNA, is binarized before being used as the output value data (for example, the survival index 0.9 is used as a boundary value, 1 is set to be higher than or equal to 0.9, and the rest is set to be 0). The results of the cell viability index of the binarization are shown in Table 8.
Table 8: binary processed cell survival index results
| Name of siRNA | Dualised cellsurvival index |
| siRNA4 |
| 1 |
| siRNA5 | 1 |
| siRNA6 | 1 |
| siRNA7 | 1 |
| siRNA8 | 1 |
| siRNA9 | 0 |
| siRNA10 | 0 |
| siRNA11 | 1 |
| siRNA12 | 0 |
| siRNA13 | 0 |
| siRNA14 | 0 |
| siRNA15 | 0 |
| siRNA16 | 0 |
| siRNA17 | 0 |
| siRNA18 | 0 |
| siRNA19 | 0 |
Substituting the normalized proteome, signal path set and core genome eigenvalues as input values and the binarized cell viability index as output value into an artificial neural network Algorithm (ANN) using the library function neuralenet of R, where the main adjustable parameter is hidden, which can be set to 1 in the preferred embodiment.
And evaluating the model by adopting 8-fold cross validation (8-fold cross validation), dividing the data set into 8 parts,training 7 parts in turn, validating 1 part, and taking the average value of results of 8 times as the estimation of the algorithm precision. The accuracy of the algorithm can reach 56.25%.
(2) Establishing machine learning model through machine learning algorithm SVM
As described above, for a specific siRNA, proteome characteristic values, signal path set characteristic values and core genome characteristic values are obtained, and these data are normalized before being used as input values of a machine learning algorithm. And mapping the data into 0-1 intervals one by adopting a (value-minimum)/(maximum-minimum) method. The results are shown in Table 7.
The output value data of the machine learning algorithm, that is, the survival index of the cell in the presence of siRNA, is binarized before being used as the output value data (for example, the survival index 0.9 is used as a boundary value, 1 is set to be higher than or equal to 0.9, and the rest is set to be 0). The results are shown in Table 8.
Substituting the normalized proteome eigenvalue, signal path set eigenvalue and core genome eigenvalue as input values and the binarized cell survival index as output value into a support vector machine algorithm (SVM) using a library function SVM of R, wherein the main adjustable parameter is hidden, which is preferably set as linear.
And evaluating the model by adopting 8-fold cross validation, dividing the data set into 8 parts,training 7 parts in turn, validating 1 part, and taking the average value of results of 8 times as the estimation of the algorithm precision. The accuracy of the algorithm can reach 62.5%.
In the present embodiment, 16 sirnas (i.e., n-16) are used, and it is understood that the accuracy of the algorithm can be further improved when the sample size of the sirnas is increased.
Example 2: predicting the toxicity of siRNA to human respiratory system cells using a machine learning model
As an example, the machine learning model obtained in example 1 (specifically, the machine learning model established by the machine learning algorithm SVM) was used to predict the effect of the above 16 sirnas on the toxicity of human respiratory cells, and the results are shown in table 9, in which the values obtained by the experiment (experimental values of binarization, i.e., cell survival index of binarization shown in table 8) and the values predicted from the machine learning model (predicted values) are listed, respectively, and the predicted values marked with underlining are different from the experimental values. The numerical values in table 9 mean as follows: the cell survival rate is 0.9 as a boundary value, wherein more than 0.9 is 1, and less than 0.9 is 0, i.e. 1 indicates no cytotoxicity, and 0 indicates cytotoxicity.
Table 9: effect of siRNA on toxicity to human respiratory cells
| Name of siRNA | Experimental value of binarization treatment | Prediction value |
| siRNA4 |
| 1 | 0 |
| siRNA5 | 1 | 0 |
| siRNA6 | 1 | 1 |
| siRNA7 | 1 | 0 |
| siRNA8 | 1 | 1 |
| siRNA9 | 0 | 0 |
| siRNA10 | 0 | 0 |
| siRNA11 | 1 | 0 |
| siRNA12 | 0 | 0 |
| siRNA13 | 0 | 0 |
| siRNA14 | 0 | 0 |
| siRNA15 | 0 | 0 |
| siRNA16 | 0 | 0 |
| siRNA17 | 0 | 0 |
| siRNA18 | 0 | 0 |
| siRNA19 | 0 | 0 |
As can be seen from the results shown in Table 9, the model established by the method of the present invention can predict those siRNAs that are relatively cytotoxic with high accuracy. In practical application, those siRNAs with a predictive value of 1 (no cytotoxicity) can be selected as candidate drugs.
SEQUENCE LISTING
<110> Hangzhou Kangwanda technology and technology of medicine
<120> establishment method and application of machine learning model for predicting toxicity of siRNA to certain cell
<130> FI-162414-59:52/C
<160> 38
<170> PatentIn version 3.5
<210> 1
<211> 21
<212> DNA
<213> Artificial sequence
<400> 1
ggaagccuau uuccgugaat t 21
<210> 2
<211> 21
<212> DNA
<213> Artificial sequence
<400> 2
uucacggaaa uaggcuucct t 21
<210> 3
<211> 21
<212> DNA
<213> Artificial sequence
<400> 3
gacaaggauu gugaaaugat t 21
<210> 4
<211> 21
<212> DNA
<213> Artificial sequence
<400> 4
ucauuucaca auccuuguct t 21
<210> 5
<211> 21
<212> DNA
<213> Artificial sequence
<400> 5
auggcuucug gcccaugaat t 21
<210> 6
<211> 21
<212> DNA
<213> Artificial sequence
<400> 6
uucaugggcc agaagccaut t 21
<210> 7
<211> 21
<212> DNA
<213> Artificial sequence
<400> 7
ccagacaggu guuauggaat t 21
<210> 8
<211> 21
<212> DNA
<213> Artificial sequence
<400> 8
uuccauaaca ccugucuggt t 21
<210> 9
<211> 21
<212> DNA
<213> Artificial sequence
<400> 9
ccagacaggu guuauggaut t 21
<210> 10
<211> 21
<212> DNA
<213> Artificial sequence
<400> 10
auccauaaca ccugucuggt t 21
<210> 11
<211> 21
<212> DNA
<213> Artificial sequence
<400> 11
ccagacaggu guuaugguut t 21
<210> 12
<211> 21
<212> DNA
<213> Artificial sequence
<400> 12
aaccauaaca ccugucuggt t 21
<210> 13
<211> 21
<212> DNA
<213> Artificial sequence
<400> 13
ccagacaggu guuaugcuut t 21
<210> 14
<211> 21
<212> DNA
<213> Artificial sequence
<400> 14
aagcauaaca ccugucuggt t 21
<210> 15
<211> 21
<212> DNA
<213> Artificial sequence
<400> 15
ccagacaggu guuauccuut t 21
<210> 16
<211> 21
<212> DNA
<213> Artificial sequence
<400> 16
aaggauaaca ccugucuggt t 21
<210> 17
<211> 21
<212> DNA
<213> Artificial sequence
<400> 17
ccagacaggu guuaaccuut t 21
<210> 18
<211> 21
<212> DNA
<213> Artificial sequence
<400> 18
aagguuaaca ccugucuggt t 21
<210> 19
<211> 21
<212> DNA
<213> Artificial sequence
<400> 19
ccagacaggu guuuaccuut t 21
<210> 20
<211> 21
<212> DNA
<213> Artificial sequence
<400> 20
aagguaaaca ccugucuggt t 21
<210> 21
<211> 21
<212> DNA
<213> Artificial sequence
<400> 21
ccagacaggu guauaccuut t 21
<210> 22
<211> 21
<212> DNA
<213> Artificial sequence
<400> 22
aagguauaca ccugucuggt t 21
<210> 23
<211> 21
<212> DNA
<213> Artificial sequence
<400> 23
gcagacaggu guuauggaat t 21
<210> 24
<211> 21
<212> DNA
<213> Artificial sequence
<400> 24
uuccauaaca ccugucugct t 21
<210> 25
<211> 21
<212> DNA
<213> Artificial sequence
<400> 25
ggagacaggu guuauggaat t 21
<210> 26
<211> 21
<212> DNA
<213> Artificial sequence
<400> 26
uuccauaaca ccugucucct t 21
<210> 27
<211> 21
<212> DNA
<213> Artificial sequence
<400> 27
ggugacaggu guuauggaat t 21
<210> 28
<211> 21
<212> DNA
<213> Artificial sequence
<400> 28
uuccauaaca ccugucacct t 21
<210> 29
<211> 21
<212> DNA
<213> Artificial sequence
<400> 29
ggucacaggu guuauggaat t 21
<210> 30
<211> 21
<212> DNA
<213> Artificial sequence
<400> 30
uuccauaaca ccugugacct t 21
<210> 31
<211> 21
<212> DNA
<213> Artificial sequence
<400> 31
ggucucaggu guuauggaat t 21
<210> 32
<211> 21
<212> DNA
<213> Artificial sequence
<400> 32
uuccauaaca ccugagacct t 21
<210> 33
<211> 21
<212> DNA
<213> Artificial sequence
<400> 33
ggucugaggu guuauggaat t 21
<210> 34
<211> 21
<212> DNA
<213> Artificial sequence
<400> 34
uuccauaaca ccucagacct t 21
<210> 35
<211> 21
<212> DNA
<213> Artificial sequence
<400> 35
ggucuguggu guuauggaat t 21
<210> 36
<211> 21
<212> DNA
<213> Artificial sequence
<400> 36
uuccauaaca ccacagacct t 21
<210> 37
<211> 21
<212> DNA
<213> Artificial sequence
<400> 37
ccagacagca cuuauggaat t 21
<210> 38
<211> 21
<212> DNA
<213> Artificial sequence
<400> 38
uuccauaagu gcugucuggt t 21