l, Ponten f.towards a knowledgeable Human Protein atlas. nat biotechnol.2010 (28) (12):1248-50. ", the entire contents of which are incorporated herein by reference), contains expression data of Protein-encoding genes of common cell lines, which are double validated at the RNA and Protein levels, respectively.

In the methods of the present invention, sirnas used in experiments on cells can be prepared using methods conventional in the art, including (for example): chemical synthesis, in vitro transcription, siRNA expression vectors, siRNA frameworks and the like.

(II) application of the method of the invention in predicting the toxicity of siRNA to the cells

In another aspect, the invention provides the use of the methods of the invention for predicting the toxicity of an siRNA against a cell of a certain type.

(III) computer readable Medium

Another aspect of the invention also provides a computer readable medium, which can be used to build the machine learning model according to the method of the invention, and which comprises the following modules:

a sequence alignment module for performing step i) of the method of the present invention;

an off-target weight calculation module for performing step ii) of the method of the present invention;

omics annotation module for performing step iii) of the method of the invention;

an omics characteristic value calculation module for performing step iv) of the method of the present invention;

a machine learning algorithm calculation module for executing step C) of the method of the present invention.

The computer readable medium may comprise an external data input module for inputting the n siRNA sequences and the corresponding cell viability index, respectively.

By way of example, FIG. 9 illustrates a schematic diagram of one embodiment of a computer-readable medium of the present invention.

(IV) device for predicting the toxicity of siRNA to certain cell types

Another aspect of the present invention also provides an apparatus for predicting the toxicity of siRNA against a cell of a type, comprising:

2) the storage unit is used for storing the machine learning model established by the method aiming at the cells;

The apparatus may be an apparatus specially constructed for the purposes of the present invention, or may be a computer.

Such as, but not limited to, a keyboard, a mouse, a scanner, a touch screen, as is known in the art.

In one aspect of the invention, the storage unit may be any type of memory for storing data and/or software, including electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), virtual storage locations on a network, memory devices, computer-readable media, computer disks, and information-transportable storage devices, or any other type of media suitable for storing the machine learning model.

The output unit includes, but is not limited to, any type of display, a printer.

(V) method for predicting toxicity of siRNA to certain cell

In another aspect, the present invention provides a method for predicting the toxicity of an siRNA against a cell of a type, comprising:

providing a sequence of the siRNA to be detected;

inputting the sequence of the siRNA into the device, and executing a machine learning model established by the method aiming at the cell by the device so as to obtain a prediction result of the toxicity of the siRNA to the cell.

The siRNA to be detected can be a drug candidate for resisting virus (including respiratory viruses, Ebola viruses and the like) infection. Generally such siRNA sequences can be obtained in a manner commonly used in the art. For example, the siRNA sequences to be tested are designed according to siRNA design principles well known in the art using known public or commercial siRNA design tools (e.g., Invitrogen, GenScript, Dharmacon, and/or siDirect, etc.).

An example of the siRNA design principle is: starting 50-100 bases after the gene promoter of a conserved region of the whole gene sequence of, for example, human respiratory virus, a nucleotide sequence of 19-21bp (e.g., 19bp) in the gene sequence is searched for which the following conditions are satisfied: (1) starting with G or C and ending with A or T; (2) at least 5 of the last 7 bases at the end are A or T; (3) avoiding 4 consecutive bases, such as AAAA or CCCC, thereby increasing the complexity of the bases; and/or (4) a GC content of between 30 and 52%.

The complete gene sequence of the human respiratory virus includes the complete gene sequence of the known human respiratory virus or the complete gene sequence of the new human respiratory virus. The complete gene sequences of known human respiratory viruses are directly available from the public database Genebank, and the complete gene sequences of new human respiratory viruses can be obtained by isolating, extracting (for example) RNA, sequencing and further genotyping the new viral strains by known methods.

Preferably, in the method of the present invention, the respiratory virus comprises influenza virus, parainfluenza virus, respiratory syncytial virus, measles virus, mumps virus, adenovirus, rubella virus, rhinovirus, coronavirus and/or reovirus; more preferably influenza virus; further preferably influenza a virus; further preferred is influenza a virus H1, H3, H5, H7 or H9; further preferred are influenza a viruses H1N1, H3N2, H5N1, H7N7, and H7N 9.

The present invention will be further explained or illustrated below by way of examples, which should not be construed as limiting the scope of the invention.

Examples of the present invention

Taking the establishment of a machine learning model for predicting the toxicity of siRNA against human airway cells as an example, the embodiments of the invention are described as follows:

[ materials used in the test ]

1) Cell culture material

The conventional culture medium was DMEM medium (Gibco, USA) supplemented with 10% (v/v) fetal bovine serum (Hyclone, USA); DMSO is purchased from Sigma-Aldrich, USA.

2) qRT-PCR detection related reagent

The cell total RNA extraction kit, the reverse transcription kit and the fluorescent quantitative PCR kit are purchased from Promega corporation in the United states;

transfection reagent Lipipo 2000 was purchased from Invitrogen, USA; the siRNA sequences were all synthesized by Invitrogen, USA.

3) Cell survival index-related reagents

The CCK-8 kit (containing CCK-8 solution) was purchased from the institute of Dojindo chemistry, Japan.

4) Experiment consumable

The disposable test consumables used in the tests were purchased from Corning, usa.

Unless otherwise indicated, the following biological experiments are performed using methods, materials, conditions and equipment conventional in the art.

Test example 1 interference rate of siRNA on expression level of mRNA of off-target Gene

Different degrees of sequence matching between siRNA and mRNA can cause different interference effects, and the specific weight setting is supported by biological experiment data. A549 non-small cell lung cancer cell line was selected, and a human gene MGMT (O-6-Methylguanine-DNA Methyltransferase ), which is known to be a weakly expressed gene in A549 cell line, was selected. The selection of a weakly expressed gene is made because if it is a strongly expressed gene, a large dose of siRNA may be required to detect the interfering effect, and a large dose of exogenous siRNA acting on the cell may cause other immune stimulating and element saturating effects.

Four siRNA sequences (each siRNA comprises a pair of sense and antisense strand sequences) were designed for MGMT, as detailed in table 1. A549 cells were transfected with siRNA at 50nM transfection concentration, respectively, and untransfected blanks were used as controls. In complete medium (10% FBS + 90% DMEM: F12(1:1)), 5% CO at 37 ℃₂After 48 hours of incubation in the incubator, mRNA expression level of MGMT was measured by qRT-PCR method, and the results are shown in FIG. 1. As shown in FIG. 1, the mRNA expression level of MGMT in the blank group was set to 1, and the mRNA expression levels in the other transfection groups were set to relative percentages, and the mRNA expression level of MGMT in the siRNA4 group was found<10%, i.e., interfering effect of siRNA4>And 90% is determined as an effective interference sequence. Thereafter, the optimal siRNA transfection concentrations were searched using the effective interference sequence, and FIG. 2 shows the respective transfection concentrations tested. As shown in FIG. 2, the transfection concentration substantially saturated at 25nM, and 25nM was selected for the subsequent experiments.

Table 1: siRNA designed aiming at MGMT gene

Based on the screened valid interfering sequences, 15 mismatched sequences were synthesized, as shown in Table 2, wherein the underlined parts are the mismatched bases.

Table 2: sequence design of mismatch siRNA against MGMT gene

A549 cells were transfected with these siRNAs, respectively, and a blank group (untransfected), a negative control group (random sequence-transfected siRNA (Invitrogen-synthesized), i.e., siRNA not directed against MGMT gene), and a positive control group (high-efficiency MGMT knock-out siRNA (siRNA 4)) were set, and the effect of each mismatch sequence siRNA on the mRNA level of MGMT was examined by qRT-PCR method after culturing for 48 hours under the above-mentioned culture conditions. The results are shown in FIG. 3, and the expression levels of all mRNAs were compared with those of the blank control group. It can be seen that as the number of mismatched bases increases, the expression level of mRNA also increases, i.e., the interfering effect of siRNA decreases. Mismatched bases are identical in regularity, either at the 5 'or 3' end, and differ only in weight coefficient (interference ratio). Based on the expression data of mRNA, the interference rate of siRNA is obtained, namely: the ratio of the expression level of each experimental group to the expression level of the blank control group is calculated, and the ratio is subtracted from 1 to obtain the interference rate of each siRNA. The interference rates of these sirnas were subjected to a curve fitting treatment. Since the expression level of mRNA in the negative control group was about 0.6, and the mRNA in the siRNA10 and siRNA11 groups was also close to 0.6, no curve fitting treatment was included, and the fitted curves are shown in FIGS. 4 and 5. The non-linear fit equations for mismatches at the 3 ' end (FIG. 4) and 5 ' end (FIG. 5), respectively, are 1) for mismatched bases at the 3 ' end: y is_3’＝-0.01316x_3’²-0.03245x_3’+ 1.0238; wherein x_3’Number of mismatched bases at 3' end, y_3’Is the interference rate of the 3' end; 2) for mismatched bases at the 5' end: y is_5’＝-0.01313x_5’²+0.03223x_5’+0.95513. Wherein x_5’Is 5'Number of mismatched bases at the end, y_5’Is the interference rate at the 5' end.

The interference rate of siRNA to off-target gene is y ═ y_3’×y_5’。

Example 1: process for establishing machine learning model for predicting toxicity of siRNA to human respiratory system cells

A. Providing siRNA for establishing machine learning model

The above 16 siRNAs (siRNA 4 in Table 1 and 15 mismatch sequence siRNA5-siRNA19 in Table 2) were used to establish a machine learning model.

B. Separately obtaining input values and output values for building a machine learning model

Wherein the input value of any one siRNA in the 16 siRNAs is obtained by:

i) comparing the siRNA sequence with the human genome mRNA sequence, and further screening off-target genes based on functional annotation and expression profile database

To initially identify off-target genes for a given siRNA, a database of mRNA sequences from the human genome was created, localized (i.e., mRNA sequences downloaded to hard disk, subsequent work can be web independent), and the sequences of the siRNA and the mRNA sequences from the human genome were aligned comprehensively by BLAST (version 2.2.31) software (see references "Camacho C, Couloris G, Avagayan V, Ma N, Papadopoulos J, Beller K, Madden TL. BLAST +: architecture and applications. BMC Bioinformatics.2008,10: 421."). To obtain a full alignment, rather than just a highly similar alignment, we chose the blastn model in the BLAST software. Default parameters are commonly used for setting parameters of BLAST software, specifically, evalue is 1000, word size is 7, gapopen is 5, gapxtend is 2, dependency is 3, and reward is 2. In the alignment, the sense and antisense strands of the siRNA are aligned separately.

Through comparison, a complete preliminary off-target gene list is obtained, and then functional annotation is carried out on the regions matched with the siRNA and the mRNA of each off-target gene, namely, whether the action region of the siRNA is distributed in the 5 'UTR, the 3' UTR or the coding region of the mRNA is marked. Based on the principle of action of siRNA, in the subsequent analysis, only off-target genes whose siRNA match site is located in the 3' UTR and/or coding region of its mRNA are of interest.

Off-target genes that are not expressed in human respiratory cells (e.g., a549 non-small cell lung cancer cell line) per se are deleted from the off-target gene list using an expression profile database of known cell lines. Expression profiling data for cell lines was derived from THE "THE HUMAN PROTEIN ATLAS" database.

Thereby selecting a series of off-target genes. Hundreds of off-target genes were obtained for each of the 16 sirnas, and the specific statistical results for the number of off-target genes are shown in table 3.

Table 3: statistics of off-target Gene number of siRNA

Name of siRNA	Number of off-target genes
		siRNA4	138
siRNA5	131
		siRNA6	140
siRNA7	124
		siRNA8	131
siRNA9	120
		siRNA10	134
siRNA11	101
		siRNA12	132
siRNA13	136
		siRNA14	121
siRNA15	127
		siRNA16	129
siRNA17	121
		siRNA18	151
siRNA19	151

ii) determining off-target weights for selected off-target genes

The off-target gene was weighted based on the interference rate of the curve fitting obtained in test example 1.

For example, if a specific off-target gene (human ERCC6,precision Repair Cross-Complementation 6, Excision Repair Cross-complement 6) is matched with a specific siRNA (e.g., siRNA4 (sense strand sequence CCAGACAGGUGUUAUGGAATT (SEQ No.7))) in a region where 1 mismatch is present at the 3 'end of the sense strand and 5 mismatches are present at the 5' end of the sense strand, the interference ratio of the siRNA to the whole off-target gene is the product of the interference ratios at both ends, i.e., 0.9782 multiplied by 0.7880 equals 0.7708.

For the complementary region, the probability that the off-target gene mRNA itself does not form secondary structures is determined using the software RNAPLFLOLD (version number 2.2.4) (see "Lewis BP, Burge CB, Bartel DP. connected fed pairing, Soft swept by adenosines, indicative that is that which is that of human genes are micro RNA targets. cell.2005,120(1): 15-20."). Specifically, the software is used for predicting the secondary structure of human whole genome mRNA, and extracting relevant text and numerical information from the output result to form a local database, so that high-speed reading is realized, and the calculation speed is improved. The parameter design of RNAPLFLOLD specifically comprises the following steps: l-40, W-80, u-25. For example, in the case of the region where the mRNA and siRNA sequences of the above off-target gene are complementary, the probability that the off-target gene cannot form a secondary structure is 0.5425, the overall interference rate obtained based on the interference rates at both ends is 0.7708, and the off-target weight of the off-target gene is 0.5425 multiplied by 0.7708 and is 0.4182.

iii) -vi) performing omics annotation on the selected off-target genes to obtain omics weights; calculating the characteristic value of the omics according to the weight of the omics and the off-target weight

(1) Calculating proteomic feature values based on the protein interaction weights and off-target weights of all selected off-target genes

The links tables of human beings in the STRING database are processed locally (i.e., downloaded to the hard disk of the local computer) and the names of the proteins are converted into common gene names for computational operations. By acting specific siRNA on cells, the possible off-target genes and their weights were determined by the methods described above. FIG. 6 illustrates a simplified example if a siRNA has seven off-target genes (indicated by circles), each with a schematic off-target weight (numbers in the circles). And determining which of the STRING database has interaction relations and protein interaction weights according to the information of the STRING database. FIG. 6 schematically shows an example in which three genes are connected by a line if there is an interaction between them, and the numbers on the line show the interaction weights. Thus, the proteomic characteristic value was calculated to be 0.9 × (280+160) +0.8 × 280+0.6 × 160. That is, for each off-target gene, if it is involved in an interaction, its off-target weight is multiplied by the protein interaction weight; multiplying the sum of the interaction weights of each protein by its off-target weight if it is involved in multiple interactions; if the off-target gene is isolated, its effect is ignored. The results of the proteome characteristic value calculation of off-target genes for each siRNA are shown in Table 4.

Table 4: calculation results of proteome characteristic values of off-target genes of siRNA

Name of siRNA	Proteomic eigenvalues
		siRNA4	150.0129
siRNA5	135.2095
		siRNA6	182.5355
siRNA7	97.8546
		siRNA8	102.6913
siRNA9	88.8456
		siRNA10	106.3368
siRNA11	65.3091
		siRNA12	141.7128
siRNA13	101.5539
		siRNA14	82.2402
siRNA15	107.9213
		siRNA16	107.9795
siRNA17	122.7014
		siRNA18	182.3832
siRNA19	134.8214

(2) Calculating a signal path set eigenvalue based on the signal path weights and off-target weights of all selected off-target genes

The human path set database consensu spathdb-human (version number 31) is localized. The determined off-target weight of each off-target gene is multiplied by the number of the paths involved in each off-target gene, and the sum is taken as the characteristic value of the signal path group, and if the off-target gene is isolated, the influence of the off-target gene is ignored. For example, three off-target genes A, B, C were identified, and from this database it was found that a participated in 3 known pathways, B participated in 2 known pathways, and C were isolated, then their signature pathway set eigenvalues were (off-target weight of a multiplied by 3) plus (off-target weight of B multiplied by 2). The results of calculating the characteristic values of the signaling pathway groups of off-target genes for each siRNA are shown in Table 5.

Table 5: calculation results of Signal pathway group characteristic values of off-target genes of siRNA

Name of siRNA	Signal path group characteristic value
		siRNA4	653.2424
siRNA5	585.7767
		siRNA6	742.5516
siRNA7	372.6335
		siRNA8	404.0694
siRNA9	416.7108
		siRNA10	419.1286
siRNA11	318.9158
		siRNA12	717.8643
siRNA13	476.5563
		siRNA14	362.0600
siRNA15	368.6291
		siRNA16	440.0923
siRNA17	551.1228
		siRNA18	837.8167
siRNA19	258.3346

(3) Calculating core genome feature values based on core gene weights and off-target weights of all selected off-target genes

It is now known that more than 1500 core genes have been found. For example, if 4 off-target genes A ', B', C ', D' are identified, where B 'and C' are core genes, based on the found core genes, then their core genome feature values are the off-target weight of B 'plus the off-target weight of C'. The results of calculation of the core genome feature values of off-target genes for each siRNA are shown in Table 6.

Table 6: calculation results of core genome characteristic values of off-target genes of siRNA

Name of siRNA	Core genome eigenvalues
		siRNA4	7.6147
siRNA5	6.6085
		siRNA6	8.0534
siRNA7	5.7126
		siRNA8	8.5514
siRNA9	5.3999
		siRNA10	5.8094
siRNA11	4.6920
		siRNA12	7.3661
siRNA13	5.4217
		siRNA14	5.7174
siRNA15	4.9374
		siRNA16	6.1381
siRNA17	4.1732
		siRNA18	7.2726
siRNA19	4.0797

Wherein the output value of any one siRNA in the 16 siRNAs is obtained by:

a549 cells were transfected with the above 16 siRNAs (siRNA 4 in Table 1 and 15 mismatched sequence siRNA5-siRNA19 in Table 2), respectively, and a blank group (untransfected) and a negative control group (random sequence-transfected siRNA (Invitrogen-synthesized), i.e., siRNA not directed against MGMT gene) were set, and after culturing for 48 hours under the above culture conditions, the cells were treated with CCK-8 solution, 10. mu.L of CCK-8 solution was added to each well, and the culture plates were incubated in an incubator for 0.5 to 1 hour. The absorbance at 450nm was measured with a microplate reader, OD450 data was collected, and the ratio of OD450 value of each experimental group to OD450 value of blank group was calculated, respectively, to obtain the cell survival index of each group, and the results are shown in fig. 7.

Comparing fig. 7 and fig. 3, it can be seen that the survival index of the cells transfected by each siRNA has no significant correlation with the mRNA expression level of MGMT, and there is no certain rule with the correlation between the mismatch sites and numbers of siRNA sequences, which indicates that the difference in cell survival index is caused by off-target effect of siRNA. In addition to off-target genes, siRNA has some effect on other genes because each off-target gene has complex network interaction effects at each level of RNA group, proteome, pathway group, etc.

C. Establishing machine learning model through machine learning algorithm

(1) Establishing machine learning model through machine learning algorithm ANN

As described above, for a specific siRNA, proteome characteristic values, signal path set characteristic values and core genome characteristic values are obtained, and these data are normalized before being used as input values of a machine learning algorithm. And mapping the data into 0-1 intervals one by adopting a (value-minimum)/(maximum-minimum) method. The results of the normalized proteome, signal pathway set and core genome eigenvalues are shown in table 7.

Table 7: normalized proteome, signal pathway set and core genome feature results

The output value data of the machine learning algorithm, that is, the survival index of the cell in the presence of siRNA, is binarized before being used as the output value data (for example, the survival index 0.9 is used as a boundary value, 1 is set to be higher than or equal to 0.9, and the rest is set to be 0). The results of the cell viability index of the binarization are shown in Table 8.

Table 8: binary processed cell survival index results

Name of siRNA	Dualised cellsurvival index
		siRNA4
	1
		siRNA5	1
siRNA6	1
		siRNA7	1
siRNA8	1
		siRNA9	0
siRNA10	0
		siRNA11	1
siRNA12	0
		siRNA13	0
siRNA14	0
		siRNA15	0
siRNA16	0
		siRNA17	0
siRNA18	0
		siRNA19	0

Substituting the normalized proteome, signal path set and core genome eigenvalues as input values and the binarized cell viability index as output value into an artificial neural network Algorithm (ANN) using the library function neuralenet of R, where the main adjustable parameter is hidden, which can be set to 1 in the preferred embodiment.

And evaluating the model by adopting 8-fold cross validation (8-fold cross validation), dividing the data set into 8 parts,training 7 parts in turn, validating 1 part, and taking the average value of results of 8 times as the estimation of the algorithm precision. The accuracy of the algorithm can reach 56.25%.

(2) Establishing machine learning model through machine learning algorithm SVM

As described above, for a specific siRNA, proteome characteristic values, signal path set characteristic values and core genome characteristic values are obtained, and these data are normalized before being used as input values of a machine learning algorithm. And mapping the data into 0-1 intervals one by adopting a (value-minimum)/(maximum-minimum) method. The results are shown in Table 7.

The output value data of the machine learning algorithm, that is, the survival index of the cell in the presence of siRNA, is binarized before being used as the output value data (for example, the survival index 0.9 is used as a boundary value, 1 is set to be higher than or equal to 0.9, and the rest is set to be 0). The results are shown in Table 8.

Substituting the normalized proteome eigenvalue, signal path set eigenvalue and core genome eigenvalue as input values and the binarized cell survival index as output value into a support vector machine algorithm (SVM) using a library function SVM of R, wherein the main adjustable parameter is hidden, which is preferably set as linear.

And evaluating the model by adopting 8-fold cross validation, dividing the data set into 8 parts,training 7 parts in turn, validating 1 part, and taking the average value of results of 8 times as the estimation of the algorithm precision. The accuracy of the algorithm can reach 62.5%.

In the present embodiment, 16 sirnas (i.e., n-16) are used, and it is understood that the accuracy of the algorithm can be further improved when the sample size of the sirnas is increased.

Example 2: predicting the toxicity of siRNA to human respiratory system cells using a machine learning model

As an example, the machine learning model obtained in example 1 (specifically, the machine learning model established by the machine learning algorithm SVM) was used to predict the effect of the above 16 sirnas on the toxicity of human respiratory cells, and the results are shown in table 9, in which the values obtained by the experiment (experimental values of binarization, i.e., cell survival index of binarization shown in table 8) and the values predicted from the machine learning model (predicted values) are listed, respectively, and the predicted values marked with underlining are different from the experimental values. The numerical values in table 9 mean as follows: the cell survival rate is 0.9 as a boundary value, wherein more than 0.9 is 1, and less than 0.9 is 0, i.e. 1 indicates no cytotoxicity, and 0 indicates cytotoxicity.

Table 9: effect of siRNA on toxicity to human respiratory cells

Name of siRNA	Experimental value of binarization treatment	Prediction value
			siRNA4
	1	0
			siRNA5	1	0
siRNA6	1	1
			siRNA7	1	0
siRNA8	1	1
			siRNA9	0	0
siRNA10	0	0
			siRNA11	1	0
siRNA12	0	0
			siRNA13	0	0
siRNA14	0	0
			siRNA15	0	0
siRNA16	0	0
			siRNA17	0	0
siRNA18	0	0
			siRNA19	0	0

As can be seen from the results shown in Table 9, the model established by the method of the present invention can predict those siRNAs that are relatively cytotoxic with high accuracy. In practical application, those siRNAs with a predictive value of 1 (no cytotoxicity) can be selected as candidate drugs.

SEQUENCE LISTING

<110> Hangzhou Kangwanda technology and technology of medicine

<120> establishment method and application of machine learning model for predicting toxicity of siRNA to certain cell

<130> FI-162414-59:52/C

<160> 38

<170> PatentIn version 3.5

<210> 1

<211> 21

<212> DNA

<213> Artificial sequence

<400> 1

ggaagccuau uuccgugaat t 21

<210> 2

<211> 21

<212> DNA

<213> Artificial sequence

<400> 2

uucacggaaa uaggcuucct t 21

<210> 3

<211> 21

<212> DNA

<213> Artificial sequence

<400> 3

gacaaggauu gugaaaugat t 21

<210> 4

<211> 21

<212> DNA

<213> Artificial sequence

<400> 4

ucauuucaca auccuuguct t 21

<210> 5

<211> 21

<212> DNA

<213> Artificial sequence

<400> 5

auggcuucug gcccaugaat t 21

<210> 6

<211> 21

<212> DNA

<213> Artificial sequence

<400> 6

uucaugggcc agaagccaut t 21

<210> 7

<211> 21

<212> DNA

<213> Artificial sequence

<400> 7

ccagacaggu guuauggaat t 21

<210> 8

<211> 21

<212> DNA

<213> Artificial sequence

<400> 8

uuccauaaca ccugucuggt t 21

<210> 9

<211> 21

<212> DNA

<213> Artificial sequence

<400> 9

ccagacaggu guuauggaut t 21

<210> 10

<211> 21

<212> DNA

<213> Artificial sequence

<400> 10

auccauaaca ccugucuggt t 21

<210> 11

<211> 21

<212> DNA

<213> Artificial sequence

<400> 11

ccagacaggu guuaugguut t 21

<210> 12

<211> 21

<212> DNA

<213> Artificial sequence

<400> 12

aaccauaaca ccugucuggt t 21

<210> 13

<211> 21

<212> DNA

<213> Artificial sequence

<400> 13

ccagacaggu guuaugcuut t 21

<210> 14

<211> 21

<212> DNA

<213> Artificial sequence

<400> 14

aagcauaaca ccugucuggt t 21

<210> 15

<211> 21

<212> DNA

<213> Artificial sequence

<400> 15

ccagacaggu guuauccuut t 21

<210> 16

<211> 21

<212> DNA

<213> Artificial sequence

<400> 16

aaggauaaca ccugucuggt t 21

<210> 17

<211> 21

<212> DNA

<213> Artificial sequence

<400> 17

ccagacaggu guuaaccuut t 21

<210> 18

<211> 21

<212> DNA

<213> Artificial sequence

<400> 18

aagguuaaca ccugucuggt t 21

<210> 19

<211> 21

<212> DNA

<213> Artificial sequence

<400> 19

ccagacaggu guuuaccuut t 21

<210> 20

<211> 21

<212> DNA

<213> Artificial sequence

<400> 20

aagguaaaca ccugucuggt t 21

<210> 21

<211> 21

<212> DNA

<213> Artificial sequence

<400> 21

ccagacaggu guauaccuut t 21

<210> 22

<211> 21

<212> DNA

<213> Artificial sequence

<400> 22

aagguauaca ccugucuggt t 21

<210> 23

<211> 21

<212> DNA

<213> Artificial sequence

<400> 23

gcagacaggu guuauggaat t 21

<210> 24

<211> 21

<212> DNA

<213> Artificial sequence

<400> 24

uuccauaaca ccugucugct t 21

<210> 25

<211> 21

<212> DNA

<213> Artificial sequence

<400> 25

ggagacaggu guuauggaat t 21

<210> 26

<211> 21

<212> DNA

<213> Artificial sequence

<400> 26

uuccauaaca ccugucucct t 21

<210> 27

<211> 21

<212> DNA

<213> Artificial sequence

<400> 27

ggugacaggu guuauggaat t 21

<210> 28

<211> 21

<212> DNA

<213> Artificial sequence

<400> 28

uuccauaaca ccugucacct t 21

<210> 29

<211> 21

<212> DNA

<213> Artificial sequence

<400> 29

ggucacaggu guuauggaat t 21

<210> 30

<211> 21

<212> DNA

<213> Artificial sequence

<400> 30

uuccauaaca ccugugacct t 21

<210> 31

<211> 21

<212> DNA

<213> Artificial sequence

<400> 31

ggucucaggu guuauggaat t 21

<210> 32

<211> 21

<212> DNA

<213> Artificial sequence

<400> 32

uuccauaaca ccugagacct t 21

<210> 33

<211> 21

<212> DNA

<213> Artificial sequence

<400> 33

ggucugaggu guuauggaat t 21

<210> 34

<211> 21

<212> DNA

<213> Artificial sequence

<400> 34

uuccauaaca ccucagacct t 21

<210> 35

<211> 21

<212> DNA

<213> Artificial sequence

<400> 35

ggucuguggu guuauggaat t 21

<210> 36

<211> 21

<212> DNA

<213> Artificial sequence

<400> 36

uuccauaaca ccacagacct t 21

<210> 37

<211> 21

<212> DNA

<213> Artificial sequence

<400> 37

ccagacagca cuuauggaat t 21

<210> 38

<211> 21

<212> DNA

<213> Artificial sequence

<400> 38

uuccauaagu gcugucuggt t 21

Claims

1. A method of establishing a machine learning model for predicting the toxicity of siRNA against a cell type, comprising the steps of:

wherein the input value of any one siRNA in the n siRNAs is obtained by:

ii) obtaining the off-target weight of each selected off-target gene mRNA independently from the complementary region of the siRNA sequence according to the characteristics of mismatched bases and the secondary structure characteristics of the off-target gene mRNA sequence; wherein the off-target gene mRNA sequence is characterized by a probability that the mRNA itself does not form secondary structure with respect to the complementary region;

iv) individual omics and off-target weights based on all selected off-target genes,

calculating to obtain a characteristic value of each omics, and taking the characteristic value as the input value; and, the output value of the strip of siRNA is obtained by:

C) calculating all the input values and the output values of the n siRNAs by a machine learning algorithm so as to establish a machine learning model;

wherein the cell of the certain type is a human cell or other mammalian cell and is capable of being contacted or infected by a virus.

2. The method of claim 1, wherein the characteristic of the mismatched bases comprises the number of mismatched bases.

3. The method of claim 2, wherein the characteristic of the mismatched base further comprises the position of the mismatched base.

4. The method according to claim 1, wherein for each selected off-target gene, the interference rate of the strip of siRNA to the expression amount of mRNA of the off-target gene is calculated based on the characteristic of the mismatched base, and the product of the interference rate and the probability of not forming secondary structure is calculated, thereby obtaining the off-target weight of the off-target gene.

5. The method of claim 1, wherein the probability of not forming secondary structure of each off-target gene mRNA is predicted using software selected from the group consisting of: RNAPLLFOLD, mfold or RNAstructure.

6. The method of claim 1 wherein the omics eigenvalues comprise at least one of proteomic eigenvalues, signal path set eigenvalues, and core genome eigenvalues, and the proteomic eigenvalues, the signal path set eigenvalues, and the core genome eigenvalues are each calculated according to the following a) through c), respectively:

7. The method of claim 1, wherein the input values are normalized prior to building the machine learning model.

8. The method of claim 1, wherein the machine learning algorithm comprises: a support vector machine, an artificial neural network, a decision tree, or a regression model.

9. The method of claim 1, wherein in step i), the selected off-target genes do not include off-target genes that: the complementary region of its mRNA to the siRNA sequence is located only in its 5' UTR.

10. The method of claim 1, wherein in step i), the off-target genes selected do not include genes that are not expressed in the certain type of cells under normal conditions.

11. Use of the method of any one of claims 1-10 for predicting the toxicity of siRNA against a cell type, wherein the cell type is a human cell or other mammalian cell and is capable of being contacted or infected by a virus.

12. A computer readable medium usable for building the machine learning model according to the method of any one of claims 1-10, and comprising the following modules:

a sequence alignment module for performing step i) of the method of any one of claims 1-10;

an off-target weight calculation module for performing step ii) of the method of any one of claims 1-10;

omics annotation module for performing step iii) of the method of any of claims 1-10;

an omic eigenvalue calculation module for performing step iv) of the method of any of claims 1-10;

a machine learning algorithm computation module for performing step C) of the method of any one of claims 1-10.

13. An apparatus for predicting the toxicity of an siRNA against a cell of a type, comprising:

2) a storage unit for storing a machine learning model established for the type of cells using the method of any one of claims 1-10;

4) the output unit is used for displaying the prediction result of the toxicity of the siRNA to the cells;

14. A method of predicting the toxicity of an siRNA against a cell of a type comprising:

providing a sequence of the siRNA to be detected;

inputting the sequence of the siRNA into the apparatus of claim 13, and executing a machine learning model established for the cell type by using the method of any one of claims 1 to 10 by the apparatus, thereby obtaining a prediction result of the toxicity of the siRNA for the cell type;