Movatterモバイル変換


[0]ホーム

URL:


CN119361020B - Method, device and computer equipment for generating antimicrobial peptides based on generative model - Google Patents

Method, device and computer equipment for generating antimicrobial peptides based on generative model

Info

Publication number
CN119361020B
CN119361020BCN202411908673.7ACN202411908673ACN119361020BCN 119361020 BCN119361020 BCN 119361020BCN 202411908673 ACN202411908673 ACN 202411908673ACN 119361020 BCN119361020 BCN 119361020B
Authority
CN
China
Prior art keywords
sequence
antimicrobial peptide
antimicrobial
trained
predicted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411908673.7A
Other languages
Chinese (zh)
Other versions
CN119361020A (en
Inventor
熊曦妍
林峰
马萧
黄行许
金舒文
曾梓菡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Shanghai AI Innovation Center
Original Assignee
Zhejiang Lab
Shanghai AI Innovation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab, Shanghai AI Innovation CenterfiledCriticalZhejiang Lab
Priority to CN202411908673.7ApriorityCriticalpatent/CN119361020B/en
Publication of CN119361020ApublicationCriticalpatent/CN119361020A/en
Application grantedgrantedCritical
Publication of CN119361020BpublicationCriticalpatent/CN119361020B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本申请涉及一种基于生成式模型的抗菌肽生成方法、装置及计算机设备,涉及人工智能、生物信息学技术领域,通过获取现有的抗菌肽序列的集合;利用预训练的生成式模型,在现有的抗菌肽序列的基础上预测新型抗菌肽,得到预测序列;将所述预测序列输入预训练的抗菌肽分类器,判断所述预测序列是否具有抗菌活性;所述抗菌肽分类器基于抗菌肽标签集和非抗菌肽序标签集训练得到;将具有抗菌活性的所述预测序列作为最终的目标抗菌肽序列进行输出,解决了抗菌肽设计生成效率低的问题,能够借助生成式模型和分类器,实现对抗菌肽的从头生成和判定,提高了生成效率和准确性。

The present application relates to a method, apparatus and computer equipment for generating antimicrobial peptides based on a generative model, and involves the fields of artificial intelligence and bioinformatics. The method comprises obtaining a set of existing antimicrobial peptide sequences; using a pre-trained generative model to predict novel antimicrobial peptides based on the existing antimicrobial peptide sequences to obtain predicted sequences; inputting the predicted sequences into a pre-trained antimicrobial peptide classifier to determine whether the predicted sequences have antimicrobial activity; the antimicrobial peptide classifier is trained based on an antimicrobial peptide label set and a non-antimicrobial peptide sequence label set; and outputting the predicted sequences having antimicrobial activity as the final target antimicrobial peptide sequence. The method solves the problem of low efficiency in antimicrobial peptide design and generation, and can achieve de novo generation and determination of antimicrobial peptides with the help of a generative model and classifier, thereby improving generation efficiency and accuracy.

Description

Antibacterial peptide generation method and device based on generation model and computer equipment
Technical Field
The application relates to the technical field of artificial intelligence and bioinformatics, in particular to an antibacterial peptide generation method, device and computer equipment based on a generation model.
Background
The antibacterial peptide (Antimicrobial Peptide, AMP) is a short peptide molecule composed of amino acids, and has wide application in the fields of anti-infection treatment, food preservation and biological medicine due to the broad-spectrum antibacterial property. The existing antibacterial peptide database contains natural antibacterial peptide and artificial synthetic antibacterial peptide, wherein the natural antibacterial peptide refers to a metabolite with antibacterial property generated by organisms, and the artificial synthetic antibacterial peptide is a protein which is synthesized by utilizing the existing synthetic technology and has the same biochemical property as the natural antibacterial peptide. Natural antimicrobial peptides have a limited number, and researchers have been designed to develop a wider variety of antimicrobial peptides by artificial synthesis.
However, the traditional antibacterial peptide design mostly adopts a research mode which depends on expert experience and manual trial and error, and has long generation period, high cost and unsatisfactory success rate.
Aiming at the problem of low design and generation efficiency of antibacterial peptide in the related technology, no effective solution is proposed at present.
Disclosure of Invention
In this embodiment, a method, a device and a computer device for generating an antibacterial peptide based on a generated model are provided, so as to solve the problem of low design efficiency of the antibacterial peptide in the related art.
In a first aspect, in this embodiment, there is provided a method for generating an antimicrobial peptide based on a generated model, the method comprising:
acquiring a collection of existing antibacterial peptide sequences;
predicting a novel antibacterial peptide on the basis of the existing antibacterial peptide sequence by utilizing a pre-trained generation model to obtain a predicted sequence;
Inputting the predicted sequence into a pre-trained antibacterial peptide classifier, and judging whether the predicted sequence has antibacterial activity or not, wherein the antibacterial peptide classifier is obtained by training based on an antibacterial peptide tag set and a non-antibacterial peptide sequence tag set;
outputting the predicted sequence with antibacterial activity as a final target antibacterial peptide sequence.
In some embodiments thereof, obtaining a collection of existing antimicrobial peptide sequences comprises:
extracting initial antimicrobial peptide data from a public database;
And cleaning the initial antibacterial peptide data to obtain a collection of the existing antibacterial peptide sequences.
In some of these embodiments, predicting a novel antimicrobial peptide based on the existing antimicrobial peptide sequence using a pre-trained generative model to obtain a predicted sequence comprises:
expanding the set of the existing antibacterial peptide sequences based on a multi-sequence alignment technology to obtain an antibacterial peptide multi-sequence alignment data set;
Inputting the antimicrobial peptide multi-sequence comparison dataset into a pre-trained generation model, and generating a predicted sequence based on any decoding sequence.
In some of these embodiments, the pre-trained generative model employs a sequential agnostic autoregressive diffusion model.
In some of these embodiments, inputting the predicted sequence into a pre-trained antimicrobial peptide classifier, determining whether the predicted sequence has antimicrobial activity, comprises:
screening the physicochemical properties of the predicted sequence to obtain a screened sequence;
inputting the screened sequence into a pre-trained antibacterial peptide classifier, and outputting a conclusion whether the screened sequence has antibacterial activity.
In some of these embodiments, the method further comprises:
training a classification model based on the obtained antibacterial peptide tag set and the obtained non-antibacterial peptide sequence tag set;
And in the training process of the two-classification model, adjusting parameters of the two-classification model based on a ten-fold cross validation result to obtain the pre-trained antibacterial peptide classifier.
In some of these embodiments, training a classification model based on the obtained set of antibacterial peptide tags and the set of non-antibacterial peptide sequence tags comprises:
Extracting features of the antibacterial peptide tag set and the non-antibacterial peptide sequence tag set to obtain antibacterial peptide features and non-antibacterial peptide features;
Inputting the antibacterial peptide characteristics and the non-antibacterial peptide characteristics into the classification model, and training the classification model.
In a second aspect, in this embodiment, there is provided an antimicrobial peptide generating apparatus based on a generating model, the apparatus comprising:
the existing data acquisition module is used for acquiring a collection of existing antibacterial peptide sequences;
the new sequence prediction module is used for obtaining a predicted sequence similar to the structure of the existing antibacterial peptide sequence by utilizing a pre-trained generation type model;
The activity verification module is used for inputting the predicted sequence into a pre-trained antibacterial peptide classifier and judging whether the predicted sequence has antibacterial activity or not, wherein the antibacterial peptide classifier is obtained by training based on an antibacterial peptide tag set and a non-antibacterial peptide sequence tag set;
And the target antibacterial peptide output module is used for outputting the predicted sequence with antibacterial activity as a final target antibacterial peptide sequence.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to realize the antibacterial peptide generation method based on the generation model in the first aspect.
In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the antimicrobial peptide generation method based on the generative model of the first aspect.
Compared with the related art, the antibacterial peptide generation method, the antibacterial peptide generation device and the computer equipment based on the generation formula model are provided in the embodiment, the problem of low antibacterial peptide design generation efficiency is solved by acquiring the set of the existing antibacterial peptide sequences, utilizing the pre-trained generation formula model to obtain a predicted sequence, inputting the predicted sequence into a pre-trained antibacterial peptide classifier to judge whether the predicted sequence has antibacterial activity, training the antibacterial peptide classifier based on an antibacterial peptide tag set and a non-antibacterial peptide sequence tag set to obtain the predicted sequence with the antibacterial activity, outputting the predicted sequence with the antibacterial activity as a final target antibacterial peptide sequence, and realizing de-novo generation and judgment of the antibacterial peptide by means of the generation formula model and the classifier, thereby improving the generation efficiency and accuracy.
The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a block diagram showing the hardware configuration of a terminal of an antimicrobial peptide generation method based on a generative model in one embodiment;
FIG. 2 is a flow diagram of an antimicrobial peptide generation method based on a generative model in one embodiment;
FIG. 3 is a schematic diagram of the performance of an antimicrobial peptide classifier in ten-fold cross-validation in one embodiment;
FIG. 4 is a flow chart of a method of generating antimicrobial peptides based on a generative model in a preferred embodiment;
fig. 5 is a block diagram showing the structure of an antimicrobial peptide production device based on a production model in one embodiment.
Detailed Description
The present application will be described and illustrated with reference to the accompanying drawings and examples for a clearer understanding of the objects, technical solutions and advantages of the present application.
Unless defined otherwise, technical or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these" and similar terms in this application are not intended to be limiting in number, but may be singular or plural. The terms "comprises," "comprising," "includes," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this disclosure are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes the association relationship of the association object, and indicates that three relationships may exist, for example, "a and/or B" may indicate that a exists alone, a and B exist simultaneously, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this disclosure, merely distinguish similar objects and do not represent a particular ordering for objects.
The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or similar computing device. For example, the method runs on a terminal, and fig. 1 is a block diagram of the hardware structure of the terminal based on the antimicrobial peptide generating method of the generating model in this embodiment. As shown in fig. 1, the terminal may include one or more (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the antimicrobial peptide generating method based on the generation model in the present embodiment, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, to implement the above-described method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. The network includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a method for generating an antimicrobial peptide based on a generated model is provided, fig. 2 is a flowchart of the method for generating an antimicrobial peptide based on a generated model of this embodiment, and as shown in fig. 2, the flowchart includes the following steps:
Step S210, obtaining a collection of existing antibacterial peptide sequences.
Specifically, in practical application, the method of acquiring the existing antibacterial peptide sequence according to the embodiment of the present application includes, but is not limited to, acquiring antibacterial peptide data with antibacterial activity from a pre-stored database or crawling antibacterial peptide data from a literature report of a network platform, and performing operations such as merging, deduplication, filtering and the like on the obtained antibacterial peptide data sequence to obtain a set of the existing antibacterial peptide sequence.
Step S220, predicting a novel antibacterial peptide based on the existing antibacterial peptide sequence by using a pre-trained generation model to obtain a sequence.
Specifically, based on a multi-sequence comparison technology, an existing antibacterial peptide sequence set is expanded to obtain an antibacterial peptide multi-sequence comparison data set, and the antibacterial peptide multi-sequence comparison data set is input into a pre-trained diffusion model to generate a predicted sequence based on any decoding sequence. In other embodiments, step S220 may also be implemented by training a large pre-trained language model (e.g., GPT series) on a corpus that includes existing antibacterial peptide sequence information and some annotation information, such as characteristics, functions, similarities, etc., so that the model can learn the syntax, semantics, and potential features of the antibacterial peptide sequence to obtain a predicted sequence. In other embodiments, candidate antimicrobial peptide sequences may be generated via a Long Short-Term Memory network (LSTM), and the candidate sequences may be input to a transducer for decoding and optimization to obtain an extended antimicrobial peptide sequence, i.e., a predicted sequence.
And step S230, inputting the predicted sequence into a pre-trained antibacterial peptide classifier, and judging whether the predicted sequence has antibacterial activity, wherein the antibacterial peptide classifier is obtained by training based on an antibacterial peptide tag set and a non-antibacterial peptide sequence tag set.
Specifically, the antibacterial peptide classifier includes, but is not limited to, learning characteristics of the antibacterial peptide by using a model such as a support vector machine, a neural network, a decision tree, and the like. The antibacterial peptide tag set can be obtained based on the existing antibacterial peptide sequence set, and the non-antibacterial peptide sequence tag set comprises various protein sequences which do not belong to antibacterial peptides.
And step S240, outputting the predicted sequence with antibacterial activity as a final target antibacterial peptide sequence.
Specifically, the target antibacterial peptide sequence can be further processed or analyzed, and the target antibacterial peptide sequence can be applied to different scenes according to the processing result or the analysis result. For example, the tumor inhibition performance of the target antibacterial peptide sequence is analyzed to be applied to the medical field, or the bactericidal effect of the target antibacterial peptide sequence is analyzed, the target antibacterial peptide sequence meeting the bactericidal standard is used as a preservative to be applied to the food safety field, and the method can be applied to the fields of scientific research, agriculture and the like.
In the embodiment, the method comprises the steps of acquiring a set of existing antibacterial peptide sequences, obtaining a predicted sequence similar to the existing antibacterial peptide sequences in structure by utilizing a pre-trained generation formula model, inputting the predicted sequence into a pre-trained antibacterial peptide classifier, judging whether the predicted sequence has antibacterial activity, obtaining the antibacterial peptide classifier by training based on an antibacterial peptide tag set and a non-antibacterial peptide sequence tag set, outputting the predicted sequence with the antibacterial activity as a final target antibacterial peptide sequence, solving the problem of low antibacterial peptide design generation efficiency, and realizing de-novo generation and judgment of the antibacterial peptide by utilizing the generation formula model and the classifier, thereby remarkably improving the antibacterial peptide design efficiency and success rate.
In some embodiments thereof, step S210, obtaining a set of existing antimicrobial peptide sequences, comprises:
Step S211, extracting initial antimicrobial peptide data from the public database.
Step S212, cleaning the initial antibacterial peptide data to obtain a collection of the existing antibacterial peptide sequences.
Specifically, the collection of existing antimicrobial peptide sequences, i.e., the AMP dataset, is assembled based on six common databases, database APD (Antimicrobial Peptide Database), database DADP (Collection of AntimicrobialPeptides), database DBAASP (Database of Antimicrobial ACTIVITY AND Structure of Peptides), database DRAMP (Database of Researchon Antimicrobial Peptides), database YADAMP (Yet Another Database of Antimicrobial Peptides), and database dbAMP (Database ofAntimicrobial Peptides). The data extracted from these databases is combined, deduplicated, and filtered to remove incomplete or nonsensical data entries, improving the quality of the AMP data set (i.e., the collection of existing antimicrobial peptide sequences).
In some embodiments, based on step S220, using the pre-trained generator model, a predicted sequence is obtained that is structurally similar to the existing antimicrobial peptide sequence, comprising:
step S221, expanding the existing antibacterial peptide sequence set based on a multi-sequence comparison technology to obtain an antibacterial peptide multi-sequence comparison data set.
Specifically, multiple sequence alignments (Multiple sequence alignment, MSA) are performed on existing collections of antimicrobial peptide sequences (i.e., AMP datasets) to obtain antimicrobial peptide multiple sequence alignment datasets AMP-MSA with evolution information. In one embodiment, the AMP-positive (i.e., antimicrobial activity) MSA sequence is extracted by a search tool to obtain an antimicrobial peptide multisequence alignment dataset AMP-MSA.
Step S222, inputting the antimicrobial peptide multi-sequence comparison data set into a pre-trained generation model, and generating a predicted sequence based on any decoding sequence.
Specifically, the generation length (for example, 15-35 amino acids) is set on the condition of the antimicrobial peptide multisequence comparison dataset, and a new sequence is predicted by using a generation formula model.
Wherein the pre-trained generative model employs a sequential agnostic autoregressive diffusion model (Order-Agnostic Diffusion Model, OADM). The model is trained on evolutionary multi-sequence alignment (MSA) data for training, allows sequences to be generated in any order, and expands the generation capacity of a traditional autoregressive model.
The OADM diffusion model is trained with log-likelihood of the generated sequence as a loss function, the log-likelihood of the generated sequence being expressed as an expectation of all possible decoding orders:
;
Wherein xσ(t) denotes the amino acid generated at position t in decoding order σ. xσ(<t) denotes all previous amino acids arranged in this order, where lovp (x) denotes the log likelihood value that generated the x sequence, Eσ -U (SL) denotes the expected value for all possible decoding orders, L denotes the length of the sequence, and SL denotes all possible decoding orders.
In training, the model processes the MSA matrix by alternating axial attention mechanisms, reduces row attention computation complexity to O (ML2), reduces column attention complexity to O (LM2), and shares inter-sequence structures in conjunction with binding row attention mechanisms.
The model is pre-trained by Masking Language Modeling (MLM) targets, whose loss function is expressed as:
;
Wherein (m, i) represents a masking position,To correctly predict the probability of a masked amino acid at position (m, i),For the masked MSA sequence, X represents the MSA sequence before masking and θ represents the parameters of the model.
In some of these embodiments, based on S230, inputting the predicted sequence into a pre-trained antimicrobial peptide classifier, determining whether the predicted sequence has antimicrobial activity, comprising:
based on S231, the predicted sequence is subjected to physicochemical property screening to obtain a screened sequence.
Based on S232, the screened sequence is input into a pre-trained antimicrobial peptide classifier, and a conclusion is output as to whether the screened sequence has antimicrobial activity.
Specifically, the screening conditions include isoelectric point, positive charge, hydrophobicity, and the like. Physicochemical screening is performed to evaluate whether the sequence generated by the generator has AMP-like (antimicrobial peptide) properties.
In some of these embodiments, the method further comprises:
Step S250, training a classification model based on the obtained antibacterial peptide tag set and the obtained non-antibacterial peptide sequence tag set.
And step S260, in the training process of the two classification models, adjusting parameters of the two classification models based on a ten-fold cross validation result to obtain the pre-trained antibacterial peptide classifier.
Specifically, ten-fold cross-validation divides all data into ten parts, and then each part is used as a validation set, and the other parts are used as training sets for training and validation. In the process, the super parameters are kept consistent, and then the average training loss and the average verification loss of 10 models are taken to measure the quality of the super parameters. Finally, after a satisfactory super-parameter is obtained, all data are used as a training set, and the super-parameter is used for training to obtain a final model. The cross-validation has the beneficial effects of reducing the contingency and improving the generalization capability of the model. Due to the contingency caused by single division of the training set and the verification set, the existing data set is fully utilized to carry out multiple divisions, so that the condition that contingent hyper-parameters and models without generalization capability are selected due to special divisions is avoided.
In some of these embodiments, based on step S250, training a classification model based on the obtained antibacterial peptide tag set and the non-antibacterial peptide order tag set, comprising:
And step S251, extracting features of the antibacterial peptide tag set and the non-antibacterial peptide sequence tag set to obtain antibacterial peptide features and non-antibacterial peptide features.
Step S252, inputting the antibacterial peptide characteristics and the non-antibacterial peptide characteristics into a classification model, and training the classification model.
Specifically, the feature extraction method includes a Pseudo-K-element reduced amino acid composition (Pseudo K-tuple Reduced Amino Acids Composition, pseKRAAC) encoding method and a quasisequence-Order (QSOrder) encoding method.
After extracting the characteristics, selecting the characteristics, wherein the characteristics are selected by adopting a two-step process. First, features are evaluated according to their Pearson Correlation Coefficient (PCC), calculated as follows:
;
Where yi represents the true target value,Representing predicted values, uy andThe other is the average of the true and predicted values, and N is the total number of samples. This step enables us to rank the features according to their predictive capabilities, thereby quantifying the effectiveness of the features. The most efficient features are then selected and further evaluated using a multi-branch convolutional neural network with Attention mechanism (MBC-Attention) model, resulting in antibacterial and non-antibacterial peptide features for training the classification model.
After the feature engineering process, the feature data is used to train a classification model that uses a gradient-lifting decision tree algorithm (Extreme Gradient Boosting, XGBoost). Wherein the antimicrobial peptide signature is labeled 1 and the non-antimicrobial peptide signature is labeled 0. The XGBoost model optimizes a regularized objective function in order to balance the accuracy and complexity of the model and thereby prevent overfitting. Objective functionThe definition is as follows:
;
where n represents the number of samples during training, l represents the loss function, k represents the index of each tree in the XGBoost integrated model,Is the prediction of xi, and the calculation formula is: Ω (f) is a regularization term: Wherein T represents the number of leaves in the tree, wj represents the weight of the (j) th1 leaf, gamma controls the number of leaves, and lambda controls the L2 norm of the leaf weight.
During the training process, XGBoost models are constructed additively, optimizing the following targets L at each iteration t:
;
Where n represents the number of samples in the training process, l represents the loss function, Ω (ft) represents the regularization term at t, ft(xi) represents the predictions made by the new tree ft on sample xi in the t-th iteration of the training process.
In the case of the objective function L(t),The overall loss of all samples in the dataset after adding the new tree ft to the improvement of the previous round of prediction is calculated. The regularization term Ω (ft) penalizes the complexity of the new tree ft to prevent overfitting.
The second order taylor expansion is used to approximate L:
;
where gi and hi are first and second order gradients, respectively.
The XGBoost model adjustments were made based on the F1 score and AUC index, using 10-fold cross validation (k-fold 10) to prevent overfitting. The F1 score is an index that considers both precision and recall for calculating a balance measure of model accuracy, and in particular when dealing with imbalance categories, it is defined as the harmonic mean of precision and recall:
;
Where F1 Score represents F1 Score, precision represents Precision, recall ratio Recall (also called sensitivity or true positive rate) is the proportion of actual positive cases that the model correctly identified, calculated as follows:
;
where TP represents true positives and FN represents false negatives.
The optimal segmentation for each node during training is determined by maximizing the Gain:
;
Wherein IL and IR represent the segmented left and right child nodes, gi and hi are first and second order gradients, respectively, λ controls the L2 norm of the leaf node weights, preventing overfitting, γ penalizes the segmentation that increases the number of leaf nodes of the tree.
A contraction (learning rate η) is applied to the predictions for each tree to scale them:
;
wherein, theRepresenting samples updated after adding the t-th treeIs used to determine the predicted value of (c),Representing the predicted value of sample xi before adding to the t-th tree, ft(xi) represents the output value of the new tree ft for sample xi in the t-th iteration of the training process. η represents the learning rate, overfitting is prevented by reducing the impact of each tree.
The best performance model determined by cross-validation is used for subsequent analysis.
Fig. 3 shows the performance curves (receiver operating characteristic, ROC) of the XGBoost-based antimicrobial peptide classifier model in ten-fold cross-validation. Each curve represents a folded receiver operating characteristic curve, depicting a tradeoff between true positive rate and false positive rate at different threshold settings. The average ROC curve, indicated by the bold line, summarizes the overall performance of the model, and the area under the curve (AUC) provides a measure of classification accuracy.
The present embodiment is described and illustrated below by way of preferred embodiments.
Fig. 4 is a flowchart of the antibacterial peptide production method based on the production model of the present preferred embodiment. As shown in fig. 4, the antibacterial peptide generation method based on the generation type model provided in the preferred embodiment includes:
S1, constructing an antibacterial peptide data set, namely, a set of the existing antibacterial peptide sequences. And extracting antibacterial peptide data from the antibacterial peptide public database, merging, removing duplication and filtering to delete incomplete or nonsensical data entries, thereby obtaining an antibacterial peptide data set.
S2, constructing an antibacterial peptide multisequence alignment data set. Based on the multi-sequence alignment technology, the existing collection of the antibacterial peptide sequences is expanded to obtain an antibacterial peptide multi-sequence alignment data set.
S3, inputting the antimicrobial peptide multi-sequence comparison data set into a pre-trained sequence-agnostic autoregressive diffusion model, and generating a predicted sequence based on any decoding sequence.
S4, screening the physicochemical properties of the predicted sequence to obtain a screened sequence.
And S5, training a classification model based on the obtained antibacterial peptide tag set and the obtained non-antibacterial peptide sequence tag set, and adjusting parameters of the classification model based on a ten-fold cross verification result in the training process of the classification model to obtain the antibacterial peptide classifier.
S6, inputting the screened sequence into an antibacterial peptide classifier, and judging whether the predicted sequence has antibacterial activity.
S7, outputting the sequence with antibacterial activity as a final target antibacterial peptide sequence.
The preferred embodiment provides a generation and screening strategy, which combines a diffusion model, multi-sequence comparison data and a machine learning optimization tool, so that the design efficiency and success rate of the antibacterial peptide are remarkably improved, and an important innovation in the fields of bioinformatics and protein engineering is realized.
It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.
In this embodiment, an antibacterial peptide generating device based on a generating model is further provided, and the device is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.
Fig. 5 is a block diagram showing the structure of an antimicrobial peptide generating apparatus based on a generated model according to the present embodiment, and as shown in fig. 5, the apparatus includes an existing data acquisition module 51, a new sequence prediction module 52, an activity verification module 53, and a target antimicrobial peptide output module 54.
An existing data acquisition module 51 for acquiring a collection of existing antimicrobial peptide sequences;
A new sequence prediction module 52, configured to predict a new antimicrobial peptide based on an existing antimicrobial peptide sequence by using a pre-trained generation model, to obtain a predicted sequence;
the activity verification module 53 is used for inputting the predicted sequence into a pre-trained antibacterial peptide classifier to judge whether the predicted sequence has antibacterial activity, wherein the antibacterial peptide classifier is obtained by training based on an antibacterial peptide tag set and a non-antibacterial peptide sequence tag set;
the target antibacterial peptide output module 54 is configured to output the predicted sequence having antibacterial activity as a final target antibacterial peptide sequence.
In some embodiments, obtaining the collection of existing antimicrobial peptide sequences includes extracting initial antimicrobial peptide data from a public database and washing the initial antimicrobial peptide data to obtain the collection of existing antimicrobial peptide sequences.
In some embodiments, the pre-trained generative model is utilized to obtain a predicted sequence similar to the structure of the existing antibacterial peptide sequence, and the method comprises the steps of expanding the existing antibacterial peptide sequence set based on a multi-sequence comparison technology to obtain an antibacterial peptide multi-sequence comparison data set, inputting the antibacterial peptide multi-sequence comparison data set into the pre-trained generative model, and generating the predicted sequence based on any decoding sequence.
In some of these embodiments, the pre-trained generative model employs a sequential agnostic autoregressive diffusion model.
In some embodiments, inputting the predicted sequence into a pre-trained antimicrobial peptide classifier to determine whether the predicted sequence has antimicrobial activity includes performing a physicochemical screening of the predicted sequence to obtain a screened sequence, inputting the screened sequence into the pre-trained antimicrobial peptide classifier, and outputting a conclusion of whether the screened sequence has antimicrobial activity.
In some embodiments, the method further comprises training the classification model based on the obtained antibacterial peptide tag set and the obtained non-antibacterial peptide sequence tag set, and adjusting parameters of the classification model based on a ten-fold cross-validation result in the training process of the classification model to obtain the pre-trained antibacterial peptide classifier.
In some embodiments, training the classification model based on the obtained antibacterial peptide tag set and the obtained non-antibacterial peptide sequence tag set comprises extracting features of the antibacterial peptide tag set and the non-antibacterial peptide sequence tag set to obtain antibacterial peptide features and non-antibacterial peptide features, inputting the antibacterial peptide features and the non-antibacterial peptide features into the classification model, and training the classification model.
The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the modules may be located in the same processor, or may be located in different processors in any combination.
There is also provided in this embodiment a computer device comprising a memory in which a computer program is stored and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the computer device may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and are not described in detail in this embodiment.
In addition, in combination with the antibacterial peptide generation method based on the generation model provided in the above embodiment, a storage medium may be provided in the present embodiment to achieve this. The storage medium has stored thereon a computer program which, when executed by a processor, implements any of the antimicrobial peptide generation methods of the above embodiments based on a generative model.
It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure in accordance with the embodiments provided herein.
It is to be understood that the drawings are merely illustrative of some embodiments of the present application and that it is possible for those skilled in the art to adapt the present application to other similar situations without the need for inventive work. In addition, it should be appreciated that while the development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as a departure from the disclosure.
The term "embodiment" in this disclosure means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive. It will be clear or implicitly understood by those of ordinary skill in the art that the embodiments described in the present application can be combined with other embodiments without conflict.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the patent claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (8)

Translated fromChinese
1.一种基于生成式模型的抗菌肽生成方法,其特征在于,所述方法包括:1. A method for generating antimicrobial peptides based on a generative model, characterized in that the method comprises:基于抗菌肽的公共数据库,获取现有的抗菌肽序列的集合;Based on the public database of antimicrobial peptides, a collection of existing antimicrobial peptide sequences is obtained;利用预训练的生成式模型,在所述现有的抗菌肽序列的基础上预测新型抗菌肽,得到预测序列;Using a pre-trained generative model, predicting a novel antimicrobial peptide based on the existing antimicrobial peptide sequence to obtain a predicted sequence;将所述预测序列输入预训练的抗菌肽分类器,判断所述预测序列是否具有抗菌活性;所述抗菌肽分类器基于抗菌肽标签集和非抗菌肽序标签集训练得到;Inputting the predicted sequence into a pre-trained antimicrobial peptide classifier to determine whether the predicted sequence has antimicrobial activity; the antimicrobial peptide classifier is trained based on an antimicrobial peptide label set and a non-antimicrobial peptide sequence label set;将具有抗菌活性的所述预测序列作为最终的目标抗菌肽序列进行输出;Outputting the predicted sequence with antibacterial activity as the final target antimicrobial peptide sequence;其中,利用预训练的生成式模型,在所述现有的抗菌肽序列的基础上预测新型抗菌肽,得到预测序列,包括:Wherein, using the pre-trained generative model, novel antimicrobial peptides are predicted based on the existing antimicrobial peptide sequences to obtain predicted sequences, including:基于多序列比对技术,对所述现有的抗菌肽序列的集合进行扩展,得到抗菌肽多序列比对数据集;Based on the multiple sequence alignment technology, the existing set of antimicrobial peptide sequences is expanded to obtain an antimicrobial peptide multiple sequence alignment dataset;将所述抗菌肽多序列比对数据集输入预训练的生成式模型,基于任意解码顺序生成预测序列;Inputting the antimicrobial peptide multiple sequence alignment dataset into a pre-trained generative model to generate a predicted sequence based on an arbitrary decoding order;其中,所述预训练的生成式模型采用顺序不可知自回归扩散模型,所述顺序不可知自回归扩散模型以生成序列的对数似然作为损失函数进行训练,所述生成序列的对数似然表示为所有可能解码顺序的期望;在训练时,所述顺序不可知自回归扩散模型通过交替轴向注意力机制处理多序列比对矩阵,并结合绑定行注意力机制共享序列间结构。Among them, the pre-trained generative model adopts an order-agnostic autoregressive diffusion model, which is trained with the log-likelihood of the generated sequence as the loss function, and the log-likelihood of the generated sequence is expressed as the expectation of all possible decoding orders; during training, the order-agnostic autoregressive diffusion model processes the multi-sequence alignment matrix through an alternating axial attention mechanism, and combines the bound row attention mechanism to share the inter-sequence structure.2.根据权利要求1所述的基于生成式模型的抗菌肽生成方法,其特征在于,获取现有的抗菌肽序列的集合,包括:2. The method for generating antimicrobial peptides based on a generative model according to claim 1, wherein obtaining a set of existing antimicrobial peptide sequences comprises:从公共数据库中提取初始抗菌肽数据;Initial antimicrobial peptide data were extracted from public databases;对所述初始抗菌肽数据进行清洗,得到现有的抗菌肽序列的集合。The initial antimicrobial peptide data are cleaned to obtain a set of existing antimicrobial peptide sequences.3.根据权利要求1所述的基于生成式模型的抗菌肽生成方法,其特征在于,将所述预测序列输入预训练的抗菌肽分类器,判断所述预测序列是否具有抗菌活性,包括:3. The method for generating antimicrobial peptides based on a generative model according to claim 1, wherein the step of inputting the predicted sequence into a pre-trained antimicrobial peptide classifier to determine whether the predicted sequence has antimicrobial activity comprises:对所述预测序列进行物理化学性质筛选,得到筛选后序列;Performing physical and chemical property screening on the predicted sequence to obtain a screened sequence;将所述筛选后序列输入预训练的抗菌肽分类器,输出所述筛选后序列是否具有抗菌活性的结论。The screened sequence is input into a pre-trained antimicrobial peptide classifier, and a conclusion of whether the screened sequence has antimicrobial activity is output.4.根据权利要求1所述的基于生成式模型的抗菌肽生成方法,其特征在于,所述方法还包括:4. The method for generating antimicrobial peptides based on a generative model according to claim 1, further comprising:基于获取到的所述抗菌肽标签集和所述非抗菌肽序标签集,训练二分类模型;Training a binary classification model based on the obtained antimicrobial peptide label set and the non-antimicrobial peptide sequence label set;在所述二分类模型的训练过程中,基于十折交叉验证结果调节所述二分类模型的参数,得到预训练的抗菌肽分类器。During the training process of the binary classification model, the parameters of the binary classification model are adjusted based on the ten-fold cross validation results to obtain a pre-trained antimicrobial peptide classifier.5.根据权利要求3所述的基于生成式模型的抗菌肽生成方法,其特征在于,基于获取到的所述抗菌肽标签集和所述非抗菌肽序标签集,训练二分类模型,包括:5. The method for generating antimicrobial peptides based on a generative model according to claim 3, wherein the step of training a binary classification model based on the acquired antimicrobial peptide label set and the acquired non-antimicrobial peptide sequence label set comprises:对所述抗菌肽标签集和所述非抗菌肽序标签集进行特征提取,得到抗菌肽特征和非抗菌肽特征;Performing feature extraction on the antimicrobial peptide tag set and the non-antimicrobial peptide sequence tag set to obtain antimicrobial peptide features and non-antimicrobial peptide features;将所述抗菌肽特征和所述非抗菌肽特征输入所述二分类模型,对所述二分类模型进行训练。The antimicrobial peptide features and the non-antimicrobial peptide features are input into the binary classification model to train the binary classification model.6.一种基于生成式模型的抗菌肽生成装置,其特征在于,所述装置包括:6. A device for generating antimicrobial peptides based on a generative model, characterized in that the device comprises:现有数据获取模块,用于基于抗菌肽的公共数据库,获取现有的抗菌肽序列的集合;An existing data acquisition module is used to acquire a set of existing antimicrobial peptide sequences based on a public database of antimicrobial peptides;新序列预测模块,用于利用预训练的生成式模型,在所述现有的抗菌肽序列的基础上预测新型抗菌肽,得到预测序列;a new sequence prediction module, configured to predict a new antimicrobial peptide based on the existing antimicrobial peptide sequence using a pre-trained generative model to obtain a predicted sequence;其中,利用预训练的生成式模型,在所述现有的抗菌肽序列的基础上预测新型抗菌肽,得到预测序列,包括:Wherein, using the pre-trained generative model, novel antimicrobial peptides are predicted based on the existing antimicrobial peptide sequences to obtain predicted sequences, including:基于多序列比对技术,对所述现有的抗菌肽序列的集合进行扩展,得到抗菌肽多序列比对数据集;将所述抗菌肽多序列比对数据集输入预训练的生成式模型,基于任意解码顺序生成预测序列;其中,所述预训练的生成式模型采用顺序不可知自回归扩散模型,所述顺序不可知自回归扩散模型以生成序列的对数似然作为损失函数进行训练,所述生成序列的对数似然表示为所有可能解码顺序的期望;在训练时,所述顺序不可知自回归扩散模型通过交替轴向注意力机制处理多序列比对矩阵,并结合绑定行注意力机制共享序列间结构;Based on multiple sequence alignment technology, the existing set of antimicrobial peptide sequences is expanded to obtain an antimicrobial peptide multiple sequence alignment dataset; the antimicrobial peptide multiple sequence alignment dataset is input into a pre-trained generative model to generate a predicted sequence based on an arbitrary decoding order; wherein the pre-trained generative model adopts an order-agnostic autoregressive diffusion model, which is trained using the log-likelihood of the generated sequence as a loss function, and the log-likelihood of the generated sequence is expressed as the expectation of all possible decoding orders; during training, the order-agnostic autoregressive diffusion model processes the multiple sequence alignment matrix through an alternating axial attention mechanism, and combines it with a bound row attention mechanism to share inter-sequence structure;活性验证模块,用于将所述预测序列输入预训练的抗菌肽分类器,判断所述预测序列是否具有抗菌活性;所述抗菌肽分类器基于抗菌肽标签集和非抗菌肽序标签集训练得到;an activity verification module, configured to input the predicted sequence into a pre-trained antimicrobial peptide classifier to determine whether the predicted sequence has antimicrobial activity; the antimicrobial peptide classifier is trained based on an antimicrobial peptide label set and a non-antimicrobial peptide sequence label set;目标抗菌肽输出模块,用于将具有抗菌活性的所述预测序列作为最终的目标抗菌肽序列进行输出。The target antimicrobial peptide output module is used to output the predicted sequence with antimicrobial activity as the final target antimicrobial peptide sequence.7.一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,其特征在于,所述处理器执行所述计算机程序时实现权利要求1至权利要求5中任一项所述的方法的步骤。7. A computer device comprising a memory and a processor, wherein the memory stores a computer program, wherein the processor implements the steps of the method according to any one of claims 1 to 5 when executing the computer program.8.一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至权利要求5中任一项所述的方法的步骤。8. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 5 are implemented.
CN202411908673.7A2024-12-242024-12-24 Method, device and computer equipment for generating antimicrobial peptides based on generative modelActiveCN119361020B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411908673.7ACN119361020B (en)2024-12-242024-12-24 Method, device and computer equipment for generating antimicrobial peptides based on generative model

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411908673.7ACN119361020B (en)2024-12-242024-12-24 Method, device and computer equipment for generating antimicrobial peptides based on generative model

Publications (2)

Publication NumberPublication Date
CN119361020A CN119361020A (en)2025-01-24
CN119361020Btrue CN119361020B (en)2025-08-05

Family

ID=94300672

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411908673.7AActiveCN119361020B (en)2024-12-242024-12-24 Method, device and computer equipment for generating antimicrobial peptides based on generative model

Country Status (1)

CountryLink
CN (1)CN119361020B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119479776B (en)*2025-01-152025-05-09湘湖实验室(农业浙江省实验室)Efficient antibacterial peptide batch design and evaluation method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115472224A (en)*2022-09-212022-12-13中国科学院深圳先进技术研究院Enzyme sequence generation method, device, medium and equipment based on multi-sequence comparison

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109326325B (en)*2018-07-252022-02-18郑州云海信息技术有限公司Method, system and related assembly for gene sequence comparison
US20220367007A1 (en)*2019-09-272022-11-17Uab Biomatter DesignsMethod for generating functional protein sequences with generative adversarial networks
CN112614538A (en)*2020-12-172021-04-06厦门大学Antibacterial peptide prediction method and device based on protein pre-training characterization learning
US20230129568A1 (en)*2021-10-212023-04-27Nec Laboratories America, Inc.T-cell receptor repertoire selection prediction with physical model augmented pseudo-labeling
CN118430654A (en)*2023-05-182024-08-02浙大宁波理工学院Method for generating target antibacterial peptide
CN118506879A (en)*2024-05-202024-08-16西北工业大学Identification model construction method, antibacterial peptide generation method, device and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN115472224A (en)*2022-09-212022-12-13中国科学院深圳先进技术研究院Enzyme sequence generation method, device, medium and equipment based on multi-sequence comparison

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Protein generation with evolutionary diffusion: sequence is all you need;Sarah Alamdari等;《bioRxiv preprint》;20241104;第1-65页*
基于计算方法的抗菌肽预测;曹隽喆等;《计算机学报》;20171231;第40卷(第12期);第2777-2792页*
曹隽喆等.基于计算方法的抗菌肽预测.《计算机学报》.2017,第40卷(第12期),第2777-2792页.*

Also Published As

Publication numberPublication date
CN119361020A (en)2025-01-24

Similar Documents

PublicationPublication DateTitle
Khamparia et al.Seasonal crops disease prediction and classification using deep convolutional encoder network
Casale et al.Probabilistic neural architecture search
Kumar et al.Deep neural network hyper-parameter tuning through twofold genetic approach
JP7250126B2 (en) Computer architecture for artificial image generation using autoencoders
CN111723915B (en)Target detection method based on deep convolutional neural network
US12026624B2 (en)System and method for loss function metalearning for faster, more accurate training, and smaller datasets
CN119361020B (en) Method, device and computer equipment for generating antimicrobial peptides based on generative model
JP2024524795A (en) Gene phenotype prediction based on graph neural networks
CN116015967B (en)Industrial Internet intrusion detection method based on improved whale algorithm optimization DELM
Wang et al.Graph neural networks: Self-supervised learning
Chen et al.Binarized neural architecture search for efficient object recognition
CN116646002A (en)Multi-non-coding RNA and disease association prediction method, device, equipment and medium
Mitschke et al.Gradient based evolution to optimize the structure of convolutional neural networks
Bhardwaj et al.Computational biology in the lens of CNN
WO2023273934A1 (en)Method for selecting hyper-parameter of model, and related apparatus
CN119336921A (en) Industrial knowledge graph completion and adaptive retrieval method based on large language model
Dimanov et al.Moncae: Multi-objective neuroevolution of convolutional autoencoders
CN117611974A (en)Image recognition method and system based on searching of multiple group alternative evolutionary neural structures
US20250217653A1 (en)Discovering Novel Artificial Neural Network Architectures
CN118629618A (en) A method for analyzing AD evolution patterns based on spatiotemporal dynamic graph learning
CN117936083A (en) A method and system for predicting hematoma expansion events in stroke patients
CN110083734A (en)Semi-supervised image search method based on autoencoder network and robust core Hash
CN120047746A (en)Image classification method and system based on pruning deep learning model
Du et al.CGaP: Continuous growth and pruning for efficient deep learning
ChenPrediction of heart disease using graph neural networks

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp