Background
With the acceleration of globalization process, people and materials in countries around the world flow frequently, and the disease burden caused by a plurality of newly-appearing and unconfirmed pathogens is continuously increased, so that the method brings unprecedented challenges to human health. For example, SARS, avian influenza, H1N1 influenza A, H7N9 influenza, ebola virus, and monkey pox all show the potential for rapid spread and serious consequences. These diseases not only pose a threat to public health, but may also lead to economic losses and social panic, causing the public health system of various countries to continually enhance pathogen monitoring, rapid identification and response capabilities. In the face of the threat of infectious diseases, rapid and accurate pathogen detection is particularly important, and important basis can be provided for effective prevention and control measures.
Currently, traditional pathogenic microorganism identification technologies mainly comprise morphological observation, cell physiological and biochemical characteristics, bacterial culture nutrient type, gene chips, automatic microorganism analysis systems and the like. These methods generally rely on the cultivation of microorganisms for a long period of time and are extremely sensitive to the cultivation conditions and are prone to false negative or false positive results. Furthermore, detection methods based on specific primers, probes or antibodies (such as antigen-antibody reactions and PCR detection) rely on a priori knowledge of the sequences of known microorganisms when performing microbial recognition, which makes them of limited effectiveness in dealing with unknown or mutated pathogens. Limitations of these techniques have led to the fact that it has become exceedingly difficult to quickly and accurately identify pathogens at the onset of an infectious disease, thereby affecting the timeliness of disease control and intervention.
The advent of new generation sequencing technologies, particularly nanopore sequencing technologies, provides a completely new solution for microbial identification. The nanopore sequencing technology has the advantages of no need of culture, shortened detection time, suitability for clinical scenes requiring rapid reaction, no need of priori knowledge, capability of identifying unknown and mutant pathogens, expanded detection range, capability of enhancing the coping capability of emerging pathogens, and long sequencing reading period, and is capable of acquiring complete genome information, so that the subsequent genome analysis and functional research are more accurate.
Metagenomic techniques enable the disclosure of genetic information of all microorganisms in a sample, including DNA viruses, bacteria, fungi, parasites, etc., by direct high throughput sequencing of all nucleic acids in the sample. It does not rely on prior culture of microorganisms or known sequence information for a particular pathogen and is therefore particularly useful for rapid identification of unknown or rare pathogens.
Therefore, nanopore sequencing technology has great value in various fields such as metagenomic sequencing, pathogen monitoring, new species genome sequencing, epigenetic research and the like, and becomes a powerful tool for rapidly identifying microorganisms.
Although nanopore sequencing technology has significant advantages in theory, a series of challenges remain in practical use. Firstly, the accuracy of nanopore sequencing is only about 95%, the sequencing error rate is high, and great challenges are generated for species identification and homology analysis, secondly, the pathogen needs to be identified from complex biological samples rapidly and accurately, data analysis algorithms and tools are required to be efficient to ensure that operable results can be generated in a short time, and finally, the problem of how to identify pathogen microorganisms which are never seen in nature or are not recorded in a database, particularly overseas infectious diseases, from the results is still a problem encountered when pathogen data analysis is carried out.
Disclosure of Invention
Aiming at the problems and the shortcomings of the prior art, the invention provides an unknown pathogenic microorganism identification and analysis system and method based on nanopore sequencing.
The invention solves the technical problems by the following technical proposal:
The invention provides an unknown pathogenic microorganism identification analysis system based on nanopore sequencing, which is characterized by comprising a microorganism database construction and updating module, a microorganism information standardization labeling module, a nanopore sequencing analysis module, a sequencing data quality control module, a host sequencing reading segment removal module, a microorganism database comparison annotation module, a generic sequencing reading segment assembly module and a microorganism database comparison calculation module;
The microbial database construction and updating module is used for acquiring a plurality of pieces of microbial species information based on a public database, each piece of microbial species information at least comprises a genome sequence, carrying out integrity evaluation on each genome sequence to remove the genome sequence with poor quality, removing redundant genome sequences according to a similarity principle aiming at each reserved genome sequence, identifying and combining a plurality of genome sequences with similarity larger than a set threshold value, wherein the result of redundancy removal is that one microbial species or similar genome sequence is reserved only, constructing a microbial database based on the microbial species information of the reserved genome sequence, and carrying out periodic updating on the microbial database;
The microorganism information standardization labeling module is used for sequentially classifying according to the grade attributes of the kingdom, the phylum, the class, the order, the family, the genus and the species, carrying out standardization classification labeling of the species on each microorganism species information in the microorganism database, and sending out manual correction reminding information after traversing each microorganism species information so as to remind the automatically labeled microorganism database of carrying out manual correction;
The nanopore sequencing analysis module is used for extracting all nucleic acid in a sample to be tested, wherein the nucleic acid is DNA or RNA, the DNA is subjected to end repair or RNA is subjected to reverse transcription and end repair to obtain a product, a nanopore sequencing platform is used for library building and sequencing aiming at the obtained product, a sequencing catalog is monitored in real time in a nanopore sequencing process, a new sequencing sequence file is generated each time and is used as a sequencing sequence file to be analyzed until the condition that nanopore sequencing is completed is met or no new sequencing sequence file is generated within a set time, and the sequencing sequence file to be analyzed comprises a plurality of sequencing reads;
The sequencing data quality control module is used for removing the joint pollution of the reading segment, removing the low-quality reading segment and removing the short-length reading segment of each sequencing reading segment in each sequencing sequence file to be analyzed aiming at each sequencing sequence file to be analyzed, so that a plurality of sequencing reading segments reserved in the sequencing sequence file to be analyzed are obtained;
The host sequencing read removal module is used for comparing each sequencing read with the genome sequence of the host aiming at each sequencing read reserved in each sequencing sequence file to be analyzed, and extracting the sequencing reads which are not compared with the genome sequence of the host;
The microbial database comparison annotation module is used for comparing each extracted sequencing read with a microbial database, when the similarity of the sequencing reads in the genome sequence of a certain species reaches a set similarity, species annotation is carried out on the sequencing reads by utilizing the species level and the genus level of the species, and when the similarity of the sequencing reads in the genome sequence of the certain species does not reach the set similarity, the sequencing reads are removed, so that the species annotation of each remained sequencing read and the species level and the genus level thereof can be obtained;
The same genus sequencing read assembly module is used for aiming at the reserved sequencing reads, taking all sequencing reads which are compared and annotated to the same genus as a read set, and carrying out consistent sequence assembly on all sequencing reads in the read set of the same genus to obtain a plurality of representative sequencing reads, so that representative sequencing reads under each genus can be obtained;
The microbial database comparison calculation module is used for carrying out calculation comparison on the representative sequencing read and any genome sequence in the microbial database aiming at each representative sequencing read, outputting the species which the representative sequencing read belongs to when a first calculation comparison condition is met, outputting the species which the representative sequencing read belongs to when a second calculation comparison condition is met, wherein the representative sequencing read and the genome sequence which the second calculation comparison condition belongs to are in a homologous relationship, and outputting the information which the representative sequencing read does not match in the microbial database when the representative sequencing read and each genome sequence in the microbial database do not meet the first calculation comparison condition and the second calculation comparison condition.
The invention also provides an unknown pathogenic microorganism identification and analysis method based on nanopore sequencing, which is characterized by comprising the following steps:
S1, acquiring a plurality of pieces of microorganism species information based on a public database, wherein each piece of microorganism species information at least comprises genome sequences, carrying out integrity evaluation on each piece of genome sequences to remove genome sequences with poor quality, removing redundant genome sequences according to a similarity principle aiming at each reserved genome sequence, identifying and combining a plurality of genome sequences with similarity larger than a set threshold value, and constructing a microorganism database based on the microorganism species information of the reserved genome sequences, wherein the redundancy removed result is that only one microorganism species or similar genome sequence is reserved;
S2, classifying according to the level attributes of the kingdom, phylum, class, order, family, genus and species in sequence, carrying out standardized classification labeling of species on each microbial species information in a microbial database, and sending out manual correction reminding information after traversing each microbial species information so as to remind the automatically labeled microbial database of carrying out manual correction;
S3, extracting all nucleic acid in a sample to be detected, wherein the nucleic acid is DNA or RNA, the DNA is subjected to end repair or RNA is subjected to reverse transcription and end repair to obtain a product, a nanopore sequencing platform is utilized for library building and sequencing aiming at the obtained product, a sequencing catalog is monitored in real time in a nanopore sequencing process, a new sequencing sequence file is used as a sequencing sequence file to be analyzed every time when the new sequencing sequence file is generated, and the sequencing sequence file to be analyzed comprises a plurality of sequencing reads until the condition that the nanopore sequencing is completed or no new sequencing sequence file is generated within a set time is met;
S4, aiming at each sequencing sequence file to be analyzed, performing operations of removing read joint pollution, removing low-quality reads and removing short-length reads on each sequencing read in the sequencing sequence file to be analyzed, thereby obtaining a plurality of sequencing reads reserved in the sequencing sequence file to be analyzed;
S5, aiming at each sequencing read segment reserved in each sequencing sequence file to be analyzed, comparing each sequencing read segment with a genome sequence of a host, and extracting a sequencing read segment which is not compared with the genome sequence of the host;
S6, comparing each extracted sequencing read with a microbial database, when the similarity between the sequencing read and a genome sequence of a certain species reaches a set similarity, annotating the sequencing read by utilizing the species level and the genus level of the species, and when the similarity between the sequencing read and the genome sequence of the certain species does not reach the set similarity, removing the sequencing read, so that the reserved each sequencing read and species annotation of the species level and the genus level thereof can be obtained;
S7, aiming at the reserved sequencing reads, taking all sequencing reads which are compared and annotated to the same genus as a read set, and carrying out consistent sequence assembly on all sequencing reads in the read set of the same genus to obtain a plurality of representative sequencing reads, so that representative sequencing reads under each genus can be obtained;
S8, aiming at each representative sequencing read, carrying out calculation comparison on the representative sequencing read and any genome sequence in the microbial database, outputting the species which the representative sequencing read belongs to and meets the first calculation comparison condition when the first calculation comparison condition is met, outputting the species which the representative sequencing read belongs to and meets the second calculation comparison condition when the second calculation comparison condition is met, and outputting the information which the representative sequencing read does not meet in the microbial database when the representative sequencing read and each genome sequence in the microbial database do not meet the first calculation comparison condition and the second calculation comparison condition.
The invention has the positive progress effects that:
The invention provides an unknown pathogenic microorganism identification analysis system and method based on nanopore sequencing, which are characterized in that all nucleic acid in a sample to be detected is extracted, DNA is subjected to terminal repair, RNA is subjected to reverse transcription and terminal repair, an obtained product is subjected to nanopore sequencing platform library building and sequencing, and unknown pathogenic microorganism identification is carried out according to sequencing data. The method is suitable for the characteristics of high sequencing error rate and the like of nanopore sequencing, can accurately identify microorganisms, and is beneficial to finding unknown pathogenic microorganisms.
The invention has the characteristics that 1, the nanopore sequencing data are subjected to twice microorganism database comparison and one-time genus level consistency sequence assembly, so that the influence of the error rate of the nanometer Kong Gaoce sequence is reduced, the identification of homologous microorganisms is more accurate, 2, the analysis period is short, the whole analysis process can be completed only by adding 3-5 hours after the whole sequencing is completed, and 3, the system can prompt the unknown pathogenic microorganisms existing in a sample due to the fact that a plurality of unknown pathogens exist in nature or microorganisms are not contained in a microorganism genome database.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, the embodiment provides an unknown pathogenic microorganism identification analysis system based on nanopore sequencing, which comprises a microorganism database construction and update module 1, a microorganism information standardization labeling module 2, a nanopore sequencing analysis module 3, a sequencing data quality control module 4, a host sequencing read segment removal module 5, a microorganism database comparison annotation module 6, a generic sequencing read segment assembly module 7, a microorganism database comparison calculation module 8 and a microorganism abundance statistics module 9.
The microbial database construction and updating module 1 is used for acquiring a plurality of pieces of microbial species information based on a public database (including NCBI, BV-BRC, EMBL, DDB and the like), wherein each piece of microbial species information at least comprises a genome sequence, carrying out integrity evaluation on each genome sequence to remove the genome sequence with poor quality, removing redundant genome sequences according to a similarity principle aiming at each reserved genome sequence, identifying and combining a plurality of genome sequences with similarity larger than a set threshold (such as 90%), wherein the redundancy removal result is that only one same species or similar genome sequence is reserved, constructing a microbial database based on microbial species information of the reserved genome sequence, and then, after the public database is detected to be updated, reconstructing the microbial database by using the updated public database, and calling the microbial information standardization labeling module to carry out standardization classification labeling on each microbial species information in the reconstructed microbial database.
In this example, each genomic sequence was subjected to an integrity assessment using QUAST software (a software for assessing the quality of genome assembly, analyzing the integrity, accuracy and consistency of genome assembly) to remove genomic sequences of poor quality, e.g., genomic sequences with a proportion of N bases greater than 5% or sequences shorter than 1000 bp.
In this example, in removing redundancy from genomic sequences, CD-HIT software (a tool for efficient clustering and redundancy removal for processing large-scale nucleic acid or protein sequence data to identify and merge similar sequences, thereby reducing redundancy in the dataset) was used to remove redundant sequences at a similarity threshold of 90%, ensuring that only one representative sequence is retained for the same species or similar genomic sequences to improve the uniqueness and efficiency of use of the microbial database. The constructed microbial database is updated based on the updating of the public database, so that the downloaded genome sequence is ensured to be consistent with the latest research result, and the outdated sequence or repeated genome sequence with the changed genome sequence in the microbial database is removed in time.
The microorganism information standardization labeling module 2 is used for sequentially classifying according to the grade attributes of the kingdom, the phylum, the class, the order, the family, the genus and the species, carrying out standardization classification labeling of the species on each microorganism species information in the microorganism database, and sending out manual correction reminding information after traversing each microorganism species information so as to remind the automatically labeled microorganism database of carrying out manual correction.
In this embodiment, since the microorganism annotation information of different databases in the common database is inconsistent, the database only includes the microorganism species name, and the analysis information is lacking. It is therefore necessary to normalize species classification information, and reference NCBI Taxonomy classification systems normalize microbial species information, including kingdom, phylum, class, order, family, genus, species, and the like. For example, the standardized classification information of Escherichia coli is Bacteria, proteobacteria (Proteobacteria), gammaproteobacteria (Gamma-Proteobacteria), enterobacterales (Enterobacteriaceae), enterobacteriaceae (Enterobacteriaceae), escherichia (E.coli), ESCHERICHIA COLI (E.coli).
In the embodiment, scripts can be written for automatic processing, corresponding annotation is carried out according to the genome sequence of the species, if the species name is inconsistent with that in the database, manual processing is needed, and the NCBI database is used for reference, automatic labeling and manual correction are carried out on the microbial database.
The nanopore sequencing analysis module 3 is used for extracting all nucleic acids (DNA or RNA) in a sample to be tested, the DNA is subjected to end repair or RNA is subjected to reverse transcription and end repair to obtain a product, a nanopore sequencing platform is used for building a library (joints required by nanopore sequencing are added on two sides of a molecule) and sequencing the obtained product, a sequencing catalog is monitored in real time in a nanopore sequencing process, each time a new sequencing sequence file is generated, the new sequencing sequence file is used as the sequencing sequence file to be analyzed until the condition that nanopore sequencing is completed is met or no new sequencing sequence file is generated within a set time, and each sequencing sequence file to be analyzed comprises a plurality of sequencing reads.
In this embodiment, the nanopore sequencing platform monitors the current change as the DNA or RNA molecule passes through the nanopore during sequencing, converts the current change into sequence information in real time, and outputs the sequence information in fastq file format. The sequencing catalog is monitored in real time (sequencing will generate a file, the sequencing catalog is a path for storing the sequencing file) using the watchdog package of Python (Python is a widely used high-level programming language), and whenever a new sequencing sequence file is generated, the system automatically calls the subsequent analysis flow (the sequence is analyzed at any time in the sequencing process, only the sequence in the sequencing file, which also contains a plurality of sequences and the unique number of the sequencing sequence), and the sequence is subjected to microbial annotation. The nanopore sequencing analysis module automatically exits until sequencing is complete (special information is available for capture when normal sequencing is complete) or no new sequencing file is generated for a long time.
The sequencing data quality control module 4 is used for performing operations of removing joint pollution, removing low-quality reads and removing short-length reads on each sequencing read in each sequencing sequence file to be analyzed aiming at each sequencing sequence file to be analyzed, so as to obtain a plurality of sequencing reads which are reserved in the sequencing sequence file to be analyzed, and performing quality check on the sequencing reads which are reserved in the sequencing sequence file to be analyzed, so as to confirm that joint pollution, low-quality reads and short-length reads are effectively removed.
In this example, porechop (a software for processing nanopore sequencing data, for removing linker sequences from raw sequencing reads) was used to remove linker contamination during quality control of nanopore sequencing data, nanoFilt (a software for filtering and quality control of raw nanopore sequencing data) was used to remove low quality reads and to remove short length reads. The sequencing reads remaining after quality control were then checked for quality using FastQC (a software for evaluating quality of high throughput sequencing data) to confirm that linker contamination, low quality reads, short length reads had been effectively removed.
The host sequencing read removal module 5 is configured to compare each sequencing read with the genomic sequence of the host for each sequencing read remaining in each sequencing sequence file to be analyzed, and extract a sequencing read that is not compared with the genomic sequence of the host.
Since sequencing reads (reads) contain sequencing reads from the host, and host reads typically occupy a large proportion, if not removed, the signals of the microorganisms can be overwhelmed, resulting in inaccurate analysis of the diversity and composition of the microbial community. Meanwhile, the host reads are removed, so that the data volume can be obviously reduced, the subsequent analysis is simplified, the calculation load is reduced, and the analysis efficiency and accuracy are improved.
In this example, host sequencing reads removal was performed using minimap (a software for rapid and efficient sequence alignment between long sequences and reference genomes or other sequences) and samtools (a software for processing aligned files such as SAM and BAM) with a back-and-forth execution order. Sequencing reads were first aligned to the host genomic sequence (either homologous or homologous) using minimap a2, and then the sequencing reads, which were not aligned to the host genomic sequence, were extracted using samtools for subsequent microbial database alignment analysis.
The microbial database comparison annotation module 6 is configured to compare each extracted sequencing read with the microbial database, annotate the sequencing read by using the species level and the genus level when the similarity between the sequencing read and the genomic sequence of a certain species reaches a set similarity, and remove the sequencing read when the similarity between the sequencing read and the genomic sequence of a certain species does not reach the set similarity, thereby obtaining each remaining sequencing read and species annotation of the species level and the genus level thereof.
In this example, kraken (a rapid and accurate microorganism classification tool) and bracken (a tool for microorganism classification and abundance estimation, used in conjunction with kraken 2) were used for microorganism database alignment and species annotation, and species and genus level species annotation was performed directly on bacteria, archaea, eukaryotic and viruses in the samples to be tested by comparison of sequencing reads that were decombined with the microorganism database. The metagenome species annotation method based on the read is more comprehensive and accurate, saves the resource consumption in the assembly and gene prediction processes, can annotate more low-abundance species, and greatly improves the accuracy of species annotation and species relative abundance.
The same genus sequencing read assembling module 7 is configured to use all sequencing reads annotated to the same genus by comparison as a read set for the remaining sequencing reads, and perform consistent sequence assembling on all sequencing reads in the read set of the same genus to obtain a plurality of representative sequencing reads, so that representative sequencing reads under each genus can be obtained.
Because nanopore sequencing error rate is high, average sequencing error rate can reach 5%, so when species annotation is directly carried out by using sequencing reads, the possibility of species annotation errors exists, such as annotation of sequencing reads from a species onto a homologous species genome with similar genome, resulting in errors in subsequent species analysis. Therefore, in order to solve the problem of higher error rate of nanopore sequencing, the system innovatively introduces the operation of assembling sequencing reads of the same genus, namely, according to the species annotation result of the previous genus level, sequencing reads of all species aligned to the same genus are independently extracted, and then, by using canu software (a software specially designed for processing long read data and capable of efficiently generating high-quality genome assembly), consistent sequence assembly is carried out respectively, so that all representative sequencing reads under the genus reads are obtained, wherein the representative reads are sequences after assembly, so that random errors of sequencing are eliminated, and the sequence can be more representative of real genome sequences of the species. Subsequent annotation of the microorganism species with the representative sequence may yield more accurate results.
The microbial database comparison calculation module 8 is configured to perform calculation comparison on each representative sequencing read with any genomic sequence in the microbial database, when a first calculation comparison condition is met, output the information that the representative sequencing read is consistent with the species to which the genomic sequence meeting the first calculation comparison condition belongs, when a second calculation comparison condition is met, output the homology relationship between the representative sequencing read and the species to which the genomic sequence meeting the second calculation comparison condition belongs, and when the representative sequencing read and each genomic sequence in the microbial database do not meet the first calculation comparison condition and the second calculation comparison condition, output the information that the representative sequencing read is not aligned in the microbial database, so as to indicate that the user may have unknown pathogen.
The first calculation comparison condition is that the similarity is greater than a first similarity threshold and the coverage is greater than a first coverage threshold, and the second calculation comparison condition is that the second similarity threshold is less than or equal to the first similarity threshold and the second phase coverage threshold is less than or equal to the first coverage threshold. In this embodiment, the first similar threshold takes 97% and the first coverage threshold takes 95%.
Similarity = representative sequencing reads have the same number of bases per total length of representative sequencing reads as a certain genomic sequence in the microbial database.
Coverage = length of region of a genomic sequence in a microbial database/total length of representative sequencing reads.
The microbial abundance statistics module 9 is configured to count the number of representative sequencing reads of each species under the species agreement, and calculate the microbial abundance of a certain species under the species agreement=the number of representative sequencing reads corresponding to the species/the cumulative sum of the number of representative sequencing reads of each species under the species agreement.
As shown in fig. 2, the embodiment also provides a method for identifying and analyzing unknown pathogenic microorganisms based on nanopore sequencing, which comprises the following steps:
And 101, constructing a microbial database, namely acquiring a plurality of pieces of microbial species information based on a public database, wherein each piece of microbial species information at least comprises genome sequences, carrying out integrity evaluation on each genome sequence to remove genome sequences with poor quality, removing redundant genome sequences according to a similarity principle aiming at each reserved genome sequence, identifying and combining a plurality of genome sequences with similarity larger than a set threshold value, and constructing the microbial database based on the microbial species information of the reserved genome sequences, wherein the same species or the similar genome sequences are reserved only after the redundancy removal.
And 102, marking the microorganism information in a standardized manner, namely sequentially classifying according to the level attributes of the kingdom, the phylum, the class, the order, the family, the genus and the species, marking the microorganism species information in the microorganism database in a standardized manner, and sending out manual correction reminding information after traversing the microorganism species information so as to remind the automatically marked microorganism database of carrying out manual correction.
And (3) after the steps 101 and 102, updating the microbial database and marking information standardization, namely, after the update of the public database is detected, reconstructing the microbial database by using the updated public database, and marking the standardization classification of the species of each microbial species in the reconstructed microbial database.
Step 103, nanopore sequencing analysis, namely extracting all nucleic acid in a sample to be tested, wherein the nucleic acid is DNA or RNA, the DNA is subjected to end repair or RNA is subjected to reverse transcription and end repair to obtain a product, a nanopore sequencing platform is used for library building and sequencing aiming at the obtained product, a sequencing catalog is monitored in real time in a nanopore sequencing process, a new sequencing sequence file is generated every time and is used as the sequencing sequence file to be analyzed until the condition that nanopore sequencing is completed is met or no new sequencing sequence file is generated within a set time, and the sequencing sequence file to be analyzed comprises a plurality of sequencing reads.
And 104, sequencing data quality control, namely performing operations of removing joint pollution, removing low-quality reads and removing short-length reads on each sequencing read in each sequencing sequence file to be analyzed aiming at each sequencing sequence file to be analyzed, so as to obtain a plurality of sequencing reads which are reserved in the sequencing sequence file to be analyzed, and performing quality check on the sequencing reads reserved in the sequencing sequence file to be analyzed to confirm that joint pollution, low-quality reads and short-length reads are effectively removed.
Step 105, host sequencing reads are removed, namely, each sequencing read is compared with the genome sequence of the host for each sequencing read reserved in each sequencing sequence file to be analyzed, and sequencing reads which are not compared with the genome sequence of the host are extracted.
And 106, comparing each extracted sequencing read with the microbial database, when the similarity of the sequencing reads in the genome sequence of a certain species reaches the set similarity, annotating the sequencing read by utilizing the species level and the genus level of the species, and when the similarity of the sequencing reads in the genome sequence of the certain species does not reach the set similarity, removing the sequencing reads, thereby obtaining the reserved species annotation of each sequencing read and the species level and the genus level thereof.
Step 107, assembling the sequencing reads of the same genus, namely regarding all sequencing reads which are compared and annotated to the same genus as a read set, and performing consistent sequence assembly on all sequencing reads in the read set of the same genus to obtain a plurality of representative sequencing reads, so that representative sequencing reads under each genus can be obtained.
And 108, comparing each representative sequencing read with any genome sequence in the microbial database, outputting the species of the representative sequencing read and the genome sequence meeting the first calculation comparison condition when the first calculation comparison condition is met, outputting the homology relationship of the representative sequencing read and the species of the genome sequence meeting the second calculation comparison condition when the second calculation comparison condition is met, and outputting the information of the representative sequencing read which is not compared in the microbial database when the representative sequencing read and each genome sequence in the microbial database do not meet the first calculation comparison condition and the second calculation comparison condition.
The first calculation comparison condition is that the similarity is greater than a first similarity threshold and the coverage is greater than a first coverage threshold, and the second calculation comparison condition is that the second similarity threshold is less than or equal to the first similarity threshold and the second phase coverage threshold is less than or equal to the first coverage threshold.
Similarity = representative sequencing reads having the same number of bases/total length of representative sequencing reads as a genomic sequence in the microbial database, coverage = representative sequencing reads aligned to the region length/total length of representative sequencing reads of a genomic sequence in the microbial database.
Step 109, counting the number of the representative sequencing reads of each species under the condition of species consistency, and calculating the sum of the microbial abundance of a certain species under the condition of species consistency=the number of the representative sequencing reads corresponding to the species/the number of the representative sequencing reads of each species under the condition of species consistency.
The analytical system can rapidly and accurately identify pathogenic microorganisms including bacteria, fungi, viruses and the like. Firstly, the random error rate of nanopore sequencing is higher, and when species identification is carried out, error comparison results are easily generated for homologous microorganisms, so that subsequent microorganism species identification errors are caused. We therefore developed specifically an analytical procedure to address species alignment errors. The nanopore sequencing platform can sequence in real time, the analysis system can grab and analyze data under nanopore sequencing in real time, the waiting period of sequencing and analysis is shortened, and the microorganism database is compared with a plurality of algorithms and software, so that the analysis flow is optimized, and the analysis result can be obtained in a short time. Finally, the analysis system can perform microbial analysis on the subordinate level, analyze the comparison result of the reads in the species contained in the genus, and if unknown pathogens exist, the reads can be compared with other species in the same genus, but the comparison result is relatively poor, so that the user is prompted that the unknown pathogens may exist.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that these are by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.