CN119360966B

Movatterモバイル変換

Info

Publication number: CN119360966B
Application number: CN202411915277.7A
Authority: CN
Inventors: 祝坤; 孙玉娥; 杜扬; 黄河
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2024-12-24
Filing date: 2024-12-24
Publication date: 2025-04-04
Anticipated expiration: 2044-12-24
Also published as: CN119360966A

Abstract

The invention relates to the technical field of computers, and particularly provides a genome assembly method and a genome assembly system based on iterative k-mer decomposition, wherein the method comprises the steps of processing a sequence fragment set according to a sequence length to obtain a frequency set; the method comprises the steps of obtaining a sequence segment set, obtaining a high-frequency set and a low-frequency set, dividing the frequency set according to a preset frequency threshold value, judging the size of the preset length threshold value and the sequence length, taking the low-frequency set as a new sequence segment set if the preset length threshold value is not larger than the sequence length, performing a value reduction operation on the sequence length, processing the new sequence segment set according to the sequence length obtained by the value reduction operation to obtain a new high-frequency set and a new low-frequency set corresponding to the new sequence segment set, if the preset length threshold value is larger than the sequence length, constructing a Debrucine diagram according to the obtained high-frequency sets, and outputting an assembled sequence set according to the Debrucine diagram. The invention effectively reduces the calculation load and improves the assembly speed and the assembly quality.

Description

Genome assembly method and system based on iterative k-mer decomposition

Technical Field

The invention relates to the technical field of computers, in particular to a genome assembly method and system based on iterative k-mer decomposition.

Background

Genome assembly is one of the core tasks in bioinformatics. Genome assembly refers to the splicing of a large number of short sequence fragments (called sequencing reads) obtained by genomic sequencing techniques to reconstruct the complete genomic sequence. Since sequencing reads cover only a partial region of the genome and may contain repeated sequences and sequencing errors, the assembly process usually requires the identification of overlapping relationships by means of algorithms and graph structures (e.g., deblur graphs), optimizing the splice order to accurately restore the complete sequence of the original genome.

In the prior art, the conventional genome assembly method based on k-mer counting has the limitation of fixed k value, and is difficult to combine the assembly precision and the identification capability of a low coverage area. In order to solve the problem of limitation of the fixed k value, there is also a k-mer assembly method by iterating k values from small to large, which can improve assembly accuracy compared to a conventional assembly method. However, the k-mer assembly method by iterating k values from small to large has the problems of high computational complexity, low processing speed and low-frequency k-mer utilization rate.

Disclosure of Invention

The genome assembly method and the genome assembly system based on iterative k-mer decomposition at least solve the problems of high computational complexity, low assembly quality and low assembly speed of the traditional genome assembly method, and the high-frequency set and the low-frequency set are respectively processed in each iteration by gradually iterating the sequence length from large to small and decomposing the sequence fragment set, so that the computational burden is effectively reduced, and the assembly speed and the assembly quality are improved.

The invention provides a genome assembly method based on iterative k-mer decomposition, which comprises the steps of processing a sequence fragment set according to a sequence length to obtain a frequency set, dividing the first sequence length into a maximum sequence length of the input sequence fragment set according to a preset frequency threshold to obtain a high-frequency set and a low-frequency set, judging the sizes of the preset length threshold and the sequence length, taking the low-frequency set as a new sequence fragment set if the preset length threshold is not larger than the sequence length, performing a value reduction operation on the sequence length, processing the new sequence fragment set according to the sequence length obtained by the value reduction operation to obtain a new high-frequency set and a new low-frequency set corresponding to the new sequence fragment set, constructing a Debrucine diagram according to the obtained high-frequency sets if the preset length threshold is larger than the sequence length, and outputting the assembled sequence set according to the Debrucine diagram.

In one embodiment of the invention, a sequence fragment set is processed according to a sequence length to obtain a frequency set, and the frequency set is obtained by extracting each sequence fragment in the sequence fragment set according to the sequence length to obtain a continuous nucleotide sequence and the number of the continuous nucleotide sequences corresponding to each sequence fragment, and establishing a mapping relation between the continuous nucleotide sequences and the number of the continuous nucleotide sequences.

In one embodiment of the invention, the frequency collection is divided according to a preset frequency threshold value to obtain a high-frequency collection and a low-frequency collection, wherein the frequency collection comprises the steps of comparing the preset frequency threshold value with the number of each continuous nucleotide sequence in the frequency collection, confirming that the corresponding continuous nucleotide sequence belongs to the high-frequency collection if the preset frequency threshold value is smaller than the number of the continuous nucleotide sequences, and confirming that the corresponding continuous nucleotide sequence belongs to the low-frequency collection if the preset frequency threshold value is not smaller than the number of the continuous nucleotide sequences.

In one embodiment of the invention, a De-Brujin graph is constructed according to a plurality of obtained high-frequency sets, the De-Brujin graph comprises the steps of selecting an unprocessed high-frequency set from the high-frequency sets to be used as a set to be repaired, obtaining the current sequence length corresponding to the set to be repaired, judging the current sequence length and the maximum sequence length, confirming that the current set to be repaired is an initial repairing set if the current sequence length is equal to the maximum sequence length, transferring to the high-frequency set to be repaired, selecting an unprocessed high-frequency set from the high-frequency sets to be used as the set to be repaired, processing the current set to be repaired if the current sequence length is smaller than the maximum sequence length, obtaining a repairing set, and selecting an unprocessed high-frequency set to be used as the set to be repaired from the high-frequency sets, wherein the sequence length of the current set to be repaired is equal to the maximum sequence length, merging the initial repairing set and the high-frequency set to be repaired, and constructing the high-frequency Brujin graph according to the high-frequency repairing set.

In one embodiment of the invention, the current set to be repaired is processed to obtain a repaired set, and the method comprises the steps of obtaining a mapping relation between each continuous nucleotide sequence and the number of the continuous nucleotide sequences in the set to be repaired, processing the continuous nucleotide sequences according to the mapping relation to obtain a reduced sequence segment, and extracting each reduced sequence segment according to the maximum sequence length to obtain the repaired set.

In one embodiment of the invention, outputting the assembly sequence set according to the Debrucine graph comprises the steps of optimizing the Debrucine graph to obtain an optimized Debrucine graph, traversing the optimized Debrucine graph and outputting the assembly sequence set.

In one embodiment of the invention, the optimization operation includes at least one of merging duplicate paths, merging bubble structures, and culling small branches.

The invention further provides a genome assembly system based on iterative k-mer decomposition, which comprises a processing module, a dividing module, an iterative decomposition module and an assembly module, wherein the processing module is used for processing a sequence fragment set according to a sequence length to obtain a frequency set, the first sequence length is the maximum sequence length of the input sequence fragment set, the dividing module is used for dividing the frequency set according to a preset frequency threshold to obtain a high-frequency set and a low-frequency set, the iterative decomposition module is used for judging the preset length threshold and the sequence length, the iterative decomposition module is used for judging the low-frequency set as a new sequence fragment set when the preset length threshold is not larger than the sequence length, performing a value reduction operation on the new sequence fragment set and transmitting the new sequence fragment set and the sequence length obtained according to the value reduction operation to the processing module, transmitting the obtained high-frequency set to the assembly module when the preset length threshold is larger than the sequence length, and constructing a plurality of assembly graph sets according to the sequence length, and the assembly graph is used for the assembly graph.

In a third aspect, the invention also provides an electronic device comprising a processor, and a memory storing a program comprising instructions that when executed by the processor cause the processor to perform the iterative k-mer decomposition based genome assembly method of any of the above.

In a fourth aspect, the present invention also provides a non-transitory machine readable medium storing computer instructions for causing the computer to perform the genome assembly method based on iterative k-mer decomposition as defined in any one of the above.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

According to the genome assembly method and system based on iterative k-mer decomposition, the sequence length is gradually reduced from the largest sequence length of an input sequence fragment set, and the sequence fragment set is processed by using the corresponding sequence length, so that a frequency set and corresponding high-frequency and low-frequency sets are obtained. By adopting a large-to-small processing mode, the calculation speed can be increased while the data set scale is gradually reduced, and the efficient assembly of large-scale genome data is realized, so that the assembly speed is effectively improved.

The high frequency set is then reserved as a reliable set and used for subsequent construction of the Debrucine plot and assembly sequence, while the low frequency set is used as an unreliable set for the next round of iterative processing. First, since only the unreliable sets are decomposed in each iteration, the data volume of each round can be significantly reduced, so that the calculation efficiency is improved and the calculation load is reduced by reducing the data scale. And secondly, the low-frequency set is not directly discarded as the method in the prior art, but is used for the next round of iterative processing and splitting, so that the low-frequency set which is easy to ignore originally is reserved and utilized, the data utilization rate is improved, and the final assembly quality is further improved. And finally, the Debrucine graph can be constructed according to the obtained multiple high-frequency sets, so that the problem of broken traversing paths of the constructed Debrucine graph is effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is evident that the drawings in the following description are only some embodiments of the invention, from which other embodiments can be obtained for a person skilled in the art without inventive effort. In the drawings:

FIG. 1 is a flow chart of a genome assembly method based on iterative k-mer decomposition in a preferred embodiment of the invention.

FIG. 2 is a schematic diagram of a genome assembly system based on iterative k-mer decomposition in a preferred embodiment of the present invention.

Wherein the above figures include the following reference numerals:

10. The device comprises a processing module, a dividing module, an iterative decomposition module, and an assembly module.

Detailed Description

Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the invention is susceptible of embodiment in the drawings, it is to be understood that the invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the invention. It should be understood that the drawings and embodiments of the invention are for illustration purposes only and are not intended to limit the scope of the present invention.

In the prior art, the genome assembly method based on k-mer count has the advantages of efficiently processing large-scale data, accurately identifying repeated areas, optimizing memory and using computing resources. The method reduces sequencing noise through frequency screening, and can flexibly adjust k value to adapt to genomes with different complexity, so that the assembly speed is faster, the accuracy is higher, and the method is particularly suitable for rapid assembly tasks of large-scale genomes. Therefore, the k-mer-based genome assembly method is becoming the mainstream.

Wherein k-mer refers to a contiguous nucleotide sequence of length k. By extracting fixed length k-mers in each sequence segment (read), complex genome assembly problems can be translated into k-mer map structure stitching problems. Current k-mer based assembly methods rely primarily on building debluring diagrams that can efficiently process and represent redundant information in large-scale k-mer sets, thereby improving assembly efficiency.

Specifically, during genome assembly based on k-mer counts, first, a set of k-mers of all length k is extracted from the sequencing data, then each k-mer is treated as a node in the graph by constructing Jiande brucine's graph, adjacent k-mers share overlapping regions of length k-1 and are connected by edges in the graph. In this way, the genome assembly problem can be converted to a path traversal problem of the graph, thereby identifying the integrity of the sequence. The Debrucine plot has significant advantages in processing large-scale data, enabling efficient integration of repetitive sequences while reducing storage space.

However, the construction and traversal efficiency of the deblur graph is affected by the magnitude of the k-value. In the traditional assembly method based on fixed k values, the selection of proper k values is important, k-mers with larger k values can more effectively identify and splice long repeated areas, and error connection is reduced, so that assembly accuracy is improved, but identification capability of low coverage areas is poor, and breakage is easy to cause. The k-mers with smaller k values can more comprehensively capture short overlapping areas, are suitable for splicing in low coverage areas, but are easy to generate redundant paths and error splicing, so that the assembly accuracy is reduced. Therefore, the choice of a fixed k value is a technical challenge that is difficult to balance.

In order to solve the limitations of the fixed k-value method, there is a prior art in which genome assembly is performed by a k-mer assembly method of iterating k-values from small to large. I.e. increasing the k value step by step, a small to large deblur plot is constructed. The method first extracts k-mers using smaller k-values and constructs preliminary contigs (i.e., spliced contiguous gene sequence fragments), thereby capturing short repeat regions in the sequence. Subsequently, the k value is gradually increased to eliminate redundant paths and identify longer overlap regions. Finally, the long repeated region in the k-mer spliced genome with larger k value is utilized to generate a more complete genome structure. The method improves the accuracy of genome assembly to a certain extent by re-extracting and splicing k-mers under different k values.

However, the k-mer assembly method by iterating k-values from small to large also has significant drawbacks. First, it is computationally complex. By iterating k-mer assembly methods from small to large k-values, the k-mer count and the Debrucine plot construction need to be independently performed first at smaller k-values, which can result in more complex computation processes due to the larger number of k-mers generated by small k-values. Second, the processing speed is slow. When the k value is replaced each time, a k-mer set with different k values is regenerated, and a corresponding Debrucine graph is constructed, namely, the Debrucine graph is reconstructed and traversed in the iterative process of the different k values, so that the processing speed is slower. Finally, the low frequency k-mer utilization is low. In k-mer assembly methods with iterative k-values from small to large, low-frequency k-mers at different k-values are usually ignored or discarded, whereas low-frequency k-mers in a real dataset may account for a large proportion, and direct discarding may result in partial genome information loss, reducing assembly quality.

In order to solve the shortcomings of the k-mer assembly method with small to large iteration k values in the prior art in terms of calculation overhead, processing speed and data utilization rate, referring to fig. 1, the invention provides a genome assembly method based on iteration k-mer decomposition, which comprises the following steps:

And processing the sequence fragment set according to the sequence length (i.e. k) to obtain a frequency number set.

Dividing the frequency collection according to a preset frequency threshold value to obtain a high-frequency collection and a low-frequency collection.

And judging the preset length threshold value and the sequence length.

If the preset length threshold is not greater than the sequence length, the low-frequency set is used as a new sequence fragment set, the sequence length is subjected to the value reduction operation, and the new sequence fragment set is processed according to the sequence length obtained by the value reduction operation, so that a new high-frequency set and a new low-frequency set corresponding to the new sequence fragment set are obtained.

If the preset length threshold is greater than the sequence length, a Debrucine diagram is constructed according to the obtained multiple high-frequency sets, and an assembled sequence set (namely an assembled contig set) is output according to the Debrucine diagram.

It should be noted that the sequence fragment set includes a plurality of sequence fragments (i.e., a plurality of reads). Each sequence fragment (i.e., read) represents a sequence of genomic sequences, which is about 50 to 300bp in length. After the collection of sequence fragments is obtained, the total number of corresponding sequence fragments, the maximum sequence length (i.e., k_max), and the minimum sequence length (i.e., k_min) are all known amounts.

When processing a sequence fragment set according to sequence length, the first sequence length, i.e. the maximum sequence length, is gradually reduced in sequence length. Each frequency set includes sequence fragments, sequence lengths, and corresponding consecutive nucleotide sequences and number of consecutive nucleotide sequences information. Illustratively, after a new set of frequency numbers is obtained, it is stored in a data structure, such as a hash table. The processing mode from large to small can gradually reduce the size of the data set, simultaneously accelerate the calculation speed, realize the efficient assembly of large-scale genome data and effectively improve the assembly speed.

It should be noted that, the preset frequency threshold is a preset value, and those skilled in the art can set the preset frequency threshold as required. For example, the preset frequency threshold is set to 1, and the number of continuous nucleotide sequences greater than 1 in the frequency set belongs to the high-frequency set, and conversely belongs to the low-frequency set. The high-frequency set is used as a reliable set, and can be reserved and used for subsequent construction of a Debrucine diagram and an assembly sequence. While the low frequency set is used as an unreliable set for the next round of iterative processing.

Specifically, after judging the preset length threshold and the current sequence length, if the preset length threshold is not greater than the current sequence length, the low-frequency set is used as a new sequence fragment set. Preferably, the preset length threshold is set to the minimum sequence length. Meanwhile, the sequence length is subjected to a subtracting operation. Wherein the person skilled in the art can set the value of the decrement depending on the actual requirements. Then, a new sequence length is obtained according to the subtracting operation, and a new sequence fragment set (namely, an old low-frequency set obtained in the last round) is processed to obtain a new frequency set. And then dividing the new frequency collection according to a preset frequency threshold value to obtain a new high-frequency collection and a new low-frequency collection corresponding to the new sequence fragment collection. Also, after new high frequency and low frequency sets are obtained, the high frequency set remains as a reliable set, while the low frequency set remains as an unreliable set for the next round of iterative processing.

By processing the high-frequency set and the low-frequency set respectively, firstly, only the unreliable set is decomposed in each iteration, so that the data volume of each round can be obviously reduced, and the calculation efficiency and the calculation burden are improved by reducing the data scale. And secondly, the low-frequency set is not directly discarded as the method in the prior art, but is used for the next round of iterative processing and splitting, so that the low-frequency set which is easy to ignore originally is reserved and utilized, the data utilization rate is improved, and the final assembly quality is further improved. And finally, the Debrucine graph can be constructed according to the obtained multiple high-frequency sets, so that the problem of broken traversing paths of the constructed Debrucine graph is effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.

After iteration is performed until the preset length threshold is greater than the sequence length, all sequence fragments of the low-frequency set cannot be extracted, a Debrucine diagram can be constructed according to the obtained high-frequency sets, and an assembled sequence set is output according to the Debrucine diagram.

According to the genome assembly method based on iterative k-mer decomposition, from the maximum sequence length of an input sequence fragment set, the sequence length is gradually reduced from large to small, and the sequence fragment set is processed by using the corresponding sequence length, so that a frequency set, a corresponding high-frequency set and a corresponding low-frequency set are obtained. By adopting a large-to-small processing mode, the calculation speed can be increased while the data set scale is gradually reduced, and the efficient assembly of large-scale genome data is realized, so that the assembly speed is effectively improved.

In some embodiments, the genome assembly method based on iterative k-mer decomposition according to the present invention processes a sequence segment set according to a sequence length to obtain a frequency set, including:

and extracting each sequence fragment in the sequence fragment set according to the sequence length to obtain a continuous nucleotide sequence and the number of the continuous nucleotide sequences corresponding to each sequence fragment.

And establishing a mapping relation between the continuous nucleotide sequences and the number of the continuous nucleotide sequences to obtain a frequency collection.

The sequence length of the continuous nucleotide sequence obtained by extraction, that is, the sequence length used at the time of the current extraction, is also described. Illustratively, the current sequence length is 47, and the sequence length of each successive nucleotide sequence obtained after the corresponding sequence fragment is extracted is also 47. The number of consecutive nucleotide sequences, i.e., the number of consecutive nucleotide sequences that can be extracted from the corresponding sequence fragment after extraction. How to extract sequence fragments according to the sequence length belongs to the prior art and is not described in detail.

Although the sequences used for extraction are consistent in length, the number of consecutive nucleotide sequences and consecutive nucleotide sequences that can be extracted will be different if the sequence fragments in the current sequence fragment set are different. Thus, statistics are required after each sequence segment in the current set of sequence segments is extracted. And establishing a mapping relation between the continuous nucleotide sequences and the number of the continuous nucleotide sequences according to the statistical result to obtain a frequency collection. And then dividing the frequency collection according to a preset frequency threshold value.

It can be imagined that since the sequence length is gradually reduced from the maximum sequence length from large to small, the calculation speed can be increased while the data set size is gradually reduced, and the efficient assembly of large-scale genome data can be realized, so that the assembly speed is effectively improved.

In some embodiments, the genome assembly method based on iterative k-mer decomposition according to the present invention divides a frequency set according to a preset frequency threshold to obtain a high-frequency set and a low-frequency set, including:

comparing the preset frequency threshold value with the number of each continuous nucleotide sequence in the frequency collection.

If the preset frequency threshold value is smaller than the number of the continuous nucleotide sequences, confirming that the corresponding continuous nucleotide sequences belong to a high-frequency set. If the preset frequency threshold is not smaller than the number of the continuous nucleotide sequences, confirming that the corresponding continuous nucleotide sequences belong to a low-frequency set.

In some embodiments, the genome assembly method based on iterative k-mer decomposition constructs a Debrucine diagram according to the obtained plurality of high-frequency sets, including:

And selecting an untreated high-frequency set from the plurality of high-frequency sets as a set to be repaired.

And acquiring the length of the current sequence corresponding to the set to be repaired.

And judging the sizes of the current sequence length and the maximum sequence length.

If the current sequence length is equal to the maximum sequence length, confirming that the current set to be repaired is an initial repair set, turning to select an untreated high-frequency set from a plurality of high-frequency sets as the set to be repaired, and carrying out corresponding treatment.

If the current sequence length is smaller than the maximum sequence length, the current set to be repaired is processed to obtain a repaired set, and the repaired set is transferred to a plurality of high-frequency sets, and an unprocessed high-frequency set is selected as the set to be repaired. Wherein the sequence length of the patch set is equal to the maximum sequence length.

And combining the initial repair set and the repair set to obtain a repair high-frequency set.

And constructing a Debrucine graph according to the patch high-frequency set.

It can be imagined that, when assembling, since the sequence length is iterated from large to small from the maximum sequence length, after the frequency collection is divided according to the preset frequency threshold, the current high-frequency collection can be obtained as the initial repair collection or the collection to be repaired which needs to be processed. And combining each processed repair set with the initial repair set, and treating the step as a repair process. The Debrucine graph is constructed by using the combined repair high-frequency set, so that the problem of broken traversing paths of the Debrucine graph can be effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.

Further, in some embodiments, the genome assembly method based on iterative k-mer decomposition according to the present invention processes a current set to be repaired to obtain a repaired set, including:

And obtaining the mapping relation between each continuous nucleotide sequence and the number of the continuous nucleotide sequences in the set to be patched.

And processing the continuous nucleotide sequence according to the mapping relation to obtain a reduced sequence fragment.

And extracting each reduced sequence segment according to the maximum sequence length to obtain a patch set.

The mapping relationship between each continuous nucleotide sequence and the number of continuous nucleotide sequences is a known amount, and the sequence fragments can be obtained by extracting the sequence fragments according to the sequence length. The prior art is not repeated as to how to process a continuous nucleotide sequence to obtain a reduced sequence fragment. By the step, the sequence lengths of the patching set and the initial patching set are equal, so that the two sets can be combined to obtain the patching high-frequency set. The Debrucine graph is constructed by using the combined repair high-frequency set, so that the problem of broken traversing paths of the Debrucine graph can be effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.

The genome assembly method based on iterative k-mer decomposition of the present invention, in some embodiments, outputs an assembly sequence set according to a debluring diagram, comprises:

And carrying out optimization operation on the Debrucine graph to obtain an optimized Debrucine graph.

And traversing the optimized Debrucine graph to output an assembly sequence set.

By means of the optimization operation, the size of the graph can be effectively reduced, and the computing and storage efficiency can be improved. And (3) after traversing the optimized Debrucine graph, outputting an assembly result, namely an assembly sequence set.

Effect verification is performed as follows. Specifically, partial genome sequencing data of E.coli and corresponding reference genomes are downloaded from the public database NCBI, the genome sequencing data are used as an input sequence fragment set, and the corresponding reference genomes are used as a control. The maximum sequence length is set to 47, the minimum sequence length is set to 27, the variation of the sequence length after each iteration is 5, and the preset length threshold is set to 1. According to the genome assembly method based on iterative k-mer decomposition and the k-mer assembly method from small to large iterative k values in the prior art, respectively, an input sequence fragment set is processed, and after assembly is completed, a result and a reference genome of comparison are submitted to an assembly effect evaluation tool QUAST to be subjected to verification evaluation.

Wherein, the k-mer assembly method from small to large iteration k value in the prior art has 721.94bp matching error in every 100,000 bp. The genome assembly method based on iterative k-mer decomposition has 688.63bp matching errors in every 100,000bp, and has simpler calculation process and faster technical processing speed.

The embodiment of the invention also provides a genome assembly system based on iterative k-mer decomposition, which comprises a processing module 10, a dividing module 20, an iterative decomposition module 30 and an assembly module 40.

The processing module 10 is configured to process the sequence segment set according to the sequence length to obtain a frequency number set.

The dividing module 20 is configured to divide the frequency set according to a preset frequency threshold value, so as to obtain a high frequency set and a low frequency set.

The iterative decomposition module 30 is configured to determine a preset length threshold and a sequence length. When the preset length threshold is not greater than the sequence length, the iterative decomposition module 30 takes the low-frequency set as a new sequence segment set, performs a value reduction operation on the sequence length, and transmits the new sequence segment set and the sequence length obtained according to the value reduction operation to the processing module 10. When the preset length threshold is greater than the sequence length, the iterative decomposition module 30 transmits the obtained plurality of high frequency sets to the assembly module 40.

The assembling module 40 is configured to construct a deblur graph according to the obtained multiple high-frequency sets, and output an assembled sequence set according to the deblur graph.

When the processing module 10 processes the sequence segment set according to the sequence length, the first sequence length, i.e. the maximum sequence length, is gradually reduced in sequence length. Each frequency set includes sequence fragments, sequence lengths, and corresponding consecutive nucleotide sequences and number of consecutive nucleotide sequences information. Illustratively, after a new set of frequency numbers is obtained, it is stored in a data structure, such as a hash table. The processing mode from large to small can gradually reduce the size of the data set, simultaneously accelerate the calculation speed, realize the efficient assembly of large-scale genome data and effectively improve the assembly speed.

Specifically, after the iterative decomposition module 30 determines the preset length threshold and the current sequence length, if the preset length threshold is not greater than the current sequence length, the low-frequency set is used as the new sequence segment set. Preferably, the preset length threshold is set to the minimum sequence length. Meanwhile, the sequence length is subjected to a subtracting operation. Wherein the person skilled in the art can set the value of the decrement depending on the actual requirements. The iterative decomposition module 30 then transmits the new sequence length and the new set of sequence segments to the processing module 10 for corresponding processing and obtaining a new set of frequency numbers. Next, the new frequency number set is divided by the dividing module 20 to obtain a new high frequency set and a new low frequency set corresponding to the new sequence segment set. Also, after new high frequency and low frequency sets are obtained, the high frequency set remains as a reliable set, while the low frequency set remains as an unreliable set for the next round of iterative processing.

After the iteration is performed until the preset length threshold is greater than the sequence length, all sequence fragments of the low-frequency set cannot be extracted, at this time, all obtained high-frequency sets can be transmitted to the assembly module 40, the assembly module 40 constructs a Debrucine diagram according to the obtained high-frequency sets, and the assembly sequence set is output according to the Debrucine diagram.

The genome assembly system based on iterative k-mer decomposition gradually reduces the sequence length from large to small from the maximum sequence length of an input sequence fragment set, and processes the sequence fragment set by using the corresponding sequence length to obtain a frequency set and a corresponding high-frequency set and low-frequency set. By adopting a large-to-small processing mode, the calculation speed can be increased while the data set scale is gradually reduced, and the efficient assembly of large-scale genome data is realized, so that the assembly speed is effectively improved.

The invention also provides an electronic device comprising a processor and a memory storing a program. The program comprises instructions that when executed by a processor cause the processor to perform the genome assembly method based on iterative k-mer decomposition described in any of the embodiments above.

The invention also provides a non-transitory machine readable medium storing a computer program, wherein the computer instructions are configured to cause a computer to perform the genome assembly method based on iterative k-mer decomposition according to any one of the embodiments described above.

It should be noted that electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

A computer program for implementing the methods of embodiments of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of embodiments of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable signal medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, or infrared system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It should be noted that the term "comprising" and its variants as used in the embodiments of the present invention are open-ended, i.e. "including but not limited to". The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "and" some embodiments "means" at least some embodiments. The references to "one" or "a plurality" of modifications in the embodiments of the invention are intended to be illustrative rather than limiting, and it will be understood by those skilled in the art that "one or more" is intended to be interpreted as "one or more" unless the context clearly indicates otherwise.

User information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) according to embodiments of the present invention are information and data authorized by a user or sufficiently authorized by parties, and the collection, use and processing of relevant data requires compliance with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation portals for the user to select authorization or denial.

The steps recited in the method embodiments provided by the embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

The term "embodiment" in this specification means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive. The various embodiments in this specification are described in a related manner, with identical and similar parts being referred to each other. In particular, for apparatus, devices, system embodiments, the description is relatively simple as it is substantially similar to method embodiments, see for relevant part of the description of method embodiments.

The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of protection. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims

Translated fromChinese

1.一种基于迭代k-mer分解的基因组组装方法，其特征在于，包括：1. A genome assembly method based on iterative k-mer decomposition, comprising:

根据序列长度处理序列片段集合，得到频数集合；其中，包括根据所述序列长度，提取所述序列片段集合中的每个序列片段，得到每个所述序列片段所对应的连续核苷酸序列和连续核苷酸序列个数；建立所述连续核苷酸序列和所述连续核苷酸序列个数之间的映射关系，得到所述频数集合；首个所述序列长度为输入的所述序列片段集合的最大序列长度；Processing the sequence fragment set according to the sequence length to obtain a frequency set; wherein, the method includes extracting each sequence fragment in the sequence fragment set according to the sequence length to obtain a continuous nucleotide sequence and the number of continuous nucleotide sequences corresponding to each sequence fragment; establishing a mapping relationship between the continuous nucleotide sequence and the number of continuous nucleotide sequences to obtain the frequency set; the first sequence length is the maximum sequence length of the input sequence fragment set;

根据预设频数阈值，划分所述频数集合，得到高频集合和低频集合；According to a preset frequency threshold, the frequency set is divided into a high-frequency set and a low-frequency set;

判断预设长度阈值和所述序列长度的大小；Determine the size of a preset length threshold and the length of the sequence;

若所述预设长度阈值不大于所述序列长度，则将所述低频集合作为新的所述序列片段集合，对所述序列长度进行减值操作，根据减值操作得到的所述序列长度，处理新的所述序列片段集合，以得到新的所述序列片段集合所对应的新的高频集合和低频集合；If the preset length threshold is not greater than the sequence length, taking the low-frequency set as the new sequence fragment set, performing a decrement operation on the sequence length, and processing the new sequence fragment set according to the sequence length obtained by the decrement operation to obtain a new high-frequency set and a low-frequency set corresponding to the new sequence fragment set;

若所述预设长度阈值大于所述序列长度，则根据得到的多个所述高频集合构建德布鲁因图，并根据所述德布鲁因图输出组装序列集合。If the preset length threshold is greater than the sequence length, a de Bruijn graph is constructed according to the obtained multiple high-frequency sets, and an assembled sequence set is output according to the de Bruijn graph.

2.根据权利要求1所述的基于迭代k-mer分解的基因组组装方法，其特征在于，根据预设频数阈值，划分所述频数集合，得到高频集合和低频集合，包括：2. The genome assembly method based on iterative k-mer decomposition according to claim 1, characterized in that the frequency set is divided according to a preset frequency threshold to obtain a high-frequency set and a low-frequency set, comprising:

比较所述预设频数阈值，以及所述频数集合中的每个连续核苷酸序列个数的大小；Comparing the preset frequency threshold and the number of each continuous nucleotide sequence in the frequency set;

若所述预设频数阈值小于所述连续核苷酸序列个数，则确认相应的连续核苷酸序列属于所述高频集合；If the preset frequency threshold is less than the number of the continuous nucleotide sequences, then confirming that the corresponding continuous nucleotide sequence belongs to the high frequency set;

若所述预设频数阈值不小于所述连续核苷酸序列个数，则确认相应的连续核苷酸序列属于所述低频集合。If the preset frequency threshold is not less than the number of the continuous nucleotide sequences, it is confirmed that the corresponding continuous nucleotide sequences belong to the low-frequency set.

3.根据权利要求1所述的基于迭代k-mer分解的基因组组装方法，其特征在于，根据得到的多个所述高频集合构建德布鲁因图，包括：3. The genome assembly method based on iterative k-mer decomposition according to claim 1, characterized in that constructing a de Bruijn graph according to the obtained multiple high-frequency sets comprises:

从多个所述高频集合中，选取一个未处理过的所述高频集合作为待修补集合；Selecting an unprocessed high frequency set from the plurality of high frequency sets as a set to be repaired;

获取所述待修补集合所对应的当前序列长度；Obtaining the current sequence length corresponding to the set to be repaired;

判断所述当前序列长度和所述最大序列长度的大小；Determine the size of the current sequence length and the maximum sequence length;

若所述当前序列长度等于所述最大序列长度，则确认当前的所述待修补集合为初始修补集合，并转至从多个所述高频集合中，选取一个未处理过的所述高频集合作为待修补集合；If the current sequence length is equal to the maximum sequence length, confirming that the current set to be repaired is an initial repair set, and switching to selecting an unprocessed high frequency set from the plurality of high frequency sets as the set to be repaired;

若所述当前序列长度小于所述最大序列长度，则处理当前的所述待修补集合，得到修补集合，并转至从多个所述高频集合中，选取一个未处理过的所述高频集合作为待修补集合；其中，所述修补集合的序列长度与所述最大序列长度相等；If the current sequence length is less than the maximum sequence length, the current set to be repaired is processed to obtain a repair set, and then an unprocessed high frequency set is selected from the multiple high frequency sets as the set to be repaired; wherein the sequence length of the repair set is equal to the maximum sequence length;

将所述初始修补集合和所述修补集合进行合并，得到修补高频集合；Merging the initial patch set and the patch set to obtain a patch high-frequency set;

根据所述修补高频集合构建所述德布鲁因图。The de Bruijn graph is constructed based on the patched high frequency set.

4.根据权利要求3所述的基于迭代k-mer分解的基因组组装方法，其特征在于，处理当前的所述待修补集合，得到修补集合，包括：4. The genome assembly method based on iterative k-mer decomposition according to claim 3, characterized in that processing the current set to be repaired to obtain a repaired set comprises:

获取所述待修补集合中，每个连续核苷酸序列和连续核苷酸序列个数之间的映射关系；Obtaining a mapping relationship between each continuous nucleotide sequence and the number of continuous nucleotide sequences in the to-be-repaired set;

根据所述映射关系处理连续核苷酸序列，得到还原序列片段；Processing the continuous nucleotide sequence according to the mapping relationship to obtain a restored sequence fragment;

根据所述最大序列长度，提取每个所述还原序列片段，得到所述修补集合。According to the maximum sequence length, each of the restored sequence fragments is extracted to obtain the repair set.

5.根据权利要求1所述的基于迭代k-mer分解的基因组组装方法，其特征在于，根据所述德布鲁因图输出组装序列集合，包括：5. The genome assembly method based on iterative k-mer decomposition according to claim 1, characterized in that the assembly sequence set is output according to the de Bruijn graph, comprising:

将所述德布鲁因图进行优化操作，得到优化德布鲁因图；Optimizing the de Bruijn graph to obtain an optimized de Bruijn graph;

将所述优化德布鲁因图进行遍历操作，输出所述组装序列集合。The optimized De Bruijn graph is traversed to output the assembly sequence set.

6.根据权利要求5所述的基于迭代k-mer分解的基因组组装方法，其特征在于：6. The genome assembly method based on iterative k-mer decomposition according to claim 5, characterized in that:

所述优化操作包括合并重复路径、合并气泡结构和剔除小分支中的至少一项。The optimization operation includes at least one of merging repeated paths, merging bubble structures, and removing small branches.

7.一种基于迭代k-mer分解的基因组组装系统，其特征在于，包括：处理模块、划分模块、迭代分解模块、组装模块；7. A genome assembly system based on iterative k-mer decomposition, characterized by comprising: a processing module, a partitioning module, an iterative decomposition module, and an assembly module;

所述处理模块用于根据序列长度处理序列片段集合，得到频数集合；其中，包括根据所述序列长度，提取所述序列片段集合中的每个序列片段，得到每个所述序列片段所对应的连续核苷酸序列和连续核苷酸序列个数；建立所述连续核苷酸序列和所述连续核苷酸序列个数之间的映射关系，得到所述频数集合；首个所述序列长度为输入的所述序列片段集合的最大序列长度；The processing module is used to process the sequence fragment set according to the sequence length to obtain a frequency set; wherein, it includes extracting each sequence fragment in the sequence fragment set according to the sequence length to obtain a continuous nucleotide sequence and the number of continuous nucleotide sequences corresponding to each sequence fragment; establishing a mapping relationship between the continuous nucleotide sequence and the number of continuous nucleotide sequences to obtain the frequency set; the first sequence length is the maximum sequence length of the input sequence fragment set;

所述划分模块用于根据预设频数阈值，划分所述频数集合，得到高频集合和低频集合；The division module is used to divide the frequency set according to a preset frequency threshold to obtain a high-frequency set and a low-frequency set;

所述迭代分解模块用于判断预设长度阈值和所述序列长度的大小；其中，在所述预设长度阈值不大于所述序列长度时，所述迭代分解模块将所述低频集合作为新的所述序列片段集合，对所述序列长度进行减值操作，并将新的所述序列片段集合、以及根据减值操作得到的所述序列长度传输给所述处理模块；在所述预设长度阈值大于所述序列长度时，所述迭代分解模块将得到的多个所述高频集合传输给所述组装模块；The iterative decomposition module is used to determine the size of a preset length threshold and the sequence length; wherein, when the preset length threshold is not greater than the sequence length, the iterative decomposition module uses the low-frequency set as a new set of sequence fragments, performs a subtraction operation on the sequence length, and transmits the new set of sequence fragments and the sequence length obtained according to the subtraction operation to the processing module; when the preset length threshold is greater than the sequence length, the iterative decomposition module transmits the obtained multiple high-frequency sets to the assembly module;

所述组装模块用于根据得到的多个所述高频集合构建德布鲁因图，并根据所述德布鲁因图输出组装序列集合。The assembly module is used to construct a de Bruijn graph according to the obtained multiple high-frequency sets, and output an assembly sequence set according to the de Bruijn graph.

8.一种电子设备，包括：处理器，以及存储程序的存储器，其特征在于，所述程序包括指令，所述指令在由所述处理器执行时使所述处理器执行根据权利要求1至6中任一项所述的基于迭代k-mer分解的基因组组装方法。8. An electronic device comprising: a processor, and a memory for storing a program, wherein the program comprises instructions, which, when executed by the processor, cause the processor to execute the genome assembly method based on iterative k-mer decomposition according to any one of claims 1 to 6.

9.一种存储有计算机指令的非瞬时机器可读介质，其特征在于，所述计算机指令用于使所述计算机执行根据权利要求1至6中任一项所述的基于迭代k-mer分解的基因组组装方法。9. A non-transitory machine-readable medium storing computer instructions, characterized in that the computer instructions are used to enable the computer to execute the genome assembly method based on iterative k-mer decomposition according to any one of claims 1 to 6.