Disclosure of Invention
The genome assembly method and the genome assembly system based on iterative k-mer decomposition at least solve the problems of high computational complexity, low assembly quality and low assembly speed of the traditional genome assembly method, and the high-frequency set and the low-frequency set are respectively processed in each iteration by gradually iterating the sequence length from large to small and decomposing the sequence fragment set, so that the computational burden is effectively reduced, and the assembly speed and the assembly quality are improved.
The invention provides a genome assembly method based on iterative k-mer decomposition, which comprises the steps of processing a sequence fragment set according to a sequence length to obtain a frequency set, dividing the first sequence length into a maximum sequence length of the input sequence fragment set according to a preset frequency threshold to obtain a high-frequency set and a low-frequency set, judging the sizes of the preset length threshold and the sequence length, taking the low-frequency set as a new sequence fragment set if the preset length threshold is not larger than the sequence length, performing a value reduction operation on the sequence length, processing the new sequence fragment set according to the sequence length obtained by the value reduction operation to obtain a new high-frequency set and a new low-frequency set corresponding to the new sequence fragment set, constructing a Debrucine diagram according to the obtained high-frequency sets if the preset length threshold is larger than the sequence length, and outputting the assembled sequence set according to the Debrucine diagram.
In one embodiment of the invention, a sequence fragment set is processed according to a sequence length to obtain a frequency set, and the frequency set is obtained by extracting each sequence fragment in the sequence fragment set according to the sequence length to obtain a continuous nucleotide sequence and the number of the continuous nucleotide sequences corresponding to each sequence fragment, and establishing a mapping relation between the continuous nucleotide sequences and the number of the continuous nucleotide sequences.
In one embodiment of the invention, the frequency collection is divided according to a preset frequency threshold value to obtain a high-frequency collection and a low-frequency collection, wherein the frequency collection comprises the steps of comparing the preset frequency threshold value with the number of each continuous nucleotide sequence in the frequency collection, confirming that the corresponding continuous nucleotide sequence belongs to the high-frequency collection if the preset frequency threshold value is smaller than the number of the continuous nucleotide sequences, and confirming that the corresponding continuous nucleotide sequence belongs to the low-frequency collection if the preset frequency threshold value is not smaller than the number of the continuous nucleotide sequences.
In one embodiment of the invention, a De-Brujin graph is constructed according to a plurality of obtained high-frequency sets, the De-Brujin graph comprises the steps of selecting an unprocessed high-frequency set from the high-frequency sets to be used as a set to be repaired, obtaining the current sequence length corresponding to the set to be repaired, judging the current sequence length and the maximum sequence length, confirming that the current set to be repaired is an initial repairing set if the current sequence length is equal to the maximum sequence length, transferring to the high-frequency set to be repaired, selecting an unprocessed high-frequency set from the high-frequency sets to be used as the set to be repaired, processing the current set to be repaired if the current sequence length is smaller than the maximum sequence length, obtaining a repairing set, and selecting an unprocessed high-frequency set to be used as the set to be repaired from the high-frequency sets, wherein the sequence length of the current set to be repaired is equal to the maximum sequence length, merging the initial repairing set and the high-frequency set to be repaired, and constructing the high-frequency Brujin graph according to the high-frequency repairing set.
In one embodiment of the invention, the current set to be repaired is processed to obtain a repaired set, and the method comprises the steps of obtaining a mapping relation between each continuous nucleotide sequence and the number of the continuous nucleotide sequences in the set to be repaired, processing the continuous nucleotide sequences according to the mapping relation to obtain a reduced sequence segment, and extracting each reduced sequence segment according to the maximum sequence length to obtain the repaired set.
In one embodiment of the invention, outputting the assembly sequence set according to the Debrucine graph comprises the steps of optimizing the Debrucine graph to obtain an optimized Debrucine graph, traversing the optimized Debrucine graph and outputting the assembly sequence set.
In one embodiment of the invention, the optimization operation includes at least one of merging duplicate paths, merging bubble structures, and culling small branches.
The invention further provides a genome assembly system based on iterative k-mer decomposition, which comprises a processing module, a dividing module, an iterative decomposition module and an assembly module, wherein the processing module is used for processing a sequence fragment set according to a sequence length to obtain a frequency set, the first sequence length is the maximum sequence length of the input sequence fragment set, the dividing module is used for dividing the frequency set according to a preset frequency threshold to obtain a high-frequency set and a low-frequency set, the iterative decomposition module is used for judging the preset length threshold and the sequence length, the iterative decomposition module is used for judging the low-frequency set as a new sequence fragment set when the preset length threshold is not larger than the sequence length, performing a value reduction operation on the new sequence fragment set and transmitting the new sequence fragment set and the sequence length obtained according to the value reduction operation to the processing module, transmitting the obtained high-frequency set to the assembly module when the preset length threshold is larger than the sequence length, and constructing a plurality of assembly graph sets according to the sequence length, and the assembly graph is used for the assembly graph.
In a third aspect, the invention also provides an electronic device comprising a processor, and a memory storing a program comprising instructions that when executed by the processor cause the processor to perform the iterative k-mer decomposition based genome assembly method of any of the above.
In a fourth aspect, the present invention also provides a non-transitory machine readable medium storing computer instructions for causing the computer to perform the genome assembly method based on iterative k-mer decomposition as defined in any one of the above.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects:
According to the genome assembly method and system based on iterative k-mer decomposition, the sequence length is gradually reduced from the largest sequence length of an input sequence fragment set, and the sequence fragment set is processed by using the corresponding sequence length, so that a frequency set and corresponding high-frequency and low-frequency sets are obtained. By adopting a large-to-small processing mode, the calculation speed can be increased while the data set scale is gradually reduced, and the efficient assembly of large-scale genome data is realized, so that the assembly speed is effectively improved.
The high frequency set is then reserved as a reliable set and used for subsequent construction of the Debrucine plot and assembly sequence, while the low frequency set is used as an unreliable set for the next round of iterative processing. First, since only the unreliable sets are decomposed in each iteration, the data volume of each round can be significantly reduced, so that the calculation efficiency is improved and the calculation load is reduced by reducing the data scale. And secondly, the low-frequency set is not directly discarded as the method in the prior art, but is used for the next round of iterative processing and splitting, so that the low-frequency set which is easy to ignore originally is reserved and utilized, the data utilization rate is improved, and the final assembly quality is further improved. And finally, the Debrucine graph can be constructed according to the obtained multiple high-frequency sets, so that the problem of broken traversing paths of the constructed Debrucine graph is effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.
Detailed Description
Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the invention is susceptible of embodiment in the drawings, it is to be understood that the invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the invention. It should be understood that the drawings and embodiments of the invention are for illustration purposes only and are not intended to limit the scope of the present invention.
In the prior art, the genome assembly method based on k-mer count has the advantages of efficiently processing large-scale data, accurately identifying repeated areas, optimizing memory and using computing resources. The method reduces sequencing noise through frequency screening, and can flexibly adjust k value to adapt to genomes with different complexity, so that the assembly speed is faster, the accuracy is higher, and the method is particularly suitable for rapid assembly tasks of large-scale genomes. Therefore, the k-mer-based genome assembly method is becoming the mainstream.
Wherein k-mer refers to a contiguous nucleotide sequence of length k. By extracting fixed length k-mers in each sequence segment (read), complex genome assembly problems can be translated into k-mer map structure stitching problems. Current k-mer based assembly methods rely primarily on building debluring diagrams that can efficiently process and represent redundant information in large-scale k-mer sets, thereby improving assembly efficiency.
Specifically, during genome assembly based on k-mer counts, first, a set of k-mers of all length k is extracted from the sequencing data, then each k-mer is treated as a node in the graph by constructing Jiande brucine's graph, adjacent k-mers share overlapping regions of length k-1 and are connected by edges in the graph. In this way, the genome assembly problem can be converted to a path traversal problem of the graph, thereby identifying the integrity of the sequence. The Debrucine plot has significant advantages in processing large-scale data, enabling efficient integration of repetitive sequences while reducing storage space.
However, the construction and traversal efficiency of the deblur graph is affected by the magnitude of the k-value. In the traditional assembly method based on fixed k values, the selection of proper k values is important, k-mers with larger k values can more effectively identify and splice long repeated areas, and error connection is reduced, so that assembly accuracy is improved, but identification capability of low coverage areas is poor, and breakage is easy to cause. The k-mers with smaller k values can more comprehensively capture short overlapping areas, are suitable for splicing in low coverage areas, but are easy to generate redundant paths and error splicing, so that the assembly accuracy is reduced. Therefore, the choice of a fixed k value is a technical challenge that is difficult to balance.
In order to solve the limitations of the fixed k-value method, there is a prior art in which genome assembly is performed by a k-mer assembly method of iterating k-values from small to large. I.e. increasing the k value step by step, a small to large deblur plot is constructed. The method first extracts k-mers using smaller k-values and constructs preliminary contigs (i.e., spliced contiguous gene sequence fragments), thereby capturing short repeat regions in the sequence. Subsequently, the k value is gradually increased to eliminate redundant paths and identify longer overlap regions. Finally, the long repeated region in the k-mer spliced genome with larger k value is utilized to generate a more complete genome structure. The method improves the accuracy of genome assembly to a certain extent by re-extracting and splicing k-mers under different k values.
However, the k-mer assembly method by iterating k-values from small to large also has significant drawbacks. First, it is computationally complex. By iterating k-mer assembly methods from small to large k-values, the k-mer count and the Debrucine plot construction need to be independently performed first at smaller k-values, which can result in more complex computation processes due to the larger number of k-mers generated by small k-values. Second, the processing speed is slow. When the k value is replaced each time, a k-mer set with different k values is regenerated, and a corresponding Debrucine graph is constructed, namely, the Debrucine graph is reconstructed and traversed in the iterative process of the different k values, so that the processing speed is slower. Finally, the low frequency k-mer utilization is low. In k-mer assembly methods with iterative k-values from small to large, low-frequency k-mers at different k-values are usually ignored or discarded, whereas low-frequency k-mers in a real dataset may account for a large proportion, and direct discarding may result in partial genome information loss, reducing assembly quality.
In order to solve the shortcomings of the k-mer assembly method with small to large iteration k values in the prior art in terms of calculation overhead, processing speed and data utilization rate, referring to fig. 1, the invention provides a genome assembly method based on iteration k-mer decomposition, which comprises the following steps:
And processing the sequence fragment set according to the sequence length (i.e. k) to obtain a frequency number set.
Dividing the frequency collection according to a preset frequency threshold value to obtain a high-frequency collection and a low-frequency collection.
And judging the preset length threshold value and the sequence length.
If the preset length threshold is not greater than the sequence length, the low-frequency set is used as a new sequence fragment set, the sequence length is subjected to the value reduction operation, and the new sequence fragment set is processed according to the sequence length obtained by the value reduction operation, so that a new high-frequency set and a new low-frequency set corresponding to the new sequence fragment set are obtained.
If the preset length threshold is greater than the sequence length, a Debrucine diagram is constructed according to the obtained multiple high-frequency sets, and an assembled sequence set (namely an assembled contig set) is output according to the Debrucine diagram.
It should be noted that the sequence fragment set includes a plurality of sequence fragments (i.e., a plurality of reads). Each sequence fragment (i.e., read) represents a sequence of genomic sequences, which is about 50 to 300bp in length. After the collection of sequence fragments is obtained, the total number of corresponding sequence fragments, the maximum sequence length (i.e., kmax), and the minimum sequence length (i.e., kmin) are all known amounts.
When processing a sequence fragment set according to sequence length, the first sequence length, i.e. the maximum sequence length, is gradually reduced in sequence length. Each frequency set includes sequence fragments, sequence lengths, and corresponding consecutive nucleotide sequences and number of consecutive nucleotide sequences information. Illustratively, after a new set of frequency numbers is obtained, it is stored in a data structure, such as a hash table. The processing mode from large to small can gradually reduce the size of the data set, simultaneously accelerate the calculation speed, realize the efficient assembly of large-scale genome data and effectively improve the assembly speed.
It should be noted that, the preset frequency threshold is a preset value, and those skilled in the art can set the preset frequency threshold as required. For example, the preset frequency threshold is set to 1, and the number of continuous nucleotide sequences greater than 1 in the frequency set belongs to the high-frequency set, and conversely belongs to the low-frequency set. The high-frequency set is used as a reliable set, and can be reserved and used for subsequent construction of a Debrucine diagram and an assembly sequence. While the low frequency set is used as an unreliable set for the next round of iterative processing.
Specifically, after judging the preset length threshold and the current sequence length, if the preset length threshold is not greater than the current sequence length, the low-frequency set is used as a new sequence fragment set. Preferably, the preset length threshold is set to the minimum sequence length. Meanwhile, the sequence length is subjected to a subtracting operation. Wherein the person skilled in the art can set the value of the decrement depending on the actual requirements. Then, a new sequence length is obtained according to the subtracting operation, and a new sequence fragment set (namely, an old low-frequency set obtained in the last round) is processed to obtain a new frequency set. And then dividing the new frequency collection according to a preset frequency threshold value to obtain a new high-frequency collection and a new low-frequency collection corresponding to the new sequence fragment collection. Also, after new high frequency and low frequency sets are obtained, the high frequency set remains as a reliable set, while the low frequency set remains as an unreliable set for the next round of iterative processing.
By processing the high-frequency set and the low-frequency set respectively, firstly, only the unreliable set is decomposed in each iteration, so that the data volume of each round can be obviously reduced, and the calculation efficiency and the calculation burden are improved by reducing the data scale. And secondly, the low-frequency set is not directly discarded as the method in the prior art, but is used for the next round of iterative processing and splitting, so that the low-frequency set which is easy to ignore originally is reserved and utilized, the data utilization rate is improved, and the final assembly quality is further improved. And finally, the Debrucine graph can be constructed according to the obtained multiple high-frequency sets, so that the problem of broken traversing paths of the constructed Debrucine graph is effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.
After iteration is performed until the preset length threshold is greater than the sequence length, all sequence fragments of the low-frequency set cannot be extracted, a Debrucine diagram can be constructed according to the obtained high-frequency sets, and an assembled sequence set is output according to the Debrucine diagram.
According to the genome assembly method based on iterative k-mer decomposition, from the maximum sequence length of an input sequence fragment set, the sequence length is gradually reduced from large to small, and the sequence fragment set is processed by using the corresponding sequence length, so that a frequency set, a corresponding high-frequency set and a corresponding low-frequency set are obtained. By adopting a large-to-small processing mode, the calculation speed can be increased while the data set scale is gradually reduced, and the efficient assembly of large-scale genome data is realized, so that the assembly speed is effectively improved.
The high frequency set is then reserved as a reliable set and used for subsequent construction of the Debrucine plot and assembly sequence, while the low frequency set is used as an unreliable set for the next round of iterative processing. First, since only the unreliable sets are decomposed in each iteration, the data volume of each round can be significantly reduced, so that the calculation efficiency is improved and the calculation load is reduced by reducing the data scale. And secondly, the low-frequency set is not directly discarded as the method in the prior art, but is used for the next round of iterative processing and splitting, so that the low-frequency set which is easy to ignore originally is reserved and utilized, the data utilization rate is improved, and the final assembly quality is further improved. And finally, the Debrucine graph can be constructed according to the obtained multiple high-frequency sets, so that the problem of broken traversing paths of the constructed Debrucine graph is effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.
In some embodiments, the genome assembly method based on iterative k-mer decomposition according to the present invention processes a sequence segment set according to a sequence length to obtain a frequency set, including:
and extracting each sequence fragment in the sequence fragment set according to the sequence length to obtain a continuous nucleotide sequence and the number of the continuous nucleotide sequences corresponding to each sequence fragment.
And establishing a mapping relation between the continuous nucleotide sequences and the number of the continuous nucleotide sequences to obtain a frequency collection.
The sequence length of the continuous nucleotide sequence obtained by extraction, that is, the sequence length used at the time of the current extraction, is also described. Illustratively, the current sequence length is 47, and the sequence length of each successive nucleotide sequence obtained after the corresponding sequence fragment is extracted is also 47. The number of consecutive nucleotide sequences, i.e., the number of consecutive nucleotide sequences that can be extracted from the corresponding sequence fragment after extraction. How to extract sequence fragments according to the sequence length belongs to the prior art and is not described in detail.
Although the sequences used for extraction are consistent in length, the number of consecutive nucleotide sequences and consecutive nucleotide sequences that can be extracted will be different if the sequence fragments in the current sequence fragment set are different. Thus, statistics are required after each sequence segment in the current set of sequence segments is extracted. And establishing a mapping relation between the continuous nucleotide sequences and the number of the continuous nucleotide sequences according to the statistical result to obtain a frequency collection. And then dividing the frequency collection according to a preset frequency threshold value.
It can be imagined that since the sequence length is gradually reduced from the maximum sequence length from large to small, the calculation speed can be increased while the data set size is gradually reduced, and the efficient assembly of large-scale genome data can be realized, so that the assembly speed is effectively improved.
In some embodiments, the genome assembly method based on iterative k-mer decomposition according to the present invention divides a frequency set according to a preset frequency threshold to obtain a high-frequency set and a low-frequency set, including:
comparing the preset frequency threshold value with the number of each continuous nucleotide sequence in the frequency collection.
If the preset frequency threshold value is smaller than the number of the continuous nucleotide sequences, confirming that the corresponding continuous nucleotide sequences belong to a high-frequency set. If the preset frequency threshold is not smaller than the number of the continuous nucleotide sequences, confirming that the corresponding continuous nucleotide sequences belong to a low-frequency set.
It should be noted that, the preset frequency threshold is a preset value, and those skilled in the art can set the preset frequency threshold as required. For example, the preset frequency threshold is set to 1, and the number of continuous nucleotide sequences greater than 1 in the frequency set belongs to the high-frequency set, and conversely belongs to the low-frequency set. The high-frequency set is used as a reliable set, and can be reserved and used for subsequent construction of a Debrucine diagram and an assembly sequence. While the low frequency set is used as an unreliable set for the next round of iterative processing.
In some embodiments, the genome assembly method based on iterative k-mer decomposition constructs a Debrucine diagram according to the obtained plurality of high-frequency sets, including:
And selecting an untreated high-frequency set from the plurality of high-frequency sets as a set to be repaired.
And acquiring the length of the current sequence corresponding to the set to be repaired.
And judging the sizes of the current sequence length and the maximum sequence length.
If the current sequence length is equal to the maximum sequence length, confirming that the current set to be repaired is an initial repair set, turning to select an untreated high-frequency set from a plurality of high-frequency sets as the set to be repaired, and carrying out corresponding treatment.
If the current sequence length is smaller than the maximum sequence length, the current set to be repaired is processed to obtain a repaired set, and the repaired set is transferred to a plurality of high-frequency sets, and an unprocessed high-frequency set is selected as the set to be repaired. Wherein the sequence length of the patch set is equal to the maximum sequence length.
And combining the initial repair set and the repair set to obtain a repair high-frequency set.
And constructing a Debrucine graph according to the patch high-frequency set.
It can be imagined that, when assembling, since the sequence length is iterated from large to small from the maximum sequence length, after the frequency collection is divided according to the preset frequency threshold, the current high-frequency collection can be obtained as the initial repair collection or the collection to be repaired which needs to be processed. And combining each processed repair set with the initial repair set, and treating the step as a repair process. The Debrucine graph is constructed by using the combined repair high-frequency set, so that the problem of broken traversing paths of the Debrucine graph can be effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.
Further, in some embodiments, the genome assembly method based on iterative k-mer decomposition according to the present invention processes a current set to be repaired to obtain a repaired set, including:
And obtaining the mapping relation between each continuous nucleotide sequence and the number of the continuous nucleotide sequences in the set to be patched.
And processing the continuous nucleotide sequence according to the mapping relation to obtain a reduced sequence fragment.
And extracting each reduced sequence segment according to the maximum sequence length to obtain a patch set.
The mapping relationship between each continuous nucleotide sequence and the number of continuous nucleotide sequences is a known amount, and the sequence fragments can be obtained by extracting the sequence fragments according to the sequence length. The prior art is not repeated as to how to process a continuous nucleotide sequence to obtain a reduced sequence fragment. By the step, the sequence lengths of the patching set and the initial patching set are equal, so that the two sets can be combined to obtain the patching high-frequency set. The Debrucine graph is constructed by using the combined repair high-frequency set, so that the problem of broken traversing paths of the Debrucine graph can be effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.
The genome assembly method based on iterative k-mer decomposition of the present invention, in some embodiments, outputs an assembly sequence set according to a debluring diagram, comprises:
And carrying out optimization operation on the Debrucine graph to obtain an optimized Debrucine graph.
And traversing the optimized Debrucine graph to output an assembly sequence set.
By means of the optimization operation, the size of the graph can be effectively reduced, and the computing and storage efficiency can be improved. And (3) after traversing the optimized Debrucine graph, outputting an assembly result, namely an assembly sequence set.
Effect verification is performed as follows. Specifically, partial genome sequencing data of E.coli and corresponding reference genomes are downloaded from the public database NCBI, the genome sequencing data are used as an input sequence fragment set, and the corresponding reference genomes are used as a control. The maximum sequence length is set to 47, the minimum sequence length is set to 27, the variation of the sequence length after each iteration is 5, and the preset length threshold is set to 1. According to the genome assembly method based on iterative k-mer decomposition and the k-mer assembly method from small to large iterative k values in the prior art, respectively, an input sequence fragment set is processed, and after assembly is completed, a result and a reference genome of comparison are submitted to an assembly effect evaluation tool QUAST to be subjected to verification evaluation.
Wherein, the k-mer assembly method from small to large iteration k value in the prior art has 721.94bp matching error in every 100,000 bp. The genome assembly method based on iterative k-mer decomposition has 688.63bp matching errors in every 100,000bp, and has simpler calculation process and faster technical processing speed.
The embodiment of the invention also provides a genome assembly system based on iterative k-mer decomposition, which comprises a processing module 10, a dividing module 20, an iterative decomposition module 30 and an assembly module 40.
The processing module 10 is configured to process the sequence segment set according to the sequence length to obtain a frequency number set.
The dividing module 20 is configured to divide the frequency set according to a preset frequency threshold value, so as to obtain a high frequency set and a low frequency set.
The iterative decomposition module 30 is configured to determine a preset length threshold and a sequence length. When the preset length threshold is not greater than the sequence length, the iterative decomposition module 30 takes the low-frequency set as a new sequence segment set, performs a value reduction operation on the sequence length, and transmits the new sequence segment set and the sequence length obtained according to the value reduction operation to the processing module 10. When the preset length threshold is greater than the sequence length, the iterative decomposition module 30 transmits the obtained plurality of high frequency sets to the assembly module 40.
The assembling module 40 is configured to construct a deblur graph according to the obtained multiple high-frequency sets, and output an assembled sequence set according to the deblur graph.
It should be noted that the sequence fragment set includes a plurality of sequence fragments (i.e., a plurality of reads). Each sequence fragment (i.e., read) represents a sequence of genomic sequences, which is about 50 to 300bp in length. After the collection of sequence fragments is obtained, the total number of corresponding sequence fragments, the maximum sequence length (i.e., kmax), and the minimum sequence length (i.e., kmin) are all known amounts.
When the processing module 10 processes the sequence segment set according to the sequence length, the first sequence length, i.e. the maximum sequence length, is gradually reduced in sequence length. Each frequency set includes sequence fragments, sequence lengths, and corresponding consecutive nucleotide sequences and number of consecutive nucleotide sequences information. Illustratively, after a new set of frequency numbers is obtained, it is stored in a data structure, such as a hash table. The processing mode from large to small can gradually reduce the size of the data set, simultaneously accelerate the calculation speed, realize the efficient assembly of large-scale genome data and effectively improve the assembly speed.
It should be noted that, the preset frequency threshold is a preset value, and those skilled in the art can set the preset frequency threshold as required. For example, the preset frequency threshold is set to 1, and the number of continuous nucleotide sequences greater than 1 in the frequency set belongs to the high-frequency set, and conversely belongs to the low-frequency set. The high-frequency set is used as a reliable set, and can be reserved and used for subsequent construction of a Debrucine diagram and an assembly sequence. While the low frequency set is used as an unreliable set for the next round of iterative processing.
Specifically, after the iterative decomposition module 30 determines the preset length threshold and the current sequence length, if the preset length threshold is not greater than the current sequence length, the low-frequency set is used as the new sequence segment set. Preferably, the preset length threshold is set to the minimum sequence length. Meanwhile, the sequence length is subjected to a subtracting operation. Wherein the person skilled in the art can set the value of the decrement depending on the actual requirements. The iterative decomposition module 30 then transmits the new sequence length and the new set of sequence segments to the processing module 10 for corresponding processing and obtaining a new set of frequency numbers. Next, the new frequency number set is divided by the dividing module 20 to obtain a new high frequency set and a new low frequency set corresponding to the new sequence segment set. Also, after new high frequency and low frequency sets are obtained, the high frequency set remains as a reliable set, while the low frequency set remains as an unreliable set for the next round of iterative processing.
By processing the high-frequency set and the low-frequency set respectively, firstly, only the unreliable set is decomposed in each iteration, so that the data volume of each round can be obviously reduced, and the calculation efficiency and the calculation burden are improved by reducing the data scale. And secondly, the low-frequency set is not directly discarded as the method in the prior art, but is used for the next round of iterative processing and splitting, so that the low-frequency set which is easy to ignore originally is reserved and utilized, the data utilization rate is improved, and the final assembly quality is further improved. And finally, the Debrucine graph can be constructed according to the obtained multiple high-frequency sets, so that the problem of broken traversing paths of the constructed Debrucine graph is effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.
After the iteration is performed until the preset length threshold is greater than the sequence length, all sequence fragments of the low-frequency set cannot be extracted, at this time, all obtained high-frequency sets can be transmitted to the assembly module 40, the assembly module 40 constructs a Debrucine diagram according to the obtained high-frequency sets, and the assembly sequence set is output according to the Debrucine diagram.
The genome assembly system based on iterative k-mer decomposition gradually reduces the sequence length from large to small from the maximum sequence length of an input sequence fragment set, and processes the sequence fragment set by using the corresponding sequence length to obtain a frequency set and a corresponding high-frequency set and low-frequency set. By adopting a large-to-small processing mode, the calculation speed can be increased while the data set scale is gradually reduced, and the efficient assembly of large-scale genome data is realized, so that the assembly speed is effectively improved.
The high frequency set is then reserved as a reliable set and used for subsequent construction of the Debrucine plot and assembly sequence, while the low frequency set is used as an unreliable set for the next round of iterative processing. First, since only the unreliable sets are decomposed in each iteration, the data volume of each round can be significantly reduced, so that the calculation efficiency is improved and the calculation load is reduced by reducing the data scale. And secondly, the low-frequency set is not directly discarded as the method in the prior art, but is used for the next round of iterative processing and splitting, so that the low-frequency set which is easy to ignore originally is reserved and utilized, the data utilization rate is improved, and the final assembly quality is further improved. And finally, the Debrucine graph can be constructed according to the obtained multiple high-frequency sets, so that the problem of broken traversing paths of the constructed Debrucine graph is effectively prevented, the final output assembly sequence set is longer, and the assembly effect is more excellent.
The invention also provides an electronic device comprising a processor and a memory storing a program. The program comprises instructions that when executed by a processor cause the processor to perform the genome assembly method based on iterative k-mer decomposition described in any of the embodiments above.
The invention also provides a non-transitory machine readable medium storing a computer program, wherein the computer instructions are configured to cause a computer to perform the genome assembly method based on iterative k-mer decomposition according to any one of the embodiments described above.
It should be noted that electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
A computer program for implementing the methods of embodiments of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of embodiments of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable signal medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, or infrared system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It should be noted that the term "comprising" and its variants as used in the embodiments of the present invention are open-ended, i.e. "including but not limited to". The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment," another embodiment "means" at least one additional embodiment, "and" some embodiments "means" at least some embodiments. The references to "one" or "a plurality" of modifications in the embodiments of the invention are intended to be illustrative rather than limiting, and it will be understood by those skilled in the art that "one or more" is intended to be interpreted as "one or more" unless the context clearly indicates otherwise.
User information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) according to embodiments of the present invention are information and data authorized by a user or sufficiently authorized by parties, and the collection, use and processing of relevant data requires compliance with relevant laws and regulations and standards of relevant countries and regions, and is provided with corresponding operation portals for the user to select authorization or denial.
The steps recited in the method embodiments provided by the embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.
The term "embodiment" in this specification means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive. The various embodiments in this specification are described in a related manner, with identical and similar parts being referred to each other. In particular, for apparatus, devices, system embodiments, the description is relatively simple as it is substantially similar to method embodiments, see for relevant part of the description of method embodiments.
The above examples merely represent a few embodiments of the present invention, which are described in more detail and are not to be construed as limiting the scope of protection. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.