I'm looking for an algorithm that finds short tandem repeats in a genome sequence.
Basically, given a really long string which can only consist of the 4 characters 'ATCG', I need to find short repeats between 2-5 characters long that are next to each other.
ex:TACATGAGATCATGATGATGATGATGGAGCTGTGAGATCwould giveATGATGATG or ATG repeated 3 times
The algorithm needs to scale up to a string of 1 million characters so I'm trying to get as close to linear runtime as possible.
My current algorithm:Since the repeats can be 2-5 characters long, I check the string character by character and see if the Nth character is the same as the N+Xth character, with X being 2 through 5. With a counter for each X that counts sequential matches and resets at a mismatch, we know if there is a repeat when X = the counter. The subsequent repeats can then be checked manually.
- 1What is the problem with your current algorithm?Pete Baughman– Pete Baughman2014-05-09 22:38:55 +00:00CommentedMay 9, 2014 at 22:38
- 1Seems like your current algorithm is already O(n) if X is boundedNiklas B.– Niklas B.2014-05-09 22:46:47 +00:00CommentedMay 9, 2014 at 22:46
- The algorithm itself sounds feasible. If performance is an issue, your should definitely consider (high-level: ) parallelizing it (should be embarassingly parallel), and possibly (low-level: ) keep in mind language-specific performance implications, like that of
Stringobjects vs.char[]arrays, for exampleMarco13– Marco132014-05-09 23:14:52 +00:00CommentedMay 9, 2014 at 23:14 - How fast has it to be? I achieved more than 50 MiB/s with a single thread and the most straight forward implementation on my laptop.Daniel Brückner– Daniel Brückner2014-05-10 00:18:16 +00:00CommentedMay 10, 2014 at 0:18
1 Answer1
You are looking at each character which gives youO(n), since you compare on each character the next (maximum) five characters this gives you a constantc:
var data = get_input();var compare = { `A`, `T`, `G`, `A`, `T` } // or whatevervar MAX_LOOKAHEAD = compare.lengthvar nvar cfor(n = data_array.length; n < size; i++) { // Has runtime O(n) for(c = 0; c < MAX_LOOKAHEAD; c++) { // Maximum O(c) if( compare[c] != data[i+c] ) { break; } else { report( "found match at position " + i ) } }}It is easy to see that this runsO(n*c) times. Sincec is very small it can be ignored - and I think one can not get rid of that constant - which results in a total runtime ofO(n).
The good news:
You can speed this up with parallelization. E.g. you could split this up ink intervals and let multiple threads do the job for you by giving them appropriate start and end indices. This could give you alinear speedup.
If you do that make sure that you treat the intersections as special cases since you could miss a match if your intervals split a match in two.
E.g.n = 50000:
Partition for 4 threads:(n/10000) - 1 = 4. The 5th thread won't have a lot to do since it just handles the intersections which is why we don't need to consider its (in our case tiny) overhead.
1 10000 20000 40000 50000|-------------------|-------------------|-------------------|-------------------|| <- thread 1 -> | <- thread 2 -> | <- thread 3 -> | <- thread 4 -> | |---| |---| |---| |___________________|___________________| | thread 5And this is how it could look like:
var data;var compare = { `A`, `T`, `G`, `A`, `T` };var MAX_LOOKAHEAD = compare.length;thread_function(args[]) { var from = args[0]; var to = args[1]; for(n = from ; n < to ; i++) { for(c = 0; c < MAX_LOOKAHEAD; c++) { if( compare[c] != data[i+c] ) { break; } else { report( "found match at position " + i ) } } }}main() { var data_size = 50000; var thread_count = 4; var interval_size = data_size / ( thread_count + 1) ; var tid[] // This loop starts the threads for us: for( var i = 0; i < thread_count; i++ ) { var args = { interval_size * i, (interval_size * i) + interval_size }; tid.add( create_thread( thread_function, args ) ); } // And this handles the intersections: for( var i = 1; i < thread_count - 1; i++ ) { var args = { interval_size * i, (interval_size * i) + interval_size }; from = (interval_size * i) - compare.length + 1; to = (interval_size * i) + compare.length - 1; for(j = from; j < to ; j++) { for(k = 0; k < MAX_LOOKAHEAD; k++) { if( compare[k] != data[j+k] ) { break; } else { report( "found match at position " + j ) } } } } wait_for_multiple_threads( tid );}