Algorithm for finding continuous repeated sequences

Question 1

I'm looking for an algorithm that finds short tandem repeats in a genome sequence.

Basically, given a really long string which can only consist of the 4 characters 'ATCG', I need to find short repeats between 2-5 characters long that are next to each other.

ex:TACATGAGATCATGATGATGATGATGGAGCTGTGAGATCwould giveATGATGATG or ATG repeated 3 times

The algorithm needs to scale up to a string of 1 million characters so I'm trying to get as close to linear runtime as possible.

My current algorithm:Since the repeats can be 2-5 characters long, I check the string character by character and see if the Nth character is the same as the N+Xth character, with X being 2 through 5. With a counter for each X that counts sequential matches and resets at a mismatch, we know if there is a repeat when X = the counter. The subsequent repeats can then be checked manually.

Question 2

What is the problem with your current algorithm?

Question 3

Seems like your current algorithm is already O(n) if X is bounded

Question 4

The algorithm itself sounds feasible. If performance is an issue, your should definitely consider (high-level: ) parallelizing it (should be embarassingly parallel), and possibly (low-level: ) keep in mind language-specific performance implications, like that ofString objects vs.char[] arrays, for example

Question 5

How fast has it to be? I achieved more than 50 MiB/s with a single thread and the most straight forward implementation on my laptop.

Question 6

You are looking at each character which gives youO(n), since you compare on each character the next (maximum) five characters this gives you a constantc:

var data    = get_input();var compare = { `A`, `T`, `G`, `A`, `T` }         // or whatevervar MAX_LOOKAHEAD = compare.lengthvar nvar cfor(n = data_array.length; n < size; i++) {       // Has runtime O(n)  for(c = 0; c < MAX_LOOKAHEAD; c++) {            // Maximum O(c)    if( compare[c] != data[i+c] ) {      break;    } else {      report( "found match at position " + i )    }  }}

It is easy to see that this runsO(n*c) times. Sincec is very small it can be ignored - and I think one can not get rid of that constant - which results in a total runtime ofO(n).

The good news:

You can speed this up with parallelization. E.g. you could split this up ink intervals and let multiple threads do the job for you by giving them appropriate start and end indices. This could give you alinear speedup.

If you do that make sure that you treat the intersections as special cases since you could miss a match if your intervals split a match in two.

E.g.n = 50000:

Partition for 4 threads:(n/10000) - 1 = 4. The 5th thread won't have a lot to do since it just handles the intersections which is why we don't need to consider its (in our case tiny) overhead.

1                 10000               20000               40000               50000|-------------------|-------------------|-------------------|-------------------|| <-   thread 1  -> | <-   thread 2  -> | <-   thread 3  -> | <-   thread 4  -> |                  |---|               |---|               |---|                                  |___________________|___________________|                                        |                                     thread 5

And this is how it could look like:

var data;var compare = { `A`, `T`, `G`, `A`, `T` };var MAX_LOOKAHEAD = compare.length;thread_function(args[]) {    var from = args[0];    var to   = args[1];    for(n = from ; n < to ; i++) {      for(c = 0; c < MAX_LOOKAHEAD; c++) {        if( compare[c] != data[i+c] ) {          break;        } else {          report( "found match at position " + i )        }      }    }}main() {    var data_size     = 50000;    var thread_count  = 4;    var interval_size = data_size / ( thread_count + 1) ;    var tid[]    // This loop starts the threads for us:    for( var i = 0; i < thread_count; i++ ) {        var args = { interval_size * i, (interval_size * i) + interval_size };        tid.add( create_thread( thread_function, args ) );    }    // And this handles the intersections:    for( var i = 1; i < thread_count - 1; i++ ) {        var args = { interval_size * i, (interval_size * i) + interval_size };        from = (interval_size * i) - compare.length + 1;        to   = (interval_size * i) + compare.length - 1;        for(j = from; j < to ; j++) {            for(k = 0; k < MAX_LOOKAHEAD; k++) {                if( compare[k] != data[j+k] ) {                    break;                } else {                    report( "found match at position " + j )                }            }        }    }    wait_for_multiple_threads( tid );}

Stefan Falk 25.8k62 gold badges225 silver badges422 bronze badges · Accepted Answer · 2014-05-10 00:14:14Z

You are looking at each character which gives youO(n), since you compare on each character the next (maximum) five characters this gives you a constantc:

var data    = get_input();var compare = { `A`, `T`, `G`, `A`, `T` }         // or whatevervar MAX_LOOKAHEAD = compare.lengthvar nvar cfor(n = data_array.length; n < size; i++) {       // Has runtime O(n)  for(c = 0; c < MAX_LOOKAHEAD; c++) {            // Maximum O(c)    if( compare[c] != data[i+c] ) {      break;    } else {      report( "found match at position " + i )    }  }}

It is easy to see that this runsO(n*c) times. Sincec is very small it can be ignored - and I think one can not get rid of that constant - which results in a total runtime ofO(n).

The good news:

You can speed this up with parallelization. E.g. you could split this up ink intervals and let multiple threads do the job for you by giving them appropriate start and end indices. This could give you alinear speedup.

If you do that make sure that you treat the intersections as special cases since you could miss a match if your intervals split a match in two.

E.g.n = 50000:

Partition for 4 threads:(n/10000) - 1 = 4. The 5th thread won't have a lot to do since it just handles the intersections which is why we don't need to consider its (in our case tiny) overhead.

1                 10000               20000               40000               50000|-------------------|-------------------|-------------------|-------------------|| <-   thread 1  -> | <-   thread 2  -> | <-   thread 3  -> | <-   thread 4  -> |                  |---|               |---|               |---|                                  |___________________|___________________|                                        |                                     thread 5

And this is how it could look like:

var data;var compare = { `A`, `T`, `G`, `A`, `T` };var MAX_LOOKAHEAD = compare.length;thread_function(args[]) {    var from = args[0];    var to   = args[1];    for(n = from ; n < to ; i++) {      for(c = 0; c < MAX_LOOKAHEAD; c++) {        if( compare[c] != data[i+c] ) {          break;        } else {          report( "found match at position " + i )        }      }    }}main() {    var data_size     = 50000;    var thread_count  = 4;    var interval_size = data_size / ( thread_count + 1) ;    var tid[]    // This loop starts the threads for us:    for( var i = 0; i < thread_count; i++ ) {        var args = { interval_size * i, (interval_size * i) + interval_size };        tid.add( create_thread( thread_function, args ) );    }    // And this handles the intersections:    for( var i = 1; i < thread_count - 1; i++ ) {        var args = { interval_size * i, (interval_size * i) + interval_size };        from = (interval_size * i) - compare.length + 1;        to   = (interval_size * i) + compare.length - 1;        for(j = from; j < to ; j++) {            for(k = 0; k < MAX_LOOKAHEAD; k++) {                if( compare[k] != data[j+k] ) {                    break;                } else {                    report( "found match at position " + j )                }            }        }    }    wait_for_multiple_threads( tid );}

Movatterモバイル変換

Collectives™ on Stack Overflow

Algorithm for finding continuous repeated sequences

1 Answer1

Comments

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

Subscribe to RSS