Posted onMar 15, 2024

Faster tetranucleotide (k-mer) frequencies!

#perl #bioinformatics #profiling #softwaredevelopment

I sawJennifer's post about re-writing herperl scripts in python and how she saw a 2.5 times improvement.

How could this be? My favourite language can't be that slow.
It must be programmer error.

I have an interest in Perl and Science, so time to roll up sleeves and learn me some profiling/benchmarking. What follows is my internal monologue and the notes I scribbled down during the learning process. For those that want to follow along, I'vecreated a small repo.

Getting started

Voltaire said that Hell is other people's code. My first step was tore-write it intoModern Perl and in the process, understand what each line does. When it's written idiomatically, it's easier to refactor and I should be able to make some minor performance improvements along the way.

Assume that the original script has been tested enough. For me to be correct, I've got to produce the exact same output. I got close, except for the header line.
line 99print OUT "\t$prefix_$j"; becomes
line 89print $out_fh "\t$j"; Yes, that's a bug because$prefix_ doesn't exist.

Search "benchmarking tools for linux" and decide thathyperfine is good for what I'm doing. Run Jennifer's new python script against my refactored perl and find that the python is 1.26 times faster for k=3 and 1.47 times faster for k=4. For the Covid-19 sequence, these are both on the order of hundreds of milliseconds.

hyperfine--warmup 3'perl/get_kmer_frequencies.pl Covid-19_seq.fasta 3 boyd1''python/get_kmer_frequencies.py -i Covid-19_seq.fasta -k 3 -p boyd2'

Ok, not bad. Better than 2.9 times faster, but that's probably down to the way thathyperfine warms the cache and separates out User time from System time.

Oh, I should just check how much I improved when I refactored. Run it against Jennifer's original perl script and ... hers was 1.1 times faster. Well, that was a bit embarrassing.

ahem I was ... aiming at improving readability, .. maintainability, y'know best practice and all that. That's my story and I'm sticking to it. ;)

For sanity's sake

Check that the output of the new file is the same as the original, otherwise you've messed up the refactoring. I started using this test script withprove to make it quick and easy.
Saved asi.t, I run it withprove i.t for the lols.
It gets noisy when there's a problem, so I go back to running it by hand.

useTest2::V0;my$standard='get_kmer_frequencies.pl';my@files=sort{-M$a<=>-M$b}glob'get_kmer*';okmy$latest=shift@files;isnt$latest,$standard,'Files to compare';my@args=qw'Covid-19.fasta 3';oksystem('perl',$standard,@args,'A')==0,'Make A_kmers.txt';oksystem('perl',$latest,@args,'B')==0,'Make B_kmers.txt';is`diff A_kmers.txt B_kmers.txt`,q{},'No differences in output';done_testing();

Clever people do this from the start.
I did this after a bug I introduced messed up the output and I hadn't immediately noticed. What it was is that I changed the key separator to a character that was found in some of those keys and it then split those keys. Oops.

NYTProf time

When you get serious about optimizing programs, trying to enhance performance, you reach forprofiling tools that can analyze your code's memory or time complexity. In Perl,Devel::NYTProf comes highly recommended. I use it to collect data on the number of times each statement is called and how long it spends executing it. That way I can work out where to invest the effort making the script faster, what gives the most bang for the buck.

Grab the profiler and run

perl -d:NYTProf get_kmer_frequencies.pl Covid-19_seq.fasta 3 boyd1

and open up thenytprof/index.html usingnytprofhtml --open to see

Calls   P   F   Exclusive Time  Inclusive Time  Subroutine9653    1   1   31.6ms  31.6ms  main::rc_seq25      2   1   28.1ms  59.7ms  main::process_it82498   7   1   9.46ms  9.46ms  main::CORE:print (opcode)3175    4   1   7.89ms  7.89ms  main::CORE:sort (opcode)

Sorting out the sort

Obviously, therc_seq is the big sub that needs attention, but what about thatsort? Quickly looking at the sort onLine 78for my $i (keys %knucs) I see that there's no reason to sort those keys. Saved one sort and the script runs about the same. There's another sortinside a loop which can be extracted out of the loop.Extracting that made it run 1.15 times faster!

Changing the header line (line 87) from a for loop to ajoin over a list is 1 or 2 percent faster.

How do I print thee? Let me count the ways.

Messing about with printing in the inner loop didn't gain much, but changing the key separator from a tab"\t" (interpolated string) to an underscore'_' (a string literal) made a 10% improvement. (it also introduced the bug noted above because the keys used the underscore. changed it to a colon - bug gone)

say is marginally slower thanprint so useprint inside the loop that gets called a lot to save maybe 10% on that call. From 32ms to 28ms is a small, but nice gain for a one line change.

rc_seq - transforming the sequence

Therc_seq sub is anif-elsif block that splits a string into individual characters, translatesACGT into their complement (TGCA), reverses the array and joins it back into a string.

Being Perl, we canmanipulate and reverse the string in-place. The change makes it shorter and more obvious (sometimes it runs faster). Actually, I ran this through the profiler and the sub now runs 5 times faster.

process_it - collecting the frequencies

This sub does the work of splitting the sequence intokmers and counting them. The longest time spent here isincrementing the%knucs hash.

The second longest time is spent turning the sequence into an array of letters to create all the kmer substrings. Splitting isn't bad, butjoining sets of letters together is. Use thestring function,substr instead and speed that line up by 5 times.

Now marginally faster the python script in speed. Over 20% faster for k=3, and 5% (+/- 5%) faster for k=4. That's a decent improvement.

Like the end of a great song ... a Key change!

line 91

my @items = map { my $key = join ':', $_, $i; $knucs{$key} // 0 } @record_keys;

The script spends most of its time (55ms!!!), longer thananything else, on this line.

Assume that the problem isn't themap but constructing the key for the lookup. Change the key to a2 dimensional lookup and see if that improves things.

WHEN you finally get it right (and remember the correct order that you construct the keys in),line 92 is now 2.5 times faster than before and the perl script is now 40% faster than the python script.

Keys are constructed/used on lines 79, 91, 135

STOP!!!!

Know when to stop.

There are no more obvious or easy gains here. Any more work is likely to yield small returns. Go outside, have a life or at the least consultthe relevant chart.

Well, after thinking a while, maybe constructing the output could be improved, but I'm moving on. I've exceeded my goal of making the perl script as fast as the python script and learned more about refactoring and profiling. A bit like how audiophiles use your music to listen to their equipment, I've used Jennifer's science to better understand my Perl and had fun doing it.

There's a niggling thought at the back of my mind, now that I feel I better understand the purpose of the script, whetherBioPerl can do this even faster. I will leave that for another day. Oh, lookglycine has already donemost of the hard work for me. Many thanks!

Lessons learned

In summary, these are reflections on the changes that I made in chronological order. This may be someone's first time considering performance, so I include basic rules of thumb I used along with the things I did not know before.

Modern Perl style adds a small amount of overhead, but the sanity it brings is a price worth paying.
Streamline a method of checking the output hasn't changed
Don'tsort when order is not important
Calculate constant valuesoutside of loops
Use built-in list functions over loops (join instead offor)
Interpolated strings are slower than string literals (prefer single quotes over double quotes)
say is slightly slower thanprint. Avoid it in heavy loops.
substr is faster thanspliting andjoining
Creating a single hash key is slower than using a 2 level hash

The Human Genome isway too large. Grab the protein sequence for Caenorhabditis elegans. It takes about 5 minutes to run.

Run your frequent tests with the Covid-19 sequence. Repeated runs with anything larger take too long for rapid turnaround.

WARNING: hyperfine will run each program 20 to 40 times to get decent statistics. You won't want to wait around for a file that takes 5 minutes to process a single run.

I'll leave you with a couple of related references for further reading,chrisarg's work onparsing FastQ files fast
and a marketsplash tutorial onPerl code profiling tools

In Conclusion

My corollary to Cunningham's Law:
Don't ask people how to make your code run faster;
Tell them their language is slow

It's taken a lot longer to reply Jennifer's post than I'd anticipated, but right now I have the warm glow that comes from being able to say, (until someone iterates on the above corollary)

...Python is SLOW!

Top comments(1)

Shengwei Li

I'm a bioinformatician based in China.

Location
Mile city, Yunnan province, China.
Education
Shanghai Jiao Tong University School of Medicine
Pronouns
he/him
Joined
Mar 4, 2025

• Mar 10

Copy link

Good post, please talk with me about in silico cloning of disease genes at the DEV Community site(dev.to/lisw05).
You are welcome.
Steven(orcid.org/my-orcid?orcid=0009-0003...)
Mar 10, 2025