NotificationsYou must be signed in to change notification settings
Fork34k
Star71.3k

gh-106865: Improve`difflib.find_longest_match` performance with cache#106877

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

gaogaotiantian wants to merge2 commits intopython:mainfromgaogaotiantian:difflib-perf

Closed

gh-106865: Improve`difflib.find_longest_match` performance with cache#106877

gaogaotiantian wants to merge2 commits intopython:mainfromgaogaotiantian:difflib-perf

Conversation

Copy link

Member

gaogaotiantian commentedJul 18, 2023•
edited by bedevere-bot
Loading

difflib.SequenceMatcher does not perform well on large input. This patch tries to improve it by doing some micro optimization on hot path.

When dealing with large inputs, for example, large files, almost all the time is spent infind_longest_match, and the hottest code is the nested for loop:

foriinrange(alo,ahi):# look at all instances of a[i] in b; note that because# b2j has no junk keys, the loop is skipped if a[i] is junkj2lenget=j2len.getnewj2len= {}forjinb2j.get(a[i],nothing):# a[i] matches b[j]ifj<blo:continueifj>=bhi:breakk=newj2len[j]=j2lenget(j-1,0)+1ifk>bestsize:besti,bestj,bestsize=i-k+1,j-k+1,k

The content in the for loop will execute(ahi - alo) * len(b2j[a[i]]) times. Any optimization in the loop could potentially be a large improvement.

The patch runs the same loop for the first timea[i] is found, except that it creates a cache for the validb2j list - list withj inside the range(blo, bhi). Ifa[i] already appears before, then the cached list would be used - then the two if checks can be removed.

Also, ifa[i] already appears before and there is at least one valid content in the list,bestsize would be already set to at least1 - we can then only do theif k > bestsize check whenk is greater than1 - which happens to be already checked if we useif to check whetherj-1 is in the dict.

In this way, we eliminated plenty of code on the cached path. This optimization is significant when the cached path is hit frequently, or the "item"s ina repeats frequently. This is the case for any large file (especially file with smaller vocabulary like ascii).

I tried this on some of files by hand - a c source file and a README file.

benchmark code

importdifflibimporttimewithopen("../viztracer/src/viztracer/modules/snaptrace.c")asf:code=f.read()withopen("../viztracer/README.md")asf:readme=f.read()s=difflib.SequenceMatcher(None,code,code)start=time.time()s.get_opcodes()print(time.time()-start)s=difflib.SequenceMatcher(None,code,readme)start=time.time()s.get_opcodes()print(time.time()-start)s=difflib.SequenceMatcher(None,readme,code)start=time.time()s.get_opcodes()print(time.time()-start)

The time spent improvement was significant:

0.65312004089355470.213754177093505860.7097032070159912

0.207928895950317380.110659599304199220.28839802742004395

This patch will have a small negative performance impact on smaller test cases and cases where none of the items ina duplicates. For example, it is significantly slower intest_recursion_limit:

deftest_recursion_limit(self):# Check if the problem described in patch #1413711 exists.limit=sys.getrecursionlimit()old= [(i%2and"K:%d"or"V:A:%d")%iforiinrange(limit*2)]new= [(i%2and"K:%d"or"V:B:%d")%iforiinrange(limit*2)]difflib.SequenceMatcher(None,old,new).get_opcodes()

However, in real world scenario, it does not make much sense to diff/match two sequences with a giant vocabulary, so in almost all actual use cases this should improve the performance. Especially on large sequences.

We can, however, check the vocabulary (set(a) / len(a)) and determine whether we want to do this optimization. Extra overhead, potentially better perf on worst case.

P.S. changingcache.append tocacheappend = cache.append does not have a observable impact, but I can do that if that's needed.

Issue:A warning should be added to difflib documentation about SequenceMatcher performance #106865

Improve difflib performance with a cache

5f35b0d

bedevere-bot added the awaiting review label

Jul 18, 2023

gaogaotiantian changed the title~~Improvedifflib.find_longest_match performance with cache~~gh-106865: Improvedifflib.find_longest_match performance with cache

Jul 18, 2023

bedevere-bot mentioned this pull request

Jul 18, 2023

A warning should be added to difflib documentation about SequenceMatcher performance#106865

Open

📜🤖 Added by blurb_it.

0396f81

Copy link

MemberAuthor

gaogaotiantian commentedJul 18, 2023

@tim-one I know this is probably ancient code to you :) git blames you for working on the code 23 years ago. Could you maybe share some thoughts on the optimization? Thanks!

Copy link

Member

tim-one commentedJul 19, 2023

However, in real world scenario, it does not make much sense to diff/match two sequences with a giant vocabulary,

To the contrary, comparing lists of lines is by far the most common - and important intended - use case, where a "letter in the alphabet" is the entire content of a line. I don't know why you would want to compare text files as sequences of characters instead, but "so don't do that" comes to mind 😉.

Two common uses for comparing sequences of characters come to mind:

When a higher-level diff (by lines) wants to find the character-level differences between two mismatching lines. Indeed, lots of codeinsidedifflib does that, to support its higher-level file-differencing functions. In that case, the sequences are typically quite short (under a hundred characters each).
Newbies comparing multi-megabyte DNA sequences, They'll never be happy withdifflib, or with anything else short of highly specialized differencing algorithms designedfor their domain.

So the patch looks good, but I'm skeptical of that it addresses a "real problem" that's common enough to be worth any slowdown in the actual common cases. Theautojunk gimmick was added to address "very common" repeated lines, and it's good at speeding those cases, but it was A Big Mistake to enable it by default.

What would be welcome is a different algorithm specifically designed for "long sequences from a small alphabet" cases.SequenceMatcher works OK for those, provided the sequences aren't very long, but the core algorithm was really designed for "giant alphabet" cases (viewing text files as sequences of lines).

Copy link

Member

tim-one commentedJul 19, 2023•
edited
Loading

BTW, if you want to pursue this, there's an "optimization" I never bothered with because I didn't expect it would pay in the common cases (which I apparently have a different idea of than you have 😉): the one-at-a-time comparison of the j's toblo andbhi are obviously magnets for contrived bad cases. Instead, something along the lines of:

js=b2j[a[i]]# this is assuming we know a[i] is in the dictlo=bisect_left(js,blo)hi=bisect_left(js,bhi,lo)ifloorhi<len(js):# or maybe faster to always do the slice; it's correct either wayjs=js[lo :hi]

That finds all the j's in range efficiently regardless of how many are out of range, and regardless of how many arein range, materializes them in a list "at C speed" instead of appending one at a time "at Python speed". The result is the value of yourcache, but computed in advance without Python-level looping.

Copy link

MemberAuthor

gaogaotiantian commentedJul 19, 2023

Okay I might have misunderstood the usage ofSequenceMatch. I agree that using lines would be a better way to do diffs.

If we do not expect duplicate items, then even the bisection may not be able to compensate the extra cost of creating lists and doing the bisection right? We might still slow down the cases where there's no duplicate.

I'll leave the decision to you, I'm okay if things keep the way they are. If you think this could be helpful when users use strings as the sequence instead of list of strings, I'll do the bisect optimization.

gaogaotiantian closed this

Mar 28, 2024

gaogaotiantian deleted the difflib-perf branch

May 12, 2024 20:45

gaogaotiantian mentioned this pull request

Apr 7, 2025

Add checking if both strings are the same inSequenceMatcher.find_longest_match#132166

Open

Labels

awaiting review

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-106865: Improve`difflib.find_longest_match` performance with cache#106877

gh-106865: Improve`difflib.find_longest_match` performance with cache#106877

Uh oh!

Conversation

gaogaotiantian commentedJul 18, 2023•
edited by bedevere-bot
Loading

Uh oh!

Uh oh!

gaogaotiantian commentedJul 18, 2023

Uh oh!

tim-one commentedJul 19, 2023

Uh oh!

tim-one commentedJul 19, 2023•
edited
Loading

Uh oh!

Uh oh!

gaogaotiantian commentedJul 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Movatterモバイル変換

Uh oh!

gh-106865: Improvedifflib.find_longest_match performance with cache#106877

gh-106865: Improvedifflib.find_longest_match performance with cache#106877

Uh oh!

Conversation

gaogaotiantian commentedJul 18, 2023• edited by bedevere-botLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

gaogaotiantian commentedJul 18, 2023

Uh oh!

tim-one commentedJul 19, 2023

Uh oh!

tim-one commentedJul 19, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

gaogaotiantian commentedJul 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gh-106865: Improve`difflib.find_longest_match` performance with cache#106877

gh-106865: Improve`difflib.find_longest_match` performance with cache#106877

gaogaotiantian commentedJul 18, 2023•
edited by bedevere-bot
Loading

tim-one commentedJul 19, 2023•
edited
Loading