Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-130167: Improve speed ofdifflib.IS_LINE_JUNK by replacingre#130170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
AA-Turner merged 13 commits intopython:mainfromdonBarbos:improve-difflib-speed
May 1, 2025

Conversation

donBarbos
Copy link
Contributor

@donBarbosdonBarbos commentedFeb 16, 2025
edited by bedevere-appbot
Loading

Now we don't import ofre globally and there is no longer an extra (undocumented) function parameter (seedocumentation)

How did i benchmark a new version?

I wrote next python scriptdifflib_bench.py:

importreimporttimeitdefis_line_junk_regex(line,pat=re.compile(r"\s*(?:#\s*)?$").match):returnpat(line)isnotNonedefis_line_junk_no_regex(line):returnline.strip()==""orline.lstrip().rstrip()=="#"test_cases= ["    ","    #","   # comment","code line"," 123 "," 123 #"," 123 # hi"," 123 #comment","\n","  #\n","hello\n",""," ","\t","#"," #","# ","   #   ","text","text # comment","#text","##","#\t","   "," #\t ",]defbenchmark(func,cases,n=100000):returntimeit.timeit(lambda: [func(line)forlineincases],number=n)regex_time=benchmark(is_line_junk_regex,test_cases)no_regex_time=benchmark(is_line_junk_no_regex,test_cases)print(f"Regex time:{regex_time:.6f} sec")print(f"No regex time:{no_regex_time:.6f} sec")

timeit result: 1.33s -> 0.64s = x2.08 as fast

$ ./python -B difflib_bench.pyRegex time: 1.330593 secNo regex time: 0.638559 sec

donBarbosand others added2 commitsFebruary 16, 2025 07:20
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Co-authored-by: Adam Turner <9087854+AA-Turner@users.noreply.github.com>
Copy link
Member

@terryjreedyterryjreedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

One oddity of difflib is that re is imported at module level for IS_LINE_JUNK and again at class level in class_mdiff line 1378. It is used in line 1381 to define class namechange_re = re.compile(r'(\++|\-+|\^+)'). To avoid importing re when importing difflib, the second import would have to be moved within a function. This is possible sincechange_re is only used within the_make_line method the immediately follows (1386).

I have not looked at how commonly this method is used.

@bedevere-app
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phraseI have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@terryjreedy
Copy link
Member

I checked correctness of the return expression replacement, using test_cases above, with

for s in test_cases:    (pat(s) is not None) is (s.strip() in ('', '#'))

and all are True.

@donBarbos
Copy link
ContributorAuthor

I checked correctness of the return expression replacement, using test_cases above, with

Ok, thank you, but thistest_cases is too small a sample and therefore I have supplemented my script with the following checks:

importstringimportrandomfromitertoolsimportproduct# First Checks:forcaseintest_cases:assertis_line_junk_regex(case)==is_line_junk_no_regex(case),f"Inconsistent behavior for{repr(case)}"defrandom_string(length=10):chars=string.printable+" "*5+"#"*5return"".join(random.choice(chars)for_inrange(length))# Second Checks:for_inrange(10000):case=random_string(random.randint(0,20))assertis_line_junk_regex(case)==is_line_junk_no_regex(case),f"Inconsistent behavior for{repr(case)}"# Third Checks:chars=" #\tabc"forlengthinrange(6):forcaseinmap("".join,product(chars,repeat=length)):assertis_line_junk_regex(case)==is_line_junk_no_regex(case),f"Inconsistent behavior for{repr(case)}"

Then I checked the tests inLib/test/test_difflib.py and everything finished well 👍

@donBarbos
Copy link
ContributorAuthor

I have made the requested changes; please review again

@bedevere-app
Copy link

Thanks for making the requested changes!

@terryjreedy,@AA-Turner: please review the changes made to this pull request.

Copy link
Member

@picnixzpicnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Since we're improving performance, we don't want to recomputeline.strip()

Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
@donBarbos
Copy link
ContributorAuthor

Since we're improving performance, we don't want to recomputeline.strip()

Sorry, I suspected this but hoped that there was a cache mechanism here :)

Copy link
Member

@serhiy-storchakaserhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I am not sure that the performance of this function is worth changing the code. In any case, the relative performance depends on the input -- for a long string with a single space character at one end (e.g.'x'*10000+' ') the non-regex version is more than 10 times slower than the regex version. For length 100000 the difference is more than 100, and for length 1000000 -- more than 1000.

Hmm, it seems, that that implementation has more than quadratic complexity. Of course, most data for which this function is used is short strings, but such behavior can be considered a security threat.

@terryjreedy
Copy link
Member

terryjreedy commentedFeb 16, 2025
edited by hugovk
Loading

I checked correctness of the return expression replacement, using test_cases above, with

forsintest_cases:    (pat(s)isnotNone)is (s.strip()in ('','#'))

and all are True.

I am going to be neutral an whether the basic change should be merged. If it is, I personally think the 'pat' parameter can be considered to be private and eliminated. But I would defer the back-compatibility to others, include 'encukou'. The only documented use of the function, marked as a constant, is to be passed as an argument. The argument is always passed on to SequenceMatcher, which calls it once with 1 argument, which would be a line.

@donBarbos
Copy link
ContributorAuthor

donBarbos commentedFeb 16, 2025
edited by hugovk
Loading

@serhiy-storchaka

In any case, the relative performance depends on the input -- for a long string with a single space character at one end (e.g.'x'*10000+' ') the non-regex version is more than 10 times slower than the regex version. For length 100000 the difference is more than 100, and for length 1000000 -- more than 1000.

I tested your cases but didn't get the result you expected (maybe I did something wrong?)
I ran next script (accidentally mixed up the dividend and divisor, but it doesn't matter):

importreimporttimeitdefis_line_junk_regex(line,pat=re.compile(r"\s*(?:#\s*)?$").match):returnpat(line)isnotNonedefis_line_junk_no_regex(line):returnline.strip()==""orline.lstrip().rstrip()=="#"defbenchmark(func,cases,n=1_000):returntimeit.timeit(lambda: [func(line)forlineincases],number=n)lengths= [10_000,100_000,1_000_000]forlengthinlengths:test_string="x"*length+" "regex_time=benchmark(is_line_junk_regex,test_string)no_regex_time=benchmark(is_line_junk_no_regex,test_string)print(f"String length:{length}")print(f"Regex version:{regex_time:.6f} seconds")print(f"Non-regex version:{no_regex_time:.6f} seconds")print(f"Speedup:{no_regex_time/regex_time:.2f}x\n")

and got next results:

$./python -B bench.pyString length: 10000Regex version: 4.220341 secondsNon-regex version: 1.983180 secondsSpeedup: 0.47xString length: 100000Regex version: 42.698992 secondsNon-regex version: 20.897327 secondsSpeedup: 0.49xString length: 1000000Regex version: 516.844790 secondsNon-regex version: 262.632697 secondsSpeedup: 0.51x

@picnixz
Copy link
Member

picnixz commentedFeb 16, 2025
edited
Loading

Aren't you iterating over a single character instead of a full line:

[func(line)forlineincases]

You should pass[test_string] instead of justtest_string (sorry if I'm wrong, I'm tired)

@AA-Turner
Copy link
Member

I was the one who originally reqested restoringpat, but I do find Terry's argument compelling, and I wouldn't oppose removingpat.

I think we should restore this PR to the simpler previous version (the differences micro-benchmarks between string comparisons are so small they will be swallowed by noise). The class-levelre import should be resolved before merge, though.

A

Co-authored-by: Hugo van Kemenade <1324225+hugovk@users.noreply.github.com>
Copy link
Member

@serhiy-storchakaserhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I was wrong, the complexity is not quadratic, but linear.

Anyway, the relative performance depends on input. For some input the regexp version is faster, for other it is slower. Claiming that this improves speed for any input is bold.

@rhettinger
Copy link
Contributor

@tim-one Do you have a reason to want to keep the regex?

@tim-one
Copy link
Member

Can't say I much care, but I don't get it. What's the point? It's dubious on the face of it to build an entirely new string object just to do simple lexical analysis. If this is some "micro-optimizztion" thing, instead of

stripped=line.strip()returnstripped==''orstripped=='#'

try

returnline.strip()in'#'# empty or single '#'

?

@donBarbos
Copy link
ContributorAuthor

donBarbos commentedFeb 18, 2025
edited
Loading

Can't say I much care, but I don't get it. What's the point? It's dubious on the face of it to build an entirely new string object just to do simple lexical analysis. If this is some "micro-optimizztion" thing, instead of

stripped=line.strip()returnstripped==''orstripped=='#'

try

returnline.strip()in'#'# empty or single '#'

?

that's how it was originally :)

I'll probably return it so as not to overcomplicate things

tim-one reacted with laugh emoji

Co-authored-by: Tim Peters <tim.peters@gmail.com>
@donBarbos
Copy link
ContributorAuthor

@rhettinger

@tim-one Do you have a reason to want to keep the regex?

Tim doesn't seem to mind

@donBarbos
Copy link
ContributorAuthor

I don't know who I can ping but it seems like there are more suggestions

@AA-TurnerAA-Turnerenabled auto-merge (squash)May 1, 2025 03:48
@AA-TurnerAA-Turner merged commitbce45bc intopython:mainMay 1, 2025
39 checks passed
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@eendebakpteendebakpteendebakpt left review comments

@serhiy-storchakaserhiy-storchakaserhiy-storchaka left review comments

@sobolevnsobolevnsobolevn left review comments

@AA-TurnerAA-TurnerAA-Turner approved these changes

@picnixzpicnixzpicnixz left review comments

@tim-onetim-onetim-one left review comments

@hugovkhugovkhugovk approved these changes

@terryjreedyterryjreedyAwaiting requested review from terryjreedy

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

10 participants
@donBarbos@terryjreedy@picnixz@AA-Turner@rhettinger@tim-one@eendebakpt@hugovk@serhiy-storchaka@sobolevn

[8]ページ先頭

©2009-2025 Movatter.jp