Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-119702: New dynamic algorithm selection for string search (+rfind alignment)#120025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
dg-pb wants to merge30 commits intopython:main
base:main
Choose a base branch
Loading
fromdg-pb:implement-119702

Conversation

dg-pb
Copy link
Contributor

@dg-pbdg-pb commentedJun 4, 2024
edited
Loading

1. Work

I managed to combine all good tricks that I found in the library into 1 dynamic solution, which seems to perform well and eliminate hard-coded boundaries for algorithm selection.

  1. Instead of 3 different implementations only one (horspool_find) is now called (for both forward and reverse search). It dynamically defaults to linear-complexity-assured solution (two_way_find) if it predicts it will perform better.
  2. Direction agnostic logic allowedrfind to use exact same code asfind.
  3. Special casen == m to usememcmp added.

2. Results

Aggregate impact of this change seems to be net positive. It results in non-trivial average performance increase, adapts more advanced search algorithms for reverse search, smooths out performancesurface and improves on general code structure and documentation.

Benefits:

  1. Performancesurface is much smoother now. There is only 1 logic that can cause a step change now and it is dynamic as opposed to many hard-coded step changes of the current logic.
  2. Direction agnostic logic works well and eases the strain of the alternative of having to keep 2 implementations in sync.
  3. Benchmarks:
  • Average75% performance increase offind for artificial benchmark of shuffled alphabet.
  • Average34% performance increase offind for real file search of different slice lengths.
  • Average247% performance increase ofrfind for artificial benchmark of shuffled alphabet.

Worth noting:

  1. Splitting 2 directions (forward and reversed) into 2 implementations would result in 10-30% better performance based on tested benchmarks. However, I think it is a good trade-off, given the advantages of unified approach.
  2. There are areas and cases where new algorithm performs worse (see benchmark). However, they are either not clustered or where they are, the performance decrease is non-substantial.

3. Benchmarks:

Benchmark result value:

current_runtime=`run time of current python version`new_runtime=`run time of this PR`result= (new_runtime-current_runtime)/min(new_runtime,current_runtime)

3.1. Artificial dataset via randomized alphabet.

Case Generation Gode

# shuffled alphabetalphabet='DHUXYEZQCLFKISBVRGNAMWPTOJ'zipf= [1/xforxinrange(1,1+len(alphabet))]defzipf_string(length,seed):letters=random.Random(seed).choices(alphabet,weights=zipf,k=length)return''.join(letters)NLS= [2,3,4,6,8,12,16,24,32,48,64,96,128,192,256,384,500,1000,10000,100_000]HSS= [500,750,1000,1500,2_000,3_000,4_000,6_000,8_000,12_000,16_000,24_000,32_000,48_000,64_000,96_000,1_000_000]defgenerate_benchmarks():output= []forminNLS:forninHSS:ifn<m:continueforsin (1,2,3):seed= (s*n+m)%1_000_003needle=zipf_string(m,seed)haystack=zipf_string(n,seed**2)name=f"needle={m}, haystack={n}, seed={s}"output.append((name,needle,haystack))withopen(f"{PATH}/_generated.py",'w')asf:print("benches = [",file=f)forname,needle,haystackinoutput:print(f"{(name,needle,haystack)!r},",file=f)print("]",file=f)

1.a. Results. Current vs new `str.find/str.count`.

Screenshot 2024-06-27 at 15 06 37

Comparison forlen(haystack) == 1000 forstr.find. x-axis is"{needle_len}:{seed}". Upper chart is run time, lower chart is percentage difference. It depicts the issue this PR is addressing. I.e. Big sub-optimal step-changes in performance for small input changes.

Screenshot 2024-06-27 at 14 12 55
1.b. Results. Current vs new `str.rfind/str.rcount`..

Screenshot 2024-06-08 at 23 23 17

3.2. Search for arbitrary chunks in real files.

Case Generation Gode. `str.rfind/str.rcount`..

/FILES= {"c": (CPYTHON_PATH/"Objects"/"unicodeobject.c").read_text(),"py": (CPYTHON_PATH/"Lib"/"_pydecimal.py").read_text(),"en": (CPYTHON_PATH/"Doc"/"library"/"stdtypes.rst").read_text(),"bin": (CPYTHON_PATH/"python.exe").read_bytes(),}MS= [10,15,20,30,40,60,80,120,160,240,320,640,1280]MR=range(12)defgenerate_benchmarks():results=dict()forfile_label,haystackinFILES.items():n=len(haystack)forminMS:foriinMR:stt= (1_000_003*i)% (n-m)needle=haystack[stt:stt+m]results[(m,file_label,i)]=haystack,needlereturnresults

2.a. Results. Current vs new `str.find/str.count`.

Screenshot 2024-06-27 at 15 06 54

nineteendo, bratao, and 0dminnimda reacted with thumbs up emoji
@dg-pbdg-pb changed the titlegh-119702: New dynamic algorithm for string searchgh-119702: New dynamic algorithm for string search (+rfind upgrade)Jun 4, 2024
@dg-pb
Copy link
ContributorAuthor

dg-pb commentedJun 4, 2024
edited
Loading

Performance comparison of `count` vs `rcount` (both of this PR).

Screenshot 2024-06-05 at 02 05 25

@dg-pbdg-pb marked this pull request as draftJune 5, 2024 11:49
@sweeneyde
Copy link
Member

-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR.

Also, how should I read your tables? I'm confused by-131.2%, since that implies negative time to me.

@dg-pb
Copy link
ContributorAuthor

dg-pb commentedJun 5, 2024
edited
Loading

-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR.

It looks a big mess now, but I am half-way through making functions that handle both directions with the same code. Initially it slowed down code and looked complex, but it is slowly recovering in both aspects. I should have a presentable version in a day or so. You can take a look then and if you're not happy I can always revert to the first commit of this PR.

Also, how should I read your tables? I'm confused by-131.2%, since that implies negative time to me.

The formula is as follows:

current_time=`time of current python version`new_time=`time of this PR`result= (new_time-current_time)/min(new_time,current_time)

I did minimum to make it fair. E.g.:

new_time=2current_time=1(new_time-current_time)/current_time=100%(new_time-current_time)/min(new_time,current_time)=100%new_time=1current_time=2(new_time-current_time)/current_time=-50%(new_time-current_time)/min(new_time,current_time)=-100%

This way it is symmetric to both positive and negative.

So-131.2% means old version is131.2% slower than the new.

@nineteendo

This comment was marked as resolved.

@dg-pb
Copy link
ContributorAuthor

dg-pb commentedJun 6, 2024
edited
Loading

Could you keep the variable names clear? I don't like 1 letter abbreviations. (i &j for loop counters are generally fine though.)

Changing variable names is easy. However, there were several functions defining same variables with different names:
a)default_find and friends was usings/n/p/m, which is a default naming for string search problems. Seehttps://en.wikipedia.org/wiki/Boyer–Moore_string-search_algorithm. Generally, 90% of search algorithms use this notation.
b)two_way_find was using full names ofhaystack &needle.

I took liberty to synchronize them and picked a) for the following reasons:
a) To me, it is easier to ingest the code visually when variable names are short. High level code, where there are no complex loops, probably long names are better. But when code gets complex, such as algorithms, then bottleneck becomes the logic, not readability, because one needs to look a certain amount of time before starting to grasp something. 2% into the work variable names are learnt by heart (whatever they might be) and visual inspection of the code is priority for the rest 98% of the time. I think there is a reason why libraries of algorithms use short names.
b) haystack and needle are definitions that are created for analogy (and probably fun). I prefer more formal names for variables or the ones that match equations. Fun variable names can be mentioned in documentation.

However, this is my take. If there are strong objections, it is no problem to revert them back.

Nevertheless, I think they should be synchronized across all the functions in the module. Naming discrepancy has cost me a fair bit of time due to having to readjust myself every time I changed focus.

UPDATE:
AlsoObjects/stringlib/find.h usesstr/str_len/sub/sub_len and it contains higher level functions. Generally, variable names get shorter as it gets to lower level, so I would at least like to keep the same length. But given complexity of the code would still try to insist on 1-letter names that match most of resources on such problems (especially code).

nineteendo and BrenBarn reacted with thumbs up emoji

@dg-pbdg-pb marked this pull request as ready for reviewJune 6, 2024 09:58
Copy link
Member

@picnixzpicnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I haven't read everything but here are some general comments:

Also, please document your code (especially the structures) and explain what is what when you use short variables. Even though it is standard in such algorithms, it does not hurt to remind ourselves thatp is the pattern,m is its length, etc. But there are other variables with short names that have no explanation.

I've suggested some improvements on blank lines and if-else blocks but feel free to ignore them.

dg-pb and nineteendo reacted with thumbs up emoji
@dg-pbdg-pb marked this pull request as draftJune 7, 2024 22:48
@dg-pb
Copy link
ContributorAuthor

dg-pb commentedJun 9, 2024
edited
Loading

Also, please document your code (especially the structures) and explain what is what when you use short variables. Even though it is standard in such algorithms, it does not hurt to remind ourselves thatp is the pattern,m is its length, etc. But there are other variables with short names that have no explanation.

I eventually changed back argument names toneedle andhaystack. Also did the same withpre-work structure. It is easier to figure things out this way, when there is no need to work with low details of algorithm.

However, inside algorithm logics/n/p/m are used and it is documented. But happy to change them to full names if needed.

nineteendo reacted with thumbs up emoji

@dg-pbdg-pb marked this pull request as draftJune 11, 2024 04:07
@dg-pbdg-pb marked this pull request as ready for reviewJune 11, 2024 08:03
@nineteendo

This comment was marked as resolved.

@dg-pb
Copy link
ContributorAuthor

Why was the force push necessary?

It wasn't necessary, but preferable, given the situation I was in.

@dg-pbdg-pb changed the titlegh-119702: New dynamic algorithm for string search (+rfind upgrade)gh-119702: New dynamic algorithm selection for string search (+rfind upgrade)Jul 13, 2024
@dg-pbdg-pb changed the titlegh-119702: New dynamic algorithm selection for string search (+rfind upgrade)gh-119702: New dynamic algorithm selection for string search (+ alignment withrfind)Jul 13, 2024
@dg-pbdg-pb changed the titlegh-119702: New dynamic algorithm selection for string search (+ alignment withrfind)gh-119702: New dynamic algorithm selection for string search (+rfind alignment)Jul 13, 2024
@pitrou
Copy link
Member

cc@tim-one

dg-pb reacted with eyes emoji

@tim-one
Copy link
Member

@sweeneyde, are you still around? You're obviously the best person to look at this, and it appears to be substantial work with real value. I can't make time to climb this learning curve.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@picnixzpicnixzpicnixz left review comments

@nineteendonineteendonineteendo left review comments

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

6 participants
@dg-pb@sweeneyde@nineteendo@pitrou@tim-one@picnixz

[8]ページ先頭

©2009-2025 Movatter.jp