Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32k
gh-119702: New dynamic algorithm selection for string search (+rfind
alignment)#120025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
rfind
upgrade)dg-pb commentedJun 4, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR. Also, how should I read your tables? I'm confused by |
dg-pb commentedJun 5, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
It looks a big mess now, but I am half-way through making functions that handle both directions with the same code. Initially it slowed down code and looked complex, but it is slowly recovering in both aspects. I should have a presentable version in a day or so. You can take a look then and if you're not happy I can always revert to the first commit of this PR.
The formula is as follows: current_time=`time of current python version`new_time=`time of this PR`result= (new_time-current_time)/min(new_time,current_time) I did minimum to make it fair. E.g.: new_time=2current_time=1(new_time-current_time)/current_time=100%(new_time-current_time)/min(new_time,current_time)=100%new_time=1current_time=2(new_time-current_time)/current_time=-50%(new_time-current_time)/min(new_time,current_time)=-100% This way it is symmetric to both positive and negative. So |
This comment was marked as resolved.
This comment was marked as resolved.
Uh oh!
There was an error while loading.Please reload this page.
dg-pb commentedJun 6, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Changing variable names is easy. However, there were several functions defining same variables with different names: I took liberty to synchronize them and picked a) for the following reasons: However, this is my take. If there are strong objections, it is no problem to revert them back. Nevertheless, I think they should be synchronized across all the functions in the module. Naming discrepancy has cost me a fair bit of time due to having to readjust myself every time I changed focus. UPDATE: |
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I haven't read everything but here are some general comments:
Also, please document your code (especially the structures) and explain what is what when you use short variables. Even though it is standard in such algorithms, it does not hurt to remind ourselves thatp is the pattern,m is its length, etc. But there are other variables with short names that have no explanation.
I've suggested some improvements on blank lines and if-else blocks but feel free to ignore them.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
dg-pb commentedJun 9, 2024 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I eventually changed back argument names to However, inside algorithm logic |
This comment was marked as resolved.
This comment was marked as resolved.
It wasn't necessary, but preferable, given the situation I was in. |
rfind
upgrade)rfind
upgrade)rfind
upgrade)rfind
)rfind
)rfind
alignment)cc@tim-one |
@sweeneyde, are you still around? You're obviously the best person to look at this, and it appears to be substantial work with real value. I can't make time to climb this learning curve. |
Uh oh!
There was an error while loading.Please reload this page.
1. Work
I managed to combine all good tricks that I found in the library into 1 dynamic solution, which seems to perform well and eliminate hard-coded boundaries for algorithm selection.
horspool_find
) is now called (for both forward and reverse search). It dynamically defaults to linear-complexity-assured solution (two_way_find
) if it predicts it will perform better.rfind
to use exact same code asfind
.n == m
to usememcmp
added.2. Results
Aggregate impact of this change seems to be net positive. It results in non-trivial average performance increase, adapts more advanced search algorithms for reverse search, smooths out performance
surface
and improves on general code structure and documentation.Benefits:
surface
is much smoother now. There is only 1 logic that can cause a step change now and it is dynamic as opposed to many hard-coded step changes of the current logic.75%
performance increase offind
for artificial benchmark of shuffled alphabet.34%
performance increase offind
for real file search of different slice lengths.247%
performance increase ofrfind
for artificial benchmark of shuffled alphabet.Worth noting:
3. Benchmarks:
Benchmark result value:
3.1. Artificial dataset via randomized alphabet.
Case Generation Gode
1.a. Results. Current vs new `str.find/str.count`.
Comparison for
len(haystack) == 1000
forstr.find
. x-axis is"{needle_len}:{seed}"
. Upper chart is run time, lower chart is percentage difference. It depicts the issue this PR is addressing. I.e. Big sub-optimal step-changes in performance for small input changes.1.b. Results. Current vs new `str.rfind/str.rcount`..
3.2. Search for arbitrary chunks in real files.
Case Generation Gode. `str.rfind/str.rcount`..
2.a. Results. Current vs new `str.find/str.count`.