Movatterモバイル変換

bedevere-appbot mentioned this pull request

Better calibration of str.find aka FASTSEARCH#119702

Open

nineteendo reviewed

Objects/stringlib/fastsearch.h OutdatedShow resolvedHide resolved

nineteendo reviewed

Objects/stringlib/fastsearch.hShow resolvedHide resolved

Objects/stringlib/fastsearch.h OutdatedShow resolvedHide resolved

rfind implemented

3bb688a

dg-pb changed the title~~gh-119702: New dynamic algorithm for string search~~gh-119702: New dynamic algorithm for string search (+rfind upgrade)

Copy link

ContributorAuthor

dg-pb commentedJun 4, 2024•
edited
Loading

Performance comparison of `count` vs `rcount` (both of this PR).

bug

48e7dd5

dg-pb marked this pull request as draft

June 5, 2024 11:49

bedevere-appbot removed the awaiting review label

Jun 5, 2024

bi-directional horspool_find

d11f9b9

Copy link

Member

sweeneyde commentedJun 5, 2024

-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR.

Also, how should I read your tables? I'm confused by-131.2%, since that implies negative time to me.

Copy link

ContributorAuthor

dg-pb commentedJun 5, 2024•
edited
Loading

-1 on any changes to rfind(). Maybe someone can consider it later, let's leave it off of this PR.

It looks a big mess now, but I am half-way through making functions that handle both directions with the same code. Initially it slowed down code and looked complex, but it is slowly recovering in both aspects. I should have a presentable version in a day or so. You can take a look then and if you're not happy I can always revert to the first commit of this PR.

Also, how should I read your tables? I'm confused by-131.2%, since that implies negative time to me.

The formula is as follows:

current_time=`time of current python version`new_time=`time of this PR`result= (new_time-current_time)/min(new_time,current_time)

I did minimum to make it fair. E.g.:

new_time=2current_time=1(new_time-current_time)/current_time=100%(new_time-current_time)/min(new_time,current_time)=100%new_time=1current_time=2(new_time-current_time)/current_time=-50%(new_time-current_time)/min(new_time,current_time)=-100%

This way it is symmetric to both positive and negative.

So-131.2% means old version is131.2% slower than the new.

dg-pb added5 commits

June 6, 2024 03:26

optimized horspool

6c9dbc3

more conservative bloom

1b9bdc9

fix assertions

982b510

seamless reverse integration

4e9d278

ready for review

c8e1cc5

This comment was marked as resolved.

nineteendo reviewed

Objects/stringlib/fastsearch.h OutdatedShow resolvedHide resolved

Copy link

ContributorAuthor

dg-pb commentedJun 6, 2024•
edited
Loading

Could you keep the variable names clear? I don't like 1 letter abbreviations. (i &j for loop counters are generally fine though.)

Changing variable names is easy. However, there were several functions defining same variables with different names:
a)default_find and friends was usings/n/p/m, which is a default naming for string search problems. Seehttps://en.wikipedia.org/wiki/Boyer–Moore_string-search_algorithm. Generally, 90% of search algorithms use this notation.
b)two_way_find was using full names ofhaystack &needle.

I took liberty to synchronize them and picked a) for the following reasons:
a) To me, it is easier to ingest the code visually when variable names are short. High level code, where there are no complex loops, probably long names are better. But when code gets complex, such as algorithms, then bottleneck becomes the logic, not readability, because one needs to look a certain amount of time before starting to grasp something. 2% into the work variable names are learnt by heart (whatever they might be) and visual inspection of the code is priority for the rest 98% of the time. I think there is a reason why libraries of algorithms use short names.
b) haystack and needle are definitions that are created for analogy (and probably fun). I prefer more formal names for variables or the ones that match equations. Fun variable names can be mentioned in documentation.

However, this is my take. If there are strong objections, it is no problem to revert them back.

Nevertheless, I think they should be synchronized across all the functions in the module. Naming discrepancy has cost me a fair bit of time due to having to readjust myself every time I changed focus.

UPDATE:
AlsoObjects/stringlib/find.h usesstr/str_len/sub/sub_len and it contains higher level functions. Generally, variable names get shorter as it gets to lower level, so I would at least like to keep the same length. But given complexity of the code would still try to insist on 1-letter names that match most of resources on such problems (especially code).

dg-pb commented

Objects/stringlib/fastsearch.h OutdatedShow resolvedHide resolved

dg-pb marked this pull request as ready for review

June 6, 2024 09:58

bedevere-appbot added the awaiting review label

rm variable & add comment

b5bd4c5

picnixz reviewed