Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

bpo-37966: Fully implement the UAX #15 quick-check algorithm.#15558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
benjaminp merged 5 commits intopython:masterfromgnprice:pr-quickcheck
Sep 4, 2019

Conversation

gnprice
Copy link
Contributor

@gnpricegnprice commentedAug 28, 2019
edited by bedevere-bot
Loading

The purpose of theunicodedata.is_normalized function is to answer
the questionstr == unicodedata.normalized(form, str) more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX#15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm. This greatly speeds up
unicodedata.is_normalized in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \    -- 'unicodedata.is_normalized("NFD", s)'50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \    -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loop

https://bugs.python.org/issue37966

This link doesn't work.Going back through that UAX's history to find the version that wascurrent when this code was added in commit7a0fedf in 2009-04,we find that that anchor still works in that version:https://www.unicode.org/reports/tr15/tr15-29.html#Annex8It's a section heading "14. Detecting Normalization Forms".  Happilythe anchor that the corresponding section heading now offers looksmuch more reasonable -- it's the title of the section -- and so likelyto be long-term stable.  ("Annex 8" seems like some kind of editingerror.)  Switch to that.
The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpython#15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loop
This restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop
Copy link
Contributor

@benjaminpbenjaminp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks. Looks good. Here are a few nitty comments.

is_normalized(PyObject *self, PyObject *input, int nfc, int k)
// This needs to match the logic in makeunicodedata.py
// which constructs the quickcheck data.
typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybemakeunicodedata.py should output this enum (with better name namespacing) then?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, good idea!

Do you think it'd be cleaner to add that to this PR, or make a small separate one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could be done latter, I think.

gnprice reacted with thumbs up emoji
@@ -208,6 +208,8 @@ def test_issue29456(self):
self.assertEqual(self.db.normalize('NFC', u11a7_str_a), u11a7_str_b)
self.assertEqual(self.db.normalize('NFC', u11c3_str_a), u11c3_str_b)

# For tests of unicodedata.is_normalized / self.db.is_normalized ,
# see test_normalization.py .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

ugh, I would support mergingtest_normalization into this file for clarity.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah; as it is, it actually caused me to think this function had evaded being tested at all -- I'd gone as far as to write up some tests myself before I spotted that other file.

A bit of context on someone's thinking in having a separate file can be found above at line 178:

        # The rest can be found in test_normalization.py        # which requires an external file.

I don't see why that means the test code should be in a separate file, though. There's already a try/skip mechanism to deal with the external file being unavailable. (I'm guessing that if I looked in the history I'd find that that mechanism wasn't there when the separate file was first added.)

Happy to send a PR to fold that file in.

*/
static QuickcheckResult
is_normalized_quickcheck(PyObject *self, PyObject *input,
int nfc, int k, int yes_only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These "boolean int" parameters could be actualbools.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh wow it's the future! 🙂 I guess that's a C99 feature, and one of those we now accept using. Good to know.

I'm not sure I want to convert the existing two parameters within this PR; it seems like that could muddy the diff a bit with changes that are orthogonal to what it's really doing. But I guess I can see how it looks.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pushed a change tobool for the new parameter.

I was a bit surprised to find that I needed to add an#include <stdbool.h> line to the file! Perhaps that should be in a central include file, in the same way aslimits.h andinttypes.h are. That's certainly orthogonal to this PR, though.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I made a quick further commit convertingnfc andk tobool too (including in the callers, and usingtrue/false rather than 1/0 for literal values.)

I'm ambivalent about adding it here. It expands the diff a bit -- was:
1 file changed, 51 insertions(+), 24 deletions(-)
and becomes:
1 file changed, 57 insertions(+), 30 deletions(-)
Slightly inclined to leave it out, but if you'd rather include it I'll do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we want to avoid polluting public headers with astdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases. Usingstdbool.h where it's needed is also consistent with IWYU.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we want to avoid polluting public headers with astdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases.

Ah, fair enough. I think that's consistent with putting it in a generic internal-only header, though. Within the tree, the very few existing uses are either stdbool, or (inModules/expat/xmltok.c) a sort of manual stdbool.

Usingstdbool.h where it's needed is also consistent with IWYU.

Hmm true. I guess IWYU doesn't feel like how this codebase is mostly written... but maybe it's best to make it more so.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

Here's the extra commit:gnprice@896c970f5
The lines not otherwise touched are where we declare and then fiddle with thenfc andk locals inunicodedata_UCD_is_normalized_impl.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Okay, we can fix up the rest in a sequel.

* normalized or certainly not, and MAYBE if quickcheck is unable to
* tell.
*
* If `yes_only` is true, then return MAYBE as soon as we determine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yes_only is a bizarre name for a parameter that causes the function to returnMAYBE.

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, I'm not really a fan of this name; I'd be glad for a better one.

The idea is that this flag says the caller doesn't care about NO vs. MAYBE; it only cares whether the answer is YES.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe invert the sense and call itwant_definite meaning "process the string until we're sure it's not normalized or we reach the end"?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The algorithm in the standard is theyes_only = false (orwant_definite = true) case, so I think things are simplest to describe if that's the baseline and the other case is described by contrast with it. Could certainly handle that with some added words, though.

Another aspect is that there is an asymmetry between a definite YES and a definite NO. In theyes_only /!want_definite case, what the caller reallymost wants is a definite YES... it's just a definite NO that is no more helpful than a MAYBE. Or another angle on that: if the caller really didn't care about getting a definite answer, it could just assume MAYBE and not bother calling the helper 😉 It's in the hope of getting a YES that it makes the call.

More ideas just now: perhapsmaybe_is_no, to mean "a MAYBE is no worse than a NO"?

Oryes_or_maybe, for "there are only two distinct answers, YES and MAYBE". That's basically what I was going for withyes_only, but probably a less paradoxical way of putting it 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The API also surprised me. I suggest tho rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I suggest to rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

Hmm, I like the idea of covering over some of the multiplication of flags with wrapper functions!

I don't like quite these names, though. First, I don't thinkis_normalized is a good name, for either value of the flag, because the function doesn't actually take full responsibility for finding out if the string is normalized -- it can return MAYBE, and then it's up to the caller to go and normalize the string and compare.

What the function does do is implement a quick but partial check to try to see if the string is normalized. Specifically it implements one that's defined in the standard, which the standard callsquickCheck in its sample code, and which uses a property in the UCD calledQuick_Check.

More precisely, the yes_only=false case implements that standard quick-check algorithm; the yes_only=true (or yes_or_maybe=true, etc.) case (which is the version in master) is a variation on it. I think the name "quickcheck" oris_normalized_quickcheck, with no further qualifications, sounds like it will mean the standard's algorithm and it's confusing if it means a different algorithm instead.

Here's another possible set of names to use with the same idea:

  • rename this function tois_normalized_quickcheck_impl
  • is_normalized_quickcheck is the standard's quickcheck (yes_only/yes_or_maybe=false)
  • is_normalized_quickercheck is the variant that's more impatient toreturn and hand responsibility back to the caller (yes_or_maybe=true)
    • Oris_normalized_quickcheck_quicker

Or if those names feel too long:is_normalized_qc_impl,is_normalized_qc,is_normalized_qc_quicker.

How does that sound?

Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oryes_or_maybe, for "there are only two distinct answers, YES and MAYBE".

Another idea for the flag name (though its name becomes less important if we use Victor's idea of a pair of wrapper functions):yes_vs_maybe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It doesn't seem like anything is standing out as dramatically better. At least the semantics are clearly documented and easily read from the function.

So, let's go with the current state.

@benjaminpbenjaminp merged commit2f09413 intopython:masterSep 4, 2019
@miss-islington
Copy link
Contributor

Thanks@gnprice for the PR, and@benjaminp for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

@bedevere-bot
Copy link

GH-15671 is a backport of this pull request to the3.8 branch.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull requestSep 4, 2019
…orithm. (pythonGH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpythonGH-15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop(cherry picked from commit2f09413)Co-authored-by: Greg Price <gnprice@gmail.com>
miss-islington added a commit that referenced this pull requestSep 4, 2019
GH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXGH-15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop(cherry picked from commit2f09413)Co-authored-by: Greg Price <gnprice@gmail.com>
@gnpricegnprice deleted the pr-quickcheck branchSeptember 6, 2019 03:23
@gnprice
Copy link
ContributorAuthor

Thanks for all the reviews and the merge!

@benjaminp , I'll go on and send next some of those followups you suggested.

@gnprice
Copy link
ContributorAuthor

@benjaminp , I'll go on and send next some of those followups you suggested.

Filedbpo-38043 for that, and sent PRs for two of them.

lisroach pushed a commit to lisroach/cpython that referenced this pull requestSep 10, 2019
…ithm. (pythonGH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpython#15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop
DinoV pushed a commit to DinoV/cpython that referenced this pull requestJan 14, 2020
…ithm. (pythonGH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpython#15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop
websurfer5 pushed a commit to websurfer5/cpython that referenced this pull requestJul 20, 2020
…ithm. (pythonGH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpython#15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@vstinnervstinnervstinner left review comments

@benjaminpbenjaminpbenjaminp approved these changes

@serhiy-storchakaserhiy-storchakaAwaiting requested review from serhiy-storchaka

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

6 participants
@gnprice@miss-islington@bedevere-bot@vstinner@benjaminp@the-knights-who-say-ni

[8]ページ先頭

©2009-2025 Movatter.jp