NotificationsYou must be signed in to change notification settings
Fork32k
Star67.3k

bpo-37966: Fully implement the UAX #15 quick-check algorithm.#15558

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

benjaminp merged 5 commits intopython:masterfromgnprice:pr-quickcheck

Sep 4, 2019

Merged

bpo-37966: Fully implement the UAX #15 quick-check algorithm.#15558

benjaminp merged 5 commits intopython:masterfromgnprice:pr-quickcheck

Sep 4, 2019

Conversation

Copy link

Contributor

gnprice commentedAug 28, 2019•
edited by bedevere-bot
Loading

The purpose of theunicodedata.is_normalized function is to answer
the questionstr == unicodedata.normalized(form, str) more
efficiently than writing just that, by using the "quick check"
optimization described in the Unicode standard in UAX#15.

However, it turns out the code doesn't implement the full algorithm
from the standard, and as a result we often miss the optimization and
end up having to compute the whole normalized string after all.

Implement the standard's algorithm. This greatly speeds up
unicodedata.is_normalized in many cases where our partial variant
of quick-check had been returning MAYBE and the standard algorithm
returns NO.

At a quick test on my desktop, the existing code takes about 4.4 ms/MB
(so 4.4 ns per byte) when the partial quick-check returns MAYBE and it
has to do the slow normalize-and-compare:

$ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \    -- 'unicodedata.is_normalized("NFD", s)'50 loops, best of 5: 4.39 msec per loop

With this patch, it gets the answer instantly (58 ns) on the same 1 MB
string:

$ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \    -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loop

https://bugs.python.org/issue37966

gnprice added3 commits

August 27, 2019 19:39

Fix a broken link in a comment in is_normalized.

4025110

This link doesn't work.Going back through that UAX's history to find the version that wascurrent when this code was added in commit7a0fedf in 2009-04,we find that that anchor still works in that version:https://www.unicode.org/reports/tr15/tr15-29.html#Annex8It's a section heading "14. Detecting Normalization Forms".  Happilythe anchor that the corresponding section heading now offers looksmuch more reasonable -- it's the title of the section -- and so likelyto be long-term stable.  ("Annex 8" seems like some kind of editingerror.)  Switch to that.

bpo-37966: Fully implement the UAXpython#15quick-check algorithm.

2a222da

The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpython#15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loop

bpo-37966: Add yes_only flag tois_normalized helper.

26892d3

This restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop

the-knights-who-say-ni added the CLA signed label

Aug 28, 2019

bedevere-bot added the awaiting review label

Aug 28, 2019

benjaminp reviewed

Aug 29, 2019

View reviewed changes

Copy link

Contributor

benjaminp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks. Looks good. Here are a few nitty comments.

Modules/unicodedata.c

		is_normalized(PyObject self, PyObject input, int nfc, int k)
		// This needs to match the logic in makeunicodedata.py
		// which constructs the quickcheck data.
		typedef enum {YES = 0, MAYBE = 1, NO = 2} QuickcheckResult;

Copy link

Contributor

benjaminpAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybemakeunicodedata.py should output this enum (with better name namespacing) then?

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Ah, good idea!

Do you think it'd be cleaner to add that to this PR, or make a small separate one?

Copy link

Contributor

benjaminpAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could be done latter, I think.

Lib/test/test_unicodedata.py

		@@ -208,6 +208,8 @@ def test_issue29456(self):
		self.assertEqual(self.db.normalize('NFC', u11a7_str_a), u11a7_str_b)
		self.assertEqual(self.db.normalize('NFC', u11c3_str_a), u11c3_str_b)

		# For tests of unicodedata.is_normalized / self.db.is_normalized ,
		# see test_normalization.py .

Copy link

Contributor

benjaminpAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

ugh, I would support mergingtest_normalization into this file for clarity.

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah; as it is, it actually caused me to think this function had evaded being tested at all -- I'd gone as far as to write up some tests myself before I spotted that other file.

A bit of context on someone's thinking in having a separate file can be found above at line 178:

        # The rest can be found in test_normalization.py        # which requires an external file.

I don't see why that means the test code should be in a separate file, though. There's already a try/skip mechanism to deal with the external file being unavailable. (I'm guessing that if I looked in the history I'd find that that mechanism wasn't there when the separate file was first added.)

Happy to send a PR to fold that file in.

Modules/unicodedata.c OutdatedShow resolvedHide resolved

Modules/unicodedata.c Outdated

		*/
		static QuickcheckResult
		is_normalized_quickcheck(PyObject self, PyObject input,
		int nfc, int k, int yes_only)

Copy link

Contributor

benjaminpAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

These "boolean int" parameters could be actualbools.

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oh wow it's the future! 🙂 I guess that's a C99 feature, and one of those we now accept using. Good to know.

I'm not sure I want to convert the existing two parameters within this PR; it seems like that could muddy the diff a bit with changes that are orthogonal to what it's really doing. But I guess I can see how it looks.

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Pushed a change tobool for the new parameter.

I was a bit surprised to find that I needed to add an#include <stdbool.h> line to the file! Perhaps that should be in a central include file, in the same way aslimits.h andinttypes.h are. That's certainly orthogonal to this PR, though.

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I made a quick further commit convertingnfc andk tobool too (including in the callers, and usingtrue/false rather than 1/0 for literal values.)

I'm ambivalent about adding it here. It expands the diff a bit -- was:
1 file changed, 51 insertions(+), 24 deletions(-)
and becomes:
1 file changed, 57 insertions(+), 30 deletions(-)
Slightly inclined to leave it out, but if you'd rather include it I'll do that.

Copy link

Contributor

benjaminpAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

Copy link

Contributor

benjaminpAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we want to avoid polluting public headers with astdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases. Usingstdbool.h where it's needed is also consistent with IWYU.

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think we want to avoid polluting public headers with astdbool.h include—bool is a bit of a name grab in arbitrary mature C codebases.

Ah, fair enough. I think that's consistent with putting it in a generic internal-only header, though. Within the tree, the very few existing uses are either stdbool, or (inModules/expat/xmltok.c) a sort of manual stdbool.

Usingstdbool.h where it's needed is also consistent with IWYU.

Hmm true. I guess IWYU doesn't feel like how this codebase is mostly written... but maybe it's best to make it more so.

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What makes the diffstat larger? Aren't you already touching all callers, since you added a parameter?

Here's the extra commit:gnprice@896c970f5
The lines not otherwise touched are where we declare and then fiddle with thenfc andk locals inunicodedata_UCD_is_normalized_impl.

Copy link

Contributor

benjaminpSep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Okay, we can fix up the rest in a sequel.

Modules/unicodedata.c

		* normalized or certainly not, and MAYBE if quickcheck is unable to
		* tell.
		*
		* If `yes_only` is true, then return MAYBE as soon as we determine

Copy link

Contributor

benjaminpAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

yes_only is a bizarre name for a parameter that causes the function to returnMAYBE.

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Yeah, I'm not really a fan of this name; I'd be glad for a better one.

The idea is that this flag says the caller doesn't care about NO vs. MAYBE; it only cares whether the answer is YES.

Copy link

Contributor

benjaminpAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe invert the sense and call itwant_definite meaning "process the string until we're sure it's not normalized or we reach the end"?

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The algorithm in the standard is theyes_only = false (orwant_definite = true) case, so I think things are simplest to describe if that's the baseline and the other case is described by contrast with it. Could certainly handle that with some added words, though.

Another aspect is that there is an asymmetry between a definite YES and a definite NO. In theyes_only /!want_definite case, what the caller reallymost wants is a definite YES... it's just a definite NO that is no more helpful than a MAYBE. Or another angle on that: if the caller really didn't care about getting a definite answer, it could just assume MAYBE and not bother calling the helper 😉 It's in the hope of getting a YES that it makes the call.

More ideas just now: perhapsmaybe_is_no, to mean "a MAYBE is no worse than a NO"?

Oryes_or_maybe, for "there are only two distinct answers, YES and MAYBE". That's basically what I was going for withyes_only, but probably a less paradoxical way of putting it 🙂

Copy link

Member

vstinnerAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The API also surprised me. I suggest tho rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I suggest to rename is_normalized_quickcheck() to is_normalized_impl() and add 2 functions: is_normalized() (yes_only=false) and is_normalized_quickcheck() (yes_only=true).

Hmm, I like the idea of covering over some of the multiplication of flags with wrapper functions!

I don't like quite these names, though. First, I don't thinkis_normalized is a good name, for either value of the flag, because the function doesn't actually take full responsibility for finding out if the string is normalized -- it can return MAYBE, and then it's up to the caller to go and normalize the string and compare.

What the function does do is implement a quick but partial check to try to see if the string is normalized. Specifically it implements one that's defined in the standard, which the standard callsquickCheck in its sample code, and which uses a property in the UCD calledQuick_Check.

More precisely, the yes_only=false case implements that standard quick-check algorithm; the yes_only=true (or yes_or_maybe=true, etc.) case (which is the version in master) is a variation on it. I think the name "quickcheck" oris_normalized_quickcheck, with no further qualifications, sounds like it will mean the standard's algorithm and it's confusing if it means a different algorithm instead.

Here's another possible set of names to use with the same idea:

rename this function tois_normalized_quickcheck_impl
is_normalized_quickcheck is the standard's quickcheck (yes_only/yes_or_maybe=false)
is_normalized_quickercheck is the variant that's more impatient toreturn and hand responsibility back to the caller (yes_or_maybe=true)
- Oris_normalized_quickcheck_quicker

Or if those names feel too long:is_normalized_qc_impl,is_normalized_qc,is_normalized_qc_quicker.

How does that sound?

Copy link

ContributorAuthor

gnpriceAug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Oryes_or_maybe, for "there are only two distinct answers, YES and MAYBE".

Another idea for the flag name (though its name becomes less important if we use Victor's idea of a pair of wrapper functions):yes_vs_maybe.

Copy link

Contributor

benjaminpSep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It doesn't seem like anything is standing out as dramatically better. At least the semantics are clearly documented and easily read from the function.

So, let's go with the current state.

gnprice added2 commits

August 28, 2019 20:43

Move UAXpython#15link to doc-comment.

27e8122

Usebool for a boolean.

3762787

vstinner requested a review fromserhiy-storchaka

August 29, 2019 13:19

benjaminp added the needs backport to 3.8 label

Sep 4, 2019

benjaminp approved these changes

Sep 4, 2019

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting review labels

Sep 4, 2019

benjaminp merged commit2f09413 intopython:master

Sep 4, 2019

Copy link

Contributor

miss-islington commentedSep 4, 2019

Thanks@gnprice for the PR, and@benjaminp for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

bedevere-bot removed the awaiting merge label

Sep 4, 2019

Copy link

bedevere-bot commentedSep 4, 2019

GH-15671 is a backport of this pull request to the3.8 branch.

bedevere-bot removed the needs backport to 3.8 label

Sep 4, 2019

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request

Sep 4, 2019

closes bpo-37966: Fully implement the UAXpythonGH-15quick-check alg…

f0f0d46

…orithm. (pythonGH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpythonGH-15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop(cherry picked from commit2f09413)Co-authored-by: Greg Price <gnprice@gmail.com>

miss-islington added a commit that referenced this pull request

Sep 4, 2019

closesbpo-37966: Fully implement the UAXGH-15quick-check algorithm. (

4dd1c9d

GH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXGH-15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop(cherry picked from commit2f09413)Co-authored-by: Greg Price <gnprice@gmail.com>

gnprice deleted the pr-quickcheck branch

September 6, 2019 03:23

Copy link

ContributorAuthor

gnprice commentedSep 6, 2019

Thanks for all the reviews and the merge!

@benjaminp , I'll go on and send next some of those followups you suggested.

Copy link

ContributorAuthor

gnprice commentedSep 6, 2019

@benjaminp , I'll go on and send next some of those followups you suggested.

Filedbpo-38043 for that, and sent PRs for two of them.

gnprice mentioned this pull request

Sep 9, 2019

bpo-38043: Move unicodedata.normalize tests into test_unicodedata.#15712

Merged

lisroach pushed a commit to lisroach/cpython that referenced this pull request

Sep 10, 2019

closes bpo-37966: Fully implement the UAXpython#15quick-check algor…

1660824

…ithm. (pythonGH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpython#15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop

DinoV pushed a commit to DinoV/cpython that referenced this pull request

Jan 14, 2020

closes bpo-37966: Fully implement the UAXpython#15quick-check algor…

a5375af

…ithm. (pythonGH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpython#15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop

websurfer5 pushed a commit to websurfer5/cpython that referenced this pull request

Jul 20, 2020

closes bpo-37966: Fully implement the UAXpython#15quick-check algor…

cb61233

…ithm. (pythonGH-15558)The purpose of the `unicodedata.is_normalized` function is to answerthe question `str == unicodedata.normalized(form, str)` moreefficiently than writing just that, by using the "quick check"optimization described in the Unicode standard in UAXpython#15.However, it turns out the code doesn't implement the full algorithmfrom the standard, and as a result we often miss the optimization andend up having to compute the whole normalized string after all.Implement the standard's algorithm.  This greatly speeds up`unicodedata.is_normalized` in many cases where our partial variantof quick-check had been returning MAYBE and the standard algorithmreturns NO.At a quick test on my desktop, the existing code takes about 4.4 ms/MB(so 4.4 ns per byte) when the partial quick-check returns MAYBE and ithas to do the slow normalize-and-compare:  $ build.base/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'  50 loops, best of 5: 4.39 msec per loopWith this patch, it gets the answer instantly (58 ns) on the same 1 MBstring:  $ build.dev/python -m timeit -s 'import unicodedata; s = "\uf900"*500000' \      -- 'unicodedata.is_normalized("NFD", s)'5000000 loops, best of 5: 58.2 nsec per loopThis restores a small optimization that the original version of thiscode had for the `unicodedata.normalize` use case.With this, that case is actually faster than in master!$ build.base/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 561 usec per loop$ build.dev/python -m timeit -s 'import unicodedata; s = "\u0338"*500000' \    -- 'unicodedata.normalize("NFD", s)'500 loops, best of 5: 512 usec per loop

gnprice mentioned this pull request

Apr 10, 2022

small cleanups in Unicode normalization code#82224

Closed

Labels

None yet

6 participants

Movatterモバイル変換

Uh oh!

bpo-37966: Fully implement the UAX #15 quick-check algorithm.#15558

bpo-37966: Fully implement the UAX #15 quick-check algorithm.#15558

Uh oh!

Conversation

gnprice commentedAug 28, 2019• edited by bedevere-botLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

benjaminp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miss-islington commentedSep 4, 2019

Uh oh!

bedevere-bot commentedSep 4, 2019

Uh oh!

gnprice commentedSep 6, 2019

Uh oh!

gnprice commentedSep 6, 2019

Uh oh!

Uh oh!

gnprice commentedAug 28, 2019•
edited by bedevere-bot
Loading