I have cleaned up the changes and ensure the behavior remains the same, however there are still a few points I need input from@malemburg
(And as Benedikt said, should be their own issue)

This function is documented as taking strings, but during the 2->3 conversion and undocumented, and untested change was made which allowed it to accept bytes. I have kept it this way (in Python, to make removal simpler), though I think this should either be documented and tested, or removed.
The function has been documented asascii only, and forbytes, it is. However, for strings, it has not been enforced with an error. What should we do?

StanFromIreland added2 commits

July 14, 2025 10:29

Clean up tests

3660160

Remove unnecessary message

4e12b9e

StanFromIreland marked this pull request as ready for review

July 14, 2025 12:54

bedevere-appbot added the awaiting review label

Jul 14, 2025

StanFromIreland requested a review frompicnixz

July 14, 2025 12:54

ZeroIntensity reviewed

Jul 15, 2025

View reviewed changes

Copy link

Member

ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would you mind running some microbenchmarks?

Modules/_codecsmodule.c OutdatedShow resolvedHide resolved

Modules/_codecsmodule.cShow resolvedHide resolved

Lib/encodings/__init__.py OutdatedShow resolvedHide resolved

picnixz reviewed

Jul 15, 2025

View reviewed changes

Lib/test/test_codecs.py OutdatedShow resolvedHide resolved

Lib/encodings/__init__.py OutdatedShow resolvedHide resolved

Review

1c9e55a

Copy link

MemberAuthor

StanFromIreland commentedJul 15, 2025

Benchmarks:

script

import timefrom encodings import normalize_encodingimport pyperfdef bench(loops):    range_it = range(loops)    t0 = time.perf_counter()    for _ in range_it:        normalize_encoding('utf_8')        normalize_encoding('utf\xE9\u20AC\U0010ffff-8')        normalize_encoding('utf   8')        normalize_encoding('%%%~')        normalize_encoding('UTF...8')    return time.perf_counter() - t0runner = pyperf.Runner()runner.bench_time_func('normalize_encoding', bench, inner_loops=30)

Main branch:

normalize_encoding: Mean +- std dev: 173 ns +- 7 ns

This PR:

normalize_encoding: Mean +- std dev: 42.9 ns +- 1.1 ns

ZeroIntensity reviewed

Jul 15, 2025

View reviewed changes

Modules/_codecsmodule.c

		return NULL;
		}

		char *normalized = PyMem_Malloc(len + 1);

Copy link

Member

ZeroIntensityJul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It'd be possible to use a VLA for this, but I'm not sure it's worth the additional complexity.@picnixz WDYT?

Copy link

Member

malemburg commentedJul 16, 2025

Sorry for the lack of response. I'm currently at EuroPython and pretty busy with other things. I'll have a look on Saturday during the sprints.

This was referencedJul 16, 2025

Improveencodings.normalize behaviour or docs#136702

Open

gh-136736: Fix handling alphanumerical non-ASCII characters in encodings.normalize_encoding()#136737

Open

serhiy-storchaka self-requested a review

July 17, 2025 09:16

ZeroIntensity reviewed

Jul 18, 2025

View reviewed changes

Modules/_codecsmodule.c

		return NULL;
		}

		if (PyUnicodeWriter_WriteUTF8(writer, normalized, (Py_ssize_t)strlen(normalized)) < 0) {

Copy link

Member

ZeroIntensityJul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The size shouldn't need to be recalculated here. It's alwayslen + 1, right?

Copy link

MemberAuthor

StanFromIrelandJul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

cpython/Lib/test/test_codecs.py

Lines 3901 to 3902 in28937d3

	self.assertEqual(normalize('utf\xE9\u20AC\U0010ffff-8'),'utf_8')
	self.assertEqual(normalize('utf 8'),'utf_8')

No, it must be done to match the current behavior, where it can change.

Labels

awaiting review

4 participants

Movatterモバイル変換

Uh oh!

gh-55531: Implementnormalize_encoding in C#136643

Are you sure you want to change the base?

gh-55531: Implementnormalize_encoding in C#136643

Conversation

StanFromIreland commentedJul 14, 2025• edited by bedevere-appbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

picnixz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commentedJul 14, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

ZeroIntensity left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StanFromIreland commentedJul 15, 2025

Uh oh!

ZeroIntensityJul 15, 2025

Choose a reason for hiding this comment

Uh oh!

malemburg commentedJul 16, 2025

Uh oh!

ZeroIntensityJul 18, 2025

Choose a reason for hiding this comment

Uh oh!

StanFromIrelandJul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gh-55531: Implement`normalize_encoding` in C#136643

gh-55531: Implement`normalize_encoding` in C#136643

StanFromIreland commentedJul 14, 2025•
edited by bedevere-appbot
Loading

StanFromIreland commentedJul 14, 2025•
edited
Loading