NotificationsYou must be signed in to change notification settings
Fork32.4k
Star67.9k

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco…#136639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

vedant713 wants to merge3 commits intopython:main

base:main

Choose a base branch

fromvedant713:patch-4

Draft

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco…#136639

vedant713 wants to merge3 commits intopython:mainfromvedant713:patch-4

Conversation

Copy link

vedant713 commentedJul 14, 2025•
edited by bedevere-appbot
Loading

The new REPL implementation (_pyrepl) crashes on Windows when the user inputs Unicode characters outside the Basic Multilingual Plane (≥ U+10000), such as emoji (e.g. 🐍). This happens because the Windows input layer provides surrogate pairs (UTF-16 code units) that _pyrepl attempts to process and tokenize directly, leading to unpaired surrogate handling issues.

This commit introduces anormalize_surrogates() helper inReader to explicitly normalize surrogate pairs by encoding to UTF-16 with 'surrogatepass' and decoding back. Theget_unicode() method is patched to use this normalization so that any code consuming REPL input (e.g. syntax highlighting via tokenize) receives valid Unicode text.

This resolves UnicodeEncodeError crashes in the REPL when typing emoji or other non-BMP characters on Windows.

Fixes#136595

Issue:Unicode characters ≥ 0x10000 cannot be inputted/behaves unusually at the REPL terminal. #136595

pythongh-136595: Normalize surrogate pairs in REPL input to fix Unico…

b9c4f77

…deEncodeError on WindowsThe new REPL implementation (_pyrepl) crashes on Windows when the user inputs Unicode characters outside the Basic Multilingual Plane (≥ U+10000), such as emoji (e.g. 🐍). This happens because the Windows input layer provides surrogate pairs (UTF-16 code units) that _pyrepl attempts to process and tokenize directly, leading to unpaired surrogate handling issues.This commit introduces a `normalize_surrogates()` helper in `Reader` to explicitly normalize surrogate pairs by encoding to UTF-16 with 'surrogatepass' and decoding back. The `get_unicode()` method is patched to use this normalization so that any code consuming REPL input (e.g. syntax highlighting via tokenize) receives valid Unicode text.This resolves UnicodeEncodeError crashes in the REPL when typing emoji or other non-BMP characters on Windows.Fixespython#136595

Copy link

bedevere-appbot commentedJul 14, 2025

Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply theskip news label instead.

bedevere-appbot mentioned this pull request

Jul 14, 2025

Unicode characters ≥ 0x10000 cannot be inputted/behaves unusually at the REPL terminal.#136595

Open

blurb-itbotand others added2 commits

July 14, 2025 01:27

📜🤖 Added by blurb_it.

a567845

Update reader.py

7a31a1f

Copy link

Member

serhiy-storchaka commentedJul 16, 2025

This implementation fails if there are lone surrogate characters. Even after fixing this, it will not completely solve the original issue for the case of lone surrogate characters -- we need to handle this at the encoding to UTF-8 step.

See also a different (regular expression based) implementation in#121219.

Labels

None yet

2 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco…#136639

Are you sure you want to change the base?

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco…#136639

Uh oh!

Conversation

vedant713 commentedJul 14, 2025•
edited by bedevere-appbot
Loading

Uh oh!

Uh oh!

bedevere-appbot commentedJul 14, 2025

Uh oh!

serhiy-storchaka commentedJul 16, 2025

Uh oh!

Uh oh!

Movatterモバイル変換

Uh oh!

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco…#136639

Are you sure you want to change the base?

gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco…#136639

Uh oh!

Conversation

vedant713 commentedJul 14, 2025• edited by bedevere-appbotLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

bedevere-appbot commentedJul 14, 2025

Uh oh!

serhiy-storchaka commentedJul 16, 2025

Uh oh!

Uh oh!

vedant713 commentedJul 14, 2025•
edited by bedevere-appbot
Loading