Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32.4k
gh-136595: Normalize surrogate pairs in REPL input to fix UnicodeEnco…#136639
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
…deEncodeError on WindowsThe new REPL implementation (_pyrepl) crashes on Windows when the user inputs Unicode characters outside the Basic Multilingual Plane (≥ U+10000), such as emoji (e.g. 🐍). This happens because the Windows input layer provides surrogate pairs (UTF-16 code units) that _pyrepl attempts to process and tokenize directly, leading to unpaired surrogate handling issues.This commit introduces a `normalize_surrogates()` helper in `Reader` to explicitly normalize surrogate pairs by encoding to UTF-16 with 'surrogatepass' and decoding back. The `get_unicode()` method is patched to use this normalization so that any code consuming REPL input (e.g. syntax highlighting via tokenize) receives valid Unicode text.This resolves UnicodeEncodeError crashes in the REPL when typing emoji or other non-BMP characters on Windows.Fixespython#136595
Most changes to Pythonrequire a NEWS entry. Add one using theblurb_it web app or theblurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
This implementation fails if there are lone surrogate characters. Even after fixing this, it will not completely solve the original issue for the case of lone surrogate characters -- we need to handle this at the encoding to UTF-8 step. See also a different (regular expression based) implementation in#121219. |
Uh oh!
There was an error while loading.Please reload this page.
The new REPL implementation (_pyrepl) crashes on Windows when the user inputs Unicode characters outside the Basic Multilingual Plane (≥ U+10000), such as emoji (e.g. 🐍). This happens because the Windows input layer provides surrogate pairs (UTF-16 code units) that _pyrepl attempts to process and tokenize directly, leading to unpaired surrogate handling issues.
This commit introduces a
normalize_surrogates()
helper inReader
to explicitly normalize surrogate pairs by encoding to UTF-16 with 'surrogatepass' and decoding back. Theget_unicode()
method is patched to use this normalization so that any code consuming REPL input (e.g. syntax highlighting via tokenize) receives valid Unicode text.This resolves UnicodeEncodeError crashes in the REPL when typing emoji or other non-BMP characters on Windows.
Fixes#136595