- Notifications
You must be signed in to change notification settings - Fork1.4k
Base implementation of_tokenize module#6240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
coderabbitaibot commentedNov 8, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Important Review skippedReview was skipped due to path filters ⛔ Files ignored due to path filters (2)
CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including You can disable this status message by setting the WalkthroughReorders and augments Ruff-related dependencies in top-level Changes
Sequence Diagram(s)sequenceDiagram participant PyUser as Python user participant TokenIter as TokenizerIter (Py) participant State as PyTokenizerIterState participant Parser as Ruff parser PyUser->>TokenIter: next() / __next__() TokenIter->>TokenIter: check internal state & tokens alt tokens exhausted TokenIter->>PyUser: call readline(vm) -> line alt readline returns line TokenIter->>State: push_line(line) State->>Parser: parse_unchecked_source(line...) Parser-->>State: tokens State->>TokenIter: next_token() TokenIter->>TokenIter: token_kind_value(kind) alt extra_tokens and operator TokenIter->>TokenIter: remap -> TOKEN_OP (55) end TokenIter-->>PyUser: (type, string, start, end, current_line) else readline -> EOF TokenIter->>State: mark EOF TokenIter-->>PyUser: EndOfFile tuple (empty string + positions) end else tokens available TokenIter-->>PyUser: next token tuple endEstimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for usingCodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 4
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (15)
Cargo.lockis excluded by!**/*.lockLib/test/test_tokenize.pyis excluded by!Lib/**Lib/test/tokenizedata/__init__.pyis excluded by!Lib/**Lib/test/tokenizedata/bad_coding.pyis excluded by!Lib/**Lib/test/tokenizedata/bad_coding2.pyis excluded by!Lib/**Lib/test/tokenizedata/badsyntax_3131.pyis excluded by!Lib/**Lib/test/tokenizedata/badsyntax_pep3120.pyis excluded by!Lib/**Lib/test/tokenizedata/coding20731.pyis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-latin1-coding-cookie-and-utf8-bom-sig.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-no-coding-cookie-and-utf8-bom-sig-only.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-no-utf8-bom-sig.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txtis excluded by!Lib/**Lib/test/tokenizedata/tokenize_tests.txtis excluded by!Lib/**Lib/token.pyis excluded by!Lib/**Lib/tokenize.pyis excluded by!Lib/**
📒 Files selected for processing (4)
Cargo.toml(1 hunks)stdlib/Cargo.toml(1 hunks)stdlib/src/lib.rs(2 hunks)stdlib/src/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management
Files:
stdlib/src/lib.rsstdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use RustPython macros (
pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
stdlib/src/lib.rsstdlib/src/tokenize.rs
🧠 Learnings (4)
📓 Common learnings
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in RustApplied to files:
stdlib/src/lib.rsstdlib/src/tokenize.rs
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitationsApplied to files:
stdlib/Cargo.tomlstdlib/src/tokenize.rsCargo.toml
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to extra_tests/**/*.py : Use ruff to lint custom Python codeApplied to files:
Cargo.toml
🧬 Code graph analysis (2)
stdlib/src/lib.rs (2)
Lib/tokenize.py (1)
tokenize(466-492)stdlib/src/socket.rs (1)
make_module(7-11)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
vm(3324-3326)Lib/test/test_tokenize.py (4)
readline(1329-1335)readline(1350-1356)readline(1576-1581)readline(2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run tests under miri
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Check Rust code with rustfmt and clippy
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Ensure compilation on various targets
Uh oh!
There was an error while loading.Please reload this page.
| if nline.is_empty() { | ||
| state.eof = true; | ||
| *zelf.state.write() = state.clone(); | ||
| let line_num = &state.start().0; | ||
| let out = vm | ||
| .ctx | ||
| .new_tuple(vec![ | ||
| token_kind_value(TokenKind::EndOfFile).to_pyobject(vm), | ||
| vm.ctx.new_str("").into(), | ||
| vm.ctx | ||
| .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) | ||
| .into(), | ||
| vm.ctx | ||
| .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)]) | ||
| .into(), | ||
| vm.ctx.new_str(state.current_line()).into(), | ||
| ]) | ||
| .into(); | ||
| return Ok(PyIterReturn::Return(out)); | ||
| } | ||
| state.push_line(&nline); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
End-of-file tuple needs to come from the real EOF token.
Synthesizing the tuple here hardcodes(line, -1) and reuses the previous line text, so consumers see the wrong coordinates and line content. Grab the parser’sEndOfFile token, set it asprev_token, and let the existing emission path produce the tuple.
- if nline.is_empty() {- state.eof = true;- *zelf.state.write() = state.clone();-- let line_num = &state.start().0;- let out = vm- .ctx- .new_tuple(vec![- token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),- vm.ctx.new_str("").into(),- vm.ctx- .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])- .into(),- vm.ctx- .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])- .into(),- vm.ctx.new_str(state.current_line()).into(),- ])- .into();- return Ok(PyIterReturn::Return(out));- }+ if nline.is_empty() {+ state.eof = true;+ if let Some(eof) = state+ .tokens+ .iter()+ .rev()+ .find(|token| token.kind() == TokenKind::EndOfFile)+ {+ state.prev_token = Some(*eof);+ break eof.clone();+ }+ return Ok(PyIterReturn::StopIteration(None));+ }
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 116 to 138, don't synthesize the EOFtuple here (which hardcodes (line, -1) and reuses the previous line text);instead obtain the parser's EndOfFile token, set it as the state's prev_token(or otherwise set prev_token to that EOF token), set state.eof = true andpersist the state as before, then allow the existing token-emission path toproduce and return the EOF tuple so the coordinates and line content come fromthe real EOF token rather than being hardcoded.| if !zelf.extra_tokens { | ||
| token_repr = token_repr.trim(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Don't trim token text whenextra_tokens is false.
trim() wipes indentation whitespace (and other significant padding), so INDENT tokens end up empty strings. CPython preserves the exact lexeme. Just use the slice as-is.
- let mut token_repr = &state.source[state.range()];- if !zelf.extra_tokens {- token_repr = token_repr.trim();- }+ let token_repr = &state.source[state.range()];
🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 152 to 154, token text is being trimmedwhen zelf.extra_tokens is false which removes significant whitespace (e.g.,INDENT lexemes); remove the call to trim() and use the slice as-is so theoriginal lexeme (including leading/trailing whitespace) is preserved whenextra_tokens is false.| extra_tokens: bool, | ||
| #[pyarg(named, optional)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
extra_tokens must default toFalse.
CPython’sTokenizerIter allows callingTokenizerIter(readline) without naming the flag. Requiring the argument here breaks that API.
- #[pyarg(named)]+ #[pyarg(named, default = "false")] extra_tokens: bool,
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 179-180, the extra_tokens parametercurrently requires being passed and therefore breaks CPython's TokenizerIterAPI; make extra_tokens default to False so it can be omitted. Update the pyargdeclaration to make the argument optional with a default of false (or change theparameter to Option<bool> and treat None as false) so TokenizerIter(readline)works without naming this flag.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (4)
stdlib/src/tokenize.rs (4)
39-42:Don't swallow non-StopIteration errors fromreadline.Catching every exception and treating it as EOF hides real failures (TypeError, UnicodeDecodeError, etc.). Only
StopIterationshould terminate the stream; all other errors must propagate.Apply this diff to fix the error handling:
- let raw_line = match self.readline.invoke((), vm) {- Ok(v) => v,- Err(_) => return Ok(String::new()),- };+ let raw_line = match self.readline.invoke((), vm) {+ Ok(v) => v,+ Err(exc) => {+ if exc.fast_isinstance(vm.ctx.exceptions.stop_iteration) {+ return Ok(String::new());+ }+ return Err(exc);+ }+ };
116-138:End-of-file tuple needs to come from the real EOF token.Synthesizing the EOF tuple here hardcodes
(line, -1)and reuses the previous line text, so consumers see the wrong coordinates and line content. Instead, retrieve the parser'sEndOfFiletoken, set it asprev_token, and let the existing emission path produce the correct tuple.Apply this diff to use the parser's EOF token:
if nline.is_empty() { state.eof = true;- *zelf.state.write() = state.clone();-- let line_num = &state.start().0;- let out = vm- .ctx- .new_tuple(vec![- token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),- vm.ctx.new_str("").into(),- vm.ctx- .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])- .into(),- vm.ctx- .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])- .into(),- vm.ctx.new_str(state.current_line()).into(),- ])- .into();- return Ok(PyIterReturn::Return(out));+ if let Some(eof) = state+ .tokens+ .iter()+ .rev()+ .find(|token| token.kind() == TokenKind::EndOfFile)+ {+ state.prev_token = Some(*eof);+ break *eof;+ }+ *zelf.state.write() = state.clone();+ return Ok(PyIterReturn::StopIteration(None)); }
151-154:Don't trim token text whenextra_tokensis false.
trim()removes indentation whitespace and other significant padding, causing INDENT tokens to become empty strings. CPython preserves the exact lexeme. Use the slice as-is without trimming.Apply this diff to preserve the exact lexeme:
- let mut token_repr = &state.source[state.range()];- if !zelf.extra_tokens {- token_repr = token_repr.trim();- }+ let token_repr = &state.source[state.range()];
178-180:extra_tokensmust default toFalse.CPython's
TokenizerIterallows callingTokenizerIter(readline)without naming the flag. Requiring the argument here breaks API compatibility.Apply this diff to make
extra_tokensoptional with a default:#[pyarg(positional)] readline: ArgCallable,- #[pyarg(named)]- extra_tokens: bool,+ #[pyarg(named, default = "false")]+ extra_tokens: bool, #[pyarg(named, optional)] encoding: Option<PyStrRef>,
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)
108-109:Reminder: Address the TODO comment for error handling.The TODO indicates that error checking (e.g., raising SyntaxError for parse errors) is missing. The
state.errorsfield is populated but never checked.Do you want me to generate code to check
state.errorsand raise appropriate SyntaxError exceptions, or open a new issue to track this task?
200-207:Re-parsing the entire source on each line may be inefficient.The
push_linemethod re-parses the entire accumulated source text each time a new line is added. For large files, this could become a performance bottleneck with O(n²) behavior.However, this approach may be necessary for correct tokenization since Python's grammar is context-sensitive (e.g., for indent/dedent tracking). Consider profiling to determine if incremental parsing or caching strategies would be beneficial.
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
stdlib/src/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management
Files:
stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use RustPython macros (
pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in RustLearnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in RustApplied to files:
stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
vm(3324-3326)Lib/test/test_tokenize.py (4)
readline(1329-1335)readline(1350-1356)readline(1576-1581)readline(2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Check Rust code with rustfmt and clippy
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Ensure compilation on various targets
🔇 Additional comments (5)
stdlib/src/tokenize.rs (5)
1-23:LGTM!Module structure and imports follow RustPython conventions correctly.
25-33:LGTM!The struct definition follows RustPython patterns correctly. The TODO comment is noted but doesn't block the current implementation.
61-72:LGTM!Debug implementation and empty
pyclassimpl block are correct.
74-93:LGTM!Constructor implementation follows RustPython patterns correctly.
210-220:No changes needed tocurrent_linelogic.The
current_linemethod correctly uses theLineRangestrait methods fromruff_source_file:full_line_str(offset)returns the full line containing an offset, andfull_lines_str(range)returns text covering all lines that a range spans. The logic properly handles edge cases:
- Multi-line tokens: Correctly handled via
full_lines_str(range)for non-Newline tokens- Newline tokens: Correctly isolated via
full_line_str(range.start())- When
prev_tokenisNone: Safely defaults toTokenKind::UnknownwithTextRange::default(), which callsfull_lines_strand returns lines from the startThe code is sound and requires no modifications.
| fn next_token(&mut self) -> Option<Token> { | ||
| for token in self.tokens.iter() { | ||
| let (kind, range) = token.as_tuple(); | ||
| if matches!(kind, TokenKind::NonLogicalNewline) { | ||
| continue; | ||
| } | ||
| if matches!(range.ordering(self.range()), Ordering::Greater) { | ||
| self.prev_token = Some(*token); | ||
| return self.prev_token; | ||
| } | ||
| } | ||
| None | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
🧩 Analysis chain
Verify the token ordering logic.
Thenext_token method usesrange.ordering(self.range()) to find tokens after the current position. Ensure this correctly handles:
- Zero-width tokens (if any)
- Tokens at the exact same position
- The initial state when
prev_tokenisNone(resulting inTextRange::default())
Run the following script to verify the behavior with the Ruff crates:
🏁 Script executed:
#!/bin/bash# Check how TextRange::default() and ordering workast-grep --pattern$'TextRange::default()'rg -n -C3"TextRange::default|ordering" --type=rust
Length of output: 10814
I need to find theTextRange type definition and understand itsordering() method to verify the logic.
Now I need to understand the context better. Let me search for hownext_token is used and what the intended behavior should be.
Let me search for the actual tokenize.rs file context to understand the full implementation.
Let me try a different approach to view the file:Now let me examine the tokenize.rs file more closely:
---🏁 Script executed:```shell#!/bin/bash# Get the full context of the tokenize.rs file around the problematic methodcat stdlib/src/tokenize.rs | head -300 | tail -100Length of output: 3125
🏁 Script executed:
#!/bin/bash# Find and display the tokenize.rs file with context around the next_token methodfd -name"tokenize.rs" -type f
Length of output: 230
🏁 Script executed:
#!/bin/bash# Display the tokenize.rs file to understand the structure and usagerg -n"fn next_token|impl|struct|fn range|self.tokens" stdlib/src/tokenize.rs -A5 -B2 --type=rust
Length of output: 3859
Let me examine theIterNext implementation to understand hownext_token() is actually being used:
---🏁 Script executed:```shell#!/bin/bash# Get the IterNext implementation to see how next_token is usedsed -n '97,173p' stdlib/src/tokenize.rsLength of output: 2843
Token ordering logic has a critical issue: tokens at position 0 are skipped on initial state.
Whenprev_token isNone,self.range() returnsTextRange::default() which is a zero-width range at position 0 (0..0). The comparisonrange.ordering(self.range()) > Ordering::Greater only returns true for tokenscompletely after the current range. Since tokens starting at position 0 (e.g., 0..1) overlap with 0..0, they returnOrdering::Equal and are skipped. Only tokens starting after position 0 are returned first, meaning the initial token in the source is never yielded.
The same issue affects zero-width tokens (e.g., at position N): they overlap with their containing range and won't be returned when that range is current.
The fix requires checking forOrdering::Equal andOrdering::Greater on initial state, or usingTextRange::EMPTY properly, or advancing past the current position differently to include overlapping tokens.
🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 223 to 238, the token selection logicskips tokens that overlap the current range (notably tokens at position 0 whenprev_token is None) because it only accepts Ordering::Greater; change thecomparison to accept both Ordering::Equal and Ordering::Greater (or comparetoken.range().start() against self.range().end()) so tokens that start at thesame position as the current range are returned; update the conditional to checkfor Equal || Greater (or use token_start >= current_end) and ensure prev_tokenis set before returning to preserve state for subsequent calls.| const fn token_kind_value(kind: TokenKind) -> u8 { | ||
| match kind { | ||
| TokenKind::EndOfFile => 0, | ||
| TokenKind::Name | ||
| | TokenKind::For | ||
| | TokenKind::In | ||
| | TokenKind::Pass | ||
| | TokenKind::Class | ||
| | TokenKind::And | ||
| | TokenKind::Is | ||
| | TokenKind::Raise | ||
| | TokenKind::True | ||
| | TokenKind::False | ||
| | TokenKind::Assert | ||
| | TokenKind::Try | ||
| | TokenKind::While | ||
| | TokenKind::Yield | ||
| | TokenKind::Lambda | ||
| | TokenKind::None | ||
| | TokenKind::Not | ||
| | TokenKind::Or | ||
| | TokenKind::Break | ||
| | TokenKind::Continue | ||
| | TokenKind::Global | ||
| | TokenKind::Nonlocal | ||
| | TokenKind::Return | ||
| | TokenKind::Except | ||
| | TokenKind::Import | ||
| | TokenKind::Case | ||
| | TokenKind::Match | ||
| | TokenKind::Type | ||
| | TokenKind::Await | ||
| | TokenKind::With | ||
| | TokenKind::Del | ||
| | TokenKind::Finally | ||
| | TokenKind::From | ||
| | TokenKind::Def | ||
| | TokenKind::If | ||
| | TokenKind::Else | ||
| | TokenKind::Elif | ||
| | TokenKind::As | ||
| | TokenKind::Async => 1, | ||
| TokenKind::Int | TokenKind::Complex | TokenKind::Float => 2, | ||
| TokenKind::String => 3, | ||
| TokenKind::Newline | TokenKind::NonLogicalNewline => 4, | ||
| TokenKind::Indent => 5, | ||
| TokenKind::Dedent => 6, | ||
| TokenKind::Lpar => 7, | ||
| TokenKind::Rpar => 8, | ||
| TokenKind::Lsqb => 9, | ||
| TokenKind::Rsqb => 10, | ||
| TokenKind::Colon => 11, | ||
| TokenKind::Comma => 12, | ||
| TokenKind::Semi => 13, | ||
| TokenKind::Plus => 14, | ||
| TokenKind::Minus => 15, | ||
| TokenKind::Star => 16, | ||
| TokenKind::Slash => 17, | ||
| TokenKind::Vbar => 18, | ||
| TokenKind::Amper => 19, | ||
| TokenKind::Less => 20, | ||
| TokenKind::Greater => 21, | ||
| TokenKind::Equal => 22, | ||
| TokenKind::Dot => 23, | ||
| TokenKind::Percent => 24, | ||
| TokenKind::Lbrace => 25, | ||
| TokenKind::Rbrace => 26, | ||
| TokenKind::EqEqual => 27, | ||
| TokenKind::NotEqual => 28, | ||
| TokenKind::LessEqual => 29, | ||
| TokenKind::GreaterEqual => 30, | ||
| TokenKind::Tilde => 31, | ||
| TokenKind::CircumFlex => 32, | ||
| TokenKind::LeftShift => 33, | ||
| TokenKind::RightShift => 34, | ||
| TokenKind::DoubleStar => 35, | ||
| TokenKind::PlusEqual => 36, | ||
| TokenKind::MinusEqual => 37, | ||
| TokenKind::StarEqual => 38, | ||
| TokenKind::SlashEqual => 39, | ||
| TokenKind::PercentEqual => 40, | ||
| TokenKind::AmperEqual => 41, | ||
| TokenKind::VbarEqual => 42, | ||
| TokenKind::CircumflexEqual => 43, | ||
| TokenKind::LeftShiftEqual => 44, | ||
| TokenKind::RightShiftEqual => 45, | ||
| TokenKind::DoubleStarEqual => 46, | ||
| TokenKind::DoubleSlash => 47, | ||
| TokenKind::DoubleSlashEqual => 48, | ||
| TokenKind::At => 49, | ||
| TokenKind::AtEqual => 50, | ||
| TokenKind::Rarrow => 51, | ||
| TokenKind::Ellipsis => 52, | ||
| TokenKind::ColonEqual => 53, | ||
| TokenKind::Exclamation => 54, | ||
| TokenKind::FStringStart => 59, | ||
| TokenKind::FStringMiddle => 60, | ||
| TokenKind::FStringEnd => 61, | ||
| TokenKind::Comment => 62, | ||
| TokenKind::TStringStart => 62, // 3.14 compatible | ||
| TokenKind::TStringMiddle => 63, // 3.14 compatible | ||
| TokenKind::TStringEnd => 64, // 3.14 compatible | ||
| TokenKind::IpyEscapeCommand | TokenKind::Question => 0, // Ruff's specific | ||
| TokenKind::Unknown => 0, | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
🧩 Analysis chain
Verify token value mappings match CPython.
This mapping should match the token values from CPython'stoken module. The comment at line 22 indicatesTOKEN_OP is 55, which is correct for Python 3.x.
Please verify that all other token values match CPython, especially:
- Lines 380-382: TString tokens marked as "3.14 compatible"
- Line 379: Comment token maps to 62 (same as TStringStart on line 380)
- Lines 383-384: Ruff-specific tokens map to 0
🌐 Web query:
What are the token type values in CPython's token module for Python 3.14?💡 Result:
I can show you the exact numeric values from your local CPython 3.14 installation — run this in a Python 3.14 interpreter:
python -c "import token, pprint; pprint.pprint({name: getattr(token, name) for name in sorted(token.tok_name.values())})"
Or, to get the reverse mapping (name → numeric value):
python -c "import token, pprint; pprint.pprint({k: v for v, k in token.tok_name.items()})"
The official Python 3.14 documentation for the token module (and the module source) is here for reference. [1]
Sources:
[1] Python 3.14 token module documentation — docs.python.org.
#!/usr/bin/env python3importsystry:importtokenprint("Python version:",sys.version)print("\nToken name to value mapping:")# Get token name to value mappingtok_mapping= {}forvalue,nameintoken.tok_name.items():tok_mapping[name]=value# Sort and printfornameinsorted(tok_mapping.keys()):print(f"{name}:{tok_mapping[name]}")exceptImportErrorase:print(f"Error:{e}")
🏁 Script executed:
#!/bin/bash# Also check if we can inspect the tokenize.rs file for comparisoncd /tmp||exit 1cat> check_cpython_tokens.py<< 'EOF'import tokenimport pprintprint("=== Token module values ===")for tok_val in sorted(token.tok_name.keys()): print(f"{token.tok_name[tok_val]}: {tok_val}")EOFpython3 check_cpython_tokens.py
Length of output: 844
🏁 Script executed:
#!/bin/bash# Get the full TokenKind enum definition and context around the functioncd /tmp||exit 1find. -name"tokenize.rs" -type f2>/dev/null| head -5
Length of output: 0
🏁 Script executed:
#!/bin/bash# Search for TokenKind enum definitionrg -n"enum TokenKind" --type=rs -A 100
Length of output: 26
🏁 Script executed:
#!/bin/bash# Inspect the TokenKind enum and the token_kind_value functionrg -n"enum TokenKind" -A 150| head -200
Length of output: 0
🏁 Script executed:
#!/bin/bash# Look at the token_kind_value function and surrounding contextrg -n"const fn token_kind_value" -B 5 -A 120| head -200
Length of output: 6992
🏁 Script executed:
#!/bin/bash# Search for any comments about token mapping or Python compatibilityrg -n"CPython|Python.*token|token.*value" stdlib/src/tokenize.rs| head -20
Length of output: 333
Critical token mapping errors found - multiple duplicates and CPython misalignment.
Verification against CPython 3.14 token module revealed significant issues:
- Duplicate values (lines 379-380): Both
CommentandTStringStartmap to 62, which is invalid - Incorrect
Commentvalue (line 379): Maps to 62, but CPython hasCOMMENT=61 - FString tokens don't match CPython (lines 376-378): Map to 59, 60, 61 which correspond to
SOFT_KEYWORD,ERRORTOKEN, andCOMMENTin CPython—not FString values - TString tokens misaligned (lines 380-382): The "3.14 compatible" comment is inaccurate; these values (62, 63, 64) don't correspond to template string tokens in CPython 3.14
- Exclamation token (line 375): Maps to 54, but CPython has
OP=54; Exclamation is not a standard Python token type
The mapping for lines 375-382 needs correction to either match CPython standard tokens or use reserved/unassigned values to avoid collisions and type misalignment.
🤖 Prompt for AI Agents
stdlib/src/tokenize.rs around lines 281-386: the TokenKind->u8 mapping hasduplicate and misaligned values (e.g., Comment and TStringStart both 62,FStringStart/Middle/End set to 59/60/61 which conflict with CPython 3.14meanings, Exclamation mapped to 54 which collides with OP, and the "3.14compatible" comments are incorrect). Fix by replacing those values with thecorrect CPython 3.14 token numeric assignments for COMMENT,ERRORTOKEN/SOFT_KEYWORD/SO_ON if you intend CPython parity; if you intentionallydiverge, pick unused/reserved numeric values that do not collide with existingentries, remove or correct the misleading "3.14 compatible" comments, ensureIpyEscapeCommand/Question and Unknown map to distinct safe values (or 0 onlywhen appropriate), and run tokenizer unit tests and a full token valueuniqueness check to confirm no duplicates remain.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (5)
stdlib/src/tokenize.rs (5)
121-143:EOF tuple must use the parser's actual EndOfFile token.The code synthesizes the EOF tuple with hardcoded
(-1)positions (lines 132, 135) and reusesstate.current_line(), which produces incorrect coordinates and line content. Instead, retrieve the parser'sEndOfFiletoken fromstate.tokens, set it asprev_token, and let the standard token-emission path (lines 145-175) produce the tuple with correct metadata.Apply this approach:
let nline = zelf.readline(vm)?; if nline.is_empty() { state.eof = true;- *zelf.state.write() = state.clone();-- let line_num = &state.start().0;- let out = vm- .ctx- .new_tuple(vec![- token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),- vm.ctx.new_str("").into(),- vm.ctx- .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])- .into(),- vm.ctx- .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])- .into(),- vm.ctx.new_str(state.current_line()).into(),- ])- .into();- return Ok(PyIterReturn::Return(out));+ if let Some(eof) = state+ .tokens+ .iter()+ .find(|token| token.kind() == TokenKind::EndOfFile)+ {+ state.prev_token = Some(*eof);+ break *eof;+ }+ *zelf.state.write() = state.clone();+ return Ok(PyIterReturn::StopIteration(None)); }
156-159:Remove trim() to preserve exact token lexemes.Calling
trim()whenextra_tokensis false (line 158) strips significant whitespace, causing INDENT tokens to become empty strings. CPython's tokenizer preserves the exact lexeme. Use the slice as-is.Apply this diff:
- let mut token_repr = &state.source[state.range()];- if !zelf.extra_tokens {- token_repr = token_repr.trim();- }+ let token_repr = &state.source[state.range()];
183-184:Makeextra_tokensoptional with defaultfalse.CPython allows
TokenizerIter(readline)without specifyingextra_tokens. The current declaration requires the argument, breaking API compatibility.Apply this diff:
- #[pyarg(named)]- extra_tokens: bool,+ #[pyarg(named, default = "false")]+ extra_tokens: bool,
227-243:Token ordering skips tokens that overlap the current range.When
prev_tokenisNone,self.range()returnsTextRange::default()(0..0). The comparisonrange.ordering(self.range()) > Ordering::Greater(line 236) only matches tokenscompletely after the current range. Tokens starting at position 0 (e.g., 0..1) overlap with 0..0, returningOrdering::Equal, and are skipped. The first token in the source is never yielded.Fix by comparing start positions:
- if matches!(range.ordering(self.range()), Ordering::Greater) {+ if range.start() > self.range().end() { self.prev_token = Some(*token); return self.prev_token; }
286-391:Critical token mapping errors: duplicates and CPython conflicts.Multiple critical issues remain:
- Duplicate value 62 (lines 384-385): Both
CommentandTStringStartmap to 62- Incorrect Comment mapping (line 384): Maps to 62, but CPython 3.14 has
COMMENT=61- FString conflicts (lines 381-383): Values 59, 60, 61 correspond to
SOFT_KEYWORD,ERRORTOKEN, andCOMMENTin CPython 3.14- Exclamation collision (line 380): Maps to 54, which is CPython's
OPtoken- Misleading comments (lines 385-387): "3.14 compatible" claims are inaccurate
These mappings must be corrected to either:
- Match CPython 3.14 standard token values exactly, or
- Use reserved/unassigned values (e.g., 100+) to avoid collisions
Run this verification to confirm no remaining duplicates after fixes:
#!/bin/bash# Extract all token value assignments and check for duplicatesrg -n"TokenKind::\w+ => \d+" stdlib/src/tokenize.rs| \ gawk'{print $NF}'| sort| uniq -d| \ xargs -I {}echo"Duplicate token value found: {}"
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)
189-212:State structure is correct; consider performance implications.The
push_linemethod reparses the entire accumulated source on each call (line 208). While correct, this has O(n²) complexity for n lines. This is acceptable for now but may need optimization for large files.
113-118:TODO: Implement syntax error handling.The TODO on line 114 indicates that parse errors from
state.errorsare not currently raised asSyntaxError. This means invalid Python syntax may be silently ignored. Consider implementing error checking in a follow-up to ensure proper error reporting.Based on learnings.
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Lib/test/test_tokenize.pyis excluded by!Lib/**
📒 Files selected for processing (1)
stdlib/src/tokenize.rs(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management
Files:
stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use RustPython macros (
pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust
Files:
stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in RustLearnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in RustApplied to files:
stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
vm(3324-3326)Lib/test/test_tokenize.py (4)
readline(1328-1334)readline(1349-1355)readline(1575-1580)readline(2105-2106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Run snippets and cpython tests (ubuntu-latest)
- GitHub Check: Run snippets and cpython tests on wasm-wasi
- GitHub Check: Run snippets and cpython tests (macos-latest)
- GitHub Check: Run snippets and cpython tests (windows-latest)
- GitHub Check: Run rust tests (macos-latest)
- GitHub Check: Run rust tests (windows-latest)
- GitHub Check: Check the WASM package and demo
- GitHub Check: Run rust tests (ubuntu-latest)
- GitHub Check: Check Rust code with rustfmt and clippy
- GitHub Check: Ensure compilation on various targets
🔇 Additional comments (9)
stdlib/src/tokenize.rs (9)
1-23:LGTM: Module structure and imports are well-organized.The
#[pymodule]macro usage follows RustPython patterns, and the TOKEN_OP constant correctly matches CPython's value.
25-33:Struct definition is sound; track the TODO for future work.The use of
PyRwLockfor state management is appropriate. The TODO on line 29 regardingPyObjectvsArgCallablecan be addressed in a follow-up.
35-64:LGTM: Error handling correctly distinguishes StopIteration from other exceptions.The implementation now properly treats only
StopIterationas EOF (lines 42-43) and propagates all other errors (line 45), fixing the critical issue from the previous review. The encoding logic for both bytes and strings is also correct.
66-77:LGTM: Standard Debug implementation and trait delegation.
79-98:LGTM: Constructor implementation is clean.
100-100:LGTM: Standard SelfIter implementation.
214-225:LGTM: Current line extraction logic is appropriate.
245-267:LGTM: Position helper methods correctly convert to Python conventions.The
start()andend()methods properly return 1-indexed line numbers and 0-indexed column offsets, matching Python's tokenize module.
270-284:LGTM: Default state initialization is clean.
Uh oh!
There was an error while loading.Please reload this page.
Summary by CodeRabbit
New Features
Chores