Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Base implementation of_tokenize module#6240

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
ShaharNaveh wants to merge8 commits intoRustPython:main
base:main
Choose a base branch
Loading
fromShaharNaveh:update-token

Conversation

@ShaharNaveh
Copy link
Collaborator

@ShaharNavehShaharNaveh commentedNov 8, 2025
edited by coderabbitaibot
Loading

Summary by CodeRabbit

  • New Features

    • New Python tokenizer module: Tokenizer iterator that yields token tuples (type, text, start, end, line), accepts a readline callable, optional encoding, and an extra-token mode; emits EOF tokens with position metadata for streaming/tokenization.
  • Chores

    • Internal dependency declarations updated to support the new tokenizer module.

@coderabbitai
Copy link
Contributor

coderabbitaibot commentedNov 8, 2025
edited
Loading

Important

Review skipped

Review was skipped due to path filters

⛔ Files ignored due to path filters (2)
  • Lib/test/test_tarfile.py is excluded by!Lib/**
  • Lib/test/test_unittest/testmock/testhelpers.py is excluded by!Lib/**

CodeRabbit blocks several paths by default. You can override this behavior by explicitly including those paths in the path filters. For example, including**/dist/** will override the default block on thedist directory, by removing the pattern from both the lists.

You can disable this status message by setting thereviews.review_status tofalse in the CodeRabbit configuration file.

Walkthrough

Reorders and augments Ruff-related dependencies in top-levelCargo.toml and adds workspace dependencies instdlib/Cargo.toml; registers a new internaltokenize module instdlib/src/lib.rs; and introducesstdlib/src/tokenize.rs, a PyO3-backed_tokenize module exposingTokenizerIter that yields Python token 5-tuples and handles tokenizer state and EOF.

Changes

Cohort / File(s)Summary
Top-level manifest
Cargo.toml
Reordered Ruff-related crate entries; addedruff_python_trivia; movedruff_text_size nearruff_source_file. No API changes.
Stdlib manifest updates
stdlib/Cargo.toml
Added workspace dependencies:ruff_python_ast,ruff_python_parser,ruff_source_file,ruff_text_size (allworkspace = true).
Module registration
stdlib/src/lib.rs
Addedmod tokenize; and registered_tokenize by including"_tokenize" => tokenize::make_module in module initializers.
Tokenizer implementation
stdlib/src/tokenize.rs
New PyO3 module_tokenize exposing Python classTokenizerIter (PyTokenizerIter) and constructor args. Implements an iterator that reads lines via a Pythonreadline callable, manages internal state with aPyRwLock, emits 5-tuples(type, string, start_pos, end_pos, current_line), remaps operator tokens toTOKEN_OP whenextra_tokens is true, and emits an EOF token tuple.

Sequence Diagram(s)

sequenceDiagram    participant PyUser as Python user    participant TokenIter as TokenizerIter (Py)    participant State as PyTokenizerIterState    participant Parser as Ruff parser    PyUser->>TokenIter: next() / __next__()    TokenIter->>TokenIter: check internal state & tokens    alt tokens exhausted        TokenIter->>PyUser: call readline(vm) -> line        alt readline returns line            TokenIter->>State: push_line(line)            State->>Parser: parse_unchecked_source(line...)            Parser-->>State: tokens            State->>TokenIter: next_token()            TokenIter->>TokenIter: token_kind_value(kind)            alt extra_tokens and operator                TokenIter->>TokenIter: remap -> TOKEN_OP (55)            end            TokenIter-->>PyUser: (type, string, start, end, current_line)        else readline -> EOF            TokenIter->>State: mark EOF            TokenIter-->>PyUser: EndOfFile tuple (empty string + positions)        end    else tokens available        TokenIter-->>PyUser: next token tuple    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focus review onstdlib/src/tokenize.rs concurrency (use ofPyRwLock), correctness of token position/range calculations, EOF emission behavior, and token-kind remapping.
  • Verify workspace dependency consistency acrossCargo.toml andstdlib/Cargo.toml.

Suggested reviewers

  • youknowone

Poem

A rabbit reads each Python line with care,
Hops through tokens drifting in the air,
Ruff crates in order, ready for the quest,
Tuples hop out tidy, positions dressed,
EOF tiptoes in — the iterator rests. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check nameStatusExplanation
Description Check✅ PassedCheck skipped - CodeRabbit’s high-level summary is enabled.
Title check✅ PassedThe title accurately describes the main change: implementing a base_tokenize module with a TokenizerIter class and supporting infrastructure.
Docstring Coverage✅ PassedNo functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for usingCodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment@coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 4

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between5cad66c and7bc0197.

⛔ Files ignored due to path filters (15)
  • Cargo.lock is excluded by!**/*.lock
  • Lib/test/test_tokenize.py is excluded by!Lib/**
  • Lib/test/tokenizedata/__init__.py is excluded by!Lib/**
  • Lib/test/tokenizedata/bad_coding.py is excluded by!Lib/**
  • Lib/test/tokenizedata/bad_coding2.py is excluded by!Lib/**
  • Lib/test/tokenizedata/badsyntax_3131.py is excluded by!Lib/**
  • Lib/test/tokenizedata/badsyntax_pep3120.py is excluded by!Lib/**
  • Lib/test/tokenizedata/coding20731.py is excluded by!Lib/**
  • Lib/test/tokenizedata/tokenize_tests-latin1-coding-cookie-and-utf8-bom-sig.txt is excluded by!Lib/**
  • Lib/test/tokenizedata/tokenize_tests-no-coding-cookie-and-utf8-bom-sig-only.txt is excluded by!Lib/**
  • Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-no-utf8-bom-sig.txt is excluded by!Lib/**
  • Lib/test/tokenizedata/tokenize_tests-utf8-coding-cookie-and-utf8-bom-sig.txt is excluded by!Lib/**
  • Lib/test/tokenizedata/tokenize_tests.txt is excluded by!Lib/**
  • Lib/token.py is excluded by!Lib/**
  • Lib/tokenize.py is excluded by!Lib/**
📒 Files selected for processing (4)
  • Cargo.toml (1 hunks)
  • stdlib/Cargo.toml (1 hunks)
  • stdlib/src/lib.rs (2 hunks)
  • stdlib/src/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

  • stdlib/src/lib.rs
  • stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/lib.rs
  • stdlib/src/tokenize.rs
🧠 Learnings (4)
📓 Common learnings
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

  • stdlib/src/lib.rs
  • stdlib/src/tokenize.rs
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations

Applied to files:

  • stdlib/Cargo.toml
  • stdlib/src/tokenize.rs
  • Cargo.toml
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to extra_tests/**/*.py : Use ruff to lint custom Python code

Applied to files:

  • Cargo.toml
🧬 Code graph analysis (2)
stdlib/src/lib.rs (2)
Lib/tokenize.py (1)
  • tokenize (466-492)
stdlib/src/socket.rs (1)
  • make_module (7-11)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
  • vm (3324-3326)
Lib/test/test_tokenize.py (4)
  • readline (1329-1335)
  • readline (1350-1356)
  • readline (1576-1581)
  • readline (2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Run tests under miri
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Run rust tests (ubuntu-latest)
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Ensure compilation on various targets

Comment on lines +116 to +138
if nline.is_empty() {
state.eof = true;
*zelf.state.write() = state.clone();

let line_num = &state.start().0;
let out = vm
.ctx
.new_tuple(vec![
token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),
vm.ctx.new_str("").into(),
vm.ctx
.new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
.into(),
vm.ctx
.new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])
.into(),
vm.ctx.new_str(state.current_line()).into(),
])
.into();
return Ok(PyIterReturn::Return(out));
}
state.push_line(&nline);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

⚠️ Potential issue |🔴 Critical

End-of-file tuple needs to come from the real EOF token.

Synthesizing the tuple here hardcodes(line, -1) and reuses the previous line text, so consumers see the wrong coordinates and line content. Grab the parser’sEndOfFile token, set it asprev_token, and let the existing emission path produce the tuple.

-                if nline.is_empty() {-                    state.eof = true;-                    *zelf.state.write() = state.clone();--                    let line_num = &state.start().0;-                    let out = vm-                        .ctx-                        .new_tuple(vec![-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),-                            vm.ctx.new_str("").into(),-                            vm.ctx-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])-                                .into(),-                            vm.ctx-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])-                                .into(),-                            vm.ctx.new_str(state.current_line()).into(),-                        ])-                        .into();-                    return Ok(PyIterReturn::Return(out));-                }+                if nline.is_empty() {+                    state.eof = true;+                    if let Some(eof) = state+                        .tokens+                        .iter()+                        .rev()+                        .find(|token| token.kind() == TokenKind::EndOfFile)+                    {+                        state.prev_token = Some(*eof);+                        break eof.clone();+                    }+                    return Ok(PyIterReturn::StopIteration(None));+                }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 116 to 138, don't synthesize the EOFtuple here (which hardcodes (line, -1) and reuses the previous line text);instead obtain the parser's EndOfFile token, set it as the state's prev_token(or otherwise set prev_token to that EOF token), set state.eof = true andpersist the state as before, then allow the existing token-emission path toproduce and return the EOF tuple so the coordinates and line content come fromthe real EOF token rather than being hardcoded.

Comment on lines +152 to +154
if !zelf.extra_tokens {
token_repr = token_repr.trim();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

⚠️ Potential issue |🟠 Major

Don't trim token text whenextra_tokens is false.

trim() wipes indentation whitespace (and other significant padding), so INDENT tokens end up empty strings. CPython preserves the exact lexeme. Just use the slice as-is.

-            let mut token_repr = &state.source[state.range()];-            if !zelf.extra_tokens {-                token_repr = token_repr.trim();-            }+            let token_repr = &state.source[state.range()];
🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 152 to 154, token text is being trimmedwhen zelf.extra_tokens is false which removes significant whitespace (e.g.,INDENT lexemes); remove the call to trim() and use the slice as-is so theoriginal lexeme (including leading/trailing whitespace) is preserved whenextra_tokens is false.

Comment on lines +179 to +180
extra_tokens: bool,
#[pyarg(named, optional)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

⚠️ Potential issue |🔴 Critical

extra_tokens must default toFalse.

CPython’sTokenizerIter allows callingTokenizerIter(readline) without naming the flag. Requiring the argument here breaks that API.

-        #[pyarg(named)]+        #[pyarg(named, default = "false")]         extra_tokens: bool,

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 179-180, the extra_tokens parametercurrently requires being passed and therefore breaks CPython's TokenizerIterAPI; make extra_tokens default to False so it can be omitted. Update the pyargdeclaration to make the argument optional with a default of false (or change theparameter to Option<bool> and treat None as false) so TokenizerIter(readline)works without naming this flag.

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
stdlib/src/tokenize.rs (4)

39-42:Don't swallow non-StopIteration errors fromreadline.

Catching every exception and treating it as EOF hides real failures (TypeError, UnicodeDecodeError, etc.). OnlyStopIteration should terminate the stream; all other errors must propagate.

Apply this diff to fix the error handling:

-            let raw_line = match self.readline.invoke((), vm) {-                Ok(v) => v,-                Err(_) => return Ok(String::new()),-            };+            let raw_line = match self.readline.invoke((), vm) {+                Ok(v) => v,+                Err(exc) => {+                    if exc.fast_isinstance(vm.ctx.exceptions.stop_iteration) {+                        return Ok(String::new());+                    }+                    return Err(exc);+                }+            };

116-138:End-of-file tuple needs to come from the real EOF token.

Synthesizing the EOF tuple here hardcodes(line, -1) and reuses the previous line text, so consumers see the wrong coordinates and line content. Instead, retrieve the parser'sEndOfFile token, set it asprev_token, and let the existing emission path produce the correct tuple.

Apply this diff to use the parser's EOF token:

                 if nline.is_empty() {                     state.eof = true;-                    *zelf.state.write() = state.clone();--                    let line_num = &state.start().0;-                    let out = vm-                        .ctx-                        .new_tuple(vec![-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),-                            vm.ctx.new_str("").into(),-                            vm.ctx-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])-                                .into(),-                            vm.ctx-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])-                                .into(),-                            vm.ctx.new_str(state.current_line()).into(),-                        ])-                        .into();-                    return Ok(PyIterReturn::Return(out));+                    if let Some(eof) = state+                        .tokens+                        .iter()+                        .rev()+                        .find(|token| token.kind() == TokenKind::EndOfFile)+                    {+                        state.prev_token = Some(*eof);+                        break *eof;+                    }+                    *zelf.state.write() = state.clone();+                    return Ok(PyIterReturn::StopIteration(None));                 }

151-154:Don't trim token text whenextra_tokens is false.

trim() removes indentation whitespace and other significant padding, causing INDENT tokens to become empty strings. CPython preserves the exact lexeme. Use the slice as-is without trimming.

Apply this diff to preserve the exact lexeme:

-            let mut token_repr = &state.source[state.range()];-            if !zelf.extra_tokens {-                token_repr = token_repr.trim();-            }+            let token_repr = &state.source[state.range()];

178-180:extra_tokens must default toFalse.

CPython'sTokenizerIter allows callingTokenizerIter(readline) without naming the flag. Requiring the argument here breaks API compatibility.

Apply this diff to makeextra_tokens optional with a default:

         #[pyarg(positional)]         readline: ArgCallable,-        #[pyarg(named)]-        extra_tokens: bool,+        #[pyarg(named, default = "false")]+        extra_tokens: bool,         #[pyarg(named, optional)]         encoding: Option<PyStrRef>,
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)

108-109:Reminder: Address the TODO comment for error handling.

The TODO indicates that error checking (e.g., raising SyntaxError for parse errors) is missing. Thestate.errors field is populated but never checked.

Do you want me to generate code to checkstate.errors and raise appropriate SyntaxError exceptions, or open a new issue to track this task?


200-207:Re-parsing the entire source on each line may be inefficient.

Thepush_line method re-parses the entire accumulated source text each time a new line is added. For large files, this could become a performance bottleneck with O(n²) behavior.

However, this approach may be necessary for correct tokenization since Python's grammar is context-sensitive (e.g., for indent/dedent tracking). Consider profiling to determine if incremental parsing or caching strategies would be beneficial.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between7bc0197 and04371d3.

📒 Files selected for processing (1)
  • stdlib/src/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

  • stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

  • stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
  • vm (3324-3326)
Lib/test/test_tokenize.py (4)
  • readline (1329-1335)
  • readline (1350-1356)
  • readline (1576-1581)
  • readline (2106-2107)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Run rust tests (ubuntu-latest)
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Ensure compilation on various targets
🔇 Additional comments (5)
stdlib/src/tokenize.rs (5)

1-23:LGTM!

Module structure and imports follow RustPython conventions correctly.


25-33:LGTM!

The struct definition follows RustPython patterns correctly. The TODO comment is noted but doesn't block the current implementation.


61-72:LGTM!

Debug implementation and emptypyclass impl block are correct.


74-93:LGTM!

Constructor implementation follows RustPython patterns correctly.


210-220:No changes needed tocurrent_line logic.

Thecurrent_line method correctly uses theLineRanges trait methods fromruff_source_file:full_line_str(offset) returns the full line containing an offset, andfull_lines_str(range) returns text covering all lines that a range spans. The logic properly handles edge cases:

  • Multi-line tokens: Correctly handled viafull_lines_str(range) for non-Newline tokens
  • Newline tokens: Correctly isolated viafull_line_str(range.start())
  • Whenprev_token isNone: Safely defaults toTokenKind::Unknown withTextRange::default(), which callsfull_lines_str and returns lines from the start

The code is sound and requires no modifications.

Comment on lines +223 to +238
fn next_token(&mut self) -> Option<Token> {
for token in self.tokens.iter() {
let (kind, range) = token.as_tuple();

if matches!(kind, TokenKind::NonLogicalNewline) {
continue;
}

if matches!(range.ordering(self.range()), Ordering::Greater) {
self.prev_token = Some(*token);
return self.prev_token;
}
}

None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

⚠️ Potential issue |🔴 Critical

🧩 Analysis chain

Verify the token ordering logic.

Thenext_token method usesrange.ordering(self.range()) to find tokens after the current position. Ensure this correctly handles:

  • Zero-width tokens (if any)
  • Tokens at the exact same position
  • The initial state whenprev_token isNone (resulting inTextRange::default())

Run the following script to verify the behavior with the Ruff crates:


🏁 Script executed:

#!/bin/bash# Check how TextRange::default() and ordering workast-grep --pattern$'TextRange::default()'rg -n -C3"TextRange::default|ordering" --type=rust

Length of output: 10814


I need to find theTextRange type definition and understand itsordering() method to verify the logic.

Now I need to understand the context better. Let me search for hownext_token is used and what the intended behavior should be.

Let me search for the actual tokenize.rs file context to understand the full implementation.

Let me try a different approach to view the file:

Now let me examine the tokenize.rs file more closely:

---🏁 Script executed:```shell#!/bin/bash# Get the full context of the tokenize.rs file around the problematic methodcat stdlib/src/tokenize.rs | head -300 | tail -100

Length of output: 3125


🏁 Script executed:

#!/bin/bash# Find and display the tokenize.rs file with context around the next_token methodfd -name"tokenize.rs" -type f

Length of output: 230


🏁 Script executed:

#!/bin/bash# Display the tokenize.rs file to understand the structure and usagerg -n"fn next_token|impl|struct|fn range|self.tokens" stdlib/src/tokenize.rs -A5 -B2 --type=rust

Length of output: 3859


Let me examine theIterNext implementation to understand hownext_token() is actually being used:

---🏁 Script executed:```shell#!/bin/bash# Get the IterNext implementation to see how next_token is usedsed -n '97,173p' stdlib/src/tokenize.rs

Length of output: 2843


Token ordering logic has a critical issue: tokens at position 0 are skipped on initial state.

Whenprev_token isNone,self.range() returnsTextRange::default() which is a zero-width range at position 0 (0..0). The comparisonrange.ordering(self.range()) > Ordering::Greater only returns true for tokenscompletely after the current range. Since tokens starting at position 0 (e.g., 0..1) overlap with 0..0, they returnOrdering::Equal and are skipped. Only tokens starting after position 0 are returned first, meaning the initial token in the source is never yielded.

The same issue affects zero-width tokens (e.g., at position N): they overlap with their containing range and won't be returned when that range is current.

The fix requires checking forOrdering::Equal andOrdering::Greater on initial state, or usingTextRange::EMPTY properly, or advancing past the current position differently to include overlapping tokens.

🤖 Prompt for AI Agents
In stdlib/src/tokenize.rs around lines 223 to 238, the token selection logicskips tokens that overlap the current range (notably tokens at position 0 whenprev_token is None) because it only accepts Ordering::Greater; change thecomparison to accept both Ordering::Equal and Ordering::Greater (or comparetoken.range().start() against self.range().end()) so tokens that start at thesame position as the current range are returned; update the conditional to checkfor Equal || Greater (or use token_start >= current_end) and ensure prev_tokenis set before returning to preserve state for subsequent calls.

Comment on lines +281 to +386
const fn token_kind_value(kind: TokenKind) -> u8 {
match kind {
TokenKind::EndOfFile => 0,
TokenKind::Name
| TokenKind::For
| TokenKind::In
| TokenKind::Pass
| TokenKind::Class
| TokenKind::And
| TokenKind::Is
| TokenKind::Raise
| TokenKind::True
| TokenKind::False
| TokenKind::Assert
| TokenKind::Try
| TokenKind::While
| TokenKind::Yield
| TokenKind::Lambda
| TokenKind::None
| TokenKind::Not
| TokenKind::Or
| TokenKind::Break
| TokenKind::Continue
| TokenKind::Global
| TokenKind::Nonlocal
| TokenKind::Return
| TokenKind::Except
| TokenKind::Import
| TokenKind::Case
| TokenKind::Match
| TokenKind::Type
| TokenKind::Await
| TokenKind::With
| TokenKind::Del
| TokenKind::Finally
| TokenKind::From
| TokenKind::Def
| TokenKind::If
| TokenKind::Else
| TokenKind::Elif
| TokenKind::As
| TokenKind::Async => 1,
TokenKind::Int | TokenKind::Complex | TokenKind::Float => 2,
TokenKind::String => 3,
TokenKind::Newline | TokenKind::NonLogicalNewline => 4,
TokenKind::Indent => 5,
TokenKind::Dedent => 6,
TokenKind::Lpar => 7,
TokenKind::Rpar => 8,
TokenKind::Lsqb => 9,
TokenKind::Rsqb => 10,
TokenKind::Colon => 11,
TokenKind::Comma => 12,
TokenKind::Semi => 13,
TokenKind::Plus => 14,
TokenKind::Minus => 15,
TokenKind::Star => 16,
TokenKind::Slash => 17,
TokenKind::Vbar => 18,
TokenKind::Amper => 19,
TokenKind::Less => 20,
TokenKind::Greater => 21,
TokenKind::Equal => 22,
TokenKind::Dot => 23,
TokenKind::Percent => 24,
TokenKind::Lbrace => 25,
TokenKind::Rbrace => 26,
TokenKind::EqEqual => 27,
TokenKind::NotEqual => 28,
TokenKind::LessEqual => 29,
TokenKind::GreaterEqual => 30,
TokenKind::Tilde => 31,
TokenKind::CircumFlex => 32,
TokenKind::LeftShift => 33,
TokenKind::RightShift => 34,
TokenKind::DoubleStar => 35,
TokenKind::PlusEqual => 36,
TokenKind::MinusEqual => 37,
TokenKind::StarEqual => 38,
TokenKind::SlashEqual => 39,
TokenKind::PercentEqual => 40,
TokenKind::AmperEqual => 41,
TokenKind::VbarEqual => 42,
TokenKind::CircumflexEqual => 43,
TokenKind::LeftShiftEqual => 44,
TokenKind::RightShiftEqual => 45,
TokenKind::DoubleStarEqual => 46,
TokenKind::DoubleSlash => 47,
TokenKind::DoubleSlashEqual => 48,
TokenKind::At => 49,
TokenKind::AtEqual => 50,
TokenKind::Rarrow => 51,
TokenKind::Ellipsis => 52,
TokenKind::ColonEqual => 53,
TokenKind::Exclamation => 54,
TokenKind::FStringStart => 59,
TokenKind::FStringMiddle => 60,
TokenKind::FStringEnd => 61,
TokenKind::Comment => 62,
TokenKind::TStringStart => 62, // 3.14 compatible
TokenKind::TStringMiddle => 63, // 3.14 compatible
TokenKind::TStringEnd => 64, // 3.14 compatible
TokenKind::IpyEscapeCommand | TokenKind::Question => 0, // Ruff's specific
TokenKind::Unknown => 0,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

⚠️ Potential issue |🔴 Critical

🧩 Analysis chain

Verify token value mappings match CPython.

This mapping should match the token values from CPython'stoken module. The comment at line 22 indicatesTOKEN_OP is 55, which is correct for Python 3.x.

Please verify that all other token values match CPython, especially:

  • Lines 380-382: TString tokens marked as "3.14 compatible"
  • Line 379: Comment token maps to 62 (same as TStringStart on line 380)
  • Lines 383-384: Ruff-specific tokens map to 0

🌐 Web query:

What are the token type values in CPython's token module for Python 3.14?

💡 Result:

I can show you the exact numeric values from your local CPython 3.14 installation — run this in a Python 3.14 interpreter:

python -c "import token, pprint; pprint.pprint({name: getattr(token, name) for name in sorted(token.tok_name.values())})"

Or, to get the reverse mapping (name → numeric value):

python -c "import token, pprint; pprint.pprint({k: v for v, k in token.tok_name.items()})"

The official Python 3.14 documentation for the token module (and the module source) is here for reference. [1]

Sources:
[1] Python 3.14 token module documentation — docs.python.org.


#!/usr/bin/env python3importsystry:importtokenprint("Python version:",sys.version)print("\nToken name to value mapping:")# Get token name to value mappingtok_mapping= {}forvalue,nameintoken.tok_name.items():tok_mapping[name]=value# Sort and printfornameinsorted(tok_mapping.keys()):print(f"{name}:{tok_mapping[name]}")exceptImportErrorase:print(f"Error:{e}")

🏁 Script executed:

#!/bin/bash# Also check if we can inspect the tokenize.rs file for comparisoncd /tmp||exit 1cat> check_cpython_tokens.py<< 'EOF'import tokenimport pprintprint("=== Token module values ===")for tok_val in sorted(token.tok_name.keys()):    print(f"{token.tok_name[tok_val]}: {tok_val}")EOFpython3 check_cpython_tokens.py

Length of output: 844



🏁 Script executed:

#!/bin/bash# Get the full TokenKind enum definition and context around the functioncd /tmp||exit 1find. -name"tokenize.rs" -type f2>/dev/null| head -5

Length of output: 0


🏁 Script executed:

#!/bin/bash# Search for TokenKind enum definitionrg -n"enum TokenKind" --type=rs -A 100

Length of output: 26



🏁 Script executed:

#!/bin/bash# Inspect the TokenKind enum and the token_kind_value functionrg -n"enum TokenKind" -A 150| head -200

Length of output: 0


🏁 Script executed:

#!/bin/bash# Look at the token_kind_value function and surrounding contextrg -n"const fn token_kind_value" -B 5 -A 120| head -200

Length of output: 6992


🏁 Script executed:

#!/bin/bash# Search for any comments about token mapping or Python compatibilityrg -n"CPython|Python.*token|token.*value" stdlib/src/tokenize.rs| head -20

Length of output: 333


Critical token mapping errors found - multiple duplicates and CPython misalignment.

Verification against CPython 3.14 token module revealed significant issues:

  1. Duplicate values (lines 379-380): BothComment andTStringStart map to 62, which is invalid
  2. IncorrectComment value (line 379): Maps to 62, but CPython hasCOMMENT=61
  3. FString tokens don't match CPython (lines 376-378): Map to 59, 60, 61 which correspond toSOFT_KEYWORD,ERRORTOKEN, andCOMMENT in CPython—not FString values
  4. TString tokens misaligned (lines 380-382): The "3.14 compatible" comment is inaccurate; these values (62, 63, 64) don't correspond to template string tokens in CPython 3.14
  5. Exclamation token (line 375): Maps to 54, but CPython hasOP=54; Exclamation is not a standard Python token type

The mapping for lines 375-382 needs correction to either match CPython standard tokens or use reserved/unassigned values to avoid collisions and type misalignment.

🤖 Prompt for AI Agents
stdlib/src/tokenize.rs around lines 281-386: the TokenKind->u8 mapping hasduplicate and misaligned values (e.g., Comment and TStringStart both 62,FStringStart/Middle/End set to 59/60/61 which conflict with CPython 3.14meanings, Exclamation mapped to 54 which collides with OP, and the "3.14compatible" comments are incorrect). Fix by replacing those values with thecorrect CPython 3.14 token numeric assignments for COMMENT,ERRORTOKEN/SOFT_KEYWORD/SO_ON if you intend CPython parity; if you intentionallydiverge, pick unused/reserved numeric values that do not collide with existingentries, remove or correct the misleading "3.14 compatible" comments, ensureIpyEscapeCommand/Question and Unknown map to distinct safe values (or 0 onlywhen appropriate), and run tokenizer unit tests and a full token valueuniqueness check to confirm no duplicates remain.

Copy link
Contributor

@coderabbitaicoderabbitaibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (5)
stdlib/src/tokenize.rs (5)

121-143:EOF tuple must use the parser's actual EndOfFile token.

The code synthesizes the EOF tuple with hardcoded(-1) positions (lines 132, 135) and reusesstate.current_line(), which produces incorrect coordinates and line content. Instead, retrieve the parser'sEndOfFile token fromstate.tokens, set it asprev_token, and let the standard token-emission path (lines 145-175) produce the tuple with correct metadata.

Apply this approach:

                 let nline = zelf.readline(vm)?;                 if nline.is_empty() {                     state.eof = true;-                    *zelf.state.write() = state.clone();--                    let line_num = &state.start().0;-                    let out = vm-                        .ctx-                        .new_tuple(vec![-                            token_kind_value(TokenKind::EndOfFile).to_pyobject(vm),-                            vm.ctx.new_str("").into(),-                            vm.ctx-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])-                                .into(),-                            vm.ctx-                                .new_tuple(vec![line_num.to_pyobject(vm), (-1).to_pyobject(vm)])-                                .into(),-                            vm.ctx.new_str(state.current_line()).into(),-                        ])-                        .into();-                    return Ok(PyIterReturn::Return(out));+                    if let Some(eof) = state+                        .tokens+                        .iter()+                        .find(|token| token.kind() == TokenKind::EndOfFile)+                    {+                        state.prev_token = Some(*eof);+                        break *eof;+                    }+                    *zelf.state.write() = state.clone();+                    return Ok(PyIterReturn::StopIteration(None));                 }

156-159:Remove trim() to preserve exact token lexemes.

Callingtrim() whenextra_tokens is false (line 158) strips significant whitespace, causing INDENT tokens to become empty strings. CPython's tokenizer preserves the exact lexeme. Use the slice as-is.

Apply this diff:

-            let mut token_repr = &state.source[state.range()];-            if !zelf.extra_tokens {-                token_repr = token_repr.trim();-            }+            let token_repr = &state.source[state.range()];

183-184:Makeextra_tokens optional with defaultfalse.

CPython allowsTokenizerIter(readline) without specifyingextra_tokens. The current declaration requires the argument, breaking API compatibility.

Apply this diff:

-        #[pyarg(named)]-        extra_tokens: bool,+        #[pyarg(named, default = "false")]+        extra_tokens: bool,

227-243:Token ordering skips tokens that overlap the current range.

Whenprev_token isNone,self.range() returnsTextRange::default() (0..0). The comparisonrange.ordering(self.range()) > Ordering::Greater (line 236) only matches tokenscompletely after the current range. Tokens starting at position 0 (e.g., 0..1) overlap with 0..0, returningOrdering::Equal, and are skipped. The first token in the source is never yielded.

Fix by comparing start positions:

-                if matches!(range.ordering(self.range()), Ordering::Greater) {+                if range.start() > self.range().end() {                     self.prev_token = Some(*token);                     return self.prev_token;                 }

286-391:Critical token mapping errors: duplicates and CPython conflicts.

Multiple critical issues remain:

  1. Duplicate value 62 (lines 384-385): BothComment andTStringStart map to 62
  2. Incorrect Comment mapping (line 384): Maps to 62, but CPython 3.14 hasCOMMENT=61
  3. FString conflicts (lines 381-383): Values 59, 60, 61 correspond toSOFT_KEYWORD,ERRORTOKEN, andCOMMENT in CPython 3.14
  4. Exclamation collision (line 380): Maps to 54, which is CPython'sOP token
  5. Misleading comments (lines 385-387): "3.14 compatible" claims are inaccurate

These mappings must be corrected to either:

  • Match CPython 3.14 standard token values exactly, or
  • Use reserved/unassigned values (e.g., 100+) to avoid collisions

Run this verification to confirm no remaining duplicates after fixes:

#!/bin/bash# Extract all token value assignments and check for duplicatesrg -n"TokenKind::\w+ => \d+" stdlib/src/tokenize.rs| \  gawk'{print $NF}'| sort| uniq -d| \  xargs -I {}echo"Duplicate token value found: {}"
🧹 Nitpick comments (2)
stdlib/src/tokenize.rs (2)

189-212:State structure is correct; consider performance implications.

Thepush_line method reparses the entire accumulated source on each call (line 208). While correct, this has O(n²) complexity for n lines. This is acceptable for now but may need optimization for large files.


113-118:TODO: Implement syntax error handling.

The TODO on line 114 indicates that parse errors fromstate.errors are not currently raised asSyntaxError. This means invalid Python syntax may be silently ignored. Consider implementing error checking in a follow-up to ensure proper error reporting.

Based on learnings.

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between04371d3 and30f54f6.

⛔ Files ignored due to path filters (1)
  • Lib/test/test_tokenize.py is excluded by!Lib/**
📒 Files selected for processing (1)
  • stdlib/src/tokenize.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Format Rust code with the default rustfmt style (runcargo fmt)
Run clippy and fix any warnings or lints introduced by your changes
Follow Rust best practices for error handling and memory management

Files:

  • stdlib/src/tokenize.rs
{vm,stdlib}/**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Use RustPython macros (pyclass,pymodule,pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • stdlib/src/tokenize.rs
🧠 Learnings (2)
📓 Common learnings
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to Lib/**/*.py : Minimize modifications to CPython standard library files; only make minimal changes to work around RustPython limitations
📚 Learning: 2025-09-11T05:06:59.971Z
Learnt from: CRRepo: RustPython/RustPython PR: 0File: .github/copilot-instructions.md:0-0Timestamp: 2025-09-11T05:06:59.971ZLearning: Applies to {vm,stdlib}/**/*.rs : Use RustPython macros (`pyclass`, `pymodule`, `pyfunction`, etc.) when implementing Python functionality in Rust

Applied to files:

  • stdlib/src/tokenize.rs
🧬 Code graph analysis (1)
stdlib/src/tokenize.rs (2)
vm/src/stdlib/io.rs (1)
  • vm (3324-3326)
Lib/test/test_tokenize.py (4)
  • readline (1328-1334)
  • readline (1349-1355)
  • readline (1575-1580)
  • readline (2105-2106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
  • GitHub Check: Run snippets and cpython tests (ubuntu-latest)
  • GitHub Check: Run snippets and cpython tests on wasm-wasi
  • GitHub Check: Run snippets and cpython tests (macos-latest)
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Run rust tests (macos-latest)
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Check the WASM package and demo
  • GitHub Check: Run rust tests (ubuntu-latest)
  • GitHub Check: Check Rust code with rustfmt and clippy
  • GitHub Check: Ensure compilation on various targets
🔇 Additional comments (9)
stdlib/src/tokenize.rs (9)

1-23:LGTM: Module structure and imports are well-organized.

The#[pymodule] macro usage follows RustPython patterns, and the TOKEN_OP constant correctly matches CPython's value.


25-33:Struct definition is sound; track the TODO for future work.

The use ofPyRwLock for state management is appropriate. The TODO on line 29 regardingPyObject vsArgCallable can be addressed in a follow-up.


35-64:LGTM: Error handling correctly distinguishes StopIteration from other exceptions.

The implementation now properly treats onlyStopIteration as EOF (lines 42-43) and propagates all other errors (line 45), fixing the critical issue from the previous review. The encoding logic for both bytes and strings is also correct.


66-77:LGTM: Standard Debug implementation and trait delegation.


79-98:LGTM: Constructor implementation is clean.


100-100:LGTM: Standard SelfIter implementation.


214-225:LGTM: Current line extraction logic is appropriate.


245-267:LGTM: Position helper methods correctly convert to Python conventions.

Thestart() andend() methods properly return 1-indexed line numbers and 0-indexed column offsets, matching Python's tokenize module.


270-284:LGTM: Default state initialization is clean.

@ShaharNavehShaharNaveh marked this pull request as draftNovember 14, 2025 10:26
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@coderabbitaicoderabbitai[bot]coderabbitai[bot] left review comments

@coolreader18coolreader18Awaiting requested review from coolreader18

@youknowoneyouknowoneAwaiting requested review from youknowone

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

@ShaharNaveh

[8]ページ先頭

©2009-2025 Movatter.jp