Movatterモバイル変換

pablogsal changed the title~~Initial implementation of PEP 701~~gh-102856: Initial implementation of PEP 701

bedevere-bot mentioned this pull request

PEP 701 – Syntactic formalization of f-strings#102856

Closed

4 tasks

pablogsal force-pushed thefstring-grammar-rebased-after-sprint branch from7cb2e44 toed0ef34Compare

March 20, 2023 22:29

Eclips4 reviewed

Parser/tokenizer.h OutdatedShow resolvedHide resolved

pablogsal commented

Mar 21, 2023

Parser/tokenizer.h OutdatedShow resolvedHide resolved

Copy link

ghost commentedMar 26, 2023•
edited by ghost
Loading

All commit authors signed the Contributor License Agreement.

pablogsal marked this pull request as ready for review

April 13, 2023 10:05

pablogsal requested review fromiritkatriel andlysnikolaou ascode owners

April 13, 2023 10:05

pablogsal commented

Grammar/python.gram OutdatedShow resolvedHide resolved

Copy link

Member

sunmy2019 commentedApr 13, 2023

One issue is that, with current grammar

f"{lambdax:{123}}"

will be recognized as a valid lambda, but

f"{lambdax: {123}}"f"{lambdax:{123}}"

won't. It definitely confuses the users.

I can't figure out an elegant way to fix this under current tokens. Since the info ofin_format_spec only exists when the token is being tokenized, then the information is lost when exiting that

One workaround is to emit an empty fstring_middle to prevent any further match by thelambdef.

Another workaround is to add 2 tokens:FSTRING_REPLACEMENT_FIELD_START/END, this preserves thein_format_spec info when passed to the parser.

Copy link

MemberAuthor

pablogsal commentedApr 13, 2023

@sunmy2019 the changes may also maketest_tokenize.TestRoundtrip.test_random_files fail for some cases, but that may be an older failure.

Copy link

Member

sunmy2019 commentedApr 13, 2023

@sunmy2019 the changes may also maketest_tokenize.TestRoundtrip.test_random_files fail for some cases, but that may be an older failure.

I ran cpu heavy tests yeterday, and found this failure. See here:pablogsal#67 (comment)

Both the tokenize and the untokenize function needs a rewrite.

pablogsal added the 🔨 test-with-refleak-buildbotsTest PR w/ refleak buildbots; report in status section label

Copy link

bedevere-bot commentedApr 13, 2023

🤖 New build scheduled with the buildbot fleet by@pablogsal for commit18f69e6 🤖

If you want to schedule another build, you need to add the🔨 test-with-refleak-buildbots label again.

bedevere-bot removed the 🔨 test-with-refleak-buildbotsTest PR w/ refleak buildbots; report in status section label

https://buildbot.python.org/all/#/builders/802/builds/760

Copy link

MemberAuthor

pablogsal commentedApr 13, 2023

We are almost there! We have a failing test in some buildbots:

I cannot reproduce in my mac machine. Maybe someone has more luck with a Linux system

Copy link

Member

isidentical commentedApr 13, 2023

I cannot reproduce in my mac machine. Maybe someone has more luck with a Linux system

No luck on my side either (with a Linux machine + debug build + refleaks) fortest_ast/test_fstring/test_tokenize. Trying the whole test suite (is there a specific option I might be missing?)

Copy link

Member

lysnikolaou commentedApr 13, 2023

I'm able to reproduce on a Debian container using Docker on my macOS. The problem has to do with code likeeval('f""'). When the f-string is too small, it results in eitherstart_char orpeek1 or bothhere to be EOF. For some reason, on this machine with this configuration they're not-1 (EOF), but rather255, which means that therelevant check intok_backup fails and we have a fatar error raised fromhere. I can't explain why they wouldn't beEOF until now, but I'm looking.

Copy link

Member

lysnikolaou commentedApr 13, 2023

More info. When running it with Python, I get the following error:

root@9ee555036b0f:/usr/src/cpython# cat t.pyeval('f"a"')root@9ee555036b0f:/usr/src/cpython# ./python t.pyFatalPythonerror:tok_backup:tok_backup:wrongcharacterPythonruntimestate:initializedCurrentthread0x0000ffff9de38750 (mostrecentcallfirst):File"/usr/src/cpython/t.py",line1in<module>Aborted

Here's a simple step thoughtok_get_fstring_mode on gdb in the last pass that generates the error

(gdb) file ./pythonReading symbols from ./python...warning: File "/usr/src/cpython/python-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".To enable execution of this file addadd-auto-load-safe-path /usr/src/cpython/python-gdb.pyline to your configuration file "/root/.gdbinit".To completely disable this security protection addset auto-load safe-path /line to your configuration file "/root/.gdbinit".For more information about this security protection see the"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:info "(gdb)Auto-loading safe path"(gdb) break tok_get_fstring_modeBreakpoint 1 at 0x17119c: file Parser/tokenizer.c, line 2442.(gdb) r t.pyStarting program: /usr/src/cpython/python t.pywarning: Error disabling address space randomization: Operation not permitted[Thread debugging using libthread_db enabled]Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".Breakpoint 1, tok_get_fstring_mode (tok=0xaaab0ecfec60, current_tok=0xaaab0ecff7d0, token=0xffffc95f8278) at Parser/tokenizer.c:24422442{(gdb) cContinuing.Breakpoint 1, tok_get_fstring_mode (tok=0xaaab0ecfec60, current_tok=0xaaab0ecff7d0, token=0xffffc95f8278) at Parser/tokenizer.c:24422442{(gdb) p tok->cur$1 = 0xffff883fc1e3 "\""(gdb) p tok->buf$2 = 0xffff883fc1e0 "f\"a\""(gdb) n2448    tok->start = tok->cur;(gdb)2449    tok->first_lineno = tok->lineno;(gdb)2450    tok->starting_col_offset = tok->col_offset;(gdb)2454    char start_char = tok_nextc(tok);(gdb)2455    char peek1 = tok_nextc(tok);(gdb) p start_char$3 = 34 '"'(gdb) stok_nextc (tok=0xaaab0ecfec60) at Parser/tokenizer.c:11691169{(gdb) n1172        if (tok->cur != tok->inp) {(gdb)1176        if (tok->done != E_OK) {(gdb)1179        if (tok->fp == NULL) {(gdb)1180            rc = tok_underflow_string(tok);(gdb) stok_underflow_string (tok=0xaaab0ecfec60) at Parser/tokenizer.c:965965tok_underflow_string(struct tok_state *tok) {(gdb) list960    } while (tok->inp[-1] != '\n');961    return 1;962}963964static int965tok_underflow_string(struct tok_state *tok) {966    char *end = strchr(tok->inp, '\n');967    if (end != NULL) {968        end++;969    }(gdb)970    else {971        end = strchr(tok->inp, '\0');972        if (end == tok->inp) {973            tok->done = E_EOF;974            return 0;975        }976    }977    if (tok->start == NULL) {978        tok->buf = tok->cur;979    }(gdb) n966    char *end = strchr(tok->inp, '\n');(gdb)967    if (end != NULL) {(gdb)971        end = strchr(tok->inp, '\0');(gdb)972        if (end == tok->inp) {(gdb)973            tok->done = E_EOF;(gdb)974            return 0;(gdb)tok_nextc (tok=0xaaab0ecfec60) at Parser/tokenizer.c:11891189        if (tok->debug) {(gdb) list1184        }1185        else {1186            rc = tok_underflow_file(tok);1187        }1188#if defined(Py_DEBUG)1189        if (tok->debug) {1190            fprintf(stderr, "line[%d] = ", tok->lineno);1191            print_escape(stderr, tok->cur, tok->inp - tok->cur);1192            fprintf(stderr, "  tok->done = %d\n", tok->done);1193        }(gdb)1194#endif1195        if (!rc) {1196            tok->cur = tok->inp;1197            return EOF;1198        }1199        tok->line_start = tok->cur;12001201        if (contains_null_bytes(tok->line_start, tok->inp - tok->line_start)) {1202            syntaxerror(tok, "source code cannot contain null bytes");1203            tok->cur = tok->inp;(gdb) n1195        if (!rc) {(gdb)1196            tok->cur = tok->inp;(gdb)1197            return EOF;(gdb)tok_get_fstring_mode (tok=0xaaab0ecfec60, current_tok=0xaaab0ecff7d0, token=0xffffc95f8278) at Parser/tokenizer.c:24562456    tok_backup(tok, peek1);(gdb) p peek1$4 = 255 '\377'

Copy link

Member

isidentical commentedApr 13, 2023•
edited
Loading

When the f-string is too small, it results in either start_char or peek1 or bothhere to be EOF.

Oh, this kind of makes sense. At least on how we got there. I wonder whether we could simply look at thepeek1 if thestart_char is{/}. This would prevent the secondarytok_nextc/tok_backup pair when in case the string is too small.

E.g. something like this (just as a hack to test if it works):

diff --git a/Parser/tokenizer.c b/Parser/tokenizer.cindex d88d737860..34f291cf89 100644--- a/Parser/tokenizer.c+++ b/Parser/tokenizer.c@@ -2452,8 +2452,14 @@ tok_get_fstring_mode(struct tok_state *tok, tokenizer_mode* current_tok, struct     // If we start with a bracket, we defer to the normal mode as there is nothing for us to tokenize     // before it.     char start_char = tok_nextc(tok);-    char peek1 = tok_nextc(tok);-    tok_backup(tok, peek1);+    char peek1;+    if (start_char == '{' || start_char == '}') {+        peek1 = tok_nextc(tok);+        tok_backup(tok, peek1);+    }+    else {+        peek1 = '0';+    }     tok_backup(tok, start_char);      if ((start_char == '{' && peek1 != '{') || (start_char == '}' && peek1 != '}')) {

For me,eval(f"a")

pablogsal added the 🔨 test-with-refleak-buildbotsTest PR w/ refleak buildbots; report in status section label

Copy link

bedevere-bot commentedApr 13, 2023

🤖 New build scheduled with the buildbot fleet by@pablogsal for commitd28efe1 🤖

If you want to schedule another build, you need to add the🔨 test-with-refleak-buildbots label again.

bedevere-bot removed the 🔨 test-with-refleak-buildbotsTest PR w/ refleak buildbots; report in status section label

Copy link

Member

lysnikolaou commentedApr 13, 2023•
edited
Loading

When the f-string is too small, it results in either start_char or peek1 or bothhere to be EOF.
Oh, this kind of makes sense. At least on how we got there. I wonder whether we could simply look at thepeek1 if thestart_char is{/}. This would prevent the secondarytok_nextc/tok_backup pair when in case the string is too small.

Not sure whether this is the actual problem though.tok_backup is okay to handleEOF and, on the other platforms we're testing, everything seems to work okay. The reason is that every check will fail, until we reachthis which should be able to handle things correctly.

The big questions to me is how do we end up withpeek1 == 255, when it very clearly came fromreturn EOF and the subsequent checkc == EOF intok_backup fails.

Copy link

Member

sunmy2019 commentedApr 13, 2023•
edited
Loading

The big questions to me is how do we end up withpeek1 == 255, when it very clearly came fromreturn EOF and the subsequent checkc == EOF intok_backup fails.

char is unsigned on those platforms (arm). Thus,

   char start_char = tok_nextc(tok);   char peek1 = tok_nextc(tok);

will lead to a 255.

Then 255 was converted to int again intok_backup.

I can reproduce this problem on x86 with

   unsigned char start_char = tok_nextc(tok);   unsigned char peek1 = tok_nextc(tok);

Copy link

Member

sunmy2019 commentedApr 13, 2023

C allows anyint to convert tochar, which may silently change its value. It is exactly what we have in this case.

explicitly usingsigned char orint should fix this 255 problem. (I prefer theint one)

Copy link

Member

lysnikolaou commentedApr 13, 2023

Oooh, that's right! Didn't know that ARM has unsigned chars by default. Pushed a fix.

Copy link

Member

isidentical commentedApr 13, 2023

Wow, thats a nice find!!

Copy link

MemberAuthor

pablogsal commentedApr 18, 2023

Same thing for@lysnikolaou 😉

pablogsal added the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

Apr 18, 2023

Copy link

bedevere-bot commentedApr 18, 2023

🤖 New build scheduled with the buildbot fleet by@pablogsal for commitafb310d 🤖

If you want to schedule another build, you need to add the🔨 test-with-buildbots label again.

bedevere-bot removed the 🔨 test-with-buildbotsTest PR w/ buildbots; report in status section label

Apr 18, 2023

lysnikolaou approved these changes

Copy link

Member

lysnikolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM! Let's merge! 🚀

bedevere-bot added awaiting merge and removed awaiting core review labels

isidentical approved these changes

Copy link

Member

isidentical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

💫 Looks great, thanks everyone for their amazing work!!

Fix whitespace to make patchcheck happy

13f942a

pablogsal merged commit1ef61cf intopython:main

pablogsal deleted the fstring-grammar-rebased-after-sprint branch

April 19, 2023 16:18

bedevere-bot removed the awaiting merge label

python deleted a comment frombedevere-bot