Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-109638: Fix for significant backtracking in csv.Sniffer#109639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
sg3-141-592 wants to merge4 commits intopython:main
base:main
Choose a base branch
Loading
fromsg3-141-592:main

Conversation

sg3-141-592
Copy link

@sg3-141-592sg3-141-592 commentedSep 21, 2023
edited
Loading

#109638 it is possible to get significant backtracking incsv.Sniffer() inside the doublequote checking regex. This change introduces a zero-length lookahead assertion to reduce the amount of backtracking.

This yields a significant improvement in testing

importcsvimporttimeforNUM_ITERATIONSinrange(10,70,10):test_str='"",'*NUM_ITERATIONS+'"'*NUM_ITERATIONS+'0'+'"'*NUM_ITERATIONS+'0't0=time.time()dialect=csv.Sniffer().sniff(test_str)t1=time.time()print(f"{NUM_ITERATIONS},{t1-t0}")
NUM_ITERATIONS | Before           | After Regex Fix10  | 0.002374649 | 0.00093555520  | 0.051596165 | 0.00325059930  | 0.371201515 | 0.01763534540  | 1.477169752 | 0.05492758850  | 4.845417738 | 0.12014007660  | 11.52993703 | 0.252950668

image

@ghost

This comment was marked as resolved.

@bedevere-app

This comment was marked as resolved.

Lib/csv.py Outdated
@@ -270,8 +270,9 @@ def _guess_quote_and_delimiter(self, data, delimiters):

# if we see an extra quote between delimiters, we've got a
# double quoted format
# in future Python versions this zero width look-ahead assert can be replaced with atomic groups
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Please may you explain this comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sure, this zero-width lookahead assertion change in the Regex can be done with an atomic group which is cleaner and more concise.

# Current change(,|^)\W*"(?=(?P<zero>[^,|"\n]*))(?P=zero)"[^,|\n]*"\W*(,|$)# Atomic Group(,|^)\W*"(?>[^,|"\n]*)"[^,|\n]*"\W*(,|$)

But atomic groups are only supported in Python 3.11 onwards so I avoided using them here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks Sean. This PR (if merged) would be part of Python 3.13, so let's use the better atomic group method.

A

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sure, I've switched us to the simpler atomic group setup. Performance is identical to the previous fix.

Copy link

@aterrelaterrel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This looks like a good quick fix for the problem.

Ultimately though, these regexs are hard to read and cause a few problems with lists and other items. I think we should be thinking about how to replace the sniffer to have a higher accuracy. (Seehttps://github.com/ws-garcia/CSVsniffer which shows that only 67.54% accuracy). I've posted to dpo on this topic here:https://discuss.python.org/t/rewrite-csv-sniffer

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@AA-TurnerAA-TurnerAA-Turner left review comments

@aterrelaterrelaterrel approved these changes

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

3 participants
@sg3-141-592@aterrel@AA-Turner

[8]ページ先頭

©2009-2025 Movatter.jp