python/cpythonPublic

NotificationsYou must be signed in to change notification settings
Fork33.3k
Star69.7k

difflib.py Differ.compare is too slow [for degenerate cases] #119105

New issue

Closed

difflib.py Differ.compare is too slow [for degenerate cases]#119105

#119492 (+2)

Labels

performancePerformance or resource usagetype-featureA feature request or enhancement

Description

pulkin

opened

on May 16, 2024

Bug report

Bug description:

importdiffliba= ["0123456789\n"]*1_000b= ["01234a56789\n"]*1_000list(difflib.Differ().compare(a,b))# very slow and will probably hit max recursion depth

The case is pathological in the sense of many lines with the same exact diff / ratio.

The issue is that in the current implementation_fancy_replace will take the first pair of lines (with the same ratio) as a split point and will call itself recursively for all lines starting at 2, then 3, 4, etc. This repeats1_000 times resulting in a massive recursion depth andO(N^3) complexity scaling.

For an average random case it should split anywhere in the range of1_000 with a complexity scaling ofO(N^2 log N).

I personally encountered this in diffing csv files where one of the files has a column added which, apparently, results in all-same ratios for every line in the file.

Proposal

Fixing this is not so hard by adding some heuristics (WIP)pulkin@31e1ed0

The idea is very straightforward: while doing the_fancy_replace magic, if you see many diffs with the same exact ratio, pick the one closest to the middle of chunks rather than the first one (which can be the worst possible choice). This is done by promoting the ratios of those pairs of lines that are closer to the middle of the chunks.

The_drag_to_center function turns line number into the weight added to the ratio (twice: for a and for b). The weight is zero for both ends of chunks and maximal in the middle (quadratic poly was chosen for simplicity). The magnitude of the weight "_gravity" is small enough to only affect ratios that are exactly the same: it relies on the assumption that we probably have less than 500k symbols in the line such that the steps in theratio are greater than 1e-6. If this assumption fails some diffs may become different (not necessarily worse).

Performance impact for non-pathological cases is probably minimal.

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

Linux

Linked PRs

Metadata

Assignees

No one assigned

Labels

performancePerformance or resource usagetype-featureA feature request or enhancement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

difflib.py Differ.compare is too slow [for degenerate cases] #119105

Description

Bug report

Bug description:

Proposal

CPython versions tested on:

Operating systems tested on:

Linked PRs

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions