Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

difflib.py Differ.compare is too slow [for degenerate cases] #119105

Closed
Labels
performancePerformance or resource usagetype-featureA feature request or enhancement
@pulkin

Description

@pulkin

Bug report

Bug description:

importdiffliba= ["0123456789\n"]*1_000b= ["01234a56789\n"]*1_000list(difflib.Differ().compare(a,b))# very slow and will probably hit max recursion depth

The case is pathological in the sense of many lines with the same exact diff / ratio.

The issue is that in the current implementation_fancy_replace will take the first pair of lines (with the same ratio) as a split point and will call itself recursively for all lines starting at 2, then 3, 4, etc. This repeats1_000 times resulting in a massive recursion depth andO(N^3) complexity scaling.

For an average random case it should split anywhere in the range of1_000 with a complexity scaling ofO(N^2 log N).

I personally encountered this in diffing csv files where one of the files has a column added which, apparently, results in all-same ratios for every line in the file.

Proposal

Fixing this is not so hard by adding some heuristics (WIP)pulkin@31e1ed0

The idea is very straightforward: while doing the_fancy_replace magic, if you see many diffs with the same exact ratio, pick the one closest to the middle of chunks rather than the first one (which can be the worst possible choice). This is done by promoting the ratios of those pairs of lines that are closer to the middle of the chunks.

The_drag_to_center function turns line number into the weight added to the ratio (twice: for a and for b). The weight is zero for both ends of chunks and maximal in the middle (quadratic poly was chosen for simplicity). The magnitude of the weight "_gravity" is small enough to only affect ratios that are exactly the same: it relies on the assumption that we probably have less than 500k symbols in the line such that the steps in theratio are greater than 1e-6. If this assumption fails some diffs may become different (not necessarily worse).

Performance impact for non-pathological cases is probably minimal.

CPython versions tested on:

3.9, 3.12

Operating systems tested on:

Linux

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagetype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp