- Notifications
You must be signed in to change notification settings - Fork294
Make parsing of text be non-quadratic.#579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
In Python, appending strings is not guaranteed to be constant-time,since they are documented to be immutable. In some corner cases,CPython is able to make these operations constant-time, but reachinginto ETree objects is not such a case.This leads to parse times being quadratic in the size of the text inthe input in pathological cases where parsing outputs a large numberof adjacent text nodes which must be combined (e.g. HTML-escapedvalues). Specifically, we expect doubling the size of the input toresult in approximately doubling the time to parse; instead, weobserve quadratic behavior:```In [1]: import html5libIn [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)```Switch from appending to the internal `str`, to appending text to anarray of text chunks, as appends can be done in constant time. Using`bytearray` is a similar solution, but benchmarks slightly worsebecause the strings must be encoded before being appended.This improves parsing of text documents noticeably:```In [1]: import html5libIn [2]: %timeit -n1 -r5 html5lib.parse("<" * 200000)2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [3]: %timeit -n1 -r5 html5lib.parse("<" * 400000)3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [4]: %timeit -n1 -r5 html5lib.parse("<" * 800000)8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)```
andersk commentedFeb 28, 2024
This solution can’t work, as it’s a breaking change to the public API. Before: >>> html5lib.parse("hello")[1].text'hello' After: >>> html5lib.parse("hello")[1].text<html5lib.treebuilders.etree.TextBuffer object at 0x7ff2e31268d0> |
lopuhin commentedMar 10, 2025
From what I can see, there are also plenty of operations in the html5lib-python/html5lib/_tokenizer.py Line 215 infd4f032
|
andersk commentedMar 10, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@lopuhin That line is slow even in CPython. In CPython, appending a character is only O(1) if the string is alocal variable inside a function with no other references. It is O(n) for an object property importtimeitdeflinear_local(n):s=""foriinrange(n):s+="a"# fastdefquadratic_object(n):classC:passc=C()c.s=""foriinrange(n):c.s+="a"# slowdefquadratic_array(n):a= [""]foriinrange(n):a[0]+="a"# slowdefquadratic_global(n):globalss=""foriinrange(n):s+="a"# slowdefquadratic_nonlocal(n):s=""definner():nonlocalsforiinrange(n):s+="a"# slowinner()forfin [linear_local,quadratic_object,quadratic_array,quadratic_global,quadratic_nonlocal]:fornin [100000,200000,400000,800000]:print(f.__name__,n,timeit.timeit(lambda:f(n),number=1)) Output with CPython 3.13.2:
|
lopuhin commentedMar 10, 2025
Good point, thank you! Indeed I can reproduce the slowness of a particular HTML under CPython as well, although the difference is less than under GraalPy. |
Uh oh!
There was an error while loading.Please reload this page.
In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case.
This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior:
Switch from appending to the internal
str
, to appending text to an array of text chunks, as appends can be done in constant time. Usingbytearray
is a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended.This improves parsing of text documents noticeably:
Old flamegraph:

New flamegraph:
