In Python, appending strings is not guaranteed to be constant-time, since they are documented to be immutable. In some corner cases, CPython is able to make these operations constant-time, but reaching into ETree objects is not such a case.

This leads to parse times being quadratic in the size of the text in the input in pathological cases where parsing outputs a large number of adjacent text nodes which must be combined (e.g. HTML-escaped values). Specifically, we expect doubling the size of the input to result in approximately doubling the time to parse; instead, we observe quadratic behavior:

In [1]: import html5libIn [2]: %timeit -n1 -r5 html5lib.parse("&lt;" * 200000)2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [3]: %timeit -n1 -r5 html5lib.parse("&lt;" * 400000)6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [4]: %timeit -n1 -r5 html5lib.parse("&lt;" * 800000)19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)

Switch from appending to the internalstr, to appending text to an array of text chunks, as appends can be done in constant time. Usingbytearray is a similar solution, but benchmarks slightly worse because the strings must be encoded before being appended.

This improves parsing of text documents noticeably:

In [1]: import html5libIn [2]: %timeit -n1 -r5 html5lib.parse("&lt;" * 200000)2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [3]: %timeit -n1 -r5 html5lib.parse("&lt;" * 400000)3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [4]: %timeit -n1 -r5 html5lib.parse("&lt;" * 800000)8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)

Old flamegraph:

New flamegraph:

Make parsing of text be non-quadratic.

075cb7c

In Python, appending strings is not guaranteed to be constant-time,since they are documented to be immutable.  In some corner cases,CPython is able to make these operations constant-time, but reachinginto ETree objects is not such a case.This leads to parse times being quadratic in the size of the text inthe input in pathological cases where parsing outputs a large numberof adjacent text nodes which must be combined (e.g. HTML-escapedvalues).  Specifically, we expect doubling the size of the input toresult in approximately doubling the time to parse; instead, weobserve quadratic behavior:```In [1]: import html5libIn [2]: %timeit -n1 -r5 html5lib.parse("&lt;" * 200000)2.99 s ± 269 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [3]: %timeit -n1 -r5 html5lib.parse("&lt;" * 400000)6.7 s ± 242 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [4]: %timeit -n1 -r5 html5lib.parse("&lt;" * 800000)19.5 s ± 1.48 s per loop (mean ± std. dev. of 5 runs, 1 loop each)```Switch from appending to the internal `str`, to appending text to anarray of text chunks, as appends can be done in constant time.  Using`bytearray` is a similar solution, but benchmarks slightly worsebecause the strings must be encoded before being appended.This improves parsing of text documents noticeably:```In [1]: import html5libIn [2]: %timeit -n1 -r5 html5lib.parse("&lt;" * 200000)2.3 s ± 373 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [3]: %timeit -n1 -r5 html5lib.parse("&lt;" * 400000)3.85 s ± 29.7 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)In [4]: %timeit -n1 -r5 html5lib.parse("&lt;" * 800000)8.04 s ± 317 ms per loop (mean ± std. dev. of 5 runs, 1 loop each)```

Copy link

andersk commentedFeb 28, 2024

This solution can’t work, as it’s a breaking change to the public API. Before:

>>> html5lib.parse("hello")[1].text'hello'

After:

>>> html5lib.parse("hello")[1].text<html5lib.treebuilders.etree.TextBuffer object at 0x7ff2e31268d0>

Copy link

lopuhin commentedMar 10, 2025

From what I can see, there are also plenty of operations in the_tokenizer.py which assume that it's possible to append a character to a string in O(1), which is often the case in CPython, but not the case for other implementations, where having a pure-python parser can be especially valuable. E.g. here

html5lib-python/html5lib/_tokenizer.py

Line 215 infd4f032

self.currentToken["data"][-1][1]+=output

Copy link

andersk commentedMar 10, 2025•
edited
Loading

@lopuhin That line is slow even in CPython.

In CPython, appending a character is only O(1) if the string is alocal variable inside a function with no other references. It is O(n) for an object propertyobj.prop or an array elementarr[i] (even if the object or array itself is a local variable), or for aglobal ornonlocal variable—in all of those cases, the string has a refcount of at least 2, which prevents it from being safely mutated in place and forces it to be copied.

importtimeitdeflinear_local(n):s=""foriinrange(n):s+="a"# fastdefquadratic_object(n):classC:passc=C()c.s=""foriinrange(n):c.s+="a"# slowdefquadratic_array(n):a= [""]foriinrange(n):a[0]+="a"# slowdefquadratic_global(n):globalss=""foriinrange(n):s+="a"# slowdefquadratic_nonlocal(n):s=""definner():nonlocalsforiinrange(n):s+="a"# slowinner()forfin [linear_local,quadratic_object,quadratic_array,quadratic_global,quadratic_nonlocal]:fornin [100000,200000,400000,800000]:print(f.__name__,n,timeit.timeit(lambda:f(n),number=1))

Output with CPython 3.13.2:

linear_local 100000 0.006017955995048396linear_local 200000 0.013165883996407501linear_local 400000 0.027179232012713328linear_local 800000 0.052238386997487396quadratic_object 100000 0.11766406099195592quadratic_object 200000 0.5580674420052674quadratic_object 400000 2.6726826040103333quadratic_object 800000 12.140160495007876quadratic_array 100000 0.12400677500409074quadratic_array 200000 0.5755963019910268quadratic_array 400000 2.642135899004643quadratic_array 800000 11.990410245998646quadratic_global 100000 0.12772354800836183quadratic_global 200000 0.5731496340013109quadratic_global 400000 2.738810390001163quadratic_global 800000 12.154955972000607quadratic_nonlocal 100000 0.1292998229910154quadratic_nonlocal 200000 0.5955325639952207quadratic_nonlocal 400000 2.6306100980000338quadratic_nonlocal 800000 11.95639012400352

Copy link

lopuhin commentedMar 10, 2025

Good point, thank you! Indeed I can reproduce the slowness of a particular HTML under CPython as well, although the difference is less than under GraalPy.

Labels

None yet

3 participants

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make parsing of text be non-quadratic.#579

Are you sure you want to change the base?

Make parsing of text be non-quadratic.#579

Uh oh!

Conversation

alexmv commentedFeb 27, 2024•
edited
Loading

Uh oh!

Uh oh!

andersk commentedFeb 28, 2024

Uh oh!

lopuhin commentedMar 10, 2025

Uh oh!

andersk commentedMar 10, 2025•
edited
Loading

Uh oh!

Uh oh!

lopuhin commentedMar 10, 2025

Uh oh!

Uh oh!

Movatterモバイル変換

Make parsing of text be non-quadratic.#579

Are you sure you want to change the base?

Make parsing of text be non-quadratic.#579

Uh oh!

Conversation

alexmv commentedFeb 27, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

andersk commentedFeb 28, 2024

Uh oh!

lopuhin commentedMar 10, 2025

Uh oh!

andersk commentedMar 10, 2025• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

lopuhin commentedMar 10, 2025

Uh oh!

Uh oh!

alexmv commentedFeb 27, 2024•
edited
Loading

andersk commentedMar 10, 2025•
edited
Loading