Movatterモバイル変換

Optimize textwrap.indent()#107369

Closed

bedevere-bot added the awaiting core review label

methane added performance

Performance or resource usage

stdlibPython modules in the Lib dir labels

Add what's new entry

6ee731c

eendebakpt approved these changes

Copy link

Contributor

eendebakpt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Looks good! Usingstr.split for the predicate instead ofline.strip might change something for input that is notstr, but I think this is ok.

serhiy-storchaka reviewed

Copy link

Member

serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

lstrip is faster for non-indented lines.

I wonder whether the following variants can be faster for some input and for how wide category of input.

defpredicate(line):returnlineand (notline[0].isspace()orline.lstrip())

predicate=re.compile(r'\S').search

Copy link

MemberAuthor

methane commentedJul 28, 2023•
edited
Loading

_has_nonspace = re.compile(r'\S').search in global andpredicate = _has_nonspace -- 3.5ms
str.rstrip = 1.95ms
str.lstrip = 2.03ms
lambda x: not x.isspace() = 2.07ms

Since we usesplitlines(keepends=True), we can use justnot x.isspace(). (no empty line is guaranteed."".splitlines(keepends=True) == [] and"foo\n".splitlines(True) == ['foo\n']).
But it is a bit tricky and has relatively high cognitive load.

In case of unicodeobject.c, rstrip is bit faster. But it may be because most lines are indented already.

So I chose str.lstrip here, as Serhiy suggested.

Use lstrip instead of strip

fad98a2

Copy link

Member

serhiy-storchaka commentedJul 28, 2023

Now that you mention it, I can see that usingisspace() is the most obvious way to do this. Why I did not see it earlier?

We want to test whether the line has any non-space character.bool(line.strip()) is actually a tricky way -- we strips the line from spaces and if the rest is not empty string, then the original line has non-space characters too.not line.isspace() is a straightforward way -- it asks the opposite question (is the line only contains space characters?) and negates the result.

Algorithmically,isspace() looks more preferable, because it does not create a string. But on practice it may not matter in common cases. Did you compare variants with different inputs? For exampleMisc/NEWS.d/3.8.0a1.rst may show a very different result.

eendebakpt reviewed

With4c6a46a andhttps://gist.github.com/methane/5c6153c564d9508199a81c48d33161eb

Lib/textwrap.py OutdatedShow resolvedHide resolved

avoid temporary tuple.

4c6a46a

Copy link

MemberAuthor

methane commentedJul 29, 2023

Now that you mention it, I can see that usingisspace() is the most obvious way to do this. Why I did not see it earlier?

Because"".isspace() is False. We need to guarantee that "" is not used here.
x and not x.isspace() would be bit obvious, but little slower.

Algorithmically,isspace() looks more preferable, because it does not create a string. But on practice it may not matter in common cases. Did you compare variants with different inputs? For exampleMisc/NEWS.d/3.8.0a1.rst may show a very different result.

lstrip() is slow when every line has long indent. ButMisc/NEWS.d/3.8.0a1.rst has almost no indents.

> ./python.exe bench_indent.py Misc/NEWS.d/3.8.0a1.rstfilename='Misc/NEWS.d/3.8.0a1.rst' 8978 lines.                   lstrip: 0.736msec          not x.isspace(): 0.877msec    x and not x.isspace(): 0.929msec> ./python.exe bench_indent.py Objects/unicodeobject.cfilename='Objects/unicodeobject.c' 15332 lines.                   lstrip: 1.812msec          not x.isspace(): 1.877msec    x and not x.isspace(): 1.970msec

If I addtext = textwrap.indent(text, " "*32) before bench:

> ./python.exe bench_indent.py Objects/unicodeobject.cfilename='Objects/unicodeobject.c' 15332 lines.                   lstrip: 2.259msec          not x.isspace(): 2.356msec    x and not x.isspace(): 2.437msec

Copy link

MemberAuthor

methane commentedJul 29, 2023

To maximize performance, we can stop using lambda by...:

    if predicate is None:        for line in text.splitlines(True):            if not line.isspace():                prefixed_lines.append(prefix)            prefixed_lines.append(line)    else:        for line in text.splitlines(True):            if predicate(line):                prefixed_lines.append(prefix)            prefixed_lines.append(line)

filename='Objects/unicodeobject.c' 15332 lines.                     None: 1.604msec                   lstrip: 1.826msec          not x.isspace(): 1.883msec

methane added2 commits

July 29, 2023 12:34

use str.isspace instead of lstrip

5e60878

add comment about splitlines(True)

16e3dbd

serhiy-storchaka approved these changes

Copy link

Member

serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thank you for your research Inada-san. Which to use here, lstrip or isspace, I leave up to you. It does not really matter in most cases.

bedevere-bot added awaiting merge and removed awaiting core review labels

25% -> 30%

734fd01

methaneenabled auto-merge (squash)

July 29, 2023 06:03

methane merged commit37551c9 intopython:main

methane deleted the opt-textwrap-indent branch

July 29, 2023 06:37

bedevere-bot removed the awaiting merge label