NotificationsYou must be signed in to change notification settings
Fork32.2k
Star67.7k

bpo-24665: Add CJK support in textwrap by default.#5649

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

JulienPalard wants to merge1 commit intopython:masterfromJulienPalard:textwrap-cjk

Closed

bpo-24665: Add CJK support in textwrap by default.#5649

JulienPalard wants to merge1 commit intopython:masterfromJulienPalard:textwrap-cjk

Conversation

Copy link

Member

JulienPalard commentedFeb 13, 2018•
edited by bedevere-bot
Loading

Related to:

https://bugs.python.org/issue24665

the-knights-who-say-ni added the CLA signed label

Feb 13, 2018

bedevere-bot added the awaiting merge label

Feb 13, 2018

ned-deily requested a review fromlarryhastings

February 13, 2018 03:19

fgallaire reviewed

Feb 14, 2018

View reviewed changes

Lib/textwrap.py

		width = 0
		pos = 0
		for char in text:
		width += 2 if east_asian_width(char) in {'F', 'W'} else 1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why inlining _len(), I don't have seen performance issues and it's less readable (less pythonic)

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

How do you know where to break once you have the whole value?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Hello I was reading too fast. In my version there's the_wide boolean function.
So herewidth += _wide(char) + 1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

And_len is justreturn sum(2 if _wide(char) else 1 for char in text) with no performance issues

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

More pythonic, DRY.

Copy link

MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I won't bet on the performances, calling _len from _slice adds two functions calls per character (one to _len and one to sum). In one case I'm doing it on a character, and in the other case in a whole string. Yes I could also factorize this ternary to a third function, but I don't find it more readable.

fgallaire reviewed

Feb 14, 2018

View reviewed changes

Lib/textwrap.py

		width += 2 if east_asian_width(char) in {'F', 'W'} else 1
		if width > index:
		break
		pos += 1

Copy link

fgallaireFeb 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why note use enumerate(), it's less readable (less pythonic)

Copy link

MemberAuthor

JulienPalardFeb 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Because it does not works with enumerate as the last incrementation were not done. I do not remember which case exactly but if you run the unit test you'll spot it easily, it was failing, I'll do if needed but can't right now.

Copy link

fgallaireFeb 15, 2018•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Interested in that, the code was strongly tested for txt2tags and don't catch this problem.

Copy link

MemberAuthor

JulienPalardMar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Your initial implementation was working thanks to yourif cjk_len(text) <= index: return text, '' fixing the special case explicitly, I may have tried to avoid it.

Copy link

fgallaireMar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

"Explicit is better than implicit." but the more important is that both solutions are correct.

Copy link

fgallaire commentedFeb 14, 2018

Don't see my author credit

Copy link

fgallaire commentedFeb 15, 2018

And you miss theif self.width <= 0: bug fixed in#89

JulienPalard force-pushed thetextwrap-cjk branch 2 times, most recently from45fd84d to4623375Compare

March 6, 2018 22:42

bpo-24665: Add CJK support in textwrap by default.

57b2882

Co-authored-by: Florent Gallaire <fgallaire@gmail.com>

JulienPalard force-pushed thetextwrap-cjk branch from4623375 to57b2882Compare

March 6, 2018 22:43

Copy link

MemberAuthor

JulienPalard commentedMar 6, 2018

And you miss the if self.width <= 0: bug fixed in#89

You're right! And trying to split a wide character yield to an infinite loop.

Don't see my author credit

Gladly fixed and co-authored you.

Copy link

fgallaire commentedMar 6, 2018

Thanks@JulienPalard, I'm so happy ! I had almost lost hope to see this issue fixed.

fgallaire reviewed

Mar 6, 2018

View reviewed changes

Lib/textwrap.py

		if self.width <= 0:
		raise ValueError("invalid width %r (must be > 0)" % self.width)
		elif self.width == 1 and _width(text) > len(text):
		raise ValueError("invalid width 1 (must be > 1 when CJK chars)")

Copy link

fgallaireMar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I have done a more complex solution:

elif self.width == 1 and (sum(self._width(chunk) for chunk in chunks) >                              sum(len(chunk) for chunk in chunks)):

It throws the exception earlier, but it's probably not absolutely necessary.

JulienPalard requested a review fromvstinner

March 28, 2018 21:16

terryjreedy requested changes

Jul 8, 2018

View reviewed changes

Copy link

Member

terryjreedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The change I request is that this be closed because it is conceptually wrong. Textwrap works in terms of abstract 'characters' (codepoint), not physical units. I will explain this on the issue.

Aside from that, 2 is the wrong number to add, as 'double-width' characters are not actually twice as wide as fixed-pitch Ascii chars of the same height. See the issue for this as well.