NotificationsYou must be signed in to change notification settings
Fork32
Star279

Fixes to characters considered zero-width#34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

Manishearth merged 6 commits intounicode-rs:masterfromJules-Bertholet:default-ignorable-code-point

Feb 13, 2024

Merged

Fixes to characters considered zero-width#34

Manishearth merged 6 commits intounicode-rs:masterfromJules-Bertholet:default-ignorable-code-point

Feb 13, 2024

Conversation

Copy link

Contributor

Jules-Bertholet commentedFeb 10, 2024•
edited
Loading

These characters are supposed to be completely invisible and ignored by rendering unless specially supported:https://www.unicode.org/faq/unsup_char.html#3.Characters affected

Edit: Now alsofixes#26

Edit 2: I've markedPrepended_Concatenation_Marks as not zero-width. This matches the behavior of glibc

Edit 3: I've given U+115F HANGUL CHOSEONG FILLER back its width 2, because it's expected to be combined with other jamo to form a width-2 syllable block.

TreatDefault_Ignorable_Code_Points as zero-width

aed33e9

Copy link

Member

Manishearth commentedFeb 10, 2024

This implements aspecific standardized algorithm as documented in the readme. This rule around Default_Ignorable doesn't seem to be documented there.

This isnot a general purpose terminal width library.

Manishearth closed this

Feb 10, 2024

Copy link

ContributorAuthor

Jules-Bertholet commentedFeb 10, 2024•
edited
Loading

This library already differs from UAX 11 in several important ways:

This crate has special handling for the soft hyphen and certain Hangul characters, neither of which are mentioned in UAX 11:
unicode-width/scripts/unicode.py
Lines 482 to 487 in8942487
# Override for soft hyphen
width_map[0x00AD]=EffectiveWidth.NARROW
# Override for Hangul Jamo medial vowels & final consonants
foriinrange(0x1160,0x11FF+1):
width_map[i]=EffectiveWidth.ZERO
UAX 11 says "[UTS51] emoji presentation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value", but this crate doesn't do that

Copy link

Member

Manishearth commentedFeb 10, 2024

Hmm, yeah. I didn't originally write this but I would like for the code to follow the spec first and offer these things as settings

Copy link

ContributorAuthor

Jules-Bertholet commentedFeb 10, 2024

UAX 11 doesn't really give a full, exact algorithm for getting a "width value" for a string. For example, control codes aren't even mentioned, nor are line breaks etc. So I think referring to other parts of the Unicode standard as well makes perfect sense.

Manishearth reopened this

Feb 10, 2024

Copy link

Member

Manishearth commentedFeb 10, 2024

Hmm that's fair. Will review later.

I would ideally like someone to take a holistic view of this crate, compare with the specs, and document/add options. Haven't had time to do this myself ever since I inherited it.

Jules-Bertholet changed the title~~TreatDefault_Ignorable_Code_Points as zero-width~~TreatDefault_Ignorable_Code_Points as zero-width, as well as vowel and trailing Jamo

Feb 10, 2024

Jules-Bertholet added2 commits

February 10, 2024 18:48

Treat all jungseong and jongseong jamo as 0-width

397ab07

Fixes#26

Don't treatPrepended_Concatenation_Marks as zero width

a6b5a52

Jules-Bertholet changed the title~~TreatDefault_Ignorable_Code_Points as zero-width, as well as vowel and trailing Jamo~~Fixes to characters considered zero-width

Feb 11, 2024

Jules-Bertholet added2 commits

February 10, 2024 23:27

Give U+115F HANGUL CHOSEONG FILLER width 2

5da0090

Add more info to README

436b0db

Copy link

ContributorAuthor

Jules-Bertholet commentedFeb 11, 2024•
edited
Loading

I would ideally like someone to take a holistic view of this crate, compare with the specs, and document

I've added some comments throughout the code, but here is a summary of the current rules (with this PR's changes included):

The soft hyphen (U+00AD) is single-width. This is based on the behavior ofwcstring()/wcsstring() from POSIX in various implementations. There is more background on this character athttps://archive.is/fCT3c; TLDR is that the Unicode spec interprets it slightly differently than ISO Latin 1 originally did. Arguably this could be configurable.
Hangul jamo medial vowels & final consonants are zero-width. This is to ensure that Hangul syllable blocks, which consists of initial consonant + medial vowel + optional final consonant, have total length 2. The spec suggests that archaic Korean can have syllable blocks with more than 1 initial consonant, which might lead this crate to give wrong results for those cases? May be worth investigating. And nonstandard jamo sequences (missing an initial consonant) might be rendered with a different width than this crate says, of course.
AllDefault_Ignorable_Code_Points are zero-width, except for U+115F HANGUL CHOSEONG FILLER. These code points are documented by Unicode as having no rendering whatsoever unless an implementation specifically supports them. U+115F is exempted, however, because it is meant for Hangul syllable blocks that are missing their initial consonant on purpose, but that still have a medial vowel and/or final consonant. So, because we still want the completed block to have length 2, we give it length 2.
All nonspacing marks (general categoryMn, orMe), as well as control (Cc) characters are zero-width.
Otherwise, we follow UAX 11 in assigning width properties to individual characters.

What's still not handled, or could be handled differently:

This crate could give "wrong" width for anything "malformed" (defective combining character sequences, nonstandard jamo sequences, more I don't know about maybe?)
UnicodeWidthChar gives a width ofSome(0) to U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR, maybe these should beNone?
Soft hyphen situation
Multi-choseong jamo?
Multichar emoji sequences.Treat emoji presentation sequences as fullwidth #35 makes emoji presentation sequences conform to UAX 11, but emoji ZWJ sequences are not handled (as documented in the README). And there are several possible knobs if such support were to be added.
Some non-East-Asian characters would seem to imply an extra-wide rendering, like U+2A753 TWO CONSECUTIVE EQUALS SIGNS (⩵) or the U+2E3B THREE-EM DASH (⸻). But Unicode doesn't encode this information, so this library can't realistically account for them.
Egyptian hieroglyph format controls, once rendering software starts to support them
Control characters, could give more options on how to handle them
U+2044 FRACTION SLASH is weird

Copy link

ContributorAuthor

Jules-Bertholet commentedFeb 11, 2024

https://www.unicode.org/L2/L2023/23107-terminal-suppt.pdf "Measurement" section highlights more problem cases

Copy link

ContributorAuthor

Jules-Bertholet commentedFeb 12, 2024

See alsohttps://www.unicode.org/versions/Unicode15.1.0/ch05.pdf#G40095, "Characters Ignored for Display"

Mark interlinear annotation chars and Egyptian hieroglyph format cont…

aae585f

…rols as non-zero width

Copy link

ContributorAuthor

Jules-Bertholet commentedFeb 12, 2024•
edited
Loading

Unicode §5.21 - "Characters Ignored for Display" - "Default Ignorable Code Point" says:

A small number of format characters (General_Category = Cf ) are also not given the Default_Ignorable_Code_Point property. This may surprise implementers, who often assume that all format characters are generally ignored in fallback display. The exact list of these exceptional format characters can be found in the Unicode Character Database. There are, however, three important sets of such format characters to note:
prepended concatenation marks
interlinear annotation characters
Egyptian hieroglyph format controls
The prepended concatenation marks always have a visible display. See “Prepended Concatenation Marks” inSection 23.2, Layout Controls for more discussion of the use and display of these signs.
The other two notable sets of format characters that exceptionally are not ignored in fallback display consist of the interlinear annotation characters, U+FFF9 INTERLINEAR ANNOTATION ANCHOR through U+FFFB INTERLINEAR ANNOTATION TERMINATOR, and the Egyptian hieroglyph format controls, U+13430 EGYPTIAN HIEROGLYPH VERTICAL JOINER through U+1343F EGYPTIAN HIEROGLYPH END WALLED ENCLOSURE. These characters should have a visible glyph display for fallback rendering, because if they are not displayed, it is too easy to misread the resulting displayed text. See “Annotation Characters” inSection 23.8, Specials, as well asSection 11.4, Egyptian Hieroglyphs for more discussion of the use and display of these characters.

Software that interprets the interlinear annotation characters should probably do that processing before passing tounicode-width, so assuming fallback rendering makes sense in that case. Additionally, next to no implementations currently support the Egyptian hieroglyph format controls, so assuming a fallback rendering probably makes sense there as well. Therefore, I've marked both as non-zero width.

Manishearth approved these changes

Feb 13, 2024

View reviewed changes

Manishearth merged commitfda272b intounicode-rs:master

Feb 13, 2024

Jules-Bertholet deleted the default-ignorable-code-point branch

February 13, 2024 21:23

Labels

None yet

Movatterモバイル変換

Fixes to characters considered zero-width#34

Fixes to characters considered zero-width#34

Uh oh!

Conversation

Jules-Bertholet commentedFeb 10, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Manishearth commentedFeb 10, 2024

Uh oh!

Jules-Bertholet commentedFeb 10, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Manishearth commentedFeb 10, 2024

Uh oh!

Jules-Bertholet commentedFeb 10, 2024

Uh oh!

Manishearth commentedFeb 10, 2024

Uh oh!

Jules-Bertholet commentedFeb 11, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Jules-Bertholet commentedFeb 11, 2024

Uh oh!

Jules-Bertholet commentedFeb 12, 2024

Uh oh!

Jules-Bertholet commentedFeb 12, 2024• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jules-Bertholet commentedFeb 10, 2024•
edited
Loading

Jules-Bertholet commentedFeb 10, 2024•
edited
Loading

Jules-Bertholet commentedFeb 11, 2024•
edited
Loading

Jules-Bertholet commentedFeb 12, 2024•
edited
Loading