Crateunicode_width

Expand description

Determine displayed width ofchar andstr types according toUnicode Standard Annex #11and other portions of the Unicode standard.See theRules for determining width sectionfor the exact rules.

This crate is#![no_std].

useunicode_width::UnicodeWidthStr;letteststr ="Ｈｅｌｌｏ, ｗｏｒｌｄ!";letwidth = UnicodeWidthStr::width(teststr);println!("{}", teststr);println!("The above string is {} columns wide.", width);

§`"cjk"` feature flag

This crate has one Cargo feature flag,"cjk"(enabled by default).It enables theUnicodeWidthChar::width_cjkandUnicodeWidthStr::width_cjk,which perform an alternate width calculationmore suited to CJK contexts. The flag also unseals theUnicodeWidthChar andUnicodeWidthStr traits.

Disabling the flag (withno_default_features inCargo.toml)will reduce the amount of static data needed by the crate.

useunicode_width::UnicodeWidthStr;letteststr ="“𘀀”";assert_eq!(teststr.width(),4);#[cfg(feature ="cjk")]assert_eq!(teststr.width_cjk(),6);

§Rules for determining width

This crate currently uses the following rules to determine the width of acharacter or string, in order of decreasing precedence. These may be tweaked in the future.

In the following cases, the width of a string differs from the sum of the widths of its constituent characters:
- The sequence"\r\n" has width 1.
- Emoji-specific ligatures:
  - Well-formed, fully-qualifiedemoji ZWJ sequences have width 2.
  - Emoji modifier sequences have width 2.
  - Emoji presentation sequences have width 2.
  - Outside of an East Asian context,text presentation sequences have width 1 if their base character:
    - Has theEmoji_Presentation property, and
    - Is not in theEnclosed Ideographic Supplement block.
- '\u{2018}','\u{2019}','\u{201C}', and'\u{201D}' always have width 1when followed by ‘\u{FE00}’ or ‘\u{FE02}’, and width 2 when followed by ‘\u{FE01}’.
- Script-specific ligatures:
  - For all the following ligatures, the insertion of any number ofdefault-ignorable combining marks anywhere in the sequence will not change the total width. In addition, for all non-Arabicligatures, the insertion of any number of'\u{200D}' ZERO WIDTH JOINERswill not affect the width.
  - Arabic: A character sequence consisting of one character withJoining_Group=Lam,followed by any number of characters withJoining_Type=Transparent, followed by one characterwithJoining_Group=Alef, has total width 1. For example:لا‎,لآ‎,ڸا‎,لٟٞأ
  - Buginese:"\u{1A15}\u{1A17}\u{200D}\u{1A10}" (<a, -i> ya,ᨕᨗ‍ᨐ) has total width 1.
  - Hebrew:"א\u{200D}ל" (Alef-Lamed,א‍ל) has total width 1.
  - Khmer: Coeng signs consisting of'\u{17D2}' followed by a character in'\u{1780}'..='\u{1782}' | '\u{1784}'..='\u{1787}' | '\u{1789}'..='\u{178C}' | '\u{178E}'..='\u{1793}' | '\u{1795}'..='\u{1798}' | '\u{179B}'..='\u{179D}' | '\u{17A0}' | '\u{17A2}' | '\u{17A7}' | '\u{17AB}'..='\u{17AC}' | '\u{17AF}'have width 0.
  - Kirat Rai: Any sequence canonically equivalent to'\u{16D68}','\u{16D69}', or'\u{16D6A}' has total width 1.
  - Lisu: Tone letter combinations consisting of a character in the range'\u{A4F8}'..='\u{A4FB}'followed by a character in the range'\u{A4FC}'..='\u{A4FD}' have width 1. For example:ꓹꓼ
  - Old Turkic:"\u{10C32}\u{200D}\u{10C03}" (𐰲‍𐰃) has total width 1.
  - Tifinagh: A sequence of a Tifinagh consonant in the range'\u{2D31}'..='\u{2D65}' | '\u{2D6F}', followed by either'\u{2D7F}' TIFINAGH CONSONANT JOINER or'\u{200D}', followed by another Tifinangh consonant, has total width 1.For example:ⵏ⵿ⴾ
- In an East Asian context only,<,=, or> have width 2 when followed by'\u{0338}' COMBINING LONG SOLIDUS OVERLAY.The two characters may be separated by any number of characters whose canonical decompositions consist only of characters meetingone of the following requirements:
  - HasCanonical_Combining_Class greater than 1, or
  - Is adefault-ignorable combining mark.
In all other cases, the width of the string equals the sum of its character widths:
1. '\u{2D7F}' TIFINAGH CONSONANT JOINER has width 1 (outside of the ligatures described previously).
2. '\u{115F}' HANGUL CHOSEONG FILLER and'\u{17A4}' KHMER INDEPENDENT VOWEL QAA have width 2.
3. '\u{17D8}' KHMER SIGN BEYYAL has width 3.
4. The following have width 0:
  - Characterswith theDefault_Ignorable_Code_Point property.
  - Characterswith theGrapheme_Extend property.
  - Characterswith aHangul_Syllable_Type ofVowel_Jamo (V) orTrailing_Jamo (T).
  - The followingPrepended_Concatenation_Marks:
  - Characterswith theGrapheme_Extend=Prepend property, that are not alsoPrepended_Concatenation_Marks.
  - '\u{A8FA}' DEVANAGARI CARET.
5. Characterswith anEast_Asian_Width ofFullwidth orWide have width 2.
6. Characters fulfilling all of the following conditions have width 2 in an East Asian context, and width 1 otherwise:
  - Fulfills one of the following conditions:
    - Has anEast_Asian_Width ofAmbiguous, or
    - Has aLine_Break ofAI, or
    - Has a canonical decomposition to anAmbiguous character followed by'\u{0338}' COMBINING LONG SOLIDUS OVERLAY, or
    - Is'\u{0387}' GREEK ANO TELEIA; and
  - Does not have aGeneral_Category ofLetter orModifier_Symbol.
7. All other characters have width 1.

§Canonical equivalence

Canonically equivalent strings are assigned the same width (CJK and non-CJK).

Constants§

UNICODE_VERSION: The version ofUnicodethat this version of unicode-width is based on.

Traits§

UnicodeWidthChar: Methods for determining displayed width of Unicode characters.
UnicodeWidthStr: Methods for determining displayed width of Unicode strings.

Movatterモバイル変換