Thestd.uni module provides an implementation of fundamental Unicode algorithms and data structures. This doesn't include UTF encoding and decoding primitives, seestd.utf.decode andstd.utf.encode instd.utf for this functionality.
All primitives listed operate on Unicode characters and sets of characters. For functions which operate on ASCII characters and ignore Unicodecharacters, seestd.ascii. For definitions of Unicodecharacter,code point and other terms used throughout this module see theterminology section below.
The focus of this module is the core needs of developing Unicode-aware applications. To that effect it provides the following optimized primitives:
It's recognized that an application may need further enhancements and extensions, such as less commonly known algorithms, or tailoring existing ones for region specific needs. To help users with building any extra functionality beyond the core primitives, the module provides:
import std.uni;void main(){// initialize code point sets using script/block or property name// now 'set' contains code points from both scripts.auto set = unicode("Cyrillic") | unicode("Armenian");// same thing but simpler and checked at compile-timeauto ascii = unicode.ASCII;auto currency = unicode.Currency_Symbol;// easy set opsauto a = set & ascii;assert(a.empty);// as it has no intersection with ascii a = set | ascii;auto b = currency - a;// subtract all ASCII, Cyrillic and Armenian// some properties of code point setsassert(b.length > 45);// 46 items in Unicode 6.1, even more in 6.2// testing presence of a code point in a set// is just fine, it is O(logN)assert(!b['$']);assert(!b['\u058F']);// Armenian dram signassert(b['¥']);// building fast lookup tables, these guarantee O(1) complexity// 1-level Trie lookup table essentially a huge bit-set ~262Kbauto oneTrie = toTrie!1(b);// 2-level far more compact but typically slightly slowerauto twoTrie = toTrie!2(b);// 3-level even smaller, and a bit slower yetauto threeTrie = toTrie!3(b);assert(oneTrie['£']);assert(twoTrie['£']);assert(threeTrie['£']);// build the trie with the most sensible trie level// and bind it as a functorauto cyrillicOrArmenian = toDelegate(set);auto balance = find!(cyrillicOrArmenian)("Hello ընկեր!");assert(balance =="ընկեր!");// compatible with bool delegate(dchar)booldelegate(dchar) bindIt = cyrillicOrArmenian;// Normalization string s ="Plain ascii (and not only), is always normalized!";assert(sis normalize(s));// is the same string string nonS ="A\u0308ffin";// A ligatureauto nS = normalize(nonS);// to NFC, the W3C endorsed standardassert(nS =="Äffin");assert(nS != nonS); string composed ="Äffin";assert(normalize!NFD(composed) =="A\u0308ffin");// to NFKD, compatibility decomposition useful for fuzzy matching/searchingassert(normalize!NFKD("2¹⁰") =="210");}
The following is a list of important Unicode notions and definitions. Any conventions used specifically in this module alone are marked as such. The descriptions are based on the formal definition as found in chapter three of The Unicode Standard Core Specification.
A unit of information used for the organization, control, or representation of textual data. Note that:This module defines a number of primitives that work with graphemes:Grapheme,decodeGrapheme andgraphemeStride. All of them are usingextended grapheme boundaries as defined in the aforementioned standard annex.
A combining character with the General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me). A combining character that is not a nonspacing mark.The concepts ofcanonical equivalent orcompatibility equivalent characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence is determined by a process called normalization, whereby strings are converted into forms which are compared directly for identity. This is the primary goal of the normalization process, see the functionnormalize to convert into any of the four defined forms.
A very important attribute of the Unicode Normalization Forms is that they must remain stable between versions of the Unicode Standard. A Unicode string normalized to a particular Unicode Normalization Form in one version of the standard is guaranteed to remain in that Normalization Form for implementations of future versions of the standard.
The Unicode Standard specifies four normalization forms. Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximalcomposition of equivalent sequences.
The choice of the normalization form depends on the particular use case. NFC is the best form for general text, since it's more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns. NFD and NFKD are the most useful for internal processing.
The Unicode standard describes a set of algorithms that depend on having the ability to quickly look up various properties of a code point. Given the codespace of about 1 millioncode points, it is not a trivial task to provide a space-efficient solution for the multitude of properties.
Common approaches such as hash-tables or binary search over sorted code point intervals (as inInversionList) are insufficient. Hash-tables have enormous memory footprint and binary search over intervals is not fast enough for some heavy-duty algorithms.
The recommended solution (see Unicode Implementation Guidelines) is using multi-stage tables that are an implementation of theTrie data structure with integer keys and a fixed number of stages. For the remainder of the section this will be called a fixed trie. The following describes a particular implementation that is aimed for the speed of access at the expense of ideal size savings.
Taking a 2-level Trie as an example the principle of operation is as follows. Split the number of bits in a key (code point, 21 bits) into 2 components (e.g. 15 and 8). The first is the number of bits in the index of the trie and the other is number of bits in each page of the trie. The layout of the trie is then an array of size 2^^bits-of-index followed an array of memory chunks of size 2^^bits-of-page/bits-per-element.
The number of pages is variable (but not less then 1) unlike the number of entries in the index. The slots of the index all have to contain a number of a page that is present. The lookup is then just a couple of operations - slice the upper bits, lookup an index for these, take a page at this index and use the lower bits as an offset within this page.
Assuming that pages are laid out consequently in one array atpages, the pseudo-code is:auto elemsPerPage = (2 ^^ bits_per_page) / Value.sizeOfInBits;pages[index[n >> bits_per_page]][n & (elemsPerPage - 1)];Where ifelemsPerPage is a power of 2 the whole process is a handful of simple instructions and 2 array reads. Subsequent levels of the trie are introduced by recursing on this notion - the index array is treated as values. The number of bits in index is then again split into 2 parts, with pages over 'current-index' and the new 'upper-index'.
For completeness a level 1 trie is simply an array. The current implementation takes advantage of bit-packing values when the range is known to be limited in advance (such asbool). See alsoBitPacked for enforcing it manually. The major size advantage however comes from the fact that multipleidentical pages on every level are merged by construction.
The process of constructing a trie is more involved and is hidden from the user in a form of the convenience functionscodepointTrie,codepointSetTrie and the even more convenienttoTrie. In general a set or built-in AA withdchar type can be turned into a trie. The trie object in this module is read-only (immutable); it's effectively frozen after construction.
This is a full list of Unicode properties accessible throughunicode with specific helpers per category nested within. Consult theCLDR utility when in doubt about the contents of a particular set.
General category sets listed below are only accessible with theunicode shorthand accessor.
| Abb. | Long form | Abb. | Long form | Abb. | Long form |
|---|---|---|---|---|---|
| L | Letter | Cn | Unassigned | Po | Other_Punctuation |
| Ll | Lowercase_Letter | Co | Private_Use | Ps | Open_Punctuation |
| Lm | Modifier_Letter | Cs | Surrogate | S | Symbol |
| Lo | Other_Letter | N | Number | Sc | Currency_Symbol |
| Lt | Titlecase_Letter | Nd | Decimal_Number | Sk | Modifier_Symbol |
| Lu | Uppercase_Letter | Nl | Letter_Number | Sm | Math_Symbol |
| M | Mark | No | Other_Number | So | Other_Symbol |
| Mc | Spacing_Mark | P | Punctuation | Z | Separator |
| Me | Enclosing_Mark | Pc | Connector_Punctuation | Zl | Line_Separator |
| Mn | Nonspacing_Mark | Pd | Dash_Punctuation | Zp | Paragraph_Separator |
| C | Other | Pe | Close_Punctuation | Zs | Space_Separator |
| Cc | Control | Pf | Final_Punctuation | - | Any |
| Cf | Format | Pi | Initial_Punctuation | - | ASCII |
Sets for other commonly useful properties that are accessible withunicode:
| Name | Name | Name |
|---|---|---|
| Alphabetic | Ideographic | Other_Uppercase |
| ASCII_Hex_Digit | IDS_Binary_Operator | Pattern_Syntax |
| Bidi_Control | ID_Start | Pattern_White_Space |
| Cased | IDS_Trinary_Operator | Quotation_Mark |
| Case_Ignorable | Join_Control | Radical |
| Dash | Logical_Order_Exception | Soft_Dotted |
| Default_Ignorable_Code_Point | Lowercase | STerm |
| Deprecated | Math | Terminal_Punctuation |
| Diacritic | Noncharacter_Code_Point | Unified_Ideograph |
| Extender | Other_Alphabetic | Uppercase |
| Grapheme_Base | Other_Default_Ignorable_Code_Point | Variation_Selector |
| Grapheme_Extend | Other_Grapheme_Extend | White_Space |
| Grapheme_Link | Other_ID_Continue | XID_Continue |
| Hex_Digit | Other_ID_Start | XID_Start |
| Hyphen | Other_Lowercase | |
| ID_Continue | Other_Math |
Below is the table with block names accepted byunicode.block. Note that the shorthand versionunicode requires "In" to be prepended to the names of blocks so as to disambiguate scripts and blocks.
| Aegean Numbers | Ethiopic Extended | Mongolian |
| Alchemical Symbols | Ethiopic Extended-A | Musical Symbols |
| Alphabetic Presentation Forms | Ethiopic Supplement | Myanmar |
| Ancient Greek Musical Notation | General Punctuation | Myanmar Extended-A |
| Ancient Greek Numbers | Geometric Shapes | New Tai Lue |
| Ancient Symbols | Georgian | NKo |
| Arabic | Georgian Supplement | Number Forms |
| Arabic Extended-A | Glagolitic | Ogham |
| Arabic Mathematical Alphabetic Symbols | Gothic | Ol Chiki |
| Arabic Presentation Forms-A | Greek and Coptic | Old Italic |
| Arabic Presentation Forms-B | Greek Extended | Old Persian |
| Arabic Supplement | Gujarati | Old South Arabian |
| Armenian | Gurmukhi | Old Turkic |
| Arrows | Halfwidth and Fullwidth Forms | Optical Character Recognition |
| Avestan | Hangul Compatibility Jamo | Oriya |
| Balinese | Hangul Jamo | Osmanya |
| Bamum | Hangul Jamo Extended-A | Phags-pa |
| Bamum Supplement | Hangul Jamo Extended-B | Phaistos Disc |
| Basic Latin | Hangul Syllables | Phoenician |
| Batak | Hanunoo | Phonetic Extensions |
| Bengali | Hebrew | Phonetic Extensions Supplement |
| Block Elements | High Private Use Surrogates | Playing Cards |
| Bopomofo | High Surrogates | Private Use Area |
| Bopomofo Extended | Hiragana | Rejang |
| Box Drawing | Ideographic Description Characters | Rumi Numeral Symbols |
| Brahmi | Imperial Aramaic | Runic |
| Braille Patterns | Inscriptional Pahlavi | Samaritan |
| Buginese | Inscriptional Parthian | Saurashtra |
| Buhid | IPA Extensions | Sharada |
| Byzantine Musical Symbols | Javanese | Shavian |
| Carian | Kaithi | Sinhala |
| Chakma | Kana Supplement | Small Form Variants |
| Cham | Kanbun | Sora Sompeng |
| Cherokee | Kangxi Radicals | Spacing Modifier Letters |
| CJK Compatibility | Kannada | Specials |
| CJK Compatibility Forms | Katakana | Sundanese |
| CJK Compatibility Ideographs | Katakana Phonetic Extensions | Sundanese Supplement |
| CJK Compatibility Ideographs Supplement | Kayah Li | Superscripts and Subscripts |
| CJK Radicals Supplement | Kharoshthi | Supplemental Arrows-A |
| CJK Strokes | Khmer | Supplemental Arrows-B |
| CJK Symbols and Punctuation | Khmer Symbols | Supplemental Mathematical Operators |
| CJK Unified Ideographs | Lao | Supplemental Punctuation |
| CJK Unified Ideographs Extension A | Latin-1 Supplement | Supplementary Private Use Area-A |
| CJK Unified Ideographs Extension B | Latin Extended-A | Supplementary Private Use Area-B |
| CJK Unified Ideographs Extension C | Latin Extended Additional | Syloti Nagri |
| CJK Unified Ideographs Extension D | Latin Extended-B | Syriac |
| Combining Diacritical Marks | Latin Extended-C | Tagalog |
| Combining Diacritical Marks for Symbols | Latin Extended-D | Tagbanwa |
| Combining Diacritical Marks Supplement | Lepcha | Tags |
| Combining Half Marks | Letterlike Symbols | Tai Le |
| Common Indic Number Forms | Limbu | Tai Tham |
| Control Pictures | Linear B Ideograms | Tai Viet |
| Coptic | Linear B Syllabary | Tai Xuan Jing Symbols |
| Counting Rod Numerals | Lisu | Takri |
| Cuneiform | Low Surrogates | Tamil |
| Cuneiform Numbers and Punctuation | Lycian | Telugu |
| Currency Symbols | Lydian | Thaana |
| Cypriot Syllabary | Mahjong Tiles | Thai |
| Cyrillic | Malayalam | Tibetan |
| Cyrillic Extended-A | Mandaic | Tifinagh |
| Cyrillic Extended-B | Mathematical Alphanumeric Symbols | Transport And Map Symbols |
| Cyrillic Supplement | Mathematical Operators | Ugaritic |
| Deseret | Meetei Mayek | Unified Canadian Aboriginal Syllabics |
| Devanagari | Meetei Mayek Extensions | Unified Canadian Aboriginal Syllabics Extended |
| Devanagari Extended | Meroitic Cursive | Vai |
| Dingbats | Meroitic Hieroglyphs | Variation Selectors |
| Domino Tiles | Miao | Variation Selectors Supplement |
| Egyptian Hieroglyphs | Miscellaneous Mathematical Symbols-A | Vedic Extensions |
| Emoticons | Miscellaneous Mathematical Symbols-B | Vertical Forms |
| Enclosed Alphanumerics | Miscellaneous Symbols | Yijing Hexagram Symbols |
| Enclosed Alphanumeric Supplement | Miscellaneous Symbols and Arrows | Yi Radicals |
| Enclosed CJK Letters and Months | Miscellaneous Symbols And Pictographs | Yi Syllables |
| Enclosed Ideographic Supplement | Miscellaneous Technical | |
| Ethiopic | Modifier Tone Letters |
Below is the table with script names accepted byunicode.script and by the shorthand versionunicode:
| Arabic | Hanunoo | Old_Italic |
| Armenian | Hebrew | Old_Persian |
| Avestan | Hiragana | Old_South_Arabian |
| Balinese | Imperial_Aramaic | Old_Turkic |
| Bamum | Inherited | Oriya |
| Batak | Inscriptional_Pahlavi | Osmanya |
| Bengali | Inscriptional_Parthian | Phags_Pa |
| Bopomofo | Javanese | Phoenician |
| Brahmi | Kaithi | Rejang |
| Braille | Kannada | Runic |
| Buginese | Katakana | Samaritan |
| Buhid | Kayah_Li | Saurashtra |
| Canadian_Aboriginal | Kharoshthi | Sharada |
| Carian | Khmer | Shavian |
| Chakma | Lao | Sinhala |
| Cham | Latin | Sora_Sompeng |
| Cherokee | Lepcha | Sundanese |
| Common | Limbu | Syloti_Nagri |
| Coptic | Linear_B | Syriac |
| Cuneiform | Lisu | Tagalog |
| Cypriot | Lycian | Tagbanwa |
| Cyrillic | Lydian | Tai_Le |
| Deseret | Malayalam | Tai_Tham |
| Devanagari | Mandaic | Tai_Viet |
| Egyptian_Hieroglyphs | Meetei_Mayek | Takri |
| Ethiopic | Meroitic_Cursive | Tamil |
| Georgian | Meroitic_Hieroglyphs | Telugu |
| Glagolitic | Miao | Thaana |
| Gothic | Mongolian | Thai |
| Greek | Myanmar | Tibetan |
| Gujarati | New_Tai_Lue | Tifinagh |
| Gurmukhi | Nko | Ugaritic |
| Han | Ogham | Vai |
| Hangul | Ol_Chiki | Yi |
Below is the table of names accepted byunicode.hangulSyllableType.
| Abb. | Long form |
|---|---|
| L | Leading_Jamo |
| LV | LV_Syllable |
| LVT | LVT_Syllable |
| T | Trailing_Jamo |
| V | Vowel_Jamo |
ReferencesASCII Table,Wikipedia,The Unicode Consortium,Unicode normalization forms,Unicode text segmentation Unicode Implementation Guidelines Unicode Conformance
TrademarksUnicode(tm) is a trademark of Unicode, Inc.
Sourcestd/uni/package.d
lineSep;paraSep;nelSep;isCodepointSet(T)isIntegralPair(T, V = uint);(T x){ V a = x[0]; V b = x[1];} The following must not compile:(T x){ V c = x[2];}CodepointSet = InversionList!(GcPolicy).InversionList;CodepointInterval;InversionList(SP = GcPolicy);InversionList is a set ofcode points represented as an array of open-right [a, b) intervals (seeCodepointInterval above). The name comes from the way the representation reads left to right. For instance a set of all values [10, 50), [80, 90), plus a singular value 60 looks like this:
10, 50, 60, 61, 80, 90
The way to read this is: start with negative meaning that all numbers smaller then the next one are not present in this set (and positive - the contrary). Then switch positive/negative after each number passed from left to right.
This way negative spans until 10, then positive until 50, then negative until 60, then positive until 61, and so on. As seen this provides a space-efficient storage of highly redundant data that comes in long runs. A description which Unicodecharacter properties fit nicely. The technique itself could be seen as a variation onRLE encoding.
Sets are value types (just likeint is) thus they are never aliased.
Example
auto a = CodepointSet('a', 'z'+1);auto b = CodepointSet('A', 'Z'+1);auto c = a;a = a | b;assert(a == CodepointSet('A', 'Z'+1, 'a', 'z'+1));assert(a != c);
See alsounicode for simpler construction of sets from predefined ones.
Memory usage is 8 bytes per each contiguous interval in a set. The value semantics are achieved by using theCOW technique and thus it'snot safe to cast this type toshared.
Note
It's not recommended to rely on the template parameters or the exact type of a currentcode point set instd.uni. The type and parameters may change when the standard allocators design is finalized. UseisCodepointSet with templates or just stick with the default aliasCodepointSet throughout the whole code base.
set)intervals)intervals...);import std.algorithm.comparison : equal;auto set = CodepointSet('a', 'z'+1, 'а', 'я'+1);foreach (v; 'a'..'z'+1)assert(set[v]);// Cyrillic lowercase intervalforeach (v; 'а'..'я'+1)assert(set[v]);//specific order is not required, intervals may interesectauto set2 = CodepointSet('а', 'я'+1, 'a', 'd', 'b', 'z'+1);//the same end resultassert(set2.byInterval.equal(set.byInterval));// test constructor this(Range)(Range intervals)auto chessPiecesWhite = CodepointInterval(9812, 9818);auto chessPiecesBlack = CodepointInterval(9818, 9824);auto set3 = CodepointSet([chessPiecesWhite, chessPiecesBlack]);foreach (v; '♔'..'♟'+1)assert(set3[v]);
byInterval() scope;opIndex(uintval) const;val in this set.auto gothic = unicode.Gothic;// Gothic letter ahsaassert(gothic['\U00010330']);// no ascii in Gothic obviouslyassert(!gothic['$']);
length();opBinary(string op, U)(Urhs)Sets support natural syntax for set algebra, namely:
| Operator | Math notation | Description |
|---|---|---|
| & | a ∩ b | intersection |
| | | a ∪ b | union |
| - | a ∖ b | subtraction |
| ~ | a ~ b | symmetric set difference i.e. (a ∪ b) \ (a ∩ b) |
import std.algorithm.comparison : equal;import std.range : iota;auto lower = unicode.LowerCase;auto upper = unicode.UpperCase;auto ascii = unicode.ASCII;assert((lower & upper).empty);// no intersectionauto lowerASCII = lower & ascii;assert(lowerASCII.byCodepoint.equal(iota('a', 'z'+1)));// throw away all of the lowercase ASCIIwriteln((ascii - lower).length);// 128 - 26auto onlyOneOf = lower ~ ascii;assert(!onlyOneOf['Δ']);// not ASCII and not lowercaseassert(onlyOneOf['$']);// ASCII and not lowercaseassert(!onlyOneOf['a']);// ASCII and lowercaseassert(onlyOneOf['я']);// not ASCII but lowercase// throw away all cased letters from ASCIIauto noLetters = ascii - (lower | upper);writeln(noLetters.length);// 128 - 26 * 2
opOpAssign(string op, U)(Urhs)opBinaryRight(string op : "in", U)(Uch) constch in this set, the same asopIndex.assert('я'in unicode.Cyrillic);assert(!('z'in unicode.Cyrillic));
opUnary(string op : "!")();byCodepoint();import std.algorithm.comparison : equal;import std.range : iota;auto set = unicode.ASCII;set.byCodepoint.equal(iota(0, 0x80));
toString(Writer)(scope Writersink, ref scope const FormatSpec!charfmt);import std.conv : to;import std.format : format;import std.uni : unicode;// This was originally using Cyrillic script.// Unfortunately this is a pretty active range for changes,// and hence broke in an update.// Therefore the range Basic latin was used instead as it// unlikely to ever change.writeln(unicode.InBasic_latin.to!string);// "[0..128)"// The specs '%s' and '%d' are equivalent to the to!string call above.writeln(format("%d", unicode.InBasic_latin));// unicode.InBasic_latin.to!stringwriteln(format("%#x", unicode.InBasic_latin));// "[0..0x80)"writeln(format("%#X", unicode.InBasic_latin));// "[0..0X80)"
add()(uinta, uintb);CodepointSet someSet;someSet.add('0', '5').add('A','Z'+1);someSet.add('5', '9'+1);assert(someSet['0']);assert(someSet['5']);assert(someSet['9']);assert(someSet['Z']);
inverted();auto set = unicode.ASCII;// union with the inverse gets all of the code points in the Unicodewriteln((set | set.inverted).length);// 0x110000// no intersection with the inverseassert((set & set.inverted).empty);
toSourceCode(stringfuncName = "");funcName taking a singledchar argument. IffuncName is empty the code is adjusted to be a lambda function.NoteUse with care for relatively small or regular sets. It could end up being slower then just using multi-staged tables.
Example
import std.stdio;// construct set directly from [a, b$RPAREN intervalsauto set = CodepointSet(10, 12, 45, 65, 100, 200);writeln(set);writeln(set.toSourceCode("func"));The above outputs something along the lines of:
bool func(dchar ch) @safepurenothrow @nogc{if (ch < 45) {if (ch == 10 || ch == 11)returntrue;returnfalse; }elseif (ch < 65)returntrue;else {if (ch < 100)returnfalse;if (ch < 200)returntrue;returnfalse; }}
empty() const;CodepointSet emptySet;writeln(emptySet.length);// 0assert(emptySet.empty);
codepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)NoteThe sum ofsizes must be equal 21.
Example
{import std.stdio;auto set = unicode("Number");auto trie =codepointSetTrie!(8, 5, 8)(set); writeln("Input code points to test:");foreach (line; stdin.byLine) {int count=0;foreach (dchar ch; line)if (trie[ch])// is number count++; writefln("Contains %d number code points.", count); }}CodepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)codepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)CodepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)NoteOverload takingCodepointSets will naturally convert only to bool mappingTries.
CodepointTrie is the type of Trie as generated by codepointTrie function.codepointTrie()(T[dchar]map, TdefValue = T.init);codepointTrie(R)(Rrange, TdefValue = T.init)MatcherConcept;NoteFor illustration purposes only, every method call results in assertion failure. UseutfMatcher to obtain a concrete matcher for UTF-8 or UTF-16 encodings.
match(Range)(ref Rangeinp)skip(Range)(ref Rangeinp)test(Range)(ref Rangeinp)Perform a semantic equivalent 2 operations: decoding acode point at front ofinp and testing if it belongs to the set ofcode points of this matcher.
The effect oninp depends on the kind of function called:
Match. If the codepoint is found in the set then rangeinp is advanced by its size incode units, otherwise the range is not modifed.
Skip. The range is always advanced by the size of the testedcode point regardless of the result of test.
Test. The range is left unaffected regardless of the result of test.
string truth ="2² = 4";auto m = utfMatcher!char(unicode.Number);assert(m.match(truth));// '2' is a number all rightassert(truth =="² = 4");// skips on matchassert(m.match(truth));// so is the superscript '2'assert(!m.match(truth));// space is not a numberassert(truth ==" = 4");// unaffected on no matchassert(!m.skip(truth));// same test ...assert(truth =="= 4");// but skips a codepoint regardlessassert(!m.test(truth));// '=' is not a numberassert(truth =="= 4");// test never affects argument
subMatcher(Lengths...)();isUtfMatcher(M, C);utfMatcher(Char, Set)(Setset)set for encoding that hasChar as code unit.toTrie(size_t level, Set)(Setset)set ofcode points.Level 1 is fastest and the most memory hungry (a bit array).
Level 4 is the slowest and has the smallest footprint.
See theSynopsis section for example.NoteLevel 4 stays very practical (being faster and more predictable) compared to using direct lookup on theset itself.
toDelegate(Set)(Setset)Builds aTrie with typically optimal speed-size trade-off and wraps it into a delegate of the following type:bool delegate(dchar ch).
Effectively this creates a 'tester' lambda suitable for algorithms like std.algorithm.find that take unary predicates.
See theSynopsis section for example.unicode;opDispatch(string name)();opCall(C)(scope const C[]name)name is not known beforehand; otherwise compile-time checkedopDispatch is typically a better choice.block;NoteHere block names are unambiguous as no scripts are searched and thus to search use simplyunicode.block.BlockName notation.
// use .block for explicitnesswriteln(unicode.block.Greek_and_Coptic);// unicode.InGreek_and_Coptic
script;auto arabicScript = unicode.script.arabic;auto arabicBlock = unicode.block.arabic;// there is an intersection between script and blockassert(arabicBlock['']);assert(arabicScript['']);// but they are differentassert(arabicBlock != arabicScript);writeln(arabicBlock);// unicode.inArabicwriteln(arabicScript);// unicode.arabic
hangulSyllableType;// L here is syllable type not Letter as in unicode.L short-cutauto leadingVowel = unicode.hangulSyllableType("L");// check that some leading vowels are presentforeach (vowel; '\u1110'..'\u115F')assert(leadingVowel[vowel]);writeln(leadingVowel);// unicode.hangulSyllableType.L
parseSet(Range)(ref Rangerange, boolcasefold = false)range using standard regex syntax '[...]'. The range is advanced skiping over regex set definition.casefold parameter determines if the set should be casefolded - that is include both lower and upper case versions for any letters in the set.graphemeStride(C)(scope const C[]input, size_tindex)index. Both the resulting length and theindex are measured incode units.| C | type that is implicitly convertible todchars |
C[]input | array of grapheme clusters |
size_tindex | starting index intoinput[] |
writeln(graphemeStride(" ", 1));// 1// A + combing ring abovestring city ="A\u030Arhus";size_t first =graphemeStride(city, 0);assert(first == 3);//\u030A has 2 UTF-8 code unitswriteln(city[0 .. first]);// "A\u030A"writeln(city[first .. $]);// "rhus"
decodeGrapheme(Input)(ref Inputinp)inp.NoteThis function modifiesinp and thusinp must be an L-value.
popGrapheme(Input)(ref Inputinp)inp, but doesn't return it. Instead returns the number of code units read. This differs from number of code points read only ifinput is an autodecodable string.NoteThis function modifiesinp and thusinp must be an L-value.
// Two Union Jacks of the Great Britain in eachstring s ="\U0001F1EC\U0001F1E7\U0001F1EC\U0001F1E7";wstring ws ="\U0001F1EC\U0001F1E7\U0001F1EC\U0001F1E7";dstring ds ="\U0001F1EC\U0001F1E7\U0001F1EC\U0001F1E7";// String pop length in code units, not points.writeln(s.popGrapheme());// 8writeln(ws.popGrapheme());// 4writeln(ds.popGrapheme());// 2writeln(s);// "\U0001F1EC\U0001F1E7"writeln(ws);// "\U0001F1EC\U0001F1E7"writeln(ds);// "\U0001F1EC\U0001F1E7"import std.algorithm.comparison : equal;import std.algorithm.iteration : filter;// Also works for non-random access ranges as long as the// character type is 32-bit.auto testPiece ="\r\nhello!"d.filter!(x => !x.isAlpha);// Windows-style line ending is two code points in a single grapheme.writeln(testPiece.popGrapheme());// 2assert(testPiece.equal("!"d));
byGrapheme(Range)(Rangerange)Iterate a string byGrapheme.
Useful for doing string manipulation that needs to be aware of graphemes.
import std.algorithm.comparison : equal;import std.range.primitives : walkLength;import std.range : take, drop;auto text ="noe\u0308l";// noël using e + combining diaeresisassert(text.walkLength == 5);// 5 code pointsauto gText = text.byGrapheme;assert(gText.walkLength == 4);// 4 graphemesassert(gText.take(3).equal("noe\u0308".byGrapheme));assert(gText.drop(3).equal("l".byGrapheme));
byCodePoint(Range)(Rangerange)byCodePoint(Range)(Rangerange)Lazily transform a range ofGraphemes to a range of code points.
Useful for converting the result to a string after doing operations on graphemes.
If passed in a range of code points, returns a range with equivalent capabilities.
import std.array : array;import std.conv : text;import std.range : retro;string s ="noe\u0308l";// noël// reverse it and convert the result to a stringstring reverse = s.byGrapheme .array .retro .byCodePoint .text;assert(reverse =="le\u0308on");// lëon
Grapheme;A structure designed to effectively packcharacters of agrapheme cluster.
Grapheme has value semantics so 2 copies of aGrapheme always refer to distinct objects. In most actual scenarios aGrapheme fits on the stack and avoids memory allocation overhead for all but quite long clusters.
chars...)seq)opIndex(size_tindex) const;opIndexAssign(dcharch, size_tindex);ch at given index in this cluster.WarningUse of this facility may invalidate grapheme cluster, see alsoGrapheme.valid.
auto g = Grapheme("A\u0302");writeln(g[0]);// 'A'assert(g.valid);g[1] = '~';// ASCII tilda is not a combining markwriteln(g[1]);// '~'assert(!g.valid);
opSlice(size_ta, size_tb) return;opSlice() return;WarningInvalidates when this Grapheme leaves the scope, attempts to use it then would lead to memory corruption.
length() const;opOpAssign(string op)(dcharch);ch to this grapheme.WarningUse of this facility may invalidate grapheme cluster, see alsovalid.
import std.algorithm.comparison : equal;auto g = Grapheme("A");assert(g.valid);g ~= '\u0301';assert(g[].equal("A\u0301"));assert(g.valid);g ~="B";// not a valid grapheme cluster anymoreassert(!g.valid);// still could be useful thoughassert(g[].equal("A\u0301B"));
opOpAssign(string op, Input)(scope Inputinp)inp to this Grapheme.valid()();sicmp(S1, S2)(scope S1r1, scope S2r2)Does basic case-insensitive comparison ofr1 andr2. This function uses simpler comparison rule thus achieving better performance thanicmp. However keep in mind the warning below.
S1r1 | aninput range of characters |
S2r2 | aninput range of characters |
r1 is lexicographically "less" thanr2, >0 ifr1 is lexicographically "greater" thanr2WarningThis function only handles 1:1code point mapping and thus is not sufficient for certain alphabets like German, Greek and few others.
writeln(sicmp("Август","авгусТ"));// 0// Greek also works as long as there is no 1:M mapping in sightwriteln(sicmp("ΌΎ","όύ"));// 0// things like the following won't get matched as equal// Greek small letter iota with dialytika and tonosassert(sicmp("ΐ","\u03B9\u0308\u0301") != 0);// while icmp has no problem with thatwriteln(icmp("ΐ","\u03B9\u0308\u0301"));// 0writeln(icmp("ΌΎ","όύ"));// 0
icmp(S1, S2)(S1r1, S2r2)r1 andr2. Follows the rules of full case-folding mapping. This includes matching as equal german ß with "ss" and other 1:Mcode point mappings unlikesicmp. The cost oficmp being pedantically correct is slightly worse performance.S1r1 | a forward range of characters |
S2r2 | a forward range of characters |
writeln(icmp("Rußland","Russland"));// 0writeln(icmp("ᾩ -> \u1F70\u03B9","\u1F61\u03B9 -> ᾲ"));// 0
icmp@safe @nogc nothrow pure.import std.utf : byDchar;writeln(icmp("Rußland".byDchar,"Russland".byDchar));// 0writeln(icmp("ᾩ -> \u1F70\u03B9".byDchar,"\u1F61\u03B9 -> ᾲ".byDchar));// 0
combiningClass(dcharch);Returns thecombining class ofch.
// shorten the codealias CC =combiningClass;// combining tildawriteln(CC('\u0303'));// 230// combining ring belowwriteln(CC('\u0325'));// 220// the simple consequence is that "tilda" should be// placed after a "ring below" in a sequence
UnicodeDecomposition: int;CanonicalCompatibilityNoteCompatibility decomposition is alossy conversion, typically suitable only for fuzzy matching and internal processing.
compose(dcharfirst, dcharsecond);first comes beforesecond in the original text, usually meaning that the first is a starter.NoteHangul syllables are not covered by this function. SeecomposeJamo below.
writeln(compose('A', '\u0308'));// '\u00C4'writeln(compose('A', 'B'));// dchar.initwriteln(compose('C', '\u0301'));// '\u0106'// note that the starter is the first one// thus the following doesn't composewriteln(compose('\u0308', 'A'));// dchar.init
decompose(UnicodeDecomposition decompType = Canonical)(dcharch);ch. If no decomposition is available returns aGrapheme with thech itself.NoteThis function also decomposes hangul syllables as prescribed by the standard.
import std.algorithm.comparison : equal;writeln(compose('A', '\u0308'));// '\u00C4'writeln(compose('A', 'B'));// dchar.initwriteln(compose('C', '\u0301'));// '\u0106'// note that the starter is the first one// thus the following doesn't composewriteln(compose('\u0308', 'A'));// dchar.initassert(decompose('Ĉ')[].equal("C\u0302"));assert(decompose('D')[].equal("D"));assert(decompose('\uD4DC')[].equal("\u1111\u1171\u11B7"));assert(decompose!Compatibility('¹')[].equal("1"));
decomposeHangul(dcharch);ch is not a composed syllable then this function returnsGrapheme containing onlych as is.import std.algorithm.comparison : equal;assert(decomposeHangul('\uD4DB')[].equal("\u1111\u1171\u11B6"));
composeJamo(dcharlead, dcharvowel, dchartrailing = (dchar).init);lead), avowel and optionaltrailing consonant jamos.lead andvowel are not a valid hangul jamo of the respectivecharacter class returns dchar.init.writeln(composeJamo('\u1111', '\u1171', '\u11B6'));// '\uD4DB'// leaving out T-vowel, or passing any codepoint// that is not trailing consonant composes an LV-syllablewriteln(composeJamo('\u1111', '\u1171'));// '\uD4CC'writeln(composeJamo('\u1111', '\u1171', ' '));// '\uD4CC'writeln(composeJamo('\u1111', 'A'));// dchar.initwriteln(composeJamo('A', '\u1171'));// dchar.init
NormalizationForm: int;NFCNFDNFKCNFKDnormalize(NormalizationForm norm = NFC, C)(return scope inout(C)[]input);input string normalized to the chosen form. Form C is used by default.NoteIn cases where the string in question is already normalized, it is returned unmodified and no memory allocation happens.
// any encoding workswstring greet ="Hello world";assert(normalize(greet)is greet);// the same exact slice// An example of a character with all 4 forms being different:// Greek upsilon with acute and hook symbol (code point 0x03D3)writeln(normalize!NFC("ϓ"));// "\u03D3"writeln(normalize!NFD("ϓ"));// "\u03D2\u0301"writeln(normalize!NFKC("ϓ"));// "\u038E"writeln(normalize!NFKD("ϓ"));// "\u03A5\u0301"
allowedIn(NormalizationForm norm)(dcharch);ch is always allowed (Quick_Check=YES) in normalization formnorm.// e.g. Cyrillic is always allowed, so is ASCIIassert(allowedIn!NFC('я'));assert(allowedIn!NFD('я'));assert(allowedIn!NFKC('я'));assert(allowedIn!NFKD('я'));assert(allowedIn!NFC('Z'));
isWhite(dcharc);c is a Unicode whitespacecharacter. (general Unicode category: Part of C0(tab, vertical tab, form feed, carriage return, and linefeed characters), Zs, Zl, Zp, and NEL(U+0085))isLower(dcharc);c is a Unicode lowercasecharacter.isUpper(dcharc);c is a Unicode uppercasecharacter.asLowerCase(Range)(Rangestr)asUpperCase(Range)(Rangestr)Rangestr | string or range of characters |
import std.algorithm.comparison : equal;assert("hEllo".asUpperCase.equal("HELLO"));
asCapitalized(Range)(Rangestr)Rangestr | string or range of characters |
import std.algorithm.comparison : equal;assert("hEllo".asCapitalized.equal("Hello"));
toLowerInPlace(C)(ref C[]s)s to lowercase (by performing Unicode lowercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. Ifs does not have any uppercase characters, thens is unaltered.toUpperInPlace(C)(ref C[]s)s to uppercase (by performing Unicode uppercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. Ifs does not have any lowercase characters, thens is unaltered.toLower(dcharc);c is a Unicode uppercasecharacter, then its lowercase equivalent is returned. Otherwisec is returned.Warningcertain alphabets like German and Greek have no 1:1 upper-lower mapping. Use overload of toLower which takes full string instead.
toLower(S)(return scope Ss)toLower(S)(Ss)s except that all of its characters are converted to lowercase (by performing Unicode lowercase mapping). If none ofs characters were affected, thens itself is returned ifs is astring-like type.Ss | Arandom access range of characters |
s.toUpper(dcharc);c is a Unicode lowercasecharacter, then its uppercase equivalent is returned. Otherwisec is returned.WarningCertain alphabets like German and Greek have no 1:1 upper-lower mapping. Use overload of toUpper which takes full string instead.
toUpper can be used as an argument tostd.algorithm.iteration.map to produce an algorithm that can convert a range of characters to upper case without allocating memory. A string can then be produced by usingstd.algorithm.mutation.copy to send it to anstd.array.appender.import std.algorithm.iteration : map;import std.algorithm.mutation : copy;import std.array : appender;auto abuf = appender!(char[])();"hello".map!toUpper.copy(abuf);writeln(abuf.data);// "HELLO"
toUpper(S)(return scope Ss)toUpper(S)(Ss)s except that all of its characters are converted to uppercase (by performing Unicode uppercase mapping). If none ofs characters were affected, thens itself is returned ifs is astring-like type.Ss | Arandom access range of characters |
s.isAlpha(dcharc);c is a Unicode alphabeticcharacter (general Unicode category: Alphabetic).isMark(dcharc);c is a Unicode mark (general Unicode category: Mn, Me, Mc).isNumber(dcharc);c is a Unicode numericalcharacter (general Unicode category: Nd, Nl, No).isAlphaNum(dcharc);c is a Unicode alphabeticcharacter or number. (general Unicode category: Alphabetic, Nd, Nl, No).dcharc | any Unicode character |
isPunctuation(dcharc);c is a Unicode punctuationcharacter (general Unicode category: Pd, Ps, Pe, Pc, Po, Pi, Pf).isSymbol(dcharc);c is a Unicode symbolcharacter (general Unicode category: Sm, Sc, Sk, So).isSpace(dcharc);c is a Unicode spacecharacter (general Unicode category: Zs)isGraphical(dcharc);c is a Unicode graphicalcharacter (general Unicode category: L, M, N, P, S, Zs).isControl(dcharc);c is a Unicode controlcharacter (general Unicode category: Cc).isFormat(dcharc);c is a Unicode formattingcharacter (general Unicode category: Cf).isPrivateUse(dcharc);c is a Unicode Private Usecode point (general Unicode category: Co).isSurrogate(dcharc);c is a Unicode surrogatecode point (general Unicode category: Cs).isSurrogateHi(dcharc);c is a Unicode high surrogate (lead surrogate).isSurrogateLo(dcharc);c is a Unicode low surrogate (trail surrogate).isNonCharacter(dcharc);c is a Unicode non-character i.e. acode point with no assigned abstract character. (general Unicode category: Cn)