Movatterモバイル変換


[0]ホーム

URL:


D Logo
Menu
Search

Library Reference

version 2.112.0

overview

Report a bug
If you spot a problem with this page, click here to create a Bugzilla issue.
Improve this page
Quickly fork, edit online, and submit a pull request for this page.Requires a signed-in GitHub account. This works well for small changes.If you'd like to make larger changes you may want to consider usinga local clone.

std.uni

Thestd.uni module provides an implementation of fundamental Unicode algorithms and data structures. This doesn't include UTF encoding and decoding primitives, seestd.utf.decode andstd.utf.encode instd.utf for this functionality.

CategoryFunctions
DecodebyCodePointbyGraphemedecodeGraphemegraphemeStridepopGrapheme
Comparisonicmpsicmp
ClassificationisAlphaisAlphaNumisCodepointSetisControlisFormatisGraphicalisIntegralPairisMarkisNonCharacterisNumberisPrivateUseisPunctuationisSpaceisSurrogateisSurrogateHiisSurrogateLoisSymbolisWhite
NormalizationNFCNFDNFKDNormalizationFormnormalize
DecomposedecomposedecomposeHangulUnicodeDecomposition
ComposecomposecomposeJamo
SetsCodepointIntervalCodepointSetInversionListunicode
TriecodepointSetTrieCodepointSetTriecodepointTrieCodepointTrietoTrietoDelegate
CasingasCapitalizedasLowerCaseasUpperCaseisLowerisUppertoLowertoLowerInPlacetoUppertoUpperInPlace
Utf8MatcherisUtfMatcherMatcherConceptutfMatcher
SeparatorslineSepnelSepparaSep
Building blocksallowedIncombiningClassGrapheme

All primitives listed operate on Unicode characters and sets of characters. For functions which operate on ASCII characters and ignore Unicodecharacters, seestd.ascii. For definitions of Unicodecharacter,code point and other terms used throughout this module see theterminology section below.

The focus of this module is the core needs of developing Unicode-aware applications. To that effect it provides the following optimized primitives:

It's recognized that an application may need further enhancements and extensions, such as less commonly known algorithms, or tailoring existing ones for region specific needs. To help users with building any extra functionality beyond the core primitives, the module provides:

  • CodepointSet, a type for easy manipulation of sets of characters. Besides the typical set algebra it provides an unusual feature: a D source code generator for detection ofcode points in this set. This is a boon for meta-programming parser frameworks, and is used internally to power classification in small sets likeisWhite.
  • A way to construct optimal packed multi-stage tables also known as a special case ofTrie. The functionscodepointTrie,codepointSetTrie construct custom tries that map dchar to value. The end result is a fast and predictableΟ(1) lookup that powers functions likeisAlpha andcombiningClass, but for user-defined data sets.
  • A useful technique for Unicode-aware parsers that perform character classification of encodedcode points is to avoid unnecassary decoding at all costs.utfMatcher provides an improvement over the usual workflow of decode-classify-process, combining the decoding and classification steps. By extracting necessary bits directly from encodedcode units matchers achieve significant performance improvements. SeeMatcherConcept for the common interface of UTF matchers.
  • Generally useful building blocks for customized normalization:combiningClass for querying combining class andallowedIn for testing the Quick_Check property of a given normalization form.
  • Access to a large selection of commonly used sets ofcode points.Supported sets include Script, Block and General Category. The exact contents of a set can be observed in the CLDR utility, on theproperty index page of the Unicode website. Seeunicode for easy and (optionally) compile-time checked set queries.

Synopsis

import std.uni;void main(){// initialize code point sets using script/block or property name// now 'set' contains code points from both scripts.auto set = unicode("Cyrillic") | unicode("Armenian");// same thing but simpler and checked at compile-timeauto ascii = unicode.ASCII;auto currency = unicode.Currency_Symbol;// easy set opsauto a = set & ascii;assert(a.empty);// as it has no intersection with ascii    a = set | ascii;auto b = currency - a;// subtract all ASCII, Cyrillic and Armenian// some properties of code point setsassert(b.length > 45);// 46 items in Unicode 6.1, even more in 6.2// testing presence of a code point in a set// is just fine, it is O(logN)assert(!b['$']);assert(!b['\u058F']);// Armenian dram signassert(b['¥']);// building fast lookup tables, these guarantee O(1) complexity// 1-level Trie lookup table essentially a huge bit-set ~262Kbauto oneTrie = toTrie!1(b);// 2-level far more compact but typically slightly slowerauto twoTrie = toTrie!2(b);// 3-level even smaller, and a bit slower yetauto threeTrie = toTrie!3(b);assert(oneTrie['£']);assert(twoTrie['£']);assert(threeTrie['£']);// build the trie with the most sensible trie level// and bind it as a functorauto cyrillicOrArmenian = toDelegate(set);auto balance = find!(cyrillicOrArmenian)("Hello ընկեր!");assert(balance =="ընկեր!");// compatible with bool delegate(dchar)booldelegate(dchar) bindIt = cyrillicOrArmenian;// Normalization    string s ="Plain ascii (and not only), is always normalized!";assert(sis normalize(s));// is the same string    string nonS ="A\u0308ffin";// A ligatureauto nS = normalize(nonS);// to NFC, the W3C endorsed standardassert(nS =="Äffin");assert(nS != nonS);    string composed ="Äffin";assert(normalize!NFD(composed) =="A\u0308ffin");// to NFKD, compatibility decomposition useful for fuzzy matching/searchingassert(normalize!NFKD("2¹⁰") =="210");}

Terminology

The following is a list of important Unicode notions and definitions. Any conventions used specifically in this module alone are marked as such. The descriptions are based on the formal definition as found in chapter three of The Unicode Standard Core Specification.

Abstract character
A unit of information used for the organization, control, or representation of textual data. Note that:
  • When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, visual).
  • An abstract character has no concrete form and should not be confused with aglyph.
  • An abstract character does not necessarily correspond to what a user thinks of as a “character” and should not be confused with aGrapheme.
  • The abstract characters encoded (see Encoded character) are known as Unicode abstract characters.
  • Abstract characters not directly encoded by the Unicode Standard can often be represented by the use of combining character sequences.

Canonical decomposition
The decomposition of a character or character sequence that results from recursively applying the canonical mappings found in the Unicode Character Database and these described in Conjoining Jamo Behavior (section 12 ofUnicode Conformance).

Canonical composition
The precise definition of the Canonical composition is the algorithm as specified in Unicode Conformance section 11. Informally it's the process that does the reverse of the canonical decomposition with the addition of certain rules that e.g. prevent legacy characters from appearing in the composed result.

Canonical equivalent
Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical.

Character
Typically differs by context. For the purpose of this documentation the termcharacter impliesencoded character, that is, a code point having an assigned abstract character (a symbolic meaning).

Code point
Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF (hex). Not all code points are assigned to encoded characters.

Code unit
The minimal bit combination that can represent a unit of encoded text for processing or interchange. Depending on the encoding this could be: 8-bit code units in the UTF-8 (char), 16-bit code units in the UTF-16 (wchar), and 32-bit code units in the UTF-32 (dchar).Note that in UTF-32, a code unit is a code point and is represented by the Ddchar type.

Combining character
A character with the General Category of Combining Mark(M).
  • All characters with non-zero canonical combining class are combining characters, but the reverse is not the case: there are combining characters with a zero combining class.
  • These characters are not normally used in isolation unless they are being described. They include such characters as accents, diacritics, Hebrew points, Arabic vowel signs, and Indic matras.

Combining class
A numerical value used by the Unicode Canonical Ordering Algorithm to determine which sequences of combining marks are to be considered canonically equivalent and which are not.

Compatibility decomposition
The decomposition of a character or character sequence that results from recursively applying both the compatibility mappings and the canonical mappings found in the Unicode Character Database, and those described in Conjoining Jamo Behavior no characters can be further decomposed.

Compatibility equivalent
Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical.

Encoded character
An association (or mapping) between an abstract character and a code point.

Glyph
The actual, concrete image of a glyph representation having been rasterized or otherwise imaged onto some display surface.

Grapheme base
A character with the property Grapheme_Base, or any standard Korean syllable block.

Grapheme cluster
Defined as the text between grapheme boundaries as specified by Unicode Standard Annex #29,Unicode text segmentation. Important general properties of a grapheme:
  • The grapheme cluster represents a horizontally segmentable unit of text, consisting of some grapheme base (which may consist of a Korean syllable) together with any number of nonspacing marks applied to it.
  • A grapheme cluster typically starts with a grapheme base and then extends across any subsequent sequence of nonspacing marks. A grapheme cluster is most directly relevant to text rendering and processes such as cursor placement and text selection in editing, but may also be relevant to comparison and searching.
  • For many processes, a grapheme cluster behaves as if it was a single character with the same properties as its grapheme base. Effectively, nonspacing marks applygraphically to the base, but do not change its properties.

This module defines a number of primitives that work with graphemes:Grapheme,decodeGrapheme andgraphemeStride. All of them are usingextended grapheme boundaries as defined in the aforementioned standard annex.

Nonspacing mark
A combining character with the General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me).

Spacing mark
A combining character that is not a nonspacing mark.

Normalization

The concepts ofcanonical equivalent orcompatibility equivalent characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence is determined by a process called normalization, whereby strings are converted into forms which are compared directly for identity. This is the primary goal of the normalization process, see the functionnormalize to convert into any of the four defined forms.

A very important attribute of the Unicode Normalization Forms is that they must remain stable between versions of the Unicode Standard. A Unicode string normalized to a particular Unicode Normalization Form in one version of the standard is guaranteed to remain in that Normalization Form for implementations of future versions of the standard.

The Unicode Standard specifies four normalization forms. Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximalcomposition of equivalent sequences.

The choice of the normalization form depends on the particular use case. NFC is the best form for general text, since it's more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns. NFD and NFKD are the most useful for internal processing.

Construction of lookup tables

The Unicode standard describes a set of algorithms that depend on having the ability to quickly look up various properties of a code point. Given the codespace of about 1 millioncode points, it is not a trivial task to provide a space-efficient solution for the multitude of properties.

Common approaches such as hash-tables or binary search over sorted code point intervals (as inInversionList) are insufficient. Hash-tables have enormous memory footprint and binary search over intervals is not fast enough for some heavy-duty algorithms.

The recommended solution (see Unicode Implementation Guidelines) is using multi-stage tables that are an implementation of theTrie data structure with integer keys and a fixed number of stages. For the remainder of the section this will be called a fixed trie. The following describes a particular implementation that is aimed for the speed of access at the expense of ideal size savings.

Taking a 2-level Trie as an example the principle of operation is as follows. Split the number of bits in a key (code point, 21 bits) into 2 components (e.g. 15 and 8). The first is the number of bits in the index of the trie and the other is number of bits in each page of the trie. The layout of the trie is then an array of size 2^^bits-of-index followed an array of memory chunks of size 2^^bits-of-page/bits-per-element.

The number of pages is variable (but not less then 1) unlike the number of entries in the index. The slots of the index all have to contain a number of a page that is present. The lookup is then just a couple of operations - slice the upper bits, lookup an index for these, take a page at this index and use the lower bits as an offset within this page.

Assuming that pages are laid out consequently in one array atpages, the pseudo-code is:

auto elemsPerPage = (2 ^^ bits_per_page) / Value.sizeOfInBits;pages[index[n >> bits_per_page]][n & (elemsPerPage - 1)];

Where ifelemsPerPage is a power of 2 the whole process is a handful of simple instructions and 2 array reads. Subsequent levels of the trie are introduced by recursing on this notion - the index array is treated as values. The number of bits in index is then again split into 2 parts, with pages over 'current-index' and the new 'upper-index'.

For completeness a level 1 trie is simply an array. The current implementation takes advantage of bit-packing values when the range is known to be limited in advance (such asbool). See alsoBitPacked for enforcing it manually. The major size advantage however comes from the fact that multipleidentical pages on every level are merged by construction.

The process of constructing a trie is more involved and is hidden from the user in a form of the convenience functionscodepointTrie,codepointSetTrie and the even more convenienttoTrie. In general a set or built-in AA withdchar type can be turned into a trie. The trie object in this module is read-only (immutable); it's effectively frozen after construction.

Unicode properties

This is a full list of Unicode properties accessible throughunicode with specific helpers per category nested within. Consult theCLDR utility when in doubt about the contents of a particular set.

General category sets listed below are only accessible with theunicode shorthand accessor.

General category
Abb.Long formAbb.Long formAbb.Long form
LLetterCnUnassignedPoOther_Punctuation
LlLowercase_LetterCoPrivate_UsePsOpen_Punctuation
LmModifier_LetterCsSurrogateSSymbol
LoOther_LetterNNumberScCurrency_Symbol
LtTitlecase_LetterNdDecimal_NumberSkModifier_Symbol
LuUppercase_LetterNlLetter_NumberSmMath_Symbol
MMarkNoOther_NumberSoOther_Symbol
McSpacing_MarkPPunctuationZSeparator
MeEnclosing_MarkPcConnector_PunctuationZlLine_Separator
MnNonspacing_MarkPdDash_PunctuationZpParagraph_Separator
COtherPeClose_PunctuationZsSpace_Separator
CcControlPfFinal_Punctuation-Any
CfFormatPiInitial_Punctuation-ASCII

Sets for other commonly useful properties that are accessible withunicode:

Common binary properties
NameNameName
AlphabeticIdeographicOther_Uppercase
ASCII_Hex_DigitIDS_Binary_OperatorPattern_Syntax
Bidi_ControlID_StartPattern_White_Space
CasedIDS_Trinary_OperatorQuotation_Mark
Case_IgnorableJoin_ControlRadical
DashLogical_Order_ExceptionSoft_Dotted
Default_Ignorable_Code_PointLowercaseSTerm
DeprecatedMathTerminal_Punctuation
DiacriticNoncharacter_Code_PointUnified_Ideograph
ExtenderOther_AlphabeticUppercase
Grapheme_BaseOther_Default_Ignorable_Code_PointVariation_Selector
Grapheme_ExtendOther_Grapheme_ExtendWhite_Space
Grapheme_LinkOther_ID_ContinueXID_Continue
Hex_DigitOther_ID_StartXID_Start
HyphenOther_Lowercase
ID_ContinueOther_Math

Below is the table with block names accepted byunicode.block. Note that the shorthand versionunicode requires "In" to be prepended to the names of blocks so as to disambiguate scripts and blocks.

Blocks
Aegean NumbersEthiopic ExtendedMongolian
Alchemical SymbolsEthiopic Extended-AMusical Symbols
Alphabetic Presentation FormsEthiopic SupplementMyanmar
Ancient Greek Musical NotationGeneral PunctuationMyanmar Extended-A
Ancient Greek NumbersGeometric ShapesNew Tai Lue
Ancient SymbolsGeorgianNKo
ArabicGeorgian SupplementNumber Forms
Arabic Extended-AGlagoliticOgham
Arabic Mathematical Alphabetic SymbolsGothicOl Chiki
Arabic Presentation Forms-AGreek and CopticOld Italic
Arabic Presentation Forms-BGreek ExtendedOld Persian
Arabic SupplementGujaratiOld South Arabian
ArmenianGurmukhiOld Turkic
ArrowsHalfwidth and Fullwidth FormsOptical Character Recognition
AvestanHangul Compatibility JamoOriya
BalineseHangul JamoOsmanya
BamumHangul Jamo Extended-APhags-pa
Bamum SupplementHangul Jamo Extended-BPhaistos Disc
Basic LatinHangul SyllablesPhoenician
BatakHanunooPhonetic Extensions
BengaliHebrewPhonetic Extensions Supplement
Block ElementsHigh Private Use SurrogatesPlaying Cards
BopomofoHigh SurrogatesPrivate Use Area
Bopomofo ExtendedHiraganaRejang
Box DrawingIdeographic Description CharactersRumi Numeral Symbols
BrahmiImperial AramaicRunic
Braille PatternsInscriptional PahlaviSamaritan
BugineseInscriptional ParthianSaurashtra
BuhidIPA ExtensionsSharada
Byzantine Musical SymbolsJavaneseShavian
CarianKaithiSinhala
ChakmaKana SupplementSmall Form Variants
ChamKanbunSora Sompeng
CherokeeKangxi RadicalsSpacing Modifier Letters
CJK CompatibilityKannadaSpecials
CJK Compatibility FormsKatakanaSundanese
CJK Compatibility IdeographsKatakana Phonetic ExtensionsSundanese Supplement
CJK Compatibility Ideographs SupplementKayah LiSuperscripts and Subscripts
CJK Radicals SupplementKharoshthiSupplemental Arrows-A
CJK StrokesKhmerSupplemental Arrows-B
CJK Symbols and PunctuationKhmer SymbolsSupplemental Mathematical Operators
CJK Unified IdeographsLaoSupplemental Punctuation
CJK Unified Ideographs Extension ALatin-1 SupplementSupplementary Private Use Area-A
CJK Unified Ideographs Extension BLatin Extended-ASupplementary Private Use Area-B
CJK Unified Ideographs Extension CLatin Extended AdditionalSyloti Nagri
CJK Unified Ideographs Extension DLatin Extended-BSyriac
Combining Diacritical MarksLatin Extended-CTagalog
Combining Diacritical Marks for SymbolsLatin Extended-DTagbanwa
Combining Diacritical Marks SupplementLepchaTags
Combining Half MarksLetterlike SymbolsTai Le
Common Indic Number FormsLimbuTai Tham
Control PicturesLinear B IdeogramsTai Viet
CopticLinear B SyllabaryTai Xuan Jing Symbols
Counting Rod NumeralsLisuTakri
CuneiformLow SurrogatesTamil
Cuneiform Numbers and PunctuationLycianTelugu
Currency SymbolsLydianThaana
Cypriot SyllabaryMahjong TilesThai
CyrillicMalayalamTibetan
Cyrillic Extended-AMandaicTifinagh
Cyrillic Extended-BMathematical Alphanumeric SymbolsTransport And Map Symbols
Cyrillic SupplementMathematical OperatorsUgaritic
DeseretMeetei MayekUnified Canadian Aboriginal Syllabics
DevanagariMeetei Mayek ExtensionsUnified Canadian Aboriginal Syllabics Extended
Devanagari ExtendedMeroitic CursiveVai
DingbatsMeroitic HieroglyphsVariation Selectors
Domino TilesMiaoVariation Selectors Supplement
Egyptian HieroglyphsMiscellaneous Mathematical Symbols-AVedic Extensions
EmoticonsMiscellaneous Mathematical Symbols-BVertical Forms
Enclosed AlphanumericsMiscellaneous SymbolsYijing Hexagram Symbols
Enclosed Alphanumeric SupplementMiscellaneous Symbols and ArrowsYi Radicals
Enclosed CJK Letters and MonthsMiscellaneous Symbols And PictographsYi Syllables
Enclosed Ideographic SupplementMiscellaneous Technical
EthiopicModifier Tone Letters

Below is the table with script names accepted byunicode.script and by the shorthand versionunicode:

Scripts
ArabicHanunooOld_Italic
ArmenianHebrewOld_Persian
AvestanHiraganaOld_South_Arabian
BalineseImperial_AramaicOld_Turkic
BamumInheritedOriya
BatakInscriptional_PahlaviOsmanya
BengaliInscriptional_ParthianPhags_Pa
BopomofoJavanesePhoenician
BrahmiKaithiRejang
BrailleKannadaRunic
BugineseKatakanaSamaritan
BuhidKayah_LiSaurashtra
Canadian_AboriginalKharoshthiSharada
CarianKhmerShavian
ChakmaLaoSinhala
ChamLatinSora_Sompeng
CherokeeLepchaSundanese
CommonLimbuSyloti_Nagri
CopticLinear_BSyriac
CuneiformLisuTagalog
CypriotLycianTagbanwa
CyrillicLydianTai_Le
DeseretMalayalamTai_Tham
DevanagariMandaicTai_Viet
Egyptian_HieroglyphsMeetei_MayekTakri
EthiopicMeroitic_CursiveTamil
GeorgianMeroitic_HieroglyphsTelugu
GlagoliticMiaoThaana
GothicMongolianThai
GreekMyanmarTibetan
GujaratiNew_Tai_LueTifinagh
GurmukhiNkoUgaritic
HanOghamVai
HangulOl_ChikiYi

Below is the table of names accepted byunicode.hangulSyllableType.

Hangul syllable type
Abb.Long form
LLeading_Jamo
LVLV_Syllable
LVTLVT_Syllable
TTrailing_Jamo
VVowel_Jamo

ReferencesASCII Table,Wikipedia,The Unicode Consortium,Unicode normalization forms,Unicode text segmentation Unicode Implementation Guidelines Unicode Conformance

TrademarksUnicode(tm) is a trademark of Unicode, Inc.

License:
Boost License 1.0.
Authors:
Dmitry Olshansky

Sourcestd/uni/package.d

Standards:
Unicode v6.2
enum dcharlineSep;
Constantcode point (0x2028) - line separator.
enum dcharparaSep;
Constantcode point (0x2029) - paragraph separator.
enum dcharnelSep;
Constantcode point (0x0085) - next line.
templateisCodepointSet(T)
Tests if T is some kind a set of code points. Intended for template constraints.
enum autoisIntegralPair(T, V = uint);
Tests ifT is a pair of integers that implicitly convert toV. The following code must compile for any pairT:
(T x){ V a = x[0]; V b = x[1];}
The following must not compile:
(T x){ V c = x[2];}
aliasCodepointSet = InversionList!(GcPolicy).InversionList;
The recommended default type for set ofcode points. For details, see the current implementation:InversionList.
structCodepointInterval;
The recommended type ofstd.typecons.Tuple to represent [a, b) intervals ofcode points. As used inInversionList. Any interval type should passisIntegralPair trait.
structInversionList(SP = GcPolicy);

InversionList is a set ofcode points represented as an array of open-right [a, b) intervals (seeCodepointInterval above). The name comes from the way the representation reads left to right. For instance a set of all values [10, 50), [80, 90), plus a singular value 60 looks like this:

10, 50, 60, 61, 80, 90

The way to read this is: start with negative meaning that all numbers smaller then the next one are not present in this set (and positive - the contrary). Then switch positive/negative after each number passed from left to right.

This way negative spans until 10, then positive until 50, then negative until 60, then positive until 61, and so on. As seen this provides a space-efficient storage of highly redundant data that comes in long runs. A description which Unicodecharacter properties fit nicely. The technique itself could be seen as a variation onRLE encoding.

Sets are value types (just likeint is) thus they are never aliased.

Example

auto a = CodepointSet('a', 'z'+1);auto b = CodepointSet('A', 'Z'+1);auto c = a;a = a | b;assert(a == CodepointSet('A', 'Z'+1, 'a', 'z'+1));assert(a != c);

See alsounicode for simpler construction of sets from predefined ones.

Memory usage is 8 bytes per each contiguous interval in a set. The value semantics are achieved by using theCOW technique and thus it'snot safe to cast this type toshared.

Note

It's not recommended to rely on the template parameters or the exact type of a currentcode point set instd.uni. The type and parameters may change when the standard allocators design is finalized. UseisCodepointSet with templates or just stick with the default aliasCodepointSet throughout the whole code base.

pure this(Set)(Setset)
if (isCodepointSet!Set);
Construct from another code point set of any type.
pure this(Range)(Rangeintervals)
if (isForwardRange!Range && isIntegralPair!(ElementType!Range));
Construct a set from a forward range of code point intervals.
this()(uint[]intervals...);
Construct a set from plain values of code point intervals.
Examples:
import std.algorithm.comparison : equal;auto set = CodepointSet('a', 'z'+1, 'а', 'я'+1);foreach (v; 'a'..'z'+1)assert(set[v]);// Cyrillic lowercase intervalforeach (v; 'а'..'я'+1)assert(set[v]);//specific order is not required, intervals may interesectauto set2 = CodepointSet('а', 'я'+1, 'a', 'd', 'b', 'z'+1);//the same end resultassert(set2.byInterval.equal(set.byInterval));// test constructor this(Range)(Range intervals)auto chessPiecesWhite = CodepointInterval(9812, 9818);auto chessPiecesBlack = CodepointInterval(9818, 9824);auto set3 = CodepointSet([chessPiecesWhite, chessPiecesBlack]);foreach (v; '♔'..'♟'+1)assert(set3[v]);
@property autobyInterval() scope;
Get range that spans all of thecode point intervals in thisInversionList.
boolopIndex(uintval) const;
Tests the presence of code pointval in this set.
Examples:
auto gothic = unicode.Gothic;// Gothic letter ahsaassert(gothic['\U00010330']);// no ascii in Gothic obviouslyassert(!gothic['$']);
@property size_tlength();
Number ofcode points in this set
ThisopBinary(string op, U)(Urhs)
if (isCodepointSet!U || is(U : dchar));

Sets support natural syntax for set algebra, namely:

OperatorMath notationDescription
&a ∩ bintersection
|a ∪ bunion
-a ∖ bsubtraction
~a ~ bsymmetric set difference i.e. (a ∪ b) \ (a ∩ b)
Examples:
import std.algorithm.comparison : equal;import std.range : iota;auto lower = unicode.LowerCase;auto upper = unicode.UpperCase;auto ascii = unicode.ASCII;assert((lower & upper).empty);// no intersectionauto lowerASCII = lower & ascii;assert(lowerASCII.byCodepoint.equal(iota('a', 'z'+1)));// throw away all of the lowercase ASCIIwriteln((ascii - lower).length);// 128 - 26auto onlyOneOf = lower ~ ascii;assert(!onlyOneOf['Δ']);// not ASCII and not lowercaseassert(onlyOneOf['$']);// ASCII and not lowercaseassert(!onlyOneOf['a']);// ASCII and lowercaseassert(onlyOneOf['я']);// not ASCII but lowercase// throw away all cased letters from ASCIIauto noLetters = ascii - (lower | upper);writeln(noLetters.length);// 128 - 26 * 2
ref ThisopOpAssign(string op, U)(Urhs)
if (isCodepointSet!U || is(U : dchar));
The 'op=' versions of the above overloaded operators.
boolopBinaryRight(string op : "in", U)(Uch) const
if (is(U : dchar));
Tests the presence of codepointch in this set, the same asopIndex.
Examples:
assert('я'in unicode.Cyrillic);assert(!('z'in unicode.Cyrillic));
autoopUnary(string op : "!")();
Obtains a set that is the inversion of this set.
See Also:
@property autobyCodepoint();
A range that spans eachcode point in this set.
Examples:
import std.algorithm.comparison : equal;import std.range : iota;auto set = unicode.ASCII;set.byCodepoint.equal(iota(0, 0x80));
voidtoString(Writer)(scope Writersink, ref scope const FormatSpec!charfmt);
Obtain a textual representation of this InversionList in form of open-right intervals.
The formatting flag is applied individually to each value, for example:
  • %s and%d format the intervals as a [low .. high) range of integrals
  • %x formats the intervals as a [low .. high) range of lowercase hex characters
  • %X formats the intervals as a [low .. high) range of uppercase hex characters
  • Examples:
    import std.conv : to;import std.format : format;import std.uni : unicode;// This was originally using Cyrillic script.// Unfortunately this is a pretty active range for changes,// and hence broke in an update.// Therefore the range Basic latin was used instead as it// unlikely to ever change.writeln(unicode.InBasic_latin.to!string);// "[0..128)"// The specs '%s' and '%d' are equivalent to the to!string call above.writeln(format("%d", unicode.InBasic_latin));// unicode.InBasic_latin.to!stringwriteln(format("%#x", unicode.InBasic_latin));// "[0..0x80)"writeln(format("%#X", unicode.InBasic_latin));// "[0..0X80)"
    ref autoadd()(uinta, uintb);
    Add an interval [a, b) to this set.
    Examples:
    CodepointSet someSet;someSet.add('0', '5').add('A','Z'+1);someSet.add('5', '9'+1);assert(someSet['0']);assert(someSet['5']);assert(someSet['9']);assert(someSet['Z']);
    @property autoinverted();
    Obtains a set that is the inversion of this set.
    See the '!'opUnary for the same but using operators.
    Examples:
    auto set = unicode.ASCII;// union with the inverse gets all of the code points in the Unicodewriteln((set | set.inverted).length);// 0x110000// no intersection with the inverseassert((set & set.inverted).empty);
    stringtoSourceCode(stringfuncName = "");
    Generates string with D source code of unary function with name offuncName taking a singledchar argument. IffuncName is empty the code is adjusted to be a lambda function.
    The function generated tests if thecode point passed belongs to this set or not. The result is to be used with string mixin. The intended usage area is aggressive optimization via meta programming in parser generators and the like.

    NoteUse with care for relatively small or regular sets. It could end up being slower then just using multi-staged tables.

    Example

    import std.stdio;// construct set directly from [a, b$RPAREN intervalsauto set = CodepointSet(10, 12, 45, 65, 100, 200);writeln(set);writeln(set.toSourceCode("func"));
    The above outputs something along the lines of:
    bool func(dchar ch)  @safepurenothrow @nogc{if (ch < 45)    {if (ch == 10 || ch == 11)returntrue;returnfalse;    }elseif (ch < 65)returntrue;else    {if (ch < 100)returnfalse;if (ch < 200)returntrue;returnfalse;    }}

    @property boolempty() const;
    True if this set doesn't contain anycode points.
    Examples:
    CodepointSet emptySet;writeln(emptySet.length);// 0assert(emptySet.empty);
    templatecodepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)
    A shorthand for creating a custom multi-level fixed Trie from aCodepointSet.sizes are numbers of bits per level, with the most significant bits used first.

    NoteThe sum ofsizes must be equal 21.

    See Also:
    toTrie, which is even simpler.

    Example

    {import std.stdio;auto set = unicode("Number");auto trie =codepointSetTrie!(8, 5, 8)(set);    writeln("Input code points to test:");foreach (line; stdin.byLine)    {int count=0;foreach (dchar ch; line)if (trie[ch])// is number                count++;        writefln("Contains %d number code points.", count);    }}

    templateCodepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)
    Type of Trie generated by codepointSetTrie function.
    templatecodepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)

    templateCodepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)
    A slightly more general tool for building fixedTrie for the Unicode data.
    Specifically unlikecodepointSetTrie it's allows creating mappings ofdchar to an arbitrary typeT.

    NoteOverload takingCodepointSets will naturally convert only to bool mappingTries.

    CodepointTrie is the type of Trie as generated by codepointTrie function.

    autocodepointTrie()(T[dchar]map, TdefValue = T.init);
    autocodepointTrie(R)(Rrange, TdefValue = T.init)
    if (isInputRange!R && is(typeof(ElementType!R.init[0]) : T) && is(typeof(ElementType!R.init[1]) : dchar));
    structMatcherConcept;
    Conceptual type that outlines the common properties of all UTF Matchers.

    NoteFor illustration purposes only, every method call results in assertion failure. UseutfMatcher to obtain a concrete matcher for UTF-8 or UTF-16 encodings.

    boolmatch(Range)(ref Rangeinp)
    if (isRandomAccessRange!Range && is(ElementType!Range : char));

    boolskip(Range)(ref Rangeinp)
    if (isRandomAccessRange!Range && is(ElementType!Range : char));

    booltest(Range)(ref Rangeinp)
    if (isRandomAccessRange!Range && is(ElementType!Range : char));

    Perform a semantic equivalent 2 operations: decoding acode point at front ofinp and testing if it belongs to the set ofcode points of this matcher.

    The effect oninp depends on the kind of function called:

    Match. If the codepoint is found in the set then rangeinp is advanced by its size incode units, otherwise the range is not modifed.

    Skip. The range is always advanced by the size of the testedcode point regardless of the result of test.

    Test. The range is left unaffected regardless of the result of test.

    Examples:
    string truth ="2² = 4";auto m = utfMatcher!char(unicode.Number);assert(m.match(truth));// '2' is a number all rightassert(truth =="² = 4");// skips on matchassert(m.match(truth));// so is the superscript '2'assert(!m.match(truth));// space is not a numberassert(truth ==" = 4");// unaffected on no matchassert(!m.skip(truth));// same test ...assert(truth =="= 4");// but skips a codepoint regardlessassert(!m.test(truth));// '=' is not a numberassert(truth =="= 4");// test never affects argument
    @property autosubMatcher(Lengths...)();
    Advanced feature - provide direct access to a subset of matcher based a set of known encoding lengths. Lengths are provided incode units. The sub-matcher then may do less operations per anytest/match.
    Use with care as the sub-matcher won't match anycode points that have encoded length that doesn't belong to the selected set of lengths. Also the sub-matcher object references the parent matcher and must not be used past the liftetime of the latter.
    Another caveat of using sub-matcher is that skip is not available preciesly because sub-matcher doesn't detect all lengths.
    enum autoisUtfMatcher(M, C);
    Test ifM is an UTF Matcher for ranges ofChar.
    autoutfMatcher(Char, Set)(Setset)
    if (isCodepointSet!Set);
    Constructs a matcher object to classifycode points from theset for encoding that hasChar as code unit.
    SeeMatcherConcept for API outline.
    autotoTrie(size_t level, Set)(Setset)
    if (isCodepointSet!Set);
    Convenience function to construct optimal configurations for packed Trie from anyset ofcode points.
    The parameterlevel indicates the number of trie levels to use, allowed values are: 1, 2, 3 or 4. Levels represent different trade-offs speed-size wise.

    Level 1 is fastest and the most memory hungry (a bit array).

    Level 4 is the slowest and has the smallest footprint.

    See theSynopsis section for example.

    NoteLevel 4 stays very practical (being faster and more predictable) compared to using direct lookup on theset itself.

    autotoDelegate(Set)(Setset)
    if (isCodepointSet!Set);

    Builds aTrie with typically optimal speed-size trade-off and wraps it into a delegate of the following type:bool delegate(dchar ch).

    Effectively this creates a 'tester' lambda suitable for algorithms like std.algorithm.find that take unary predicates.

    See theSynopsis section for example.
    structunicode;
    A single entry point to lookup Unicodecode point sets by name or alias of a block, script or general category.
    It uses well defined standard rules of property name lookup. This includes fuzzy matching of names, so that 'White_Space', 'white-SpAce' and 'whitespace' are all considered equal and yield the same set of white spacecharacters.
    pure @property autoopDispatch(string name)();
    Performs the lookup of set ofcode points with compile-time correctness checking. This short-cut version combines 3 searches: across blocks, scripts, and common binary properties.
    Note that since scripts and blocks overlap the usual trick to disambiguate is used - to get a block useunicode.InBlockName, to search a script useunicode.ScriptName.
    See Also:
    block,script and (not included in this search)hangulSyllableType.
    autoopCall(C)(scope const C[]name)
    if (is(C : dchar));
    The same lookup across blocks, scripts, or binary properties, but performed at run-time. This version is provided for cases wherename is not known beforehand; otherwise compile-time checkedopDispatch is typically a better choice.
    See thetable of properties for available sets.
    structblock;
    Narrows down the search for sets ofcode points to all Unicode blocks.

    NoteHere block names are unambiguous as no scripts are searched and thus to search use simplyunicode.block.BlockName notation.

    Seetable of properties for available sets.

    Examples:
    // use .block for explicitnesswriteln(unicode.block.Greek_and_Coptic);// unicode.InGreek_and_Coptic
    structscript;
    Narrows down the search for sets ofcode points to all Unicode scripts.
    See thetable of properties for available sets.
    Examples:
    auto arabicScript = unicode.script.arabic;auto arabicBlock = unicode.block.arabic;// there is an intersection between script and blockassert(arabicBlock['؁']);assert(arabicScript['؁']);// but they are differentassert(arabicBlock != arabicScript);writeln(arabicBlock);// unicode.inArabicwriteln(arabicScript);// unicode.arabic
    structhangulSyllableType;
    Fetch a set ofcode points that have the given hangul syllable type.
    Other non-binary properties (once supported) follow the same notation -unicode.propertyName.propertyValue for compile-time checked access andunicode.propertyName(propertyValue) for run-time checked one.
    See thetable of properties for available sets.
    Examples:
    // L here is syllable type not Letter as in unicode.L short-cutauto leadingVowel = unicode.hangulSyllableType("L");// check that some leading vowels are presentforeach (vowel; '\u1110'..'\u115F')assert(leadingVowel[vowel]);writeln(leadingVowel);// unicode.hangulSyllableType.L
    CodepointSetparseSet(Range)(ref Rangerange, boolcasefold = false)
    if (isInputRange!Range && is(ElementType!Range : dchar));
    Parse unicode codepoint set from givenrange using standard regex syntax '[...]'. The range is advanced skiping over regex set definition.casefold parameter determines if the set should be casefolded - that is include both lower and upper case versions for any letters in the set.
    pure @safe size_tgraphemeStride(C)(scope const C[]input, size_tindex)
    if (is(C : dchar));
    Computes the length of grapheme cluster starting atindex. Both the resulting length and theindex are measured incode units.
    Parameters:
    Ctype that is implicitly convertible todchars
    C[]inputarray of grapheme clusters
    size_tindexstarting index intoinput[]
    Returns:
    length of grapheme cluster
    Examples:
    writeln(graphemeStride("  ", 1));// 1// A + combing ring abovestring city ="A\u030Arhus";size_t first =graphemeStride(city, 0);assert(first == 3);//\u030A has 2 UTF-8 code unitswriteln(city[0 .. first]);// "A\u030A"writeln(city[first .. $]);// "rhus"
    GraphemedecodeGrapheme(Input)(ref Inputinp)
    if (isInputRange!Input && is(immutable(ElementType!Input) == immutable(dchar)));
    Reads one full grapheme cluster from aninput range of dcharinp.
    For examples see theGrapheme below.

    NoteThis function modifiesinp and thusinp must be an L-value.

    size_tpopGrapheme(Input)(ref Inputinp)
    if (isInputRange!Input && is(immutable(ElementType!Input) == immutable(dchar)));
    Reads one full grapheme cluster from aninput range of dcharinp, but doesn't return it. Instead returns the number of code units read. This differs from number of code points read only ifinput is an autodecodable string.

    NoteThis function modifiesinp and thusinp must be an L-value.

    Examples:
    // Two Union Jacks of the Great Britain in eachstring s ="\U0001F1EC\U0001F1E7\U0001F1EC\U0001F1E7";wstring ws ="\U0001F1EC\U0001F1E7\U0001F1EC\U0001F1E7";dstring ds ="\U0001F1EC\U0001F1E7\U0001F1EC\U0001F1E7";// String pop length in code units, not points.writeln(s.popGrapheme());// 8writeln(ws.popGrapheme());// 4writeln(ds.popGrapheme());// 2writeln(s);// "\U0001F1EC\U0001F1E7"writeln(ws);// "\U0001F1EC\U0001F1E7"writeln(ds);// "\U0001F1EC\U0001F1E7"import std.algorithm.comparison : equal;import std.algorithm.iteration : filter;// Also works for non-random access ranges as long as the// character type is 32-bit.auto testPiece ="\r\nhello!"d.filter!(x => !x.isAlpha);// Windows-style line ending is two code points in a single grapheme.writeln(testPiece.popGrapheme());// 2assert(testPiece.equal("!"d));
    autobyGrapheme(Range)(Rangerange)
    if (isInputRange!Range && is(immutable(ElementType!Range) == immutable(dchar)));

    Iterate a string byGrapheme.

    Useful for doing string manipulation that needs to be aware of graphemes.

    See Also:
    Examples:
    import std.algorithm.comparison : equal;import std.range.primitives : walkLength;import std.range : take, drop;auto text ="noe\u0308l";// noël using e + combining diaeresisassert(text.walkLength == 5);// 5 code pointsauto gText = text.byGrapheme;assert(gText.walkLength == 4);// 4 graphemesassert(gText.take(3).equal("noe\u0308".byGrapheme));assert(gText.drop(3).equal("l".byGrapheme));
    autobyCodePoint(Range)(Rangerange)
    if (isInputRange!Range && is(immutable(ElementType!Range) == immutable(Grapheme)));

    autobyCodePoint(Range)(Rangerange)
    if (isInputRange!Range && is(immutable(ElementType!Range) == immutable(dchar)));

    Lazily transform a range ofGraphemes to a range of code points.

    Useful for converting the result to a string after doing operations on graphemes.

    If passed in a range of code points, returns a range with equivalent capabilities.

    Examples:
    import std.array : array;import std.conv : text;import std.range : retro;string s ="noe\u0308l";// noël// reverse it and convert the result to a stringstring reverse = s.byGrapheme    .array    .retro    .byCodePoint    .text;assert(reverse =="le\u0308on");// lëon
    structGrapheme;

    A structure designed to effectively packcharacters of agrapheme cluster.

    Grapheme has value semantics so 2 copies of aGrapheme always refer to distinct objects. In most actual scenarios aGrapheme fits on the stack and avoids memory allocation overhead for all but quite long clusters.

    See Also:
    this(C)(scope const C[]chars...)
    if (is(C : dchar));

    this(Input)(Inputseq)
    if (!isDynamicArray!Input && isInputRange!Input && is(ElementType!Input : dchar));
    Ctor
    pure nothrow @nogc @trusted dcharopIndex(size_tindex) const;
    Gets acode point at the given index in this cluster.
    pure nothrow @nogc @trusted voidopIndexAssign(dcharch, size_tindex);
    Writes acode pointch at given index in this cluster.

    WarningUse of this facility may invalidate grapheme cluster, see alsoGrapheme.valid.

    Examples:
    auto g = Grapheme("A\u0302");writeln(g[0]);// 'A'assert(g.valid);g[1] = '~';// ASCII tilda is not a combining markwriteln(g[1]);// '~'assert(!g.valid);
    pure nothrow @nogc @safe SliceOverIndexed!GraphemeopSlice(size_ta, size_tb) return;

    pure nothrow @nogc @safe SliceOverIndexed!GraphemeopSlice() return;
    Random-access range over Grapheme'scharacters.

    WarningInvalidates when this Grapheme leaves the scope, attempts to use it then would lead to memory corruption.

    pure nothrow @nogc @property @safe size_tlength() const;
    Grapheme cluster length incode points.
    ref @trusted autoopOpAssign(string op)(dcharch);
    Appendcharacterch to this grapheme.

    WarningUse of this facility may invalidate grapheme cluster, see alsovalid.

    Examples:
    import std.algorithm.comparison : equal;auto g = Grapheme("A");assert(g.valid);g ~= '\u0301';assert(g[].equal("A\u0301"));assert(g.valid);g ~="B";// not a valid grapheme cluster anymoreassert(!g.valid);// still could be useful thoughassert(g[].equal("A\u0301B"));
    ref autoopOpAssign(string op, Input)(scope Inputinp)
    if (isInputRange!Input && is(ElementType!Input : dchar));
    Append allcharacters from the input rangeinp to this Grapheme.
    @property boolvalid()();
    True if this object contains valid extended grapheme cluster. Decoding primitives of this module always return a validGrapheme.
    Appending to and direct manipulation of grapheme'scharacters may render it no longer valid. Certain applications may chose to use Grapheme as a "small string" of anycode points and ignore this property entirely.
    intsicmp(S1, S2)(scope S1r1, scope S2r2)
    if (isInputRange!S1 && isSomeChar!(ElementEncodingType!S1) && isInputRange!S2 && isSomeChar!(ElementEncodingType!S2));

    Does basic case-insensitive comparison ofr1 andr2. This function uses simpler comparison rule thus achieving better performance thanicmp. However keep in mind the warning below.

    Parameters:
    S1r1aninput range of characters
    S2r2aninput range of characters
    Returns:
    Anint that is 0 if the strings match, <0 ifr1 is lexicographically "less" thanr2, >0 ifr1 is lexicographically "greater" thanr2

    WarningThis function only handles 1:1code point mapping and thus is not sufficient for certain alphabets like German, Greek and few others.

    See Also:
    Examples:
    writeln(sicmp("Август","авгусТ"));// 0// Greek also works as long as there is no 1:M mapping in sightwriteln(sicmp("ΌΎ","όύ"));// 0// things like the following won't get matched as equal// Greek small letter iota with dialytika and tonosassert(sicmp("ΐ","\u03B9\u0308\u0301") != 0);// while icmp has no problem with thatwriteln(icmp("ΐ","\u03B9\u0308\u0301"));// 0writeln(icmp("ΌΎ","όύ"));// 0
    inticmp(S1, S2)(S1r1, S2r2)
    if (isForwardRange!S1 && isSomeChar!(ElementEncodingType!S1) && isForwardRange!S2 && isSomeChar!(ElementEncodingType!S2));
    Does case insensitive comparison ofr1 andr2. Follows the rules of full case-folding mapping. This includes matching as equal german ß with "ss" and other 1:Mcode point mappings unlikesicmp. The cost oficmp being pedantically correct is slightly worse performance.
    Parameters:
    S1r1a forward range of characters
    S2r2a forward range of characters
    Returns:
    Anint that is 0 if the strings match, <0 ifstr1 is lexicographically "less" thanstr2, >0 ifstr1 is lexicographically "greater" thanstr2
    See Also:
    Examples:
    writeln(icmp("Rußland","Russland"));// 0writeln(icmp("ᾩ -> \u1F70\u03B9","\u1F61\u03B9 -> ᾲ"));// 0
    Examples:
    By usingstd.utf.byUTF and its aliases, GC allocations via auto-decoding and thrown exceptions can be avoided, makingicmp@safe @nogc nothrow pure.
    import std.utf : byDchar;writeln(icmp("Rußland".byDchar,"Russland".byDchar));// 0writeln(icmp("ᾩ -> \u1F70\u03B9".byDchar,"\u1F61\u03B9 -> ᾲ".byDchar));// 0
    pure nothrow @nogc @safe ubytecombiningClass(dcharch);

    Returns thecombining class ofch.

    Examples:
    // shorten the codealias CC =combiningClass;// combining tildawriteln(CC('\u0303'));// 230// combining ring belowwriteln(CC('\u0325'));// 220// the simple consequence is that  "tilda" should be// placed after a "ring below" in a sequence
    enumUnicodeDecomposition: int;
    Unicode character decomposition type.
    Canonical
    Canonical decomposition. The result is canonically equivalent sequence.
    Compatibility
    Compatibility decomposition. The result is compatibility equivalent sequence.

    NoteCompatibility decomposition is alossy conversion, typically suitable only for fuzzy matching and internal processing.

    pure nothrow @safe dcharcompose(dcharfirst, dcharsecond);
    Try to canonically compose 2characters. Returns the composedcharacter if they do compose and dchar.init otherwise.
    The assumption is thatfirst comes beforesecond in the original text, usually meaning that the first is a starter.

    NoteHangul syllables are not covered by this function. SeecomposeJamo below.

    Examples:
    writeln(compose('A', '\u0308'));// '\u00C4'writeln(compose('A', 'B'));// dchar.initwriteln(compose('C', '\u0301'));// '\u0106'// note that the starter is the first one// thus the following doesn't composewriteln(compose('\u0308', 'A'));// dchar.init
    @safe Graphemedecompose(UnicodeDecomposition decompType = Canonical)(dcharch);
    Returns a fullCanonical (by default) orCompatibility decomposition ofcharacterch. If no decomposition is available returns aGrapheme with thech itself.

    NoteThis function also decomposes hangul syllables as prescribed by the standard.

    See Also:
    decomposeHangul for a restricted version that takes into account only hangul syllables but no other decompositions.
    Examples:
    import std.algorithm.comparison : equal;writeln(compose('A', '\u0308'));// '\u00C4'writeln(compose('A', 'B'));// dchar.initwriteln(compose('C', '\u0301'));// '\u0106'// note that the starter is the first one// thus the following doesn't composewriteln(compose('\u0308', 'A'));// dchar.initassert(decompose('Ĉ')[].equal("C\u0302"));assert(decompose('D')[].equal("D"));assert(decompose('\uD4DC')[].equal("\u1111\u1171\u11B7"));assert(decompose!Compatibility('¹')[].equal("1"));
    pure nothrow @safe GraphemedecomposeHangul(dcharch);
    Decomposes a Hangul syllable. Ifch is not a composed syllable then this function returnsGrapheme containing onlych as is.
    Examples:
    import std.algorithm.comparison : equal;assert(decomposeHangul('\uD4DB')[].equal("\u1111\u1171\u11B6"));
    pure nothrow @nogc @safe dcharcomposeJamo(dcharlead, dcharvowel, dchartrailing = (dchar).init);
    Try to compose hangul syllable out of a leading consonant (lead), avowel and optionaltrailing consonant jamos.
    On success returns the composed LV or LVT hangul syllable.
    If any oflead andvowel are not a valid hangul jamo of the respectivecharacter class returns dchar.init.
    Examples:
    writeln(composeJamo('\u1111', '\u1171', '\u11B6'));// '\uD4DB'// leaving out T-vowel, or passing any codepoint// that is not trailing consonant composes an LV-syllablewriteln(composeJamo('\u1111', '\u1171'));// '\uD4CC'writeln(composeJamo('\u1111', '\u1171', ' '));// '\uD4CC'writeln(composeJamo('\u1111', 'A'));// dchar.initwriteln(composeJamo('A', '\u1171'));// dchar.init
    enumNormalizationForm: int;
    Enumeration type for normalization forms, passed as template parameter for functions likenormalize.
    NFC

    NFD

    NFKC

    NFKD
    Shorthand aliases from values indicating normalization forms.
    pure @safe inout(C)[]normalize(NormalizationForm norm = NFC, C)(return scope inout(C)[]input);
    Returnsinput string normalized to the chosen form. Form C is used by default.
    For more information on normalization forms see thenormalization section.

    NoteIn cases where the string in question is already normalized, it is returned unmodified and no memory allocation happens.

    Examples:
    // any encoding workswstring greet ="Hello world";assert(normalize(greet)is greet);// the same exact slice// An example of a character with all 4 forms being different:// Greek upsilon with acute and hook symbol (code point 0x03D3)writeln(normalize!NFC("ϓ"));// "\u03D3"writeln(normalize!NFD("ϓ"));// "\u03D2\u0301"writeln(normalize!NFKC("ϓ"));// "\u038E"writeln(normalize!NFKD("ϓ"));// "\u03A5\u0301"
    boolallowedIn(NormalizationForm norm)(dcharch);
    Tests if dcharch is always allowed (Quick_Check=YES) in normalization formnorm.
    Examples:
    // e.g. Cyrillic is always allowed, so is ASCIIassert(allowedIn!NFC('я'));assert(allowedIn!NFD('я'));assert(allowedIn!NFKC('я'));assert(allowedIn!NFKD('я'));assert(allowedIn!NFC('Z'));
    pure nothrow @nogc @safe boolisWhite(dcharc);
    Whether or notc is a Unicode whitespacecharacter. (general Unicode category: Part of C0(tab, vertical tab, form feed, carriage return, and linefeed characters), Zs, Zl, Zp, and NEL(U+0085))
    pure nothrow @nogc @safe boolisLower(dcharc);
    Return whetherc is a Unicode lowercasecharacter.
    pure nothrow @nogc @safe boolisUpper(dcharc);
    Return whetherc is a Unicode uppercasecharacter.
    autoasLowerCase(Range)(Rangestr)
    if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) && !isConvertibleToString!Range);

    autoasUpperCase(Range)(Rangestr)
    if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) && !isConvertibleToString!Range);
    Convert aninput range or a string to upper or lower case.
    Does not allocate memory. Characters in UTF-8 or UTF-16 format that cannot be decoded are treated asstd.utf.replacementDchar.
    Parameters:
    Rangestrstring or range of characters
    Returns:
    an input range ofdchars
    See Also:
    Examples:
    import std.algorithm.comparison : equal;assert("hEllo".asUpperCase.equal("HELLO"));
    autoasCapitalized(Range)(Rangestr)
    if (isInputRange!Range && isSomeChar!(ElementEncodingType!Range) && !isConvertibleToString!Range);
    Capitalize aninput range or string, meaning convert the first character to upper case and subsequent characters to lower case.
    Does not allocate memory. Characters in UTF-8 or UTF-16 format that cannot be decoded are treated asstd.utf.replacementDchar.
    Parameters:
    Rangestrstring or range of characters
    Returns:
    an InputRange of dchars
    See Also:
    Examples:
    import std.algorithm.comparison : equal;assert("hEllo".asCapitalized.equal("Hello"));
    pure @trusted voidtoLowerInPlace(C)(ref C[]s)
    if (is(C == char) || is(C == wchar) || is(C == dchar));
    Convertss to lowercase (by performing Unicode lowercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. Ifs does not have any uppercase characters, thens is unaltered.
    pure @trusted voidtoUpperInPlace(C)(ref C[]s)
    if (is(C == char) || is(C == wchar) || is(C == dchar));
    Convertss to uppercase (by performing Unicode uppercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. Ifs does not have any lowercase characters, thens is unaltered.
    pure nothrow @nogc @safe dchartoLower(dcharc);
    Ifc is a Unicode uppercasecharacter, then its lowercase equivalent is returned. Otherwisec is returned.

    Warningcertain alphabets like German and Greek have no 1:1 upper-lower mapping. Use overload of toLower which takes full string instead.

    @trusted ElementEncodingType!S[]toLower(S)(return scope Ss)
    if (isSomeString!S);

    ElementEncodingType!S[]toLower(S)(Ss)
    if (!isSomeString!S && (isRandomAccessRange!S && hasLength!S && hasSlicing!S && isSomeChar!(ElementType!S)));
    Creates a new array which is identical tos except that all of its characters are converted to lowercase (by performing Unicode lowercase mapping). If none ofs characters were affected, thens itself is returned ifs is astring-like type.
    Parameters:
    SsArandom access range of characters
    Returns:
    An array with the same element type ass.
    pure nothrow @nogc @safe dchartoUpper(dcharc);
    Ifc is a Unicode lowercasecharacter, then its uppercase equivalent is returned. Otherwisec is returned.

    WarningCertain alphabets like German and Greek have no 1:1 upper-lower mapping. Use overload of toUpper which takes full string instead.

    toUpper can be used as an argument tostd.algorithm.iteration.map to produce an algorithm that can convert a range of characters to upper case without allocating memory. A string can then be produced by usingstd.algorithm.mutation.copy to send it to anstd.array.appender.

    Examples:
    import std.algorithm.iteration : map;import std.algorithm.mutation : copy;import std.array : appender;auto abuf = appender!(char[])();"hello".map!toUpper.copy(abuf);writeln(abuf.data);// "HELLO"
    @trusted ElementEncodingType!S[]toUpper(S)(return scope Ss)
    if (isSomeString!S);

    ElementEncodingType!S[]toUpper(S)(Ss)
    if (!isSomeString!S && (isRandomAccessRange!S && hasLength!S && hasSlicing!S && isSomeChar!(ElementType!S)));
    Allocates a new array which is identical tos except that all of its characters are converted to uppercase (by performing Unicode uppercase mapping). If none ofs characters were affected, thens itself is returned ifs is astring-like type.
    Parameters:
    SsArandom access range of characters
    Returns:
    An new array with the same element type ass.
    pure nothrow @nogc @safe boolisAlpha(dcharc);
    Returns whetherc is a Unicode alphabeticcharacter (general Unicode category: Alphabetic).
    pure nothrow @nogc @safe boolisMark(dcharc);
    Returns whetherc is a Unicode mark (general Unicode category: Mn, Me, Mc).
    pure nothrow @nogc @safe boolisNumber(dcharc);
    Returns whetherc is a Unicode numericalcharacter (general Unicode category: Nd, Nl, No).
    pure nothrow @nogc @safe boolisAlphaNum(dcharc);
    Returns whetherc is a Unicode alphabeticcharacter or number. (general Unicode category: Alphabetic, Nd, Nl, No).
    Parameters:
    dcharcany Unicode character
    Returns:
    true if the character is in the Alphabetic, Nd, Nl, or No Unicode categories
    pure nothrow @nogc @safe boolisPunctuation(dcharc);
    Returns whetherc is a Unicode punctuationcharacter (general Unicode category: Pd, Ps, Pe, Pc, Po, Pi, Pf).
    pure nothrow @nogc @safe boolisSymbol(dcharc);
    Returns whetherc is a Unicode symbolcharacter (general Unicode category: Sm, Sc, Sk, So).
    pure nothrow @nogc @safe boolisSpace(dcharc);
    Returns whetherc is a Unicode spacecharacter (general Unicode category: Zs)

    NoteThis doesn't include '\n', '\r', \t' and other non-spacecharacter. For commonly used less strict semantics seeisWhite.

    pure nothrow @nogc @safe boolisGraphical(dcharc);
    Returns whetherc is a Unicode graphicalcharacter (general Unicode category: L, M, N, P, S, Zs).
    pure nothrow @nogc @safe boolisControl(dcharc);
    Returns whetherc is a Unicode controlcharacter (general Unicode category: Cc).
    pure nothrow @nogc @safe boolisFormat(dcharc);
    Returns whetherc is a Unicode formattingcharacter (general Unicode category: Cf).
    pure nothrow @nogc @safe boolisPrivateUse(dcharc);
    Returns whetherc is a Unicode Private Usecode point (general Unicode category: Co).
    pure nothrow @nogc @safe boolisSurrogate(dcharc);
    Returns whetherc is a Unicode surrogatecode point (general Unicode category: Cs).
    pure nothrow @nogc @safe boolisSurrogateHi(dcharc);
    Returns whetherc is a Unicode high surrogate (lead surrogate).
    pure nothrow @nogc @safe boolisSurrogateLo(dcharc);
    Returns whetherc is a Unicode low surrogate (trail surrogate).
    pure nothrow @nogc @safe boolisNonCharacter(dcharc);
    Returns whetherc is a Unicode non-character i.e. acode point with no assigned abstract character. (general Unicode category: Cn)
    Copyright © 1999-2026 by theD Language Foundation | Page generated byDdoc on Sat Feb 21 00:07:41 2026

    [8]ページ先頭

    ©2009-2026 Movatter.jp