Movatterモバイル変換


[0]ホーム

URL:


GitHub

Unicode

TheUnicode module provides essential functionality for managing Unicode characters and strings. It includes validation, category determination, normalization, case transformation, and grapheme segmentation, enabling effective Unicode data handling.

UnicodeModule

TheUnicode module provides essential functionality for managing Unicode characters and strings. It includes validation, category determination, normalization, case transformation, and grapheme segmentation, enabling effective Unicode data handling.

Unicode.julia_chartransformFunction
Unicode.julia_chartransform(c::Union{Char,Integer})

Map the Unicode character (Char) or codepoint (Integer)c to the corresponding "equivalent" character or codepoint, respectively, according to the custom equivalence used within the Julia parser (in addition to NFC normalization).

For example,'µ' (U+00B5 micro) is treated as equivalent to'μ' (U+03BC mu) by Julia's parser, sojulia_chartransform performs this transformation while leaving other characters unchanged:

julia> Unicode.julia_chartransform('µ')'μ': Unicode U+03BC (category Ll: Letter, lowercase)julia> Unicode.julia_chartransform('x')'x': ASCII/Unicode U+0078 (category Ll: Letter, lowercase)

julia_chartransform is mainly useful for passing to theUnicode.normalize function in order to mimic the normalization used by the Julia parser:

julia> s = "µö""µö"julia> s2 = Unicode.normalize(s, compose=true, stable=true, chartransform=Unicode.julia_chartransform)"μö"julia> collect(s2)2-element Vector{Char}: 'μ': Unicode U+03BC (category Ll: Letter, lowercase) 'ö': Unicode U+00F6 (category Ll: Letter, lowercase)julia> s2 == string(Meta.parse(s))true
Julia 1.8

This function was introduced in Julia 1.8.

Unicode.isassignedFunction
Unicode.isassigned(c) -> Bool

Returntrue if the given char or integer is an assigned Unicode code point.

Examples

julia> Unicode.isassigned(101)truejulia> Unicode.isassigned('\x01')true
Unicode.isequal_normalizedFunction
isequal_normalized(s1::AbstractString, s2::AbstractString; casefold=false, stripmark=false, chartransform=identity)

Return whethers1 ands2 are canonically equivalent Unicode strings. Ifcasefold=true, ignores case (performs Unicode case-folding); ifstripmark=true, strips diacritical marks and other combining characters.

As withUnicode.normalize, you can also pass an arbitrary function via thechartransform keyword (mappingInteger codepoints to codepoints) to perform custom normalizations, such asUnicode.julia_chartransform.

Julia 1.8

Theisequal_normalized function was added in Julia 1.8.

Examples

For example, the string"noël" can be constructed in two canonically equivalent ways in Unicode, depending on whether"ë" is formed from a single codepoint U+00EB or from the ASCII character'e' followed by the U+0308 combining-diaeresis character.

julia> s1 = "noël""noël"julia> s2 = "noël""noël"julia> s1 == s2falsejulia> isequal_normalized(s1, s2)truejulia> isequal_normalized(s1, "noel", stripmark=true)truejulia> isequal_normalized(s1, "NOËL", casefold=true)true
Unicode.normalizeFunction
Unicode.normalize(s::AbstractString; keywords...)Unicode.normalize(s::AbstractString, normalform::Symbol)

Normalize the strings. By default, canonical composition (compose=true) is performed without ensuring Unicode versioning stability (compat=false), which produces the shortest possible equivalent string but may introduce composition characters not present in earlier Unicode versions.

Alternatively, one of the four "normal forms" of the Unicode standard can be specified:normalform can be:NFC,:NFD,:NFKC, or:NFKD. Normal forms C (canonical composition) and D (canonical decomposition) convert different visually identical representations of the same abstract string into a single canonical form, with form C being more compact. Normal forms KC and KD additionally canonicalize "compatibility equivalents": they convert characters that are abstractly similar but visually distinct into a single canonical choice (e.g. they expand ligatures into the individual characters), with form KC being more compact.

Alternatively, finer control and additional transformations may be obtained by callingUnicode.normalize(s; keywords...), where any number of the following boolean keywords options (which all default tofalse except forcompose) are specified:

  • compose=false: do not perform canonical composition
  • decompose=true: do canonical decomposition instead of canonical composition (compose=true is ignored if present)
  • compat=true: compatibility equivalents are canonicalized
  • casefold=true: perform Unicode case folding, e.g. for case-insensitive string comparison
  • newline2lf=true,newline2ls=true, ornewline2ps=true: convert various newline sequences (LF, CRLF, CR, NEL) into a linefeed (LF), line-separation (LS), or paragraph-separation (PS) character, respectively
  • stripmark=true: strip diacritical marks (e.g. accents)
  • stripignore=true: strip Unicode's "default ignorable" characters (e.g. the soft hyphen or the left-to-right marker)
  • stripcc=true: strip control characters; horizontal tabs and form feeds are converted to spaces; newlines are also converted to spaces unless a newline-conversion flag was specified
  • rejectna=true: throw an error if unassigned code points are found
  • stable=true: enforce Unicode versioning stability (never introduce characters missing from earlier Unicode versions)

You can also use thechartransform keyword (which defaults toidentity) to pass an arbitraryfunction mappingInteger codepoints to codepoints, which is called on each character ins as it is processed, in order to perform arbitrary additional normalizations. For example, by passingchartransform=Unicode.julia_chartransform, you can apply a few Julia-specific character normalizations that are performed by Julia when parsing identifiers (in addition to NFC normalization:compose=true, stable=true).

For example, NFKC corresponds to the optionscompose=true, compat=true, stable=true.

Examples

julia> "é" == Unicode.normalize("é") #LHS: Unicode U+00e9, RHS: U+0065 & U+0301truejulia> "μ" == Unicode.normalize("µ", compat=true) #LHS: Unicode U+03bc, RHS: Unicode U+00b5truejulia> Unicode.normalize("JuLiA", casefold=true)"julia"julia> Unicode.normalize("JúLiA", stripmark=true)"JuLiA"
Julia 1.8

Thechartransform keyword argument requires Julia 1.8.

Unicode.graphemesFunction
graphemes(s::AbstractString) -> GraphemeIterator

Return an iterator over substrings ofs that correspond to the extended graphemes in the string, as defined by Unicode UAX #29. (Roughly, these are what users would perceive as single characters, even though they may contain more than one codepoint; for example a letter combined with an accent mark is a single grapheme.)

graphemes(s::AbstractString, m:n) -> SubString

Returns aSubString ofs consisting of them-th throughn-th graphemes of the strings, where the second argumentm:n is an integer-valuedAbstractUnitRange.

Loosely speaking, this corresponds to them:n-th user-perceived "characters" in the string. For example:

julia> s = graphemes("exposé", 3:6)"posé"julia> collect(s)5-element Vector{Char}: 'p': ASCII/Unicode U+0070 (category Ll: Letter, lowercase) 'o': ASCII/Unicode U+006F (category Ll: Letter, lowercase) 's': ASCII/Unicode U+0073 (category Ll: Letter, lowercase) 'e': ASCII/Unicode U+0065 (category Ll: Letter, lowercase) '́': Unicode U+0301 (category Mn: Mark, nonspacing)

This consists of the 3rd to7th codepoints (Chars) in"exposé", because the grapheme"é" is actuallytwo Unicode codepoints (an'e' followed by an acute-accent combining character U+0301).

Because finding grapheme boundaries requires iteration over the string contents, thegraphemes(s, m:n) function requires time proportional to the length of the string (number of codepoints) before the end of the substring.

Julia 1.9

Them:n argument ofgraphemes requires Julia 1.9.

Settings


This document was generated withDocumenter.jl version 1.8.0 onWednesday 9 July 2025. Using Julia version 1.11.6.


[8]ページ先頭

©2009-2025 Movatter.jp