Movatterモバイル変換


[0]ホーム

URL:


[Unicode]  Technical Reports
 

Unicode Standard Annex #44

Unicode Character Database

VersionUnicode 5.2.0
AuthorsMark Davis (markdavis@google.com) and Ken Whistler (ken@unicode.org)
Date2009-09-24
This Versionhttp://www.unicode.org/reports/tr44/tr44-4.html
Previous Versionhttp://www.unicode.org/reports/tr44/tr44-2.html
Latest Versionhttp://www.unicode.org/reports/tr44/
Latest Proposed Updatehttp://www.unicode.org/reports/tr44/proposed.html
Revision4

Summary

This annex provides the core documentation for the Unicode Character Database (UCD). It describes the layout and organization of the Unicode Character Database and how it specifies the formal definitions of the Unicode Character Properties.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published online as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version of the Unicode Standard of which it forms a part.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For any errata which may apply to this annex, see [Errata].

Contents


 

Note: the information in this annex is not intended as an exhaustive description of the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the Unicode Character Database, and relies on the notation and definitions supplied inThe Unicode Standard. All chapter references are to Version 5.2.0 of the standard unless otherwise indicated.

1Introduction

The Unicode Standard is far more than a simple encoding of characters. The standard also associates a rich set of semantics with each encoded character—properties that are required for interoperability and correct behavior in implementations, as well as for Unicode conformance. These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files which contain the Unicode character code points and character names. The data files define the Unicode character properties and mappings between Unicode characters (such as case mappings).

This annex describes the UCD and provides a guide to the various documentation files associated with it. Additional information about character properties and their use is contained in the Unicode Standard and its annexes. In particular, implementers should familiarize themselves with the formal definitions and conformance requirements for properties detailed in Section 3.5, "Properties" in [Unicode] and with the material in Chapter 4, "Character Properties" in [Unicode].

The latest version of the UCD is always located on the Unicode Web site at:

http://www.unicode.org/Public/UNIDATA/

The specific files for the UCD associated with this version of the Unicode Standard (5.2.0) are located at:

http://www.unicode.org/Public/5.2.0/

Stable, archived versions of the UCD associated with all earlier versions of the Unicode Standard can be accessed from:

http://www.unicode.org/ucd/

For a description of the changes in the UCD for this version and earlier versions, see theUCD Change History.

2Conformance

The Unicode Character Database is an integral part of the Unicode Standard.

The UCD contains normative property and mapping information required for implementation of various Unicode algorithms such as the Unicode Bidirectional Algorithm, Unicode Normalization, and Unicode Casefolding. The data files also contain additional informative and provisional character property information.

Each specification of a Unicode algorithm, whether specified in the text of [Unicode] or in one of the Unicode Standard Annexes, designates which data file(s) in the UCD are needed to provide normative property information required by that algorithm.

For information on the meaning and application of the terms,normative,informative, andprovisional, see Section 3.5, "Properties" in [Unicode].

For information about the applicable terms of use for theUCD, see the UnicodeTerms of Use.

2.1Simple and Derived Properties

Some character properties in the UCD are simple properties.This status has no bearing on whether or not the properties arenormative, but merely indicates that their valuesare not derived from some combination of other properties.

Other character properties are derived. This means thattheir values are derived by rule from some othercombination of properties. Generally such rules arestated as set operations, and may or may not includeexplicit exception lists for individual characters.

Certain simple properties are defined merelyto make the statement of the rule defining a derivedproperty more compact or general. Such properties areknown ascontributory properties.Sometimes these contributory properties are defined toencapsulate the messiness inherent in exceptionlists. At other times, a contributory property maybe defined to help stabilize the definition ofan important derived property which is subject to stabilityguarantees.

Derived character properties are not consideredsecond-class citizens among Unicode character properties.They are defined to make implementation of importantalgorithms easier to state. Included among thefirst-class derived properties important for suchimplementations are: Uppercase, Lowercase, XID_Start,XID_Continue, Math, and Default_Ignorable_Code_Point, alldefined in DerivedCoreProperties.txt, as well as derivedproperties for the optimization of normalization, definedin DerivedNormalizationProps.txt.

Implementations should simply use the derived properties,and should not try to rederive them from lists of simpleproperties and collections of rules, because of thechances for error and divergence when doing so.

Definitions of property derivations are providedfor information only, typically in comment fieldsin the data files. Such definitions may be refactored,refined, or corrected over time.

If there are any cases of mismatchesbetween the definition of a derived property aslisted in DerivedCoreProperties.txt or similar datafiles in the UCD, and the definition of a derivedproperty as a set definition rule, the explicitlisting in the data file shouldalways be takenas the normative definition of the property. As describedinStability of Releases the propertylisting in the data files for any given versionof the standard will never change for that version.

2.2Use of Default Values

Unicode character properties have default values. Defaultvalues are the value or values that a character property takesfor an unassigned code point, or in some instances, fordesignated subranges of code points, whether assigned orunassigned. For example, the default value of a binaryUnicode character property is always "N".

For the formal discussion of default values, see D26 inSection 3.5, "Properties" in [Unicode].For conventions related to default values in various data filesof the UCD and for documentation regarding the particular default values ofindividual Unicode character properties, seeDefault Values.

2.3Stability of Releases

Just as for the Unicode Standard as a whole, each version of theUCD, once published, is absolutely stable and willneverchange. Each released version is archived in a directory onthe Unicode Web site, with a directory number associated withthat version. URLs pointing to that version's directory are alsostable and will be maintained in perpetuity.

Any errors discovered for a released version of the UCDare noted in [Errata],and if appropriate will be corrected in asubsequentversion of the UCD.

Stability guarantees constraining how Unicode characterproperties can (or cannot) change between releases of the UCDare documented in the Unicode Consortium StabilityPolicies [Stability].

2.3.1Changes to Properties Between Releases

Updates to character properties in the Unicode Character Database may be requiredfor any of three reasons:

  1. To cover new characters added to the standard
  2. To add new character properties to the standard
  3. To change the assigned values for a property for some characters already in the standard

While the Unicode Consortium endeavors to keep the values of allcharacter properties as stable as possible between versions, occasionally circumstancesmay arise which require changing them. In particular, as less well-documented scripts, suchas those for minority languages, or historic scripts are added to the standard, the exactcharacter properties and behavior may not fully be known when the script is first encoded.The properties for some of these characters may change as further information becomesavailable or as implementations turn up problems in the initial property assignments.As far as possible, any readjustment of property values basedon growing implementation experience is made to be compatible with established practice.

Occasionally, a character property value is changed to prevent incorrect generalizationsabout a character's use based on its nominal property values. For example, U+200B ZEROWIDTH SPACE was originally classified as a space character (General_Category=Zs), butit was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space charactersin its function as a format control for line breaking.

There is no guarantee that a particular value for an enumeratedproperty will actually have characters associated with it. Also, because ofchanges in property value assignments between versions of the standard, aproperty value that once had characters associated with it may later have none.Such conditions and changes are rare, but implementations must notassume that all property values are associated with non-nullsets of characters. For example, currently the special Script propertyvalue Katakana_Or_Hiragana has no characters associated with it.

2.3.2Obsolete Properties

In some instances an entire property may becomeobsolete.For example, theISO_Comment property was once used to keeptrack of annotations for characters used in the production of name lists forISO/IEC 10646 code charts. As of Unicode 5.2.0 that property became obsolete, and its value is now defaulted to the null string for all Unicode code points.

An obsolete property is never removed from the UCD.

2.3.3Deprecated Properties

Occasionally an obsolete property may also be formallydeprecated. This is an indication that the property is no longer recommended foruse, perhaps because its original intent has been replaced by another propertyor because its specification was somehow defective. For example, theGrapheme_Link property is deprecated. See also thegeneral discussion ofDeprecation.

A deprecated property is never removed from the UCD.

2.3.3Stabilized Properties

Another possibility is that an obsolete property may bedeclared to bestabilized. Such a determination does not indicate thatthe property should or should not be used; instead it is a declaration that theUTC will no longer actively maintain the property or extend it for newlyencoded characters. The property values of astabilized property are frozen as of a particular release of the standard. Forexample, theHyphen property was stabilized as of Version 4.0.0.

A stabilized property is never removed from the UCD.

3Documentation

This annex provides the core documentation for the UCD, butadditional information about character properties is available inother parts of the standard and in additional documentation filescontained within the UCD.

3.1Character Properties in the Standard

The formal definitions related to character properties used by the Unicode Standard are documented in Section 3.5, "Properties" in [Unicode]. Understanding those definitions and related terminology is essential to the appropriate use of Unicode character properties.

See Section 4.1, "Unicode Character Database", in [Unicode] for a general discussion of the UCD and its use in defining properties. The rest of Chapter 4 provides important explanations regarding the meaning and use of various normative character properties.

3.2The Character Property Model

For a general discussion of the property model which underlies the definitions associated with the UCD, see UTR #23: The Unicode Character Property Model [UTR23]. That technical report is informative, but over the years various content from it has been incorporated into normative portions of the Unicode Standard, particularly for the definitions in Chapter 3.

UTR #23 also discusses string functions and their relation to character properties.

3.3NamesList.html

NamesList.html formally describes the format of the NamesList.txt data file in BNF.That data file is used to drive the printing of the Unicode code charts and names list. See also Section 17.1, "Character Names List", in [Unicode] for a detailed discussion of the conventions used in the Unicode names list asformatted for printing.

3.4StandardizedVariants.html

StandardizedVariants.html documents standardized variants, showing a representative glyph for each. It is closely tied to the data file, StandardizedVariants.txt, which defines those sequences normatively.

3.5Unihan and UAX #38

UAX #38, Unicode Han Database (Unihan) [UAX38] describes the format and content of the Unihan Database, which collects together all property information for CJK Unified Ideographs. That annex also specifies in detailwhich of the Unihan character properties are normative,informative, or provisional.

The Unihan Database contains extensive and detailed mapping information for CJK Unified Ideographs encoded in the Unicode Standard, but it is aimedonly at those ideographs, not at other characters used in the East Asian context in general.In contrast, East Asian legacy character sets, including important commercial and national character set standards, contain many non-CJK characters. As a result, the Unihan Database must be supplemented from other sources to establish mapping tables for those character sets.

The majority of the content of the Unihan Database isreleased for each version of the Unicode Standard as a collection of Unihan datafiles in the UCD. Because of their large size, these data files are released only asa zipped file, Unihan.zip. The details of the particular data files in Unihan.zipand the CJK properties each one contains are provided in [UAX38].For versions of the UCD prior to Version 5.2.0, all of the CJK properties werelisted together in a very large, single file, Unihan.txt.

3.6Data File Comments

In addition to the specific documentation files for the UCD, individual data files often contain extensive header comments describing their content and any special conventions used in the data.

In some instances, individual property definition sections also contain comments with information about how the property may be derived. Such comments are informative; while they are intendedto convey the intent of the derivation, in case of any mismatch betweena statement of a derivation in a comment field and the actuallisting of the derived property, it is the list which is to be takenas normative. SeeSimple and Derived Properties.

3.7Obsolete Documentation Files

UCD.html was formerly the primary documentation file for the UCD. As of Version 5.2.0, itscontent has been wholly incorporated into this document.

Unihan.html was formerly the primary documentation file for the Unihan Database. As of Version 5.1.0, itscontent has been wholly incorporated into [UAX38].

Versions of the Unicode Standard prior to Version 4.0.0 contained small, focusseddocumentation files, UnicodeCharacterDatabase.html, PropList.html, andDerivedProperties.html, which were later consolidated into UCD.html.

4UCD Files

The heart of the UCD consists of the data files themselves. This section describes the directory structure for the UCD, the format conventions for the data files, and provides documentation for data files not documented elsewhere in this annex.

4.1Directory Structure

Each version of the UCD is released in a separate, numbered directory under thePublic directory on the Unicode Web site. The content of that directory is complete for that release. It is also stable—once released, it will be archived permanently in that directory, unchanged, at a stable URL.

The specific files for the UCD associated with this version of the Unicode Standard (5.2.0) are located at:

http://www.unicode.org/Public/5.2.0/

4.1.1UCD Files Proper

The UCD proper is located in theucd subdirectory of the numbered version directory. That directory contains all of the documentation files and most of the data files for the UCD, including some data files for derived properties.

Although all UCD data files are version-specific for a release and most contain internal date and version stamps, the file names of the released data files do not differ from version to version. When linking to a version-specific data file, the version will be indicated by the version number of the directory for the release.

All files for derived extracted properties are in theextracted subdirectory of theucd subdirectory. SeeDerived Extracted Properties fordocumentation regarding those data files and their content.

A number of auxiliary properties are specified in files in theauxiliary subdirectory of theucd subdirectory. In Version 5.2.0 it contains data files specifying properties associated with UAX #29, Unicode Text Segmentation [UAX29] and with UAX #14, Unicode Line Breaking Algorithm [UAX14], as well as test data for those algorithms. SeeSegmentation Test Files and Documentation for more information about the test data.

4.1.2UCD XML Files

The XML version of the UCD is located in theucdxml subdirectory of the numbered version directory. See theUCD in XML for more details.

4.1.3Charts

The code charts specific to a version of Unicode are archived as a single large pdf file in thecharts subdirectory of the numbered version directory. See the readme.txt in that subdirectory and the general web page explaining theUnicode Code Charts for more details.

4.1.4Beta Review Considerations

Prior to the formal release for any particular version of the UCD, a beta review is conducted. The beta review files are located in the same directory that is later used for the released UCD, but during the beta review period, the subdirectory structure differs somewhat and may contain temporary files, including documentation of diffs between deltas for the beta review. Also, during the beta review, all data file names are suffixed with version numbers and delta numbers. So a typical file name during beta review may be "PropList-5.2.0d13.txt" instead of the finally released "PropList.txt".

Notices contained in a ReadMe.txt file in the UCD directory during the beta review period also make it clear that that directory contains preliminary material under review, rather than a final, stable release.

4.1.5File Directory Differences for Early Releases

TheUCD in XML was introduced in Version 5.1.0, so UCD directories prior to that do not contain theucdxml subdirectory.

UCD directories prior to Version 4.1.0 do not contain theauxiliary subdirectory.

UCD directories prior to Version 3.2.0 do not contain theextracted subdirectory.

The general structure of the file directory for a released version of the UCD described above applies to Versions 4.1.0 and later. Prior to Version 4.1.0, versions of the UCD were not self-contained, complete sets of data files for that version, but instead only contained any new data files or any data files which hadchanged since the prior release.

Because of this, the property files for a given version prior to Version 4.1.0 can be spread over several directories. Consult the component listings atEnumerated Versions to find out which files in which directories comprise a complete set of data files for that version.

The directory naming conventions and the file naming conventions also differed prior to Version 4.1.0. So, for example, Version 4.0.0 of the UCD is contained in a directory named4.0-Update, and Version 4.0.1 of the UCD in a directory named4.0-Update1. Furthermore, for these earlier versions, the data file namesdo contain explicit version numbers.

4.2File Format Conventions

Files in the UCD use the format conventions described in this section, unless otherwise specified.

4.2.1Data Fields

4.2.2Code Points and Sequences

4.2.3Code Point Ranges

4.2.4Comments

4.2.5Code Point Labels

Table 1. Code Point Label Tags

TagGeneral_CategoryNote
reservedCnNoncharacter_Code_Point=F
noncharacterCnNoncharacter_Code_Point=T
controlCc 
private-useCo 
surrogateCs 

4.2.6Multiple Values

4.2.7Binary Property Values

4.2.8Default Values

Default values for common catalog, enumeration, and numeric properties are listed inTable 2.

Table 2. Default Values for Properties

Property NameDefault Value
Ageunassigned
Bidi_ClassL, AL, R
BlockNo_Block
Canonical_Combining_ClassNot_Reordered (= 0)
Decomposition_TypeNone
East_Asian_WidthNeutral (= N), Wide (= W)
General_CategoryCn
Numeric_TypeNone
Numeric_ValueNaN
ScriptUnknown (= Zzzz)

Default values for the Unicode character propertyBidi_Class are complex. See UAX #9, The Unicode Bidirectional Algorithm [UAX9] and DerivedBidiClass.txt for more details.

Default values for theEast_Asian_Width property are also complex. This property defaults to Neutral for most code points, but defaults to Wide for unassigned code points in blocks associated with CJK ideographs. See UAX #11, East Asian Width [UAX11] and DerivedEastAsianWidth.txt for more details.

4.2.9Text Encoding

4.2.10Line Termination

4.2.11Other Conventions

4.2.12Other File Formats

4.3File List

The exact list of files associated with any particular version of the UCD is available on the Unicode Web site by referring to the component listings atEnumerated Versions.

The majority of the data files in the UCD provide specifications of character properties for Unicode characters. Those files and their contents are documented in detail in theProperty Table section below.

The data files in theextracted subdirectory constitute reformatted listings of single character properties extracted from UnicodeData.txt or other primary data files. The reformatting is provided to make it easier to see the particular set of characters having certain values for enumerated properties, or to separate the statement of that property from other properties defined together in UnicodeData.txt. These extracted, derived data files are further documented in theDerived Extracted Properties section below.

The UCD also contains a number of test data files, whose purpose is to provide standard test cases useful in verifying the implementation of complex Unicode algorithms. See theTest Files section below for more documentation.

The remaining files in the Unicode Character Database do not directly specify Unicode properties. The important ones and their functions are listed inTable 3. The Status column indicates whether the file (and its content) is consideredNormative,Informative, orProvisional.

Table 3. Files in the UCD

File NameReferenceStatusDescription
CJKRadicals.txt[UAX38]IList of Unified CJK Ideographs and CJK Radicals that correspond to specific radical numbers used in the CJK radical stroke counts.
Index.txtChapter 17IIndex to Unicode characters, as printed in the Unicode Standard.
NamesList.txtChapter 17INames list used for production of the code charts, derived from UnicodeData.txt. It contains additional annotations.
NamesList.htmlChapter 17IDocuments the format of NamesList.txt.
StandardizedVariants.txtChapter 16NLists all the standardized variant sequences that have been defined, plus a textual description of their desired appearance.
StandardizedVariants.htmlChapter 16NA derived documentation file, generated from StandardizedVariants.txt, plus a list of sample glyphs showing the desired appearance of each standardized variant.
NamedSequences.txt[UAX34]NLists the names for all approved named sequences.
NamedSequencesProv.txt[UAX34]PLists the names for all provisional named sequences.

For more information about these files and their use, see the referenced annexes or chapters of Unicode Standard.

4.4Zipped Files

Starting with Version 4.1.0, zipped versions of all of the UCD files, both data files and documentation files, are available under thePublic/zipped directory on the Unicode Web site. Each collection of zipped files is located there in a numbered subdirectory corresponding to that version of the UCD.

Two different zipped files are provided for each version:

This bifurcation allows for better management of downloading version-specific information, because Unihan.zip contains all the pertinent CJK-related property information, while UCD.zip contains all of the rest of the UCD property information, for those who may not need the voluminous CJK data.

In versions of the UCD prior to Version 4.1.0, zipped copies of the Unihan data files (which for those versions were released as a single large text file, Unihan.txt) are provided in the same directory as the UCD data files. These zipped files are only posted for versions of the UCD in which Unihan.txt was updated.

4.5UCD in XML

Starting with Version 5.1.0, a set of XML data files are also released with each version of the UCD. Those data files make it possible to import and process the UCD property data using standard XML parsing tools, instead of the specialized parsing required for the various individual data files of the UCD.

4.5.1UAX #42

UAX #42, Unicode Character Database in XML [UAX42] defines an XML schema which is used to incorporate all of the Unicode character property information into the XML version of the UCD. See that annex for details of theschema and conventions regarding the grouping of property values formore compact representations.

4.5.2XML File List

The XML version of the UCD is contained in theucdxml subdirectory of the UCD. The files are all zipped. The list of files is shown inTable 4.

Table 4. XML File List

File NameCJKnon-CJK
ucd.all.flat.zipxx
ucd.all.grouped.zipxx
ucd.nounihan.flat.zip x
ucd.nounihan.grouped.zip x
ucd.unihan.flat.zipx 
ucd.unihan.grouped.zipx 

The "flat" file versions simply list all attributes with no particular compression. The "grouped" file versions apply the grouping mechanism described in [UAX42] to cut down on the size of the data files.

5Properties

This section documents the Unicode character properties, relating them in detail to the particular UCD data files in which they are specified. For enumerated properties in particular, this section also documents the actual values which those properties can have.

An index of all the non-CJK character properties by name can be found below in theProperty Summary section. For a comparable index of CJK character propertes, see UAX #38, Unicode Han Database (Unihan) [UAX38].

5.1Property Table

The big property table below,Table 6, specifies the list of character properties defined in the UCD.Table 6 is divided into separate sections for each data file in the UCD. Data files which define a single property or a small number of properties are listed first, followed by the data files which define a large number of properties:DerivedCoreProperties.txt,DerivedNormalizationProps.txt,PropList.txt, andUnicodeData.txt. In some instances for these files defining many properties, the entries in the property table are grouped by type, for clarity in presentation, rather than being listed alphabetically.

InTable 6 each property is described as follows:

First Column. This column contains the name of each of the character properties specified in the respective data file. Any special status for a property, such as whether it isobsolete,deprecated, orstabilized, is also indicated in the first column.

Second Column. This column indicates the type of the property, according to the key inTable 5.

Table 5. Property Type Key

Property TypeSymbolExamples
CatalogCAge, Block
EnumerationEJoining_Type, Line_Break
BinaryBUppercase, White_Space
StringSUppercase_Mapping, Case_Folding
NumericNNumeric_Value
MiscellaneousMName, Jamo_Short_Name

Third Column. This column indicates the status of the property:Normative orInformative orContributory.

Fourth Column. This column provides a description of the property or properties. This includes information on derivation for derived properties, as well as references to locations in the standard where the property is defined or discussed in detail.

In the section of the table forUnicodeData.txt, the data field numbers are also supplied in parentheses at the start of the description.

For a few entries in the property table, values specified in the fields in a data file only contribute to a full definition of a Unicode character property. For example, the values in field 1 (Name) in UnicodeData.txt do not provide all the values for the Name property for all code points;Jamo.txt must also be used, and the Name property for CJK Unified Ideographs is derived by rule.

None of the Unicode character properties should be used simply on the basis of the descriptions in the property table without consulting the relevant discussions in the Unicode Standard. Because of the enormous variety of characters in the repertoire of the Unicode Standard, character properties tend not to be self-evident in application, even when the names of the properties may seem familiar from their usage with much smaller legacy character encodings.

Table 6. Property Table

ArabicShaping.txt
Joining_Type
Joining_Group
ENBasic Arabic and Syriac character shaping properties, such as initial, medial and final shapes. See Section 8.2, "Arabic" in [Unicode].
BidiMirroring.txt
Bidi_Mirroring_GlyphSIInformative mapping for substituting characters in an implementation of bidirectional mirroring. This maps a subset of characters with the Bidi_Mirrored property to other characters that normally are displayed with the corresponding mirrored glyph. See UAX #9: The Unicode Bidirectional Algorithm [UAX9]. Do not confuse this with theBidi_Mirrored property itself.
Blocks.txt
BlockCNList of block names, which are arbitrary names for ranges of code points. See Chapter 17 in [Unicode].
CompositionExclusions.txt
Composition_ExclusionBNProperties for normalization. See UAX #15: Unicode Normalization Forms [UAX15]. Unlike other files, CompositionExclusions simply lists the relevant code points.
CaseFolding.txt
Simple_Case_Folding
Case_Folding
SNMapping from characters to their case-folded forms. This is an informative file containing normative derived properties.

Derived from UnicodeData and SpecialCasing.

Note:The case foldings are omitted in the data file if they are the same as the code point itself.

DerivedAge.txt
AgeCN/IThis file shows when various code points were designated/assigned in successive versions of the Unicode Standard.

The Age property is normative in the sense that it is completely specified based on when a character is encoded in the standard. However, DerivedAge.txt is provided for information. The value of the Age property for a code point can be derived by analysis of successive versions of the UCD, and Age is not used normatively in the specification of any Unicode algorithm.

Note: When using the Age property in regular expressions, an expression such as "\p{age=3.0}" matches all of the code points assigned in Version 3.0—that is, all the code points with a valueless than or equal to 3.0 for the Age property. For more information, see [UTS18].

EastAsianWidth.txt
East_Asian_WidthEIProperties for determining the choice of wide versus narrow glyphs in East Asian contexts. Property values are described in UAX #11: East Asian Width [UAX11].
HangulSyllableType.txt
Hangul_Syllable_Type
 
ENThe values L, V, T, LV, and LVT used in Chapter 3 in [Unicode].
Jamo.txt
Jamo_Short_Name
 
MCThe Hangul Syllable names are derived from the Jamo Short Names, as described in Chapter 3 in [Unicode].
LineBreak.txt
Line_BreakENProperties for line breaking. For more information, see UAX #14: Unicode Line Breaking Algorithm [UAX14].
GraphemeBreakProperty.txt
Grapheme_Cluster_BreakEI

See UAX #29: Unicode Text Segmentation [UAX29]

SentenceBreakProperty.txt
Sentence_BreakEI

See UAX #29: Unicode Text Segmentation [UAX29]

WordBreakProperty.txt
Word_BreakEI

See UAX #29: Unicode Text Segmentation [UAX29]

NameAliases.txt
Name_Alias
 
MNNormative formal aliases for characters with erroneous names, as described in Chapter 4 in [Unicode]. These aliases exactly match the formal aliases published in the Unicode Standard code charts.
NormalizationCorrections.txt
used in Decomposition MappingsSNNormalizationCorrections lists code point differences forNormalization Corrigenda. For more information, see UAX #15: Unicode Normalization Forms [UAX15].
Scripts.txt
ScriptCIScript values for use in regular expressions. For more information, see UAX #24: Unicode Script Property [UAX24].
SpecialCasing.txt
Uppercase_Mapping
Lowercase_Mapping
Titlecase_Mapping
SIData for producing (in combination with the simple case mappings fromUnicodeData.txt) the full case mappings.
Unihan data files (for more information, see [UAX38])
Numeric_Type
Numeric_Value
EIThe characters tagged with either kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric are given the property value Numeric_Type=Numeric, and the Numeric_Value indicated in those tags.

Most characters have these numeric properties based on values from UnicodeData.txt. SeeNumeric_Type.

Unicode_Radical_StrokeMIThe Unicode radical-stroke count, based on the tag kRSUnicode.
DerivedCoreProperties.txt
LowercaseBICharacters with the Lowercase property. For more information, see Chapter 4 in [Unicode].

Generated from: Ll +Other_Lowercase

UppercaseBICharacters with the Uppercase property. For more information, see Chapter 4 in [Unicode].

Generated from: Lu +Other_Uppercase

CasedBICharacters which are considered to be either uppercase, lowercase or titlecase characters. This property is not identical to the Changes_When_Casemapped property. For more information, see D120 in Section 3.13, "Default Case Algorithms" in [Unicode].

Generated from:Lowercase +Uppercase + Lt

Case_IgnorableBICharacters which are ignored for casing purposes. For more information, see D121 in Section 3.13, "Default Case Algorithms" in [Unicode].

Generated from: Mn + Me + Cf + Lm + Sk +Word_Break=MidLetter +Word_Break=MidNumLet

Changes_When_LowercasedBICharacters whose normalized forms are not stable under a toLowercase mapping. For more information, see D124 in Section 3.13, "Default Case Algorithms" in [Unicode].

Generated from: toLowercase(toNFD(X)) != toNFD(X)

Changes_When_UppercasedBICharacters whose normalized forms are not stable under a toUppercase mapping. For more information, see D125 in Section 3.13, "Default Case Algorithms" in [Unicode].

Generated from: toUppercase(toNFD(X)) != toNFD(X)

Changes_When_TitlecasedBICharacters whose normalized forms are not stable under a toTitlecase mapping. For more information, see D126 in Section 3.13, "Default Case Algorithms" in [Unicode].

Generated from: toTitlecase(toNFD(X)) != toNFD(X)

Changes_When_CasefoldedBICharacters whose normalized forms are not stable under case folding. For more information, see D127 in Section 3.13, "Default Case Algorithms" in [Unicode].

Generated from: toCasefold(toNFD(X)) != toNFD(X)

Changes_When_CasemappedBICharacters which may change when they undergo case mapping. For more information, see D128 in Section 3.13, "Default Case Algorithms" in [Unicode].

Generated from: Changes_When_Lowercased(X) or Changes_When_Uppercased(X) or Changes_When_Titlecased(X)

AlphabeticBICharacters with the Alphabetic property. For more information, see Chapter 4 in [Unicode].

Generated from: Lu + Ll + Lt + Lm + Lo + Nl +Other_Alphabetic

Default_Ignorable_Code_PointBNFor programmatic determination of default ignorable code points. New characters that should be ignored in rendering (unless explicitly supported) will be assigned in these ranges, permitting programs to correctly handle the default rendering of such characters when not otherwise supported. For more information, see the FAQDisplay of Unsupported Characters, and Section 5.21, "Default Ignorable Code Points" in [Unicode].

Generated from
Other_Default_Ignorable_Code_Point
+ Cf (format characters)
+ Variation_Selector
- White_Space
- FFF9..FFFB (annotation characters)
- 0600..0603, 06DD, 070F (exceptional Cf characters that should be visible)

Grapheme_BaseBIFor programmatic determination of grapheme cluster boundaries. For more information, see UAX #29: Unicode Text Segmentation [UAX29].

Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp -Grapheme_Extend

Grapheme_ExtendBIFor programmatic determination of grapheme cluster boundaries. For more information, see UAX #29: Unicode Text Segmentation [UAX29].

Generated from: Me + Mn +Other_Grapheme_Extend

Note: Depending on an application's interpretation of Co (private use), they may be either in Grapheme_Base, or in Grapheme_Extend, or in neither.

Grapheme_Link (Deprecated as of 5.0.0)BIFormerly proposed for programmatic determination of grapheme cluster boundaries.

Generated from: Canonical_Combining_Class=Virama

MathBICharacters with the Math property. For more information, see Chapter 4 in [Unicode].

Generated from: Sm +Other_Math

ID_StartBIUsed to determine programming identifiers, as described in UAX #31: Unicode Identifier and Pattern Syntax [UAX31].
ID_ContinueBI
XID_StartBI
XID_ContinueBI
DerivedNormalizationProps.txt
Full_Composition_ExclusionBNCharacters that are excluded from composition: those listed explicitly in CompositionExclusions.txt, plus the derivable sets ofSingleton Decompositions andNon-Starter Decompositions, as documented in that data file.
Expands_On_NFC
Expands_On_NFD
Expands_On_NFKC
Expands_On_NFKD
BNCharacters that expand to more than one character in the specified normalization form.
FC_NFKC_ClosureSNCharacters that require extra mappings for closure under Case Folding plus Normalization Form KC. Characters marked with this property have a third field with the mapping in it.

Generated with the following, whereFold is defined as the default fold operation (excluding the Turkic-specific foldings):

b = NFKC(Fold(a));c = NFKC(Fold(b));if (c != b) add mapping from a to cto the set of mappings that constitute the FC_NFKC_Closure list

Note:The FC_NFKC_Closure value is omitted in the data file if it is the same as the code point itself.

NFD_Quick_Check
NFKD_Quick_Check
NFC_Quick_Check
NFKC_Quick_Check
ENFor property values, see Decompositions and Normalization. (Abbreviated names: NFD_QC, NFKD_QC, NFC_QC, NFKC_QC)
NFKC_CasefoldSIMapping from a character to the string produced by casefolding it, removing any Default_Ignorable_Code_Point=T characters, and converting to NFKC form. (This set of transforms is then repeated, to deal with certain edge cases.)
Changes_When_NFKC_CasefoldedBICharacters which are not identical to their NFKC_Casefold mapping.

Generated from: (cp != NFKC_CaseFold(cp))

PropList.txt
ASCII_Hex_DigitBNASCII characters commonly used for the representation of hexadecimal numbers.
Bidi_ControlBNFormat control characters which have specific functions in the Unicode Bidirectional Algorithm [UAX9].
DashBIPunctuation characters explicitly called out as dashes in the Unicode Standard, plus their compatibility equivalents. Most of these have the General_Category value Pd, but some have the General_Category value Sm because of their use in mathematics.
DeprecatedBNFor a machine-readable list of deprecated characters. No characters will ever be removed from the standard, but the usage of deprecated characters is strongly discouraged.
DiacriticBICharacters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics.
ExtenderBICharacters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks.
Hex_DigitBICharacters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents.
Hyphen (Stabilized as of 4.0.0)BIDashes which are used to mark connections between pieces of words, plus theKatakana middle dot. TheKatakana middle dot functions like a hyphen, but is shaped like a dot rather than a dash.
IdeographicBICharacters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs.
IDS_Binary_OperatorBNUsed in Ideographic Description Sequences.
IDS_Trinary_OperatorBNUsed in Ideographic Description Sequences.
Join_ControlBNFormat control characters which have specific functions for control of cursive joining and ligation.
Logical_Order_ExceptionBNThere are a small number of characters that do not use logical order. These characters require special handling in most processing.
Noncharacter_Code_PointBNCode points permanently reserved for internal use.
Other_AlphabeticBCUsed in deriving the Alphabetic property.
Other_Default_Ignorable_Code_PointBCUsed in deriving the Default_Ignorable_Code_Point property.
Other_Grapheme_ExtendBCUsed in deriving  the Grapheme_Extend property.
Other_ID_ContinueBCUsed for backward compatibility ofID_Continue.
Other_ID_StartBCUsed for backward compatibility ofID_Start.
Other_LowercaseBCUsed in deriving the Lowercase property.
Other_MathBCUsed in deriving the Math property.
Other_UppercaseBCUsed in deriving the Uppercase property.
Pattern_SyntaxBNUsed for pattern syntax as described in UAX #31: Unicode Identifier and Pattern Syntax [UAX31].
Pattern_White_SpaceBN
Quotation_MarkBIPunctuation characters that function as quotation marks.
RadicalBNUsed in Ideographic Description Sequences.
Soft_DottedBNCharacters with a "soft dot", likei orj. An accent placed on these characters causes the dot to disappear. An explicitdot above can be added where required, such as in Lithuanian.
STermBISentence Terminal. Used in UAX #29: Unicode Text Segmentation [UAX29].
Terminal_PunctuationBIPunctuation characters that generally mark the end of textual units.
Unified_IdeographBNUsed in Ideographic Description Sequences.
Variation_SelectorBNIndicates characters that are Variation Selectors. For details on the behavior of these characters, seeStandardizedVariants.html, Section 16.4, "Variation Selectors" in [Unicode], and the Unicode Ideographic Variation Database [UTS37].
White_SpaceBNSeparator characters and control characters which should be treated by programming languages as "white space" for the purpose of parsing elements.

Note: ZERO WIDTH SPACE and ZERO WIDTH NO-BREAK SPACE are not included, because their functions are restricted to line-break control. Their names are unfortunately misleading in this respect.

Note:There are other senses of "whitespace" that encompass a different set of characters.

UnicodeData.txt
NameMN(1) These names match exactly the names published in the code charts of the Unicode Standard. The derived Hangul Syllable names are omitted from this file; seeJamo.txt for their derivation.
General_CategoryEN(2) This is a useful breakdown into various character types which can be used as a default categorization in implementations. For the property values, seeGeneral Category Values.
Canonical_Combining_ClassNN(3) The classes used for the Canonical Ordering Algorithm in the Unicode Standard. This property could be considered either an enumerated property or a numeric property: the principal use of the property is in terms of the numeric values. For the property value names associated with different numeric values, seeDerivedCombiningClass.txt andCanonical Combining Class Values.
Bidi_ClassEN(4) These are the categories required by the Unicode Bidirectional Algorithm. For the property values, seeBidirectional Class Values. For more information, see UAX #9: The Unicode Bidirectional Algorithm [UAX9].

The default property values depend on the code point, and are explained in DerivedBidiClass.txt

Decomposition_Type
Decomposition_Mapping
E
S
N(5) This field contains both values, with the type in angle brackets. The decomposition mappings exactly match the decomposition mappings published with the character names in the Unicode Standard. For more information, seeCharacter Decomposition Mappings.
Numeric_Type
Numeric_Value
E
N
N(6) If the character has the property value Numeric_Type=Decimal, then the Numeric_Value of that digit is represented with an integer value in fields 6, 7, and 8. See the discussion ofdecimal digits in Chapter 4 in [Unicode].
E
N
N(7) If the character has the property value Numeric_Type=Digit, then the Numeric_Value of that digit is represented with an integer value in fields 7 and 8, and field 6 is null. This covers digits that need special handling, such as the compatibility superscript digits.
E
N
N(8) If the character has the property value Numeric_Type=Numeric, then the Numeric_Value of that character is represented with a positive or negative integer or rational number in this field, and fields 6 and 7 are null. This includes fractions such as, for example, "1/5" for U+2155 VULGAR FRACTION ONE FIFTH.

Some characters have these properties based on values from the Unihan data files. SeeNumeric_Type, Han.

Bidi_MirroredBN(9) If the character is a "mirrored" character in bidirectional text, this field has the value "Y"; otherwise "N". See Section 4.7, "Bidi Mirrored—Normative" of [Unicode].Do not confuse this with theBidi_Mirroring_Glyph property.
Unicode_1_NameMI(10) Old name as published in Unicode 1.0. This name is only provided when it is significantly different from the current name for the character. The value of field 10 for control characters does not always match the Unicode 1.0 names. Instead, field 10 contains ISO 6429 names for control functions, for printing in the code charts.
ISO_Comment (Obsolete as of 5.2.0)MI(11) ISO 10646 comment field. It was used for notes that appeared in parentheses in the 10646 names list, or contained an asterisk to mark an Annex P note.

As of Unicode 5.2.0, this field no longer contains any non-null values.

Simple_Uppercase_MappingSN(12) Simple uppercase mapping (single character result).
If a character is part of an alphabet with case distinctions, and has a simple uppercase equivalent, then the uppercase equivalent is in this field. The simple mappings have a single character result, where the full mappings may have multi-character results. For more information, seeCase and Case Mapping.
Simple_Lowercase_MappingSN(13) Simple lowercase mapping (single character result).
Simple_Titlecase_MappingSN(14) Simple titlecase mapping (single character result).

Note: If this field is null, then the Simple_Titlecase_Mapping is the same as the Simple_Uppercase_Mapping for this character.

 

5.2Derived Extracted Properties

A number of Unicode character properties have been separated out, reformatted, and listed in range format, one property per file. These filesare located under theextracted directory of the UCD.The exact list of derived extracted files and the extracted properties they represent are given inTable 7.

The derived extracted files are provided purely as a reformatting of data for properties specified in other data files.In case of any inadvertant mismatch between the primary data files specifyingthose properties and these lists of extracted properties, the primarydata files are taken as definitive.

Table 7. Extracted Properties

FileStatusPropertyExtracted from
DerivedBidiClass.txtNBidi_ClassUnicodeData.txt, field 4
DerivedBinaryProperties.txtNBidi_MirroredUnicodeData.txt, field 9
DerivedCombiningClass.txtNCanonical_Combining_ClassUnicodeData.txt, field 3
DerivedDecompositionType.txtN/IDecomposition_Typethe <tag> in UnicodeData.txt, field 5
DerivedEastAsianWidth.txtIEast_Asian_WidthEastAsianWidth.txt, field 1
DerivedGeneralCategory.txtNGeneral_CategoryUnicodeData.txt, field 2
DerivedJoiningGroup.txtNJoining_GroupArabicShaping.txt, field 2
DerivedJoiningType.txtNJoining_TypeArabicShaping.txt, field 1
DerivedLineBreak.txtNLine_BreakLineBreak.txt, field 1
DerivedNumericType.txtNNumeric_TypeUnicodeData.txt, fields 6 through 8
DerivedNumericValues.txtNNumeric_ValueUnicodeData.txt, field 8

For the extraction of Decomposition_Type, characters with canonical decomposition mappings in field 5 of UnicodeData.txt have no tag. For those characters, the extracted value is Decomposition_Type=Canonical. For characters with compatibility decomposition mappings, there are explicit tags in field 5, and the value of Decomposition_Type is equivalent to those tags. The value Decomposition_Type=Canonical is normative. Other values for Decomposition_Type are informative.

Numeric_Value is extracted based on the actual numeric value of the data in field 8 of UnicodeData.txt or the values of the kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric tags, for characters listed in the Unihan data files.

Numeric_Type is extracted as follows. If fields 6, 7, and 8 in UnicodeData.txt are all non-empty, then Numeric_Type=Decimal. Otherwise, if fields 7 and 8 are both non-empty, then Numeric_Type=Digit. Otherwise, if field 8 is non-empty, then Numeric_Type=Numeric. For characters listed in the Unihan data files, Numeric_Type=Numeric for characters that have kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric tags. The default value is Numeric_Type=None.

5.3Property Summary

Table 8 provides a summary list of the Unicode character properties, excluding most of those specific to the Unihan data files. The properties are roughly organized into groups based on their usage. This grouping is primarily for documentation convenience and except forcontributory properties, has no normative implications. The link on each property leads its description in theProperty Table above.

Table 8. Property Summary Table

GeneralNormalizationCJK
NameCanonical_Combining_ClassIdeographic
Name_AliasDecomposition_MappingUnified_Ideograph
BlockComposition_ExclusionRadical
AgeFull_Composition_ExclusionIDS_Binary_Operator
General_CategoryDecomposition_TypeIDS_Trinary_Operator
ScriptFC_NFKC_ClosureUnicode_Radical_Stroke
White_SpaceNFC_Quick_CheckMiscellaneous
AlphabeticNFKC_Quick_CheckMath
Hangul_Syllable_TypeNFD_Quick_CheckQuotation_Mark
Noncharacter_Code_PointNFKD_Quick_CheckDash
Default_Ignorable_Code_PointExpands_On_NFCHyphen (stabilized)
DeprecatedExpands_On_NFDSTerm
Logical_Order_ExceptionExpands_On_NFKCTerminal_Punctuation
Variation_SelectorExpands_On_NFKDDiacritic
 NFKC_CasefoldExtender
 Changes_When_NFKC_CasefoldedGrapheme_Base
CaseShaping and RenderingGrapheme_Extend
UppercaseJoin_ControlGrapheme_Link (deprecated)
LowercaseJoining_GroupUnicode_1_Name
Lowercase_MappingJoining_TypeISO_Comment (obsolete)
Titlecase_MappingLine_Break 
Uppercase_MappingGrapheme_Cluster_Break 
Case_FoldingSentence_BreakContributory Properties
Simple_Lowercase_MappingWord_BreakOther_Alphabetic
Simple_Titlecase_MappingEast_Asian_WidthOther_Default_Ignorable_Code_Point
Simple_Uppercase_MappingBidirectionalOther_Grapheme_Extend
Simple_Case_FoldingBidi_ClassOther_ID_Start
Soft_DottedBidi_ControlOther_ID_Continue
CasedBidi_MirroredOther_Lowercase
Case_IgnorableBidi_Mirroring_GlyphOther_Math
Changes_When_LowercasedIdentifiersOther_Uppercase
Changes_When_UppercasedID_ContinueJamo_Short_Name
Changes_When_TitlecasedID_StartNumeric
Changes_When_CasefoldedXID_ContinueNumeric_Value
Changes_When_CasemappedXID_StartNumeric_Type
 Pattern_SyntaxHex_Digit
 Pattern_White_SpaceASCII_Hex_Digit

 

5.3.1Contributory Properties

Contributory properties contain sets of exceptions used in the generation of other properties derived from them. The contributory properties specifically concerned with identifiers and casing contribute to the maintenance of stability guarantees for properties and/or to invariance relationships between related properties. Other contributory properties are simply defined as a convenience for property derivation.

Most contributory properties have names using the pattern "Other_XXX" and are used to derive the corresponding "XXX" property. For example, the Other_Alphabetic property is used in the derivation of theAlphabetic property.

Contributory properties are typically defined inPropList.txt and the corresponding derived property is then listed inDerivedCoreProperties.txt.

Jamo_Short_Name is an unusual contributory property, both in terms of its name and how it is used. It is defined in its own property file, Jamo.txt, and is used to derive the Name property value for Hangul syllable characters, according to the rules spelled out in Section 3.12, "Conjoining Jamo Behavior" in [Unicode].

Contributory is considered to be a distinct status for a Unicode character property. Contributory properties are neithernormative norinformative. This distinct status is marked in the property table.

Contributory properties are incomplete by themselves and are not intended for independent use. For example, an API returning Unicode property values should implement the derived core properties such as Alphabetic or Default_Ignorable_Code_Point,rather than the corresponding contributory properties,Other_Alphabetic or Other_Default_Ignorable_Code_Point.

5.4Case and Case Mapping

Case for bicameral scripts and case mapping of characters are complicated topics in the Unicode Standard—both because of their inherent algorithmic complexity and because of the number of characters and special edge cases involved.

This section provides a brief roadmap to discussions about these topics, and specifications and definitions in the standard, as well as explaining which case-related properties are defined in the UCD.

Section 3.13, "Default Case Algorithms" in [Unicode] provides formal definitions for a number of case-related concepts (cased,case-ignorable, ...), for case conversion (toUppercase(X), ...), and for case detection (isUppercase(X), ...). It also provides the formal definition of caseless matching for the standard, taking normalization into account.

Section 4.2, "Case—Normative" in [Unicode] introduces case and case mapping properties. Table 4-1, "Sources for Case Mapping Information" in [Unicode] describes the kind of case-related information that is available in various data files of the UCD.Table 9 lists those data files again, giving the explicit list of case-related properties defined in each. The link on each property leads its description in theProperty Table above.

Table 9. UCD Files and Case Properties

File NameCase Properties
UnicodeData.txtSimple_Uppercase_Mapping,Simple_Lowercase_Mapping,Simple_Titlecase_Mapping
SpecialCasing.txtUppercase_Mapping,Lowercase_Mapping,Titlecase_Mapping
CaseFolding.txtSimple_Case_Folding,Case_Folding
DerivedCoreProperties.txtUppercase,Lowercase,Cased,Case_Ignorable,Changes_When_Lowercased,Changes_When_Uppercased,Changes_When_Titlecased,Changes_When_Casefolded,Changes_When_Casemapped
DerivedNormalizationProps.txtNFKC_Casefold,Changes_When_NFKC_Casefolded
PropList.txtSoft_Dotted,Other_Uppercase,Other_Lowercase

For compatibility with existing parsers, UnicodeData.txt only contains case mappings for characters where they constitute one-to-one mappings; it also omits information about context-sensitive case mappings. Information about these special cases can be found in the separate data file, SpecialCasing.txt, expressed as separate properties.

Section 5.18, "Case Mappings", in [Unicode] discusses various implementation issues for handling case, including language-specific case mapping, as for Greek and for Turkish. That section also describes case folding in particular detail.

The special casing conditions associated with case mapping for Greek, Turkish, and Lithuanian are specified in an additional field inSpecialCasing.txt. For example, the lowercase mapping for sigma in Greek varies according to its position in a word. The condition list does not constitute a formal character property in the UCD, because it is a statement about the context of occurrence of casing behavior for a character or characters, rather than a semantic attribute of those characters. Versions of the UCD from Version 3.2.0 to Version 5.0.0did list property aliases for Special_Case_Condition (scc), but this was determined to be an error when the UCD was analyzed for representation in XML; consequently, the Special_Case_Condition property aliases were removed as of Version 5.1.0.

Caseless matching is of particular concern for a number of text processing algorithms, so is also discussed at some length in UAX #31: Unicode Identifier and Pattern Syntax [UAX31] and in UTS #10: Unicode Collation Algorithm [UTS10].

Further information about locale-specific casing conventions can be found in the Unicode Common Locale Data Repository [CLDR].

5.5Property Value Lists

The following subsections give summaries of property values for certain Enumeration properties. Other property values are documented in other, topically-specific annexes; for example, the Line_Break property values are documented in UAX #14: Unicode Line Breaking Algorithm [UAX14] and the various segmentation-related property values are documented in UAX #29: Unicode Text Segmentation [UAX29].

5.5.1General Category Values

The General_Category property of a code point provides for the most general classification of that code point. It is usually determined based on the primary characteristic of the assigned character for that code point. For example, is the character a letter, a mark, a number, punctuation, or a symbol, and if so, of what type? Other General_Category values define the classification of code points which are not assigned to regular graphic characters, including such statuses as private-use, control, surrogate code point, and reserved unassigned.

Many characters have multiple uses, and not all such cases can be captured entirely by the General_Category value. For example, the General_Category value of Latin, Greek, or Hebrew letters does not attempt to cover (or preclude) the numerical use of such letters as Roman numerals or in other numerary systems. Conversely, the General_Category of ASCII digits 0..9 as Nd (decimal digit) neither attempts to cover (or preclude) the occasional use of these digits as letters in various orthographies. The General_Category is simply the first-order, most usual categorization of a character.

For more information about the General_Category property, see Chapter 4 in [Unicode].

The values in the General_Category field in UnicodeData.txt make use of the short, abbreviated property value aliases for General_Category. For convenience in reference,Table 10 lists all the abbreviated and long value aliases for General_Category values, reproduced fromPropertyValueAliases.txt, along with a brief description of each category.

Table 10. General_Category Values

AbbrLongDescription
LuUppercase_Letteran uppercase letter
LlLowercase_Lettera lowercase letter
LtTitlecase_Lettera digraphic character, with first part uppercase
LmModifier_Lettera modifier letter
LoOther_Letterother letters, including syllables and ideographs
MnNonspacing_Marka nonspacing combining mark (zero advance width)
McSpacing_Marka spacing combining mark (positive advance width)
MeEnclosing_Markan enclosing combining mark
NdDecimal_Numbera decimal digit
NlLetter_Numbera letterlike numeric character
NoOther_Numbera numeric character of other type
PcConnector_Punctuationa connecting punctuation mark, like a tie
PdDash_Punctuationa dash or hyphen punctuation mark
PsOpen_Punctuationan opening punctuation mark (of a pair)
PeClose_Punctuationa closing punctuation mark (of a pair)
PiInitial_Punctuationan initial quotation mark
PfFinal_Punctuationa final quotation mark
PoOther_Punctuationa punctuation mark of other type
SmMath_Symbola symbol of primarily mathematical use
ScCurrency_Symbola currency sign
SkModifier_Symbola non-letterlike modifier symbol
SoOther_Symbola symbol of other type
ZsSpace_Separatora space character (of various non-zero widths)
ZlLine_SeparatorU+2028 LINE SEPARATOR only
ZpParagraph_SeparatorU+2029 PARAGRAPH SEPARATOR only
CcControla C0 or C1 control code
CfFormata format control character
CsSurrogatea surrogate code point
CoPrivate_Usea private-use character
CnUnassigneda reserved unassigned code point or a noncharacter

Note that the value gc=Cn does not actually occur in UnicodeData.txt, because that data file does not list unassigned code points.

Characters with the quotation-related General_Category values Pi or Pf may behave like opening punctuation (gc=Ps) or closing punctuation (gc=Pe), depending on usage and quotation conventions.

The symbol "L&" is used to stand for any combination of uppercase, lowercase or titlecase letters (Lu, Ll, or Lt), in the first part of comments in the data files. The LC value for the General_Category property, as documented in PropertyValueAliases.txt also stands for uppercase, lowercase or titlecase letters.

The Unicode Standard does not assign non-default property values to control characters (gc=Cc), except for certain well-defined exceptions involving the Unicode Bidirectional Algorithm, the Unicode Line Breaking Algorithm, and Unicode Text Segmentation. Also, implementations will usually assign behavior to certain line breaking control characters—most notably U+000D and U+000A (CR and LF)—according to platform conventions. See Section 5.8, "Newline Guidelines" in [Unicode] for more information.

5.5.2Bidirectional Class Values

The values in the Bidi_Class field in UnicodeData.txt make use of the short, abbreviated property value aliases for Bidi_Class. For convenience in reference,Table 11 lists all the abbreviated and long value aliases for Bidi_Class values, reproduced fromPropertyValueAliases.txt, along with a brief description of each category.

Table 11. Bidi_Class Values

AbbrLongDescription
LLeft_To_Rightany strong left-to-right character
LRELeft_To_Right_EmbeddingU+202A: the LR embedding control
LROLeft_To_Right_OverrideU+202D: the LR override control
RRight_To_Leftany strong right-to-left (non-Arabic-type) character
ALArabic_Letterany strong right-to-left (Arabic-type) character
RLERight_To_Left_EmbeddingU+202B: the RL embedding control
RLORight_To_Left_OverrideU+202E: the RL override control
PDFPop_Directional_FormatU+202C: terminates an embedding or override control
ENEuropean_Numberany ASCII digit or Eastern Arabic-Indic digit
ESEuropean_Separatorplus and minus signs
ETEuropean_Terminatora terminator in a numeric format context, includes currency signs
ANArabic_Numberany Arabic-Indic digit
CSCommon_Separatorcommas, colons, and slashes
NSMNonspacing_Markany nonspacing mark
BNBoundary_Neutralmost format characters, control codes, or noncharacters
BParagraph_Separatorvarious newline characters
SSegment_Separatorvarious segment-related control codes
WSWhite_Spacespaces
ONOther_Neutralmost other symbols and punctuation marks

Please refer to UAX #9: The Unicode Bidirectional Algorithm [UAX9] for an an explanation of the significance of these values when formatting bidirectional text.

5.5.3Character Decomposition Mapping

The value of the Decomposition_Mapping property for a character is provided in field 5 of UnicodeData.txt. This is a string property, consisting of a sequence of one or more Unicode code points. The default value of the Decomposition_Mapping property is the code point of the character itself. The use of the default value for a character is indicated by leaving field 5 empty in UnicodeData.txt. Informally, the value of the Decomposition_Mapping property for a character is known simply as itsdecomposition mapping. When a character's decomposition mapping is other than the default value, the decomposition mapping is printed out explicitly in the names list for the Unicode code charts.

The prefixed tags supplied with a subset of the decomposition mappings generally indicate formatting information. Where no such tag is given, the mapping is canonical. Conversely, the presence of a formatting tag also indicates that the mapping is a compatibility mapping and not a canonical mapping. In the absence of other formatting information in a compatibility mapping, the tag is used to distinguish it from canonical mappings.

In some instances a canonical mapping or a compatibility mapping may consist of a single character. For a canonical mapping, this indicates that the character is a canonical equivalent of another single character. For a compatibility mapping, this indicates that the character is a compatibility equivalent of another single character.

The compatibility formatting tags used in the UCD are listed inTable 12.

Table 12. Compatibility Formatting Tags

TagDescription
<font>Font variant (for example, a blackletter form)
<noBreak>No-break version of a space or hyphen
<initial>Initial presentation form (Arabic)
<medial>Medial presentation form (Arabic)
<final>Final presentation form (Arabic)
<isolated>Isolated presentation form (Arabic)
<circle>Encircled form
<super>Superscript form
<sub>Subscript form
<vertical>Vertical layout presentation form
<wide>Wide (or zenkaku) compatibility character
<narrow>Narrow (or hankaku) compatibility character
<small>Small variant form (CNS compatibility)
<square>CJK squared font variant
<fraction>Vulgar fraction form
<compat>Otherwise unspecified compatibility character

Note:There is a difference between decomposition and the Decomposition_Mapping property. The Decomposition_Mapping property is a string property whose values (mappings) are defined in UnicodeData.txt, while the decomposition (also termed "full decomposition") is defined in Section 3.7, "Decomposition" in [Unicode] to use those mappingsrecursively.

Starting from Unicode 2.1.9, the decomposition mappings inUnicodeData.txt can be used to derive the full decomposition of any single character in canonical order, without the need to separately apply the Canonical Ordering Algorithm. However, canonical ordering of combining character sequencesmust still be applied in decomposition when normalizing source text which contains any combining marks.

The normalization of Hangul conjoining jamos and of Hangul syllables depends on algorithmic mapping, as specified in Section 3.12, "Conjoining Jamo Behavior" in [Unicode]. That algorithm specifies the full decomposition of all precomposed Hangul syllables, but effectively it is equivalent to the recursive application of pairwise decomposition mappings, as for all other Unicode characters. Formally, the Decomposition_Mapping property value for a Hangul syllable is the pairwise decomposition and not the full decomposition.

Each character with theHangul_Syllable_Type value LVT will have a Decomposition_Mapping consisting of a character with an LV value and a character with a T value. Thus for U+CE31 the Decomposition_Mapping is <U+CE20, U+11B8>, rather than <U+110E, U+1173, U+11B8>.

5.5.4Canonical Combining Class Values

The values in the Canonical_Combining_Class field in UnicodeData.txt are numerical values used in the Canonical Ordering Algorithm. Some of those numerical values also have explicit symbolic labels as property value aliases, to make their intended application more understandable. For convenience in reference,Table 13 lists all the long symbolic aliases for Canonical_Combining_Class values, reproduced fromPropertyValueAliases.txt, along with a brief description of each category.

Table 13. Canonical_Combining_Class Values

ValueLongDescription
0Not_ReorderedSpacing and enclosing marks; also many vowel and consonant signs, even if nonspacing
1OverlayMarks which overlay a base letter or symbol
7NuktaDiacritic nukta marks in Brahmi-derived scripts
8Kana_VoicingHiragana/Katakana voicing marks
9ViramaViramas
10 Start of fixed position classes
199 End of fixed position classes
200Attached_Below_LeftMarks attached at the bottom left
202Attached_BelowMarks attached directly below
204 Marks attached at the top right
208 Marks attached to the left
210 Marks attached to the right
212 Marks attached at the top left
214Attached_AboveMarks attached directly above
216Attached_Above_RightMarks attached at the top right
218Below_LeftDistinct marks at the bottom left
220BelowDistinct marks directly below
222Below_RightDistinct marks at the bottom right
224LeftDistinct marks to the left
226RightDistinct marks to the right
228Above_LeftDistinct marks at the top left
230AboveDistinct marks directly above
232Above_RightDistinct marks at the top right
233Double_BelowDistinct marks subtending two bases
234Double_AboveDistinct marks extending above two bases
240Iota_SubscriptGreek iota subscript only

Some of the Canonical_Combining_Class values in the table are not currently used for any characters but are specified here for completeness. Some values do not have long symbolic aliases, but these two sets are not congruent. Do not assume that absence of a long symbolic alias implies non-use of a particular Canonical_Combining_Class. SeeDerivedCombiningClass.txt for a complete listing of the use of Canonical_Combining_Class values for any particular version of the UCD.

Combining marks with ccc=224 (Left) follow their base character in storage, as for all combining marks, but are rendered visually on the left side of them. For all past versions of the UCD and continuing with this version of the UCD, only two tone marks used in certain notations for Hangul syllables have ccc=224. Those marks are actually rendered visually on the left side of the precedinggrapheme cluster, in the case of Hangul syllables resulting from sequences of conjoining jamos.

Those few instances of combining marks with ccc=Left should be distinguished from the far more numerous examples of left-side vowel signs and vowel letters in Brahmi-derived scripts. The Canonical_Combining_Class value is zero (Not_Reordered) for both ordinary, left-side (reordrant) vowel signs such as U+093F DEVANAGARI VOWEL SIGN I and for Thai-style left-side (Logical_Order_Exception=Yes) vowel letters such as U+0E40 THAI CHARACTER SARA E. The "Not_Reordered" of ccc=Not_Reordered refers to the behavior of the character in terms of the Canonical Ordering Algorithm as part of the definition of Unicode Normalization; it doesnot refer to any issues of visual reordering of glyphs involved in display and rendering. See Section 3.11, "Canonical Ordering Behavior" in [Unicode].

5.5.5Decompositions and Normalization

Decomposition is specified in Chapter 3, Conformance of [Unicode]. UAX #15, Unicode Normalization Forms [UAX15] specifies the interaction between decomposition and normalization. That annex specifies how the decompositions defined inUnicodeData.txt are used to derive normalized forms of Unicode text.

A number of derived properties related to Unicode normalization are called the "Quick_Check" properties. These are defined to enable various optimizations for implementations of normalization, as explained in Section 14, "Detecting Normalization Forms", in UAX #15, Unicode Normalization Forms [UAX15]. The values for the four Quick_Check properties for all code points are listed in DerivedNormalizationProps.txt. The interpretations of the possible property values are summarized inTable 14.

Table 14. Quick_Check Property Values

PropertyValueDescription
NFC_QC, NFKC_QC, NFD_QC, NFKD_QCNoCharacters that cannot ever occur in the respective normalization form.
NFC_QC, NFKC_QCMaybeCharacters that may occur in the respective normalization, depending on the context.
NFC_QC, NFKC_QC, NFD_QC, NFKD_QCYesAll other characters. This is the default value for Quick_Check properties.

5.6Property and Property Value Aliases

Both Unicode character properties themselves and their values are given symbolic aliases. The formal lists of aliases are provided so that well-defined symbolic values are available for XML formats of the UCD data, for regular expression property tests, and for other programmatic textual descriptions of Unicode data. The aliases for properties are defined in PropertyAliases.txt. The aliases for property values are defined in PropertyValueAliases.txt.

Table 15. Alias Files in the UCD

File NameStatusDescription
PropertyAliases.txtNNames and abbreviations for properties
PropertyValueAliases.txtNNames and abbreviations for property values

Aliases are defined as ASCII-compatible identifiers, using only uppercase or lowercase A-Z, digits, and underscore "_". Case is not significant when comparing aliases, but the preferred form used in the data files for longer aliases is to titlecase them.

Aliases may be translated in appropriate environments, and additional aliases may be useful in certain contexts. There is no requirement that only the aliases defined in the alias files of the UCD be used when referring to Unicode character properties or their values; however, their use is recommended for interoperability in data formats or in programmatic contexts.

5.6.1 Property Aliases

In PropertyAliases.txt, the first field specifies an abbreviated symbolic name for the property, and the second field specifies the long symbolic name for the property. These are the preferred aliases. Additional aliases for a few properties are specified in the third or subsequent fields.

Aliases for normative and informative properties defined in the Unihan data files are included in PropertyAliases.txt, beginning with Version 5.2.

The long symbolic name alias is self-descriptive, and is treated as the official name of a Unicode character property. For clarity it is used whenever possible when referring to that property in this annex and elsewhere in the Unicode Standard. For example: "The Line_Break property is discussed in UAX #14, Unicode Line Breaking Algorithm [UAX14]."

The abbreviated symbolic name alias is short and less mnemonic, but is useful for expressions such as "lb=BA" in data or in other contexts where the meaning is clear.

The property aliases specified in PropertyAliases.txt constitute a unique name space. When using these symbolic values, no alias for one property will match an alias for another property.

5.6.2 Property Value Aliases

In PropertyValueAliases.txt, the first field contains the abbreviated alias for a Unicode property, the second field specifies an abbreviated symbolic name for a value of that property, and the third field specifies the long symbolic name for that value of that property. These are the preferred aliases. Additional aliases for some property values may be specified in the fourth or subsequent fields. For example, for binary properties, the abbreviated alias for the True value is "Y", and the long alias is "Yes", but each entry also specifies "T" and "True" as additional aliases for that value, as shown inTable 16.

Table 16. Binary Property Value Aliases

LongAbbreviatedOther Aliases
YesYTrue, T
NoNFalse, F

Not every property value has an associated alias. Property value aliases are typically supplied for catalog and enumeration properties, which have well-defined, enumerated values. It does not make sense to specify property value aliases, for example, for the Numeric_Value property, whose value could be any number, or for a string property such as Simple_Lowercase_Mapping, whose values are mappings from one code point to another.

The Canonical_Combining_Class property requires special handling in PropertyValueAliases.txt. The values of this property are numeric, but they comprise a closed, enumerated set of values. The more important of those values are given symbolic name aliases. In PropertyValueAliases.txt, the second field provides the numeric value, while the third field contains the abbreviated symbolic name alias and the fourth field contains the long symbolic name alias for that numeric value. For example:

ccc; 230; A    ; Aboveccc; 232; AR   ; Above_Right

Taken by themselves, property value aliases do not constitute a unique name space. The abbreviated aliases, in particular, are often re-used as aliases for values for different properties. All of the binary property value aliases, for example, make use of the same "Y", "Yes", "T", "True" symbols. Property value aliases may also overlap the symbols used for property aliases. For example, "Sc" is the abbreviated alias for the "Currency_Symbol" value of the General_Category value, but it is also the abbreviated alias for the Script property. However, the aliases for values for any single property are always unique within the context of that property. That means that expressions that combine a property alias and a property value alias, such as "lb=BA" or "gc=Sc"always refer unambiguously just to one value of one given property, and will not match any other value of any other property.

The property value alias entries for three properties, Age, Block, and Joining_Group, make use of a special metavalue "n/a" in the field for the abbreviated alias. This should be understood as meaning that no abbreviated alias is defined for that value for that property, rather than as an alias per se.

In a few cases, because of longstanding legacy practice in referring to values of a property by short identifiers, the abbreviated alias and the long alias are the same. This can be seen, for example, in some property value aliases for the Line_Break property and the Grapheme_Cluster_Break property.

5.7Matching Rules

When matching Unicode character property names and values, it is strongly recommended that allProperty and Property Value Aliases be recognized. For best results in matching, rather than using exact binary comparisons, the following loose matching rules should be observed.

Numeric Property Values

For all numeric properties, and for properties such as Unicode_Radical_Stroke which are constructed from combinations of numeric values, use loose matching rule UAX44-LM1 when comparing property values.

UAX44-LM1. Apply numeric equivalences.

Character Names

Unicode character names constitute a special case. Formally, they are values of the Name property. While each Unicode character name for an assigned character is guaranteed to be unique, names are assigned in such a way that the presence or absence of spaces cannot be used to distinguish them. Furthermore, implementations sometimes create identifiers from Unicode character names by inserting underscores for spaces. For best results in comparing Unicode character names, use loose matching rule UAX44-LM2.

UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.

Symbolic Values

Property aliases and property value aliases are symbolic values. When comparing them, use loose matching rule UAX44-LM3.

UAX44-LM3. Ignore case, whitespace, underscore ('_'), and hyphens.

Loose matching is generally appropriate for the property values of Catalog, Enumeration, and Binary properties, which have symbolic aliases defined for their values. Loose matching should not be done for the property values of String properties, which do not have symbolic aliases defined for their values; exact matching for String property values is important, as case distinctions or other distinctions in those values may be significant.

5.8Invariants

Property values in the UCD may be subject to correction in subsequent versions of the standard, as errors are found. Also, some multi-valued properties such as Line_Break or Word_Break may have additional values defined for them. However, some property values and some aspects of the file formats are considered invariant. This section documents such invariants.

5.8.1Character Property Invariants

All formally guaranteed invariants for properties or property values are described in the Unicode Character Encoding Stability Policy [Stability]. That policy and the list of invariants it enumerates are maintained outside the context of the Unicode Standard per se. They are not part of the standard, but rather are constraints on what can and cannot change in the standard between versions, and on what decisions the Unicode Technical Committee can and cannot take regarding the standard.

In addition to the formally guaranteed invariants described in the Unicode Character Encoding Stability Policy, this section notes a few additional points regarding character property invariants in the UCD.

Some character properties are simply consideredimmutable: once assigned, they are never changed. For example, a character's name is immutable, because of its importance in exact identification of the character. The Canonical_Combining_Class and Decomposition_Mapping of a character are immutable, because of their importance to the stability of the Unicode Normalization Algorithm [UAX15].

The list of immutable character properties is shown inTable 17.

Table 17. Immutable Properties

Property NameAbbr Name
Namena
Jamo_Short_Namejsn
Canonical_Combining_Classccc
Decomposition_Mappingdm
Pattern_SyntaxPat_Syn
Pattern_White_SpacePat_WS

In some cases, a property is not immutable, but the list of possible values that it can have is considered invariant. For example, while at least some General_Category values are subject to change and correction, the enumerated set of possible values that the General_Category property can have is fixed and cannot be added to in the future.

All characters other than those of General_Category M* are guaranteed to have Canonical_Combining_Class=0. Currently it is also true that all characters other than those of General_Category Mn have Canonical_Combining_Class=0. However, the more constrained statement is not a guaranteed invariant; it is possible that some new character of General_Category Me or Mc could be given a non-zero value for Canonical_Combining_Class in the future.

In Unicode 4.0 and thereafter, the General_Category valueDecimal_Number (Nd), and the Numeric_Type valueDecimal (de) are defined to be co-extensive; that is, the set of characters having General_Category=Nd will always be the same as the set of characters having NumericType=de.

5.8.2UCD File Format Invariants

There are also some constraints on allowable change in the file formats for UCD files. In general, thefile format conventions are changed as little as possible, to minimize the impact on implementations which parse the machine-readable data files. However, some of the constraints on allowable file format change go beyond conservatism in format and instead have the status of invariants. These guarantees apply in particular to UnicodeData.txt, the very first data file associated with the UCD.

The number and order of the fields in UnicodeData.txt is fixed. Any additional information about character properties to be added to the UCD in the future will appear in separate data files, rather than being added as an additional field to UnicodeData.txt or by reinterpretation of any of the existing fields.

5.8.3Invariants in Implementations

Applications may wish to take the various character property and file format invariants into account when choosing how to implement character properties.

The Canonical_Combining_Class offers a good example. The character property invariants regarding Canonical_Combining_Class guarantee that values, once assigned, will never change, and that all values used will be in the range 0..255. This means that the Canonical_Combining_Class can be safely implemented in an unsigned byte and that any value stored in a table for an existing character will not need to be updated dynamically for a later version.

In practice, for Canonical_Combining_Class far fewer than 256 values are used. Unicode 3.0 used 53 values; Unicode 3.1 through Unicode 4.1 used 54 values; and Unicode 5.0 through Unicode 5.1 used 55 values. New, non-zero Canonical_Combining_Class values are seldom added to the standard. (For details about this history, seeDerivedCombiningClass.txt.) Implementations may take advantage of this fact for compression, because only the ordering of the non-zero values, and not their absolute values, matters for the Canonical Ordering Algorithm. In principle, it would be possible for up to 256 values to be used in the future, but the chances of the actual number of values exceeding 128 are remote at this point. There are implementation advantages in restricting the number of internal class values to 128—for example, the ability to use signed bytes without implicit widening to ints in Java.

5.9Validation

The Unicode character property values in the UCD files can be validated by means of regular expressions. Such validation can also be useful in testing of implementations that return property values. The method of validation depends on the type of property, as described below. These expressions use Perl syntax, but may of course be converted to other formal conventions for use with other regular expression engines.

The regular expressions which are appropriate for validation of particular properties may change in each subsequent version of the UCD. However, because of stability guarantees for character property aliases, these regular expressions for one version of the Unicode Standard will match valid values for previous versions of the standard.

5.9.1Enumerated and Binary Properties

Enumerated and binary character properties can be validated by generating a regular expression using the PropertyValueAliases.txt file. Because enumerated properties have a defined list of possible values, the validating regular expression simply ORs together all of the possible values. Binary properties are a special case of enumerated property, with a predefined very short list of possible values.

For example, to validate the East_Asian_Width property in the UCD, or to test an implementation that returns the East_Asian_Width property, parse the following relevant lines from PropertyValueAliases.txt and produce a regular expression that concatenates each of the short and long property alias values.

# East_Asian_Width (ea)ea ; A         ; Ambiguousea ; F         ; Fullwidthea ; H         ; Halfwidthea ; N         ; Neutralea ; Na        ; Narrowea ; W         ; Wide

The resulting regular expression would then be:

  /A|Ambiguous|F|Fullwidth|H|Halfwidth|N|Neutral|Na|Narrow|W|Wide/

For each Unicode binary character property, the regular expression can be precomputed simply as:

  /N|No|F|False|Y|Yes|T|True/

The Catalog properties, Age, Block, and Script, are another type of enumerated character property. All possible values of those properties for any given version of the Unicode Standard are listed in PropertyValueAliases.txt, so a validating regular expression for a Catalog property for that given version of the UCD can be generated by concatenating values, as for the other enumerated properties.

5.9.2Combining_Character_Class Property

The Combining_Character_Class (ccc) property is a hybrid type. The possible values defined for it in UnicodeData.txt range from 0 to 255 and are numeric values. However, Combining_Character_Class also has symbolic aliases defined for those particular values that are in actual use; those symbolic aliases are listed in PropertyValueAliases.txt. To produce a validating regular expression for Combining_Character_Class, concatenate together the symbolic aliases from PropertyValueAliases.txt, and then add the numeric range 0..255.

5.9.3Unihan Properties

The validating regular expressions for each property tag defined in the Unihan database are described in detail in [UAX38].

5.9.4Other Properties

Regular expressions to validate String and Miscellaneous properties in the UCD are provided inTable 19. Although Catalog properties may use strict tests, as described inSection 5.9.1Enumerated and Binary Properties, generic patterns for Age, Block, and Script are also provided inTable 19.

To simplify the presentation of these expressions, commonly occurring subexpressions are first abstracted out as variables defined inTable 18.

Table 18. Common Subexpressions for Validation

VariableValue
$positiveDecimal[0-9]+\.[0-9]+
$decimal-?$positiveDecimal
$optionalDecimal-?[0-9]+(\.[0-9]+)?
$name[a-zA-Z0-9]+([_-\ ][a-zA-Z0-9]+)*
$codePoint(10|[A-F0-9])?[A-F0-9]{4}

The regular expressions listed inTable 19 cover all the straightforward cases for other property values. For properties involving somewhat more irregular values, such asAge,ISO_Comment, andUnicode_1_Name, details for validation can be found in [UAX42].

Table 19. Regular Expressions for Other Property Values

AbbrNameRegex for Allowable Values
nvNumeric_Value/$decimal/Field 2
/$optionalDecimal/Field 3
blkBlock/$name/
scScript
dmDecomposition_Mapping/$codePoint+/
FC_NFKCFC_NFKC_Closure
NFKC_CFNFKC_Casefold
cfCase_Folding/$codePoint+/
lcLowercase_Mapping
tcTitlecase_Mapping
ucUppercase_Mapping
sfcSimple_Case_Folding/$codePoint/
slcSimple_Lowercase_Mapping
stcSimple_Titlecase_Mapping
sucSimple_Uppercase_Mapping
bmgBidi_Mirroring_Glyph/$codePoint/
naName/$name/

5.10Deprecation

In the Unicode Standard, the termdeprecation is used somewhatdifferently than it is in some other standards. Deprecation is used tomean that a character or other feature is strongly discouraged from use.This should not, however, be taken as indicating that anything has beenremoved from the standard, nor that anything isplanned for removalfrom the standard. Any such change is constrained by theUnicode Consortium Stability Policies [Stability].

For the Unicode Character Database, there are two important typesof deprecation to be noted. First, anencoded character may bedeprecated. Second, acharacter property may be deprecated.

When an encoded character is strongly discouraged from use, it isgiven the property value Deprecated=True. TheDeprecated propertyis a binary property defined specifically to carry this information aboutUnicode characters. Very few characters are ever formallydeprecated this way; it is not enough that a character be uncommon, obsolete,disliked, or not preferred. Only those few characters which have beendetermined by the UTC to have serious architectural defects or whichhave been determined to cause significant implementation problems areever deprecated. Even in the most severe cases, such as thedeprecated format control characters (U+206A..U+206F), an encoded characterisnever removed from the standard. Furthermore, although deprecatedcharacters are strongly discouraged from use, and should be avoided infavor of other, more appropriate mechanisms, theymay occur in data.Conformant implementations of Unicode processes such a Unicode normalizationmusthandle even deprecated characters correctly.

In the Unicode Character Database, a character property mayalso become strongly discouraged—usually because it no longerserves the purpose it was originally defined for. In such cases, theproperty is labelled "deprecated" in theProperty Table.For example, see theGrapheme_Link property.

6Test Files

The UCD contains a number of test data files. Those provide data in standard formats which can be used to test implementations of Unicode algorithms. The test data filesdistributed with this version of the UCD are listed inTable 20.

Table 20. Unicode Algorithm Test Data Files

File NameSpecificationStatusUnicode Algorithm
BidiTest.txt[UAX9]NUnicode Bidirectional Algorithm
NormalizationTest.txt[UAX15]NUnicode Normalization Algorithm
LineBreakTest.txt[UAX14]NUnicode Line Breaking Algorithm
GraphemeBreakTest.txt[UAX29]NGrapheme Cluster Boundary Determination
WordBreakTest.txt[UAX29]NWord Boundary Determination
SentenceBreakTest.txt[UAX29]NSentence Boundary Determination

The normative status of these test files reflects their use to determine the correctness of implementations claiming conformance to the respective algorithms listed in the table. There is no requirement that any particular Unicode implementation also implement the Unicode Line Breaking Algorithm, for example, butif it implements that algorithm correctly, it should be able to replicate the test case results specified in the data entries in LineBreakTest.txt.

6.1 NormalizationTest.txt

This file contains data which can be used to test an implementation of the Unicode Normalization Algorithm. (See [UAX15].)

The data file has a Unicode string in the first field (which may consist of just a single code point). The next four fields then specify the expected output results of converting that string to Unicode Normalization Forms NFC, NFD, NFKC, and NFKD, respectively. There are many tricky edge cases included in the input data, to ensure that implementations have correctly implemented some of the more complex subtleties of the Unicode Normalization Algorithm.

The header section of NormalizationTest.txt provides additional information regarding the normalization invariant relations that any conformant implementation should be able to replicate.

The Unicode Normalization Algorithm is not tailorable. Conformant implementations should be expected to produce results as specified in NormalizationTest.txt and should not deviate from those results.

6.2Segmentation Test Files and Documentation

LineBreakTest.txt, located in the auxiliary directory of the UCD, contains data which can be used to test an implementation of the Unicode Line Breaking Algorithm. (See [UAX14].) The header ofthat file specifies the data format and the use of the test data tospecify line break opportunities. Note that non-ASCII characters are usedin this test data as field delimiters.

There is an associated documentation file, LineBreakTest.html, which displays the results of the Line Breaking Algorithm in an interactive chart form, with a documented listing of the rules.

The Unicode text segmentation test data files are also located in the auxiliary directory of the UCD. They contain data which can be used to test an implementation of the segmentation algorithms specified in [UAX29]. The headers of those file specify the data format and the use of the test data to specify text segmentation opportunities. Note that non-ASCII characters are used in this test data as field delimiters.

There are also associated documentation files, which display the results of the segmentation algorithms in an interactive chart form, with a documented listing of the rules:

Unlike the Unicode Normalization Algorithm, the Unicode Line Breaking Algorithm and the various text segmentation algorithms are tailorable, and there is every expectation that implementations will tailor these algorithms to produce results as needed. The test data files only test thedefault behavior of the algorithms. Testing of tailored implementations will need to modify and/or extend the test cases as appropriate to match any documented tailoring.

6.3 BidiTest.txt

This file contains data which can be used to test an implementation of the Unicode Bidirectional Algorithm. (See [UAX9].)

The data in BidiTest.txt is intended to exhaustively test all possible combinations of Bidi_Class values for strings of length four or less. To allow for the resulting very large number of test cases, the data file has a somewhat complicated format which is described in the header. Fundamentally, for each input string and for each possible input paragraph level, the test data specifies the resulting bidi levels and expected reordering.

The Unicode Bidirectional Algorithm is tailorable within certain limits. Conformant implementations with no tailoring are expected to produce the results as specified in BidiTest.txt and should not deviate from those results. Tailored implementations can also use the data in BidiTest.txt to test for overall conformance to the algorithm by changing the assignment of properties to characters to reflect the details of their tailoring.

7UCD Change History

This section summarizes the changes to the UCD—including its documentation files—and is organized by Unicode versions. The summary includes changes extending all the way back to Unicode 2.0.0, taken from the obsoleted UCD.html documentation file, which predates the creation of this annex. The intent is for this first consolidated version of the annex to preserve that complete prior history from UCD.html. Subsequent versions of the annex will provide only an abbreviated UCD change history section containing only the delta change information from each preceding version.

Starting from Unicode 4.0.1, references in the change history are often made to a Public Review Issue (PRI). See http://www.unicode.org/review/resolved-pri.html for more information about each of those cases.

Changes documented prior to Unicode 4.0 only covered UnicodeData.txt. From Unicode 4.0 onward, the documentation of changes includes modifications of other files as well.

Unicode 5.2.0

General:

The documentation file UCD.html was obsoleted. Themain documentation for the UCD is now contained in [UAX44].Documentation specifically for the Unihan data files can be found in[UAX38]

Changes in specific files:

Appropriate data files were updated to include the 6,648 newcharacters added in Unicode 5.2. Nine new properties were added.

Unicode 5.1.0

General:

UCD.html:

Changes in specific files:

Appropriate data files were updated to include the 1,624 newcharacters added in Unicode 5.1.

Unicode 5.0.0

UCD.html:

Common file changes:

In many data files an explicit default property assignment rangewas added (in a machine-readable comment line), to assist implementations inassigning values for code points not otherwise listed in the data file.

Changes in specific files:

Appropriate data files were updated to include the 1,369 newcharacters added in Unicode 5.0.

Two new data files, NameAliases.txt and NamedSequencesProv.txt, wereadded to the UCD.

Unicode 4.1.0

General:

UCD.html:

Common file changes:

All remaining files not corrected for Unicode 4.0.1 havehad their headers updated to explicitly point toTerms of Use. The headers have also beensynchronized somewhat to share a more common format forfile version, date, and pointers to documentation.The major exception is UnicodeData.txt, which for legacyreasons, has no header.

Changes in specific files:

Appropriate data files were updated to include the 1,273new characters added in Unicode 4.1.0.

The description of the Unihan properties was separated out from UCD.html, extensively revised, and moved into a new documentation file, Unihan.html.

Unicode 4.0.1

UCD.html:

Common file changes:

Some property values have different casing (upper versus lower) for consistency between the data files and the PropertyValueAlias file. There are some additional changes in comments:

Changes in specific files:

Unicode 4.0.0

General:

For details on changes made to the UCD for Unicode 4.0.0, seeSection D.4, "Changes from Unicode Version 3.2 to Version 4.0" inAppendix D ofThe Unicode Standard, Version 4.0.

Common file changes:

Default property values were more precisely defined, for code points not explicitly listed in the data files.

Changes in specific files:

Unicode 3.2.0

General:

For details on changes made to the UCD for Unicode 3.2.0, seeSection D.3, "Changes from Unicode Version 3.1 to Version 3.2" inAppendix D ofThe Unicode Standard, Version 4.0.

Changes in specific files:

Appropriate data files were updated to include the 1,016new characters added in Unicode 3.2.0.

Unicode 3.1.1

Changes in specific files:

Unicode 3.1.0

General:

For details on changes made to the UCD for Unicode 3.1.0, seeSection D.2, "Changes from Unicode Version 3.0 to Version 3.1" inAppendix D ofThe Unicode Standard, Version 4.0.

Changes in specific files:

Appropriate data files were updated to include the 2,237new entries, to cover new individual characters and the newranges of Unified CJK Ideographs encoded in Unicode 3.1.0.

Unicode 3.0.1

General:

Changes in specific files:

Unicode 3.0.0

Modifications made for Version 3.0.0 of UnicodeData.txt include many new characters and a number of property changes. These are summarized in Appendix D ofThe Unicode Standard, Version 3.0.

Unicode 2.1.9

Modifications made for Version 2.1.9 of UnicodeData.txt include:

Unicode 2.1.8

Modifications made for Version 2.1.8 of UnicodeData.txt include:

Version 2.1.7

This version was for internal change tracking only, and never publicly released.

Version 2.1.6

This version was for internal change tracking only, and never publicly released.

Unicode 2.1.5

Modifications made for Version 2.1.5 of UnicodeData.txt include:

Version 2.1.4

This version was for internal change tracking only, and never publicly released.

Version 2.1.3

This version was for internal change tracking only, and never publicly released.

Unicode 2.1.2

Modifications made in updating UnicodeData.txt to Version 2.1.2 for the Unicode Standard, Version 2.1 (from Version 2.0) include:

Version 2.1.1

This version was for internal change tracking only, and never publicly released.

Unicode 2.0.0

The modifications made in updating UnicodeData.txt for the Unicode Standard, Version 2.0 include:

Acknowledgments

Mark Davis and Ken Whistler are the authors of the initial version and have added to and maintained the text of this annex. Julie Allen and Asmus Freytag provided editorialsuggestions for improvement of the text. Over the years, manymembers of the UTC have participated in the review of the UCDand its documentation.

References

For references for this annex, see Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.”

Modifications

For details of the change history, see the online copy of this annex at http://www.unicode.org/reports/tr44/.

The following summarizes modifications from previous revisions of this annex.

Revision 4 [KW]

  • Reissued for Unicode 5.2.0.
  • Completely reorganized and rewritten, to include all the content from the obsoletedUCD.html.
  • Added Section 5.10 re deprecation.
  • Added subsection in Section 4.2 re line termination conventions.
  • Added Contributory as a formal status and updated the Property Table accordingly.
  • Added note in Section 5.3.1 to indicate that contributory properties are neither normative nor informative.
  • Updated documentation for default values.
  • Cleaned up description of numeric properties.
  • Tweaked the description of NamesList.html.
  • Miscellaneous minor point edits.
  • Updated summary statement of the document.
  • Centered tables.
  • Added anchors and numbers to tables and adjusted text referencing tables accordingly.
  • Added clarifications about exceptional format issues for Unihan data files.
  • Updated references to Section 4.8, "Named—Normative" for derived names and for code point labels.
  • Added mention of property aliases from Unihan data files to Section 5.6.1.
  • Added documentation for new derived properties: Cased, Case_Ignorable, Changes_When_Lowercased, Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casefolded, Changes_When_Casemapped, NFKC_Casefold, and Changes_When_NFKC_Casefolded.
  • Added strong pointers to Section 3.5 and Chapter 4 of [Unicode] in the Introduction.
  • Added new Section 2.3.1, "Changes to Properties Between Releases".
  • Updated default values for East_Asian_Width.
  • Clarified the applicability of comments in cases where properties have multiple default values.
  • Restructured Section 5.1 documentation of columns in the property table, for better text flow.
  • Reordered entries for DerivedCoreProperties.txt in the property table, for clarity.
  • Added documentation of new test file: BidiTest.txt.
  • Updated terminology related to the Unihan Database.
  • Added documentation for the new data file, CJKRadicals.txt.
  • Added Attached_Above for ccc=214 in Table 13.
  • Complete revision of Validation section and associated tables.
  • Minor revision of text in Section 4.1.5, "File Directory Differences for Early Releases."
  • Added a cautionary note about the use of the Age property in regular expressions.
  • Added sections explaining obsolete, deprecated, and stabilized properties, and clearly identified existing such properties in the property table.

Revision 3 being a proposed update, only changes between versions 4 and 2 are noted here.

Revision 2

  • Initial approved version for Unicode 5.1.0.

Revision 1

  • Initial draft.

Copyright © 2000-2009 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The UnicodeTerms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


[8]ページ先頭

©2009-2025 Movatter.jp