Unicode® Standard Annex #29

Unicode Text Segmentation

Version	Unicode 17.0.0
Editors	Josh Hadley ([email protected])
Date	2025-08-17
This Version	https://www.unicode.org/reports/tr29/tr29-47.html
Previous Version	https://www.unicode.org/reports/tr29/tr29-45.html
Latest Version	https://www.unicode.org/reports/tr29/
Latest Proposed Update	https://www.unicode.org/reports/tr29/proposed.html
Revision	47

Summary

This annex describes guidelines for determining defaultsegmentation boundaries between certain significant text elements:grapheme clusters (“user-perceived characters”), words, andsentences. For line boundaries, see [UAX14].

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral partof the Unicode Standard, but is published online as a separatedocument. The Unicode Standard may require conformance to normativecontent in a Unicode Standard Annex, if so specified in theConformance chapter of that version of the Unicode Standard. Theversion number of a UAX document corresponds to the version of theUnicode Standard of which it forms a part.

Please submit corrigenda and other comments with the onlinereporting form [Feedback].Related information that is useful in understanding this annex isfound in Unicode Standard Annex #41, “CommonReferences for Unicode Standard Annexes.” For the latest version ofthe Unicode Standard, see [Unicode]. For alist of current Unicode Technical Reports, see [Reports]. For moreinformation about versions of the Unicode Standard, see [Versions]. For anyerrata which may apply to this annex, see [Errata].

1Introduction

This annex describes guidelines for determining default boundariesbetween certain significant text elements: user-perceived characters,words, and sentences. The process of boundary determination is alsocalledsegmentation.

A string of Unicode-encoded text often needs to be broken up intotext elements programmatically. Common examples of text elementsinclude what users think of as characters, words, lines (moreprecisely, where line breaks are allowed), and sentences. The precisedetermination of text elements may vary according to orthographicconventions for a given script or language. The goal of matching userperceptions cannot always be met exactly because the text alone doesnot always contain enough information to unambiguously decideboundaries. For example, theperiod (U+002E FULL STOP) isused ambiguously, sometimes for end-of-sentence purposes, sometimesfor abbreviations, and sometimes for numbers. In most cases, however,programmatic text boundaries can match user perceptions quiteclosely, although sometimes the best that can be done is not tosurprise the user.

Rather than concentrate on algorithmically searching for textelements (often calledsegments), a simpler and more usefulcomputation instead detects theboundaries (orbreaks)between those text elements. The determination of those boundaries isoften critical to performance, so it is important to be able to makesuch a determination as quickly as possible. (For a generaldiscussion of text elements, seeChapter 2, General Structure,of [Unicode].)

The default boundary determination mechanism specified in this annexprovides a straightforward and efficient way to determine some of themost significant boundaries in text: user-perceived characters,words, and sentences. Boundaries used in line breaking (also calledwordwrapping) are defined in [UAX14].

The sheer number of characters in the Unicode Standard,together with its representational power, place requirements on boththe specification of text element boundaries and the underlyingimplementation. The specification needs to allow the designation oflarge sets of characters sharing the same characteristics (forexample, uppercase letters), while the implementation must providequick access and matches to those large sets. The mechanism also musthandle special features of the Unicode Standard, such as nonspacingmarks and conjoining jamos.

The default boundary determination builds upon the uniformcharacter representation of the Unicode Standard, while handling thelarge number of characters and special features such as nonspacingmarks and conjoining jamos in an effective manner. As this mechanismlends itself to a completely data-driven implementation, it can betailored to particular orthographic conventions or user preferenceswithout recoding.

As in other Unicode algorithms, these specifications provide alogicaldescription of the processes: implementations can achieve the sameresults without using code or data that follows these rulesstep-by-step. In particular, many production-grade implementationswill use a state-table approach. In that case, the performance doesnot depend on the complexity or number of rules. Rather, performanceis only affected by the number of characters that may matchafter the boundary position in a rule that applies.

1.1Notation

A boundary specification summarizes boundary property valuesused in that specification, then lists the rules for boundarydeterminations in terms of those property values. The summary isprovided as a list, where each element of the list is one of thefollowing:

A literal character
A range of literal characters
All characters satisfying a given condition, usingproperties defined in the Unicode Character Database [UCD]:
- Non-Boolean property values are given as<property>= <property value>, such as General_Category =Titlecase_Letter.
- Boolean properties are given as<property> =Yes, such as Uppercase = Yes.
- Other conditions are specified textually in terms of UCDproperties.
Boolean combinations of the above
Two special identifiers,sot andeot, standingforstart of text andend of text, respectively

For example, the following is such a list:

General_Category = Line_Separator,or
General_Category = Paragraph_Separator,or
General_Category = Control,or
General_Category =Format
and not U+000D CARRIAGE RETURN (CR)
and not U+000A LINE FEED (LF)
and not U+200CZERO WIDTH NON-JOINER (ZWNJ)
and not U+200D ZEROWIDTH JOINER (ZWJ)

In the table assigning the boundary property values, all of thevalues are intended to be disjoint except for the special valueAny.In case of conflict, rows higher in the table have precedence interms of assigning property values to characters. Data filescontaining explicit assignments of the property values are found in [Props].

Boundary determination is specified in terms of an ordered listof rules, indicating the status of a boundary position. The rules arenumbered for reference and are applied in sequence to determinewhether there is a boundary at any given offset. That is, there is animplicit “otherwise” at the front of each rule following the first.The rules are processed from top to bottom. As soon as a rule matchesand produces a boundary status (boundary or no boundary) for thatoffset, the process is terminated.

Each rule consists of a left side, a boundary symbol (seeTable 1), and a rightside. Either of the sides can be empty. The left and right sides usethe boundary property values in regular expressions. The regularexpression syntax used is a simplified version of the format suppliedinUnicode Technical Standard #18, Unicode RegularExpressions [UTS18].

Table 1.Boundary Symbols

÷	Boundary (allow break here)
×	No boundary (do not allow break here)
→	Treat whatever on the left side as if it were what is onthe right side

Anopen-box symbol (“␣”) is used to indicate a space inexamples.

1.2RuleConstraints

These rules are constrained in three ways, to makeimplementations significantly simpler and more efficient. Theseconstraints have not been found to be limitations for naturallanguage use. In particular, the rules are formulated so that theycan be efficiently implemented, such as with a deterministicfinite-state machine based on a small number of property values.

Single boundaries. Each rule hasexactly one boundary position. This restriction is more a limitationon the specification methods, because a rule with multipleboundaries could be expressed instead as multiple rules. Forexample:
- “a b ÷ c d ÷ e f” could be broken into two rules “a b ÷ cd e f” and “a b c d ÷ e f”
- “a b × c d × e f” could be broken into two rules “a b × cd e f” and “a b c d × e f”
Limited negation. Negation ofexpressions is limited to instances that resolve to a match againstsingle characters, such as “¬(OLetter | Upper | Lower | Sep)”.
Ignore degenerates. No specialprovisions are made to get marginally better behavior for degeneratecases that never occur in practice, such as anA followed byan Indic combining mark.
Script boundaries.Script boundaries are treated as degenerate cases in these rules, sothe string “aquaφοβία” is treated as a single word, and the sequence‘a’ + ‘ ि’ as a single grapheme cluster. However, implementationsare free to customize boundary testing to break at scriptboundaries, which may be especially useful for grapheme clusters.When this is done, the Common/Inherited values need to be handledproperly, and the Script_Extensions property should be used insteadof the Script property alone.

2Conformance

There are many different ways to divide text elementscorresponding to user-perceived characters, words, and sentences, andthe Unicode Standard does not restrict the ways in whichimplementations can produce these divisions. However, it does provide conformance clauses to enable implementations to clearly describe their behavior in relation to the default behavior.

UAX29-C1.Extended Grapheme Cluster Boundaries:An implementation shall choose either UAX29-C1-1 or UAX29-C1-2 to determine whether an offset within a sequence of characters is an extended grapheme cluster boundary.

UAX29-C1-1.Use the property values defined in the Unicode Character Database [UCD] and theextended rules in Section 3.1Grapheme Cluster Boundary Rules to determine the boundaries.

The default grapheme clusters are also known asextended grapheme clusters.

UAX29-C1-2.Declare the use of a profile of UAX29-C1-1, and define that profile with a precise specification of any changes in property values or rules and/or provide a description of programmatic overrides to the behavior of UAX29-C1-1.

Legacy grapheme clusters are such a profile.

UAX29-C2.Word Boundaries:An implementation shall choose either UAX29-C2-1 or UAX29-C2-2 to determine whether an offset within a sequence of characters is a word boundary.

UAX29-C2-1.Use the property values defined in the Unicode Character Database [UCD] and the rules in Section 4.1Default Word Boundary Specification to determine the boundaries.

UAX29-C2-2.Declare the use of a profile of UAX29-C2-1, and define that profile with a precise specification of any changes in property values or rules and/or provide a description of programmatic overrides to the behavior of UAX29-C2-1.

UAX29-C3.Sentence Boundaries:An implementation shall choose either UAX29-C3-1 or UAX29-C3-2 to determine whether an offset within a sequence of characters is a sentence boundary.

UAX29-C3-1.Use the property values defined in the Unicode Character Database [UCD] and the rules in Section 5.1Default Sentence Boundary Specification to determine the boundaries.

UAX29-C3-2.Declare the use of a profile of UAX29-C3-1, and define that profile with a precise specification of any changes in property values or rules and/or provide a description of programmatic overrides to the behavior of UAX29-C3-1.

This specification definesdefault mechanisms; moresophisticated implementations canand should tailor them forparticular locales or environments and, for the purpose of claiming conformance, document the tailoring in the form of a profile. For example, reliable detectionof word boundaries in languages such as Thai, Lao, Chinese, orJapanese requires the use of dictionary lookup or other mechanisms, analogous to Englishhyphenation. An implementation therefore may need to provide means for a programmatic override of the default mechanisms described in this annex.Note that a profile can both add and remove boundary positions, compared to the results specified byUAX29-C1-1,UAX29-C2-1, orUAX29-C3-1.

Notes:
Locale-sensitive boundary specifications, includingboundary suppressions, can be expressed in LDML [UTS35]. Some profiles areavailable in the Common Locale Data Repository [CLDR].
Some changes to rules and data are needed for bestsegmentation behavior of additional emoji zwj sequences [UTS51]. Implementations arestrongly encouraged to use the extended text segmentation rules inthe latest version of CLDR.

To maintain canonical equivalence, all of the followingspecifications are defined on text normalized in form NFD, as definedin Unicode Standard Annex #15, “Unicode NormalizationForms” [UAX15].Boundaries never occur within a combining character sequence or conjoining sequence,so the boundaries within non-NFD text can be derived from corresponding boundaries in the NFD form of that text.For convenience, the default rules have been written so that they can be applied directly to non-NFD text and yield equivalent results.(This may not be the case with tailored default rules.) For more information, see Section 6,Implementation Notes.

3Grapheme Cluster Boundaries

A single Unicode code point is often, but not always the same as a basic unit of a writingsystem for a language, or what a typical user might think of as a “character”. There are manycases where such a basic unit is made up of multiple Unicode code points. To avoid ambiguitywith the term character as defined for encoding purposes, it can be useful to speak of auser-perceived character. For example, “G” + grave-accent is a user-perceived character: usersthink of it as a single character, yet is actually represented by two Unicode code points.

The notion of user-perceived character is not always an unambiguous concept for a given writingsystem: it may differ based on language, script style, or even based on context, for the sameuser. Drop-caps and initialisms, text selection, or "character" counting for text size limitsare all contexts in which the basic unit may be defined differently.

In implementations, the notion of user-perceived characters corresponds to the concept ofgrapheme clusters. They are a best-effort approximation that can be determinedprogrammatically and unambiguously. The definition of grapheme clusters attempts to achieveuniformity across all human text without requiring language or font metadata about that text.As an approximation, it may not cover all potential types of user-perceived characters, and itmay have suboptimal behavior in some scripts where further metadata is needed, or where adifferent notion of user-perceived character is preferred. Such special cases may require acustomization of the algorithm, while the generic case continues to be supported by the standard algorithm.

As far as a user is concerned, the underlying representation of text is not important, butit is important that an editing interface present a uniform implementation of what the userthinks of as characters. Grapheme clusters can be treated as units, by default, for processessuch as the formatting of drop caps, as well as the implementation of text selection, arrowkey movement, forward deletion, and so forth. For example, when a grapheme clusteris represented internally by a character sequence consisting of base character + accents, thenusing the right arrow key would skip from the start of the base character to the end of thelast accent.

Grapheme cluster boundaries are also important for collation, regular expressions, UIinteractions, segmentation for vertical text, identification of boundaries for first-letterstyling, and counting “character” positions within text. Word boundaries, line boundaries, andsentence boundaries should not occur within a grapheme cluster: in other words, a graphemecluster should be an atomic unit with respect to the process of determining these otherboundaries.

This document defines a default specification for grapheme clusters. It may be customized for particular languages, operations, or other situations. For example, arrow key movement could be tailored by language, or could use knowledge specific to particular fonts to move in a more granular manner, in circumstances where it would be useful to edit individual components. This could apply, for example, to the complex editorial requirements for the Northern Thai script Tai Tham (Lanna). Similarly, editing a grapheme cluster element by elementmay be preferable in some circumstances. For example, on a given system thebackspacekey might delete by code point, while thedelete key maydelete an entire cluster.

Moreover, there is not a one-to-onerelationship between grapheme clusters and keys on a keyboard. Asingle key on a keyboard may correspond to a whole grapheme cluster,a part of a grapheme cluster, or a sequence of more than one graphemecluster.

Grapheme clusters can only provide anapproximation of where to put cursors. Detailed cursor placementdepends on the text editing framework. The text editing frameworkdetermines where the edges of glyphs are, and how they correspond tothe underlying characters, based on information supplied by thelower-level text rendering engine and font. For example, the textediting framework must know if a digraph is represented as a singleglyph in the font, and therefore may not be able to position a cursorat the proper position separating its two components. That frameworkmust also be able to determine display representation in cases wheretwo glyphs overlap—this is true generally when a character isdisplayed together with a subsequent nonspacing mark, but must alsobe determined in detail for complex script rendering. For cursorplacement, grapheme clusters boundaries can only supply anapproximate guide for cursor placement using least-common-denominatorfonts for the script.

In those relatively rare circumstances where programmers need tosupply end users with user-perceived character counts, the countsshould correspond to the number of segments delimited by graphemecluster boundaries. Grapheme clusters may also be used insearching and matching; for more information, see Unicode TechnicalStandard #10, “Unicode Collation Algorithm” [UTS10], and Unicode TechnicalStandard #18, “Unicode Regular Expressions” [UTS18].

The Unicode Standard provides a default algorithm for determining grapheme cluster boundaries; the default grapheme clusters are also known asextended grapheme clusters. For backwards compatibility with earlier versions of this specification, the Standard also defines and maintains a profile forlegacy grapheme clusters.

These algorithms can be adapted to producetailoredgrapheme clusters for specific locales or other customizations,such as the contractions used in collation tailoring tables. InTable 1a aresome examples of the differences between these concepts. The tailoredexamples are only for illustration: what constitutes a graphemecluster will depend on the customizations used by the particulartailoring in question.

Table 1a.Sample Grapheme Clusters

Ex	Characters	Comments
Grapheme clusters(both legacy and extended)
g̈	0067 ( g ) LATIN SMALL LETTER G 0308( ◌̈ ) COMBINING DIAERESIS	combining character sequences
각	AC01 ( 각 ) HANGUL SYLLABLE GAG	Hangul syllables such asgag (which may be a single character, or a sequence of conjoiningjamos)
각	1100 ( ᄀ ) HANGUL CHOSEONG KIYEOK 1161( ᅡ ) HANGUL JUNGSEONG A 11A8 ( ᆨ )HANGUL JONGSEONG KIYEOK
ก	0E01 ( ก ) THAI CHARACTER KO KAI	Thaiko
Extended graphemeclusters
நி	0BA8 ( ந ) TAMIL LETTER NA 0BBF ( ி ) TAMIL VOWELSIGN I	Tamilni
เ	0E40 ( เ ) THAI CHARACTER SARA E	Thaie
กำ	0E01 ( ก ) THAI CHARACTER KO KAI 0E33( ำ ) THAI CHARACTER SARA AM	Thaikam
षि	0937 ( ष ) DEVANAGARI LETTER SSA 093F ( ि )DEVANAGARI VOWEL SIGN I	Devanagarissi
क्षि	0915 ( क ) DEVANAGARI LETTER KA 094D ( ् )DEVANAGARI SIGN VIRAMA 0937 ( ष ) DEVANAGARI LETTER SSA 093F ( ि ) DEVANAGARI VOWEL SIGN I	Devanagarikshi
Legacy graphemeclusters
ำ	0E33 ( ำ ) THAI CHARACTER SARA AM	Thaiam
ष	0937 ( ष ) DEVANAGARI LETTER SSA	Devanagarissa
ि	093F ( ि ) DEVANAGARI VOWEL SIGN I	Devanagarii
Possible tailored graphemeclusters in a profile
ch	0063 ( c ) LATIN SMALL LETTER C 0068( h ) LATIN SMALL LETTER H	Slovakch digraph
kʷ	006B ( k ) LATIN SMALL LETTER K 02B7( ʷ ) MODIFIER LETTER SMALL W	sequence with modifier letter

See also:Whereis my Character?, and the UCD fileNamedSequences.txt[Data34].

Alegacy grapheme cluster is defined as a base (such asA or カ) followed by zero or more continuing characters. One way tothink of this is as a sequence of characters that form a “stack”.

The base can be single characters, or be any sequence of Hangul Jamocharacters that form a Hangul Syllable, as defined by D133 in TheUnicode Standard, or be a pair of Regional_Indicator (RI) characters.For more information about RI characters, see [UTS51].

The continuing characters include nonspacing marks, the Join_Controls(U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER) used inIndic languages, and a few spacing combining marks to ensurecanonical equivalence.There are cases in Bangla, Khmer, Malayalam, and Odiya in which a ZWNJ occurs after a consonant and before avirama or other combining mark. These cases should not provide an opportunity for a grapheme cluster break. Therefore, ZWNJ has been included in the Extend class.Additional cases need to be added forcompleteness, so that any string of text can be divided up into asequence of grapheme clusters. Some of these may bedegeneratecases, such as a control code, or an isolated combining mark.

Anextended grapheme cluster is the same as a legacygrapheme cluster, with the addition of some other characters. Thecontinuing characters are extended to include all spacing combiningmarks, such as the spacing (but dependent) vowel signs in Indicscripts. For example, this includes U+093F ( ि ) DEVANAGARIVOWEL SIGN I. The extended grapheme clusters should be used inimplementations in preference to legacy grapheme clusters, becausethey provide better results for Indic scripts such as Tamil orDevanagari in which editing by orthographic syllable is typicallypreferred. For scripts such as Thai, Lao, and certain other SoutheastAsian scripts, editing by visual unit is typically preferred, so forthose scripts the behavior of extended grapheme clusters is similarto (but not identical to) the behavior of legacy grapheme clusters.

For the rules defining the boundaries for grapheme clusters, seeSection 3.1. For moreinformation on the composition of Hangul syllables, seeChapter3, Conformance, of [Unicode].

A key feature of Unicode grapheme clusters (both legacyand extended) is that they remain unchanged across all canonicallyequivalent forms of the underlying text. Thus the boundaries remainunchanged whether the text is in NFC or NFD. Using a grapheme clusteras the fundamental unit of matching thus provides a very clear andeasily explained basis for canonically equivalent matching. This is important for applications from searching to regular expressions.

Another key feature is that default Unicode grapheme clusters areatomic units with respect to the process of determining the Unicodedefault word, and sentence boundaries. They are usually—but notalways—atomic units with respect to line boundaries: there areexceptions due to the special handling of spaces. For moreinformation, seeSection 9.2 Legacy Support for SpaceCharacter as Base for Combining Marks in [UAX14].

Grapheme clusters can be tailored to meet further requirements. Suchtailoring is permitted, but the possible rules are outside of thescope of this document. One example of such a tailoring would be fortheaksaras, ororthographic syllables, used in manyIndic scripts. Aksaras usually consist of a consonant, sometimes withan inherent vowel and sometimes followed by an explicit, dependentvowel whose rendering may end up on any side of the consonant letterbase. Extended grapheme clusters include such simple combinations.

However, aksaras may also include one or more additional consonants, typically with avirama (halant) character betweeneach pair of consonants in the sequence. Some consonant clusteraksaras are not incorporated into the default rules for extendedgrapheme clusters, in part because not all such sequences areconsidered to be single “characters” by users. Another reason is that additional changes to therules are made when new information becomes available. Indic scripts varyconsiderably in how they handle the rendering of such aksaras—in somecases stacking them up into combined forms known as consonantconjuncts, and in other cases stringing them out horizontally, withvisible renditions of the halant on each consonant in the sequence.There is even greater variability in how the typical liquidconsonants (or “medials”),ya, ra, la, andwa, arehandled for display in combinations in aksaras. So tailorings foraksaras may need to be script-, language-, font-, or context-specificto be useful.

Note: Font-based information may be required to determine theappropriate unit to use for UI purposes, such as identification ofboundaries for first-letter paragraph styling. For example, such aunit could be a ligature formed of two grapheme clusters, such asلا (Arabic lam + alef).

The Unicode specification of grapheme clusters >allows for more sophisticated profiles where appropriate. Such definitions may moreprecisely match the user expectations within individual languages forgiven processes. For example, “ch” may be considered a graphemecluster in Slovak, for processes such as collation. The defaultdefinitions are, however, designed to provide a much more accuratematch to overall user expectations for what the user perceives of ascharacters than is provided by individual Unicode code points.

Note: The term cluster isused to emphasize that the term grapheme is used differently inlinguistics.

Display of Grapheme Clusters. Grapheme clusters are notthe same as ligatures. For example, the grapheme cluster “ch” inSlovak is not normally a ligature and, conversely, the ligature “fi”is not a grapheme cluster. Default grapheme clusters do notnecessarily reflect text display. For example, the sequence <f,i> may be displayed as a single glyph on the screen, but wouldstill be two grapheme clusters.

For information on the matching of grapheme clusters with regularexpressions, see Unicode Technical Standard #18, “Unicode RegularExpressions” [UTS18].

Degenerate Cases. The default specifications aredesigned to be simple to implement, and provide an algorithmicdetermination of grapheme clusters. However, they do not haveto cover edge cases that will not occur in practice. For the purposeof segmentation, they may also include degenerate cases that are notthought of as grapheme clusters, such as an isolated controlcharacter or combining mark. In this, they differ from the combiningcharacter sequences and extended combining character sequencesdefined in [Unicode]. Inaddition, Unassigned (Cn) code points and Private_Use (Co) charactersare given property values that anticipate potential usage.

Combining Character Sequences andGrapheme Clusters. For comparison,Table1b shows the relationship between combining character sequences andgrapheme clusters, using regex notation. Note that given alternates(X|Y), the first match is taken. Thesimple identifiers starting with lowercase are variables that aredefined inTable 1c; thosestarting with uppercase letters areGrapheme_Cluster_BreakProperty Values defined inTable 2.

Table 1b.CombiningCharacter Sequences and Grapheme Clusters

Term	Regex	Notes
combining character sequence	`ccs-base? ccs-extend+`	A single base character is not a combining charactersequence. However, a single combining markis a(degenerate) combining character sequence.
extended combining character sequence	`extended_base?ccs-extend+`	extended_base includes Hangul Syllables
legacy grapheme cluster	`crlf \| Control \|legacy-core legacy-postcore*`	A single base character is a grapheme cluster. Degeneratecases include any isolated non-base characters, and non-basecharacters like controls.
extended grapheme cluster	`crlf \| Control \| precore* core postcore*`	Extended grapheme clusters add prepending and spacingmarks.

Table1b uses several symbols defined inTable1c. Square brackets and \p{...} areused to indicate sets of characters, using the normal UnicodeSetnotion.

Table 1c.RegexDefinitions

`ccs-base :=`	`[\p{L}\p{N}\p{P}\p{S}\p{Zs}]`
`ccs-extend :=`	`[\p{M}\p{Join_Control}]`
`extended_base :=`	`ccs-base \| hangul-syllable`
`crlf :=`	`CR LF \| CR \| LF`
`legacy-core :=`	`hangul-syllable \| RI-Sequence \| xpicto-sequence \| [^Control CRLF]`
`legacy-postcore :=`	`[Extend ZWJ]`
`core :=`	`hangul-syllable \| RI-Sequence \| xpicto-sequence \| conjunctCluster \| [^Control CR LF]`
`postcore :=`	`[Extend ZWJ SpacingMark]`
`precore :=`	`Prepend`
`RI-Sequence :=`	`RI RI`
`hangul-syllable :=`	`L* (V+ \| LV V* \| LVT) T* \| L+ \| T+`
`xpicto-sequence :=`	`\p{Extended_Pictographic} (ExtendZWJ \p{Extended_Pictographic})`
`conjunctCluster :=`	`\p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+`

3.1Default Grapheme ClusterBoundary Specification

The following is a general specification for grapheme cluster boundaries—language-specific rules in [CLDR] should be used where available.

The Grapheme_Cluster_Break property value assignments are explicitlylisted in the corresponding data file in [Props]. The values in thatfile are the normative property values.

For illustration, property values are summarized inTable 2,but the lists of characters are illustrative.

Table 2.Grapheme_Cluster_BreakProperty Values

Value	Summary List of Characters
CR	U+000D CARRIAGE RETURN (CR)
LF	U+000A LINE FEED (LF)
Control	General_Category = Line_Separator,or General_Category = Paragraph_Separator,or General_Category = Control,or General_Category= Unassignedand Default_Ignorable_Code_Point,or General_Category = Format and not U+000D CARRIAGERETURN and not U+000A LINE FEED andnot U+200C ZERO WIDTH NON-JOINER (ZWNJ) and notU+200D ZERO WIDTH JOINER (ZWJ) and not Prepended_Concatenation_Mark = Yes
Extend	Grapheme_Extend = Yes, or Emoji_Modifier=Yes This includes: General_Category = Nonspacing_Mark General_Category =Enclosing_Mark U+200C ZERO WIDTH NON-JOINER plusa few General_Category = Spacing_Markneeded for canonicalequivalence.
ZWJ	U+200D ZERO WIDTH JOINER
Regional_Indicator(RI)	Regional_Indicator = Yes Thisconsists of the range: U+1F1E6 REGIONAL INDICATOR SYMBOLLETTER A ..U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z
Prepend	Indic_Syllabic_Category = Consonant_Preceding_Repha,or Indic_Syllabic_Category = Consonant_Prefixed,or Prepended_Concatenation_Mark = Yes
SpacingMark	Grapheme_Cluster_Break ≠ Extend,and General_Category = Spacing_Mark, or anyof the following (which have General_Category = Other_Letter): U+0E33 ( ำ ) THAI CHARACTER SARA AM U+0EB3( ຳ ) LAO VOWEL SIGN AM Exceptions:The following (which have General_Category = Spacing_Markandwould otherwise be included) are specifically excluded: U+102B ( ါ ) MYANMAR VOWEL SIGN TALL AA U+102C( ာ ) MYANMAR VOWEL SIGN AA U+1038( း ) MYANMAR SIGN VISARGA U+1062( ၢ ) MYANMAR VOWEL SIGN SGAW KAREN EU ..U+1064( ၤ ) MYANMAR TONE MARK SGAW KAREN KE PHO U+1067 ( ၧ ) MYANMAR VOWEL SIGN WESTERN PWO KAREN EU ..U+106D ( ၭ ) MYANMAR SIGN WESTERN PWO KAREN TONE-5 U+1083 ( ႃ ) MYANMAR VOWEL SIGN SHAN AA U+1087( ႇ ) MYANMAR SIGN SHAN TONE-2 ..U+108C( ႌ ) MYANMAR SIGN SHAN COUNCIL TONE-3 U+108F( ႏ ) MYANMAR SIGN RUMAI PALAUNG TONE-5 U+109A( ႚ ) MYANMAR SIGN KHAMTI TONE-1 ..U+109C( ႜ ) MYANMAR VOWEL SIGN AITON A U+1A61( ᩡ ) TAI THAM VOWEL SIGN A U+1A63( ᩣ ) TAI THAM VOWEL SIGN AA U+1A64( ᩤ ) TAI THAM VOWEL SIGN TALL AA U+AA7B( ꩻ ) MYANMAR SIGN PAO KAREN TONE U+AA7D( ꩽ ) MYANMAR SIGN TAI LAING TONE-5 U+11720( 𑜠 ) AHOM VOWEL SIGN A U+11721( 𑜡 ) AHOM VOWEL SIGN AA
L	Hangul_Syllable_Type=L,such as: U+1100 (ᄀ ) HANGUL CHOSEONG KIYEOK U+115F (ᅟ ) HANGULCHOSEONG FILLER U+A960 ( ꥠ ) HANGUL CHOSEONG TIKEUT-MIEUM U+A97C ( ꥼ ) HANGUL CHOSEONG SSANGYEORINHIEUH
V	Hangul_Syllable_Type=V,such as: U+1160 (ᅠ ) HANGUL JUNGSEONG FILLER U+11A2 ( ᆢ ) HANGULJUNGSEONG SSANGARAEA U+D7B0 ( ힰ ) HANGUL JUNGSEONG O-YEO U+D7C6 ( ퟆ ) HANGUL JUNGSEONG ARAEA-E, and: U+16D63 (𖵣) KIRAT RAI VOWEL SIGN AA U+16D67 (𖵧) KIRAT RAI VOWEL SIGN E ..U+16D6A (𖵪) KIRAT RAI VOWEL SIGN AU
T	Hangul_Syllable_Type=T,such as: U+11A8 (ᆨ ) HANGUL JONGSEONG KIYEOK U+11F9 ( ᇹ ) HANGUL JONGSEONGYEORINHIEUH U+D7CB ( ퟋ ) HANGUL JONGSEONG NIEUN-RIEUL U+D7FB ( ퟻ ) HANGUL JONGSEONG PHIEUPH-THIEUTH
LV	Hangul_Syllable_Type=LV,that is: U+AC00 (가 ) HANGUL SYLLABLE GA U+AC1C ( 개 ) HANGUL SYLLABLE GAE U+AC38 ( 갸 ) HANGUL SYLLABLE GYA ...
LVT	Hangul_Syllable_Type=LVT,that is: U+AC01( 각 ) HANGUL SYLLABLE GAG U+AC02 ( 갂 ) HANGUL SYLLABLEGAGG U+AC03 ( 갃 ) HANGUL SYLLABLE GAGS U+AC04 (간 ) HANGUL SYLLABLE GAN ...
E_Base	This value is obsolete and unused.
E_Modifier	This value is obsolete and unused.
Glue_After_Zwj	This value is obsolete and unused.
E_Base_GAZ (EBG)	This value is obsolete and unused.
Any	This is not a property value; it is used in therules to represent any code point.

3.1.1Grapheme ClusterBoundary Rules

The same rules are used for the two variants of grapheme clusters,except the rulesGB9a,GB9b, andGB9c. The following table shows thedifferences, which are also marked on the rules themselves. The extended rules are recommended, except where the legacyvariant is required for a specific environment.

Grapheme Cluster Variant	Includes	Excludes
LG: legacy grapheme clusters		GB9a, GB9b, GB9c
EG: extended grapheme clusters	GB9a, GB9b, GB9c

When citing the Unicode definition of grapheme clusters, itmust be clear which of the two alternatives are being specified:extended versus legacy.

Break at the start and end oftext, unless the text is empty.
GB1	sot	÷	Any
GB2	Any	÷	eot
Do not break between a CR and LF.Otherwise, break before and after controls.
GB3	CR	×	LF
GB4	(Control \| CR \| LF)	÷
GB5		÷	(Control \| CR \| LF)
Do not break Hangul syllable or other conjoiningsequences.
GB6	L	×	(L \| V \| LV \| LVT)
GB7	(LV \| V)	×	(V \| T)
GB8	(LVT \| T)	×	T
Do not break before extendingcharacters or ZWJ.
GB9		×	(Extend \| ZWJ)
TheGB9a andGB9b rules only apply to extended graphemeclusters: Do not break before SpacingMarks, or after Prependcharacters.
GB9a		×	SpacingMark
GB9b	Prepend	×
TheGB9c rule only applies to extended grapheme clusters: Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker.
GB9c	\p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend} \p{InCB=Linker} ]*	×	\p{InCB=Consonant}
Do not break within emoji modifier sequences or emoji zwj sequences.
GB11	\p{Extended_Pictographic} Extend* ZWJ	×	\p{Extended_Pictographic}
Do not break within emoji flagsequences. That is, do not break between regional indicator (RI)symbols if there is an odd number of RI characters before the break point.
GB12	sot (RI RI)* RI	×	RI
GB13	[^RI] (RI RI)* RI	×	RI
Otherwise, break everywhere.
GB999	Any	÷	Any

Notes:
Grapheme cluster boundaries can be transformed into simpleregular expressions. For more information, seeSection 6.3,State MachinesandTable 1c,Regex Definitions.
The Grapheme_Base and Grapheme_Extend properties predatedthe development of the Grapheme_Cluster_Break property. The set ofcharacters with Grapheme_Extend=Yes is used to derive the set ofcharacters with Grapheme_Cluster_Break=Extend. However, theGrapheme_Base property proved to be insufficient for determininggrapheme cluster boundaries. Grapheme_Base is no longer used by thisspecification.
Eachemoji sequence is a single grapheme cluster. See definition ED-17 in Unicode Technical Standard #51, "Unicode Emoji" [UAX51].
Similar to Jamo clustering into Hangul Syllables,other characters bind tightly into grapheme clusters, that, unlikecombining characters, don't depend on a base character.These characters are said to exhibitconjoining behavior.For the purpose of Grapheme_Cluster_Break, the property value V has beenextended beyond characters of Hangul_Syllable_Type=V to cover them.

4WordBoundaries

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement(“move to next word” control-arrow keys), and the dialog option “WholeWord Search” for search and replace. They are also used in databasequeries, to determine whether elements are within a certain number ofwords of one another. Searching may also use word boundaries indetermining matching items. Word boundaries are not restricted towhitespace and punctuation. Indeed, some languages do not use spacesat all.

Figure 1 gives an example of word boundaries, marked in thesample text with vertical bars. In the following discussion, searchterms are indicated by enclosing them in square brackets for clarity.Spaces are indicated with the open-box symbol “␣”, and the matchingparts between the search terms and target text are emphasized incolor.

Figure 1.Word Boundaries

The

quick

fox

can’t

jump

32.3

feet

right

Boundaries such as those flanking the words inFigure 1 arethe boundaries that users would expect, for example, when searchingfor a term in the target text using Whole Word Search mode. In thatmode there is a match if—in addition to a matching sequence ofcharacters—there are word boundaries in the target text on both sidesof the search term. In the sample target text inFigure 1,Whole Word Search would have results such as the following:

The search term [] matchesbecause there are word boundaries on both sides.
The search term [] does notmatch because there is no word boundary in the target text between‘w’ and the following character, ‘n’.
The term [] matchesbecause there are word boundaries between the quotation marks andthe parentheses that enclose them.
The term [] alsomatches because there are word boundaries between the parenthesesand the space characters around them.
Finally, the term []with spaces included matches as well, because there are wordboundaries between the space characters and the letters immediatelybefore and after them in the target text.

To allow for such matches that users would expect, there areword breaks by default between most characters that are not normallyconsidered parts of words, such as punctuation and spaces.

Word boundaries can also be used in intelligent cut and paste.With this feature, if the user cuts a selection of text on wordboundaries, adjacent spaces are collapsed to a single space. Forexample, cutting “quick” from “The␣quick␣fox” would leave“The␣ ␣fox”. Intelligent cut and paste collapses this text to“The␣fox”. However, spaces need to be handled separately: cutting thecenter space from “The␣ ␣ ␣fox” probably should notcollapse the remaining two spaces to one.

Proximity tests in searching determines whether, for example, “quick”is within three words of “fox”. That is done with the aboveboundaries by ignoring any words that contain only whitespace, punctuation, and similar characters, as inFigure 2. Thus, forproximity, “fox” is within three words of “quick”. This sametechnique can be used for “get next/previous word” commands orkeyboard arrow keys. Letters are not the only characters that can beused to determine the “significant” words; different implementationsmay include other types of characters such as digits or perform otheranalysis of the characters.

Figure 2.Extracted Words

The

quick

brown

fox

can’t

jump

32.3

feet

right

As with the other default specifications, implementations mayoverride (tailor) the results to meet the requirements of differentenvironments or particular languages. For some languages, it may alsobe necessary to have different tailored word break rules forselection versus Whole Word Search.

Whether the default word boundary detection described here isadequate, and whether word boundaries are related to line breaks, variesbetween scripts. The style of context analysis in line breaking (see [UAX14,section 3.1]) used for a script can provide some rough guidance:

For scripts that use the Western style of context analysis, defaultword boundaries and default line breaks are usually adequate. A defaultline boundary break opportunity is usually a default word boundary,but there are exceptions such as a word containing a SHY (soft hyphen):it will break across lines, yet is a single word. Tailorings may findadditional line break opportunities within words due to hyphenation.Scripts in this group include Latin, Arabic, Devanagari, and many others;they can be identified by having letters with line break class AL.
For scripts that use the East Asian or Brahmic styles of contextanalysis, the default word boundary detection is not adequate; itneeds tailoring. The default line breaks, on the other hand, areusually adequate. Word boundaries are irrelevant to line breaking.Scripts in this group include Chinese, Japanese, Brahmi, Javanese,and others; they can be identified by having letters with line breakclass ID, AK, or AS.
For scripts that use the South East Asian style of context analysis,neither the default word boundaries nor the default line breaks areadequate. Both need tailoring. The reason is that line breaks shouldonly occur at word boundaries, but there’s no demarcation of words.Scripts in this group include Thai, Myanmar, Khmer, and others; theycan be identified by having letters with line break class SA.

Hangul is treated as part of the first group for defaultword boundary detection; and as part of the second group for default line breaking.Some scripts may be treated as being part of the first group only because notenough information is available for them.

4.1DefaultWord Boundary Specification

The following is a general specification for word boundaries—language-specific rules in [CLDR] should be used where available.

The Word_Break property value assignments are explicitly listed inthe corresponding data file in [Props].The values in that file are the normative property values.

For illustration, property values are summarized inTable 3, butthe lists of characters are illustrative.

Table 3.Word_Break PropertyValues

Value	Summary List of Characters
CR	U+000D CARRIAGE RETURN (CR)
LF	U+000A LINE FEED (LF)
Newline	U+000B LINE TABULATION U+000C FORM FEED (FF) U+0085 NEXT LINE (NEL) U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR
Extend	Grapheme_Extend = Yes,or General_Category= Spacing_Mark, or Emoji_Modifier=Yes and not U+200D ZERO WIDTH JOINER (ZWJ)
ZWJ	U+200D ZERO WIDTH JOINER
Regional_Indicator(RI)	Regional_Indicator = Yes Thisconsists of the range: U+1F1E6 REGIONAL INDICATOR SYMBOLLETTER A ..U+1F1FF REGIONAL INDICATOR SYMBOL LETTER Z
Format	General_Category = Format and not U+200BZERO WIDTH SPACE (ZWSP) and not U+200C ZERO WIDTHNON-JOINER (ZWNJ) and not U+200D ZERO WIDTH JOINER(ZWJ) and not Grapheme_Cluster_Break = Prepend
Katakana	Script = KATAKANA,or any of thefollowing: U+3031 ( 〱 ) VERTICAL KANA REPEAT MARK U+3032 (〲 ) VERTICAL KANA REPEAT WITH VOICED SOUND MARK U+3033 (〳 ) VERTICAL KANA REPEAT MARK UPPER HALF U+3034 ( 〴 )VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF U+3035 ( 〵 ) VERTICAL KANA REPEAT MARK LOWER HALF U+309B( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK U+309C ( ゜ )KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK U+30A0 ( ゠ )KATAKANA-HIRAGANA DOUBLE HYPHEN U+30FC ( ー )KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF70 ( ｰ )HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
Hebrew_Letter	Script = Hebrew and General_Category =Other_Letter
ALetter	Alphabetic = Yes,or any of the following characters: U+00B8 ( ¸ ) CEDILLA U+02C2 ( ˂ ) MODIFIER LETTER LEFT ARROWHEAD ..U+02C5 ( ˅ ) MODIFIER LETTER DOWN ARROWHEAD U+02D2 ( ˒ ) MODIFIER LETTER CENTRED RIGHT HALF RING ..U+02D7 ( ˗ ) MODIFIER LETTER MINUS SIGN U+02DE ( ˞ ) MODIFIER LETTER RHOTIC HOOK U+02DF ( ˟ ) MODIFIER LETTER CROSS ACCENT U+02E5 ( ˥ ) MODIFIER LETTER EXTRA-HIGH TONE BAR ..U+02EB ( ˫ ) MODIFIER LETTER YANG DEPARTING TONE MARK U+02ED ( ˭ ) MODIFIER LETTER UNASPIRATED U+02EF ( ˯ ) MODIFIER LETTER LOW DOWN ARROWHEAD ..U+02FF ( ˿ ) MODIFIER LETTER LOW LEFT ARROW U+055A ( ՚ ) ARMENIAN APOSTROPHE U+055B ( ՛ ) ARMENIAN EMPHASIS MARK U+055C ( ՜ ) ARMENIAN EXCLAMATION MARK U+055E ( ՞ ) ARMENIAN QUESTION MARK U+058A ( ֊ ) ARMENIAN HYPHEN U+05F3 ( ׳ ) HEBREW PUNCTUATION GERESH U+070F ( ܏ ) SYRIAC ABBREVIATION MARK U+A708 ( ꜈ ) MODIFIER LETTER EXTRA-HIGH DOTTED TONE BAR ..U+A716 ( ꜖ ) MODIFIER LETTER EXTRA-LOW LEFT-STEM TONE BAR U+A720 (꜠ ) MODIFIER LETTER STRESS AND HIGH TONE U+A721 (꜡ ) MODIFIER LETTER STRESS AND LOW TONE U+A789 (꞉ ) MODIFIER LETTER COLON U+A78A ( ꞊ ) MODIFIER LETTER SHORT EQUALS SIGN U+AB5B ( ꭛ ) MODIFIER BREVE WITH INVERTED BREVE and Ideographic = No and Word_Break ≠ Katakana andLine_Break ≠ Complex_Context (SA) and Script ≠Hiragana and Word_Break ≠ Extend andWord_Break ≠ Hebrew_Letter
Single_Quote	U+0027 ( ' ) APOSTROPHE
Double_Quote	U+0022 ( " ) QUOTATION MARK
MidNumLet	U+002E ( . ) FULL STOP U+2018 ( ‘ ) LEFTSINGLE QUOTATION MARK U+2019 ( ’ ) RIGHT SINGLEQUOTATION MARK U+2024 ( ․ ) ONE DOT LEADER U+FE52 ( ﹒ ) SMALL FULL STOP U+FF07 ( ＇ ) FULLWIDTHAPOSTROPHE U+FF0E ( ． ) FULLWIDTH FULL STOP
MidLetter	U+003A ( : ) COLON(used in Swedish) U+00B7 ( · ) MIDDLE DOT U+0387 ( · ) GREEK ANO TELEIA U+055F ( ՟ ) ARMENIAN ABBREVIATION MARK U+05F4 ( ״ ) HEBREW PUNCTUATION GERSHAYIM U+2027 ( ‧ ) HYPHENATION POINT U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON U+FE55 ( ﹕ ) SMALL COLON U+FF1A ( ： ) FULLWIDTH COLON
MidNum	Line_Break = Infix_Numeric,or anyof the following: U+066C ( ٬ ) ARABIC THOUSANDSSEPARATOR U+FE50 ( ﹐ ) SMALL COMMA U+FE54 ( ﹔ )SMALL SEMICOLON U+FF0C ( ， ) FULLWIDTH COMMA U+FF1B ( ； ) FULLWIDTH SEMICOLON and not U+003A (: ) COLON and not U+FE13 ( ︓ ) PRESENTATION FORMFOR VERTICAL COLON and not U+002E ( . ) FULL STOP
Numeric	Line_Break = Numeric or General_Category = Decimal_Number and not U+066C ( ٬ )ARABIC THOUSANDS SEPARATOR
ExtendNumLet	General_Category = Connector_Punctuation,or U+202F NARROW NO-BREAK SPACE (NNBSP)
E_Base	This value is obsolete andunused.
E_Modifier	This value is obsolete andunused.
Glue_After_Zwj	This value is obsolete andunused.
E_Base_GAZ(EBG)	This value is obsolete andunused.
WSegSpace	General_Category = Zs and not Linebreak =Glue
Any	This is not a property value; it is used in therules to represent any code point.

4.1.1WordBoundary Rules

The table of word boundary rules uses the macro values listedin Table 3a. Each macro represents a repeated union of the basicWord_Break property values and is shown in boldface to distinguish itfrom the basic property values.

Table 3a.Word_BreakRule Macros

Macro	Represents
AHLetter	(ALetter \| Hebrew_Letter)
MidNumLetQ	(MidNumLet \| Single_Quote)

Break at the start and end oftext, unless the text is empty.
WB1	sot	÷	Any
WB2	Any	÷	eot
Do not break within CRLF.
WB3	CR	×	LF
Otherwise break before and afterNewlines (including CR and LF)
WB3a	(Newline \| CR \| LF)	÷
WB3b		÷	(Newline \| CR \| LF)
Do not break within emoji zwjsequences.
WB3c	ZWJ	×	\p{Extended_Pictographic}
Keep horizontal whitespacetogether.
WB3d	WSegSpace	×	WSegSpace
Ignore Format and Extendcharacters, except after sot, CR, LF, and Newline. (See Section6.2,ReplacingIgnore Rules.) This also has the effect of: Any × (Format \| Extend\| ZWJ)
WB4	X (Extend \| Format \| ZWJ)*	→	X
Do not break between most letters.
WB5	AHLetter	×	AHLetter
Do not break letters acrosscertain punctuation, such as within “e.g.” or “example.com”.
WB6	AHLetter	×	(MidLetter \|MidNumLetQ)AHLetter
WB7	AHLetter (MidLetter \|MidNumLetQ)	×	AHLetter
WB7a	Hebrew_Letter	×	Single_Quote
WB7b	Hebrew_Letter	×	Double_Quote Hebrew_Letter
WB7c	Hebrew_Letter Double_Quote	×	Hebrew_Letter
Do not break within sequences ofdigits, or digits adjacent to letters (“3a”, or “A3”).
WB8	Numeric	×	Numeric
WB9	AHLetter	×	Numeric
WB10	Numeric	×	AHLetter
Do not break within sequences,such as “3.2” or “3,456.789”.
WB11	Numeric (MidNum \|MidNumLetQ)	×	Numeric
WB12	Numeric	×	(MidNum \|MidNumLetQ) Numeric
Do not break between Katakana.
WB13	Katakana	×	Katakana
Do not break from extenders.
WB13a	(AHLetter \| Numeric \|Katakana \| ExtendNumLet)	×	ExtendNumLet
WB13b	ExtendNumLet	×	(AHLetter \| Numeric \| Katakana)
Do not break within emoji flagsequences. That is, do not break between regional indicator (RI)symbols if there is an odd number of RI characters before the breakpoint.
WB15	sot (RI RI)* RI	×	RI
WB16	[^RI] (RI RI)* RI	×	RI
Otherwise, break everywhere(including around ideographs).
WB999	Any	÷	Any

Notes:
It is not possible to provide a uniform set of rules thatresolves all issues across languages or that handles all ambiguoussituations within a given language. The goal for the specificationpresented in this annex is to provide a workable default; tailoredimplementations can be more sophisticated.
The correct interpretation of hyphens in the context of wordboundaries is challenging. It is quite common for separate words tobe connected with a hyphen: “out-of-the-box,” “under-the-table,”“Italian-American,” and so on. A significant number are hyphenatednames, such as “Smith-Hawkins.” When doing a Whole Word Search orquery, users expect to find the word within those hyphens. Whilethere are some cases where they are separate words (usually toresolve some ambiguity such as “re-sort” as opposed to “resort”),it is better overall to keep the hyphen out of the defaultdefinition. Hyphens include U+002D HYPHEN-MINUS, U+2010 HYPHEN,possibly also U+058A ARMENIAN HYPHEN, and U+30A0 KATAKANA-HIRAGANADOUBLE HYPHEN.
Implementations may build on the information supplied by wordboundaries. For example, a spell-checker would first check thateach word was valid according to the above definition, checking thefour words in “out-of-the-box.” If any of the words failed, itcould build the compound word and check if it as a whole sequencewas in the dictionary (even if all the components were not in thedictionary), such as with “re-iterate.” Of course, spell-checkersfor highly inflected or agglutinative languages will need much moresophisticated algorithms.
The use of the apostrophe is ambiguous. It is usuallyconsidered part of one word (“can’t” or “aujourd’hui”) but it mayalso be considered as part of two words (“l’objectif”). A furthercomplication is the use of the same character as an apostrophe andas a quotation mark. Therefore leading or trailing apostrophes arebest excluded from the default definition of a word. In somelanguages, such as French and Italian, tailoring to break wordswhen the character after the apostrophe is a vowel may yield betterresults in more cases. This can be done by adding a rule WB5a.
Break between apostrophe andvowels (French, Italian).
WB5a apostrophe ÷ vowels
and defining appropriate property values for apostrophe and vowels.Apostrophe includes U+0027 ( ' ) APOSTROPHE and U+2019 ( ’ )RIGHT SINGLE QUOTATION MARK (curly apostrophe). Finally, in sometransliteration schemes, apostrophe is used at the beginning ofwords, requiring special tailoring.
Certain cases such as colons in words (for example, “AIK:are” and “c:a”) are included inthe default even though they may be specific to relatively smalluser communities (Swedish) because they do not occur otherwise, innormal text, and so do not cause a problem for other languages.
For Hebrew, a tailoring may include a double quotation markbetween letters, because legacy data may contain that in place ofU+05F4 ( ״ ) HEBREW PUNCTUATION GERSHAYIM. This can be doneby adding double quotation mark to MidLetter. U+05F3 ( ׳ )HEBREW PUNCTUATION GERESH may also be included in a tailoring.
Format characters are included if they are not initial. Thus<LRM><ALetter> will break before the <letter>,but there is no break in <ALetter><LRM><ALetter>or <ALetter><LRM>.
Characters such as hyphens, apostrophes, quotation marks, and colonshould be taken into account when using identifiers that areintended to represent words of one or more natural languages. SeeSection 2.4,Specific Character Adjustments, of [UAX31]. Treatment ofhyphens, in particular, may be different in the case of processingidentifiers than when using word break analysis for a Whole WordSearch or query, because when handling identifiers the goal will beto parse maximal units corresponding to natural language “words,”rather than to find smaller word units within longer lexical unitsconnected by hyphens.
Normally word breaking does not require breaking betweendifferent scripts. However, adding that capability may be useful incombination with other extensions of word segmentation. Forexample, in Korean the sentence “I live in Chicago.” is written asthree segments delimited by spaces:
나는 Chicago에 산다.
According to Korean standards, the grammatical suffixes, suchas “에” meaning “in”, are considered separate words. Thus the abovesentence would be broken into the following five words:
나, 는, Chicago, 에, and 산다.
Separating the first two words requires a dictionary lookup,but for Latin text (“Chicago”) the separation is trivial based onthe script boundary.
Modifier letters (General_Category = Lm) are almostall included in the ALetter class, by virtue of their Alphabeticproperty value. Thus, by default, modifier letters do not causeword breaks and should be included in word selections. Modifiersymbols (General_Category = Sk) are not in the ALetter class and sodo cause word breaks by default.
Some or all of the following characters may be tailored tobe in MidLetter, depending on the environment:
U+002D ( - ) HYPHEN-MINUS
U+055A ( ՚ ) ARMENIANAPOSTROPHE
U+058A ( ֊ ) ARMENIAN HYPHEN
U+0F0B (་ ) TIBETAN MARK INTERSYLLABIC TSHEG
U+1806 ( ᠆ )MONGOLIAN TODO SOFT HYPHEN
U+2010 ( ‐ ) HYPHEN
U+2011 ( ‑ ) NON-BREAKING HYPHEN
U+201B ( ‛ ) SINGLEHIGH-REVERSED-9 QUOTATION MARK
U+30A0 ( ゠ )KATAKANA-HIRAGANA DOUBLE HYPHEN
U+30FB ( ・ ) KATAKANAMIDDLE DOT
U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS
U+FF0D ( － ) FULLWIDTH HYPHEN-MINUS
In UnicodeSet notation, this is: [\u002D\uFF0D\uFE63\u058A\u1806\u2010\u2011\u30A0\u30FB\u201B\u055A\u0F0B]
For example, some writing systems use a hyphen characterbetween syllables within a word. An example is the Iu Mienlanguage written with the Thai script. Such words should behave assingle words for the purpose of selection (“double-click”),indexing, and so forth, meaning that they should not word-break onthe hyphen.
Some or all of the following characters may be tailored tobe in MidNum, depending on the environment, to allow for languagesthat use spaces as thousands separators, such as €1 234,56.
U+0020 SPACE
U+00A0 NO-BREAK SPACE
U+2007 FIGURE SPACE
U+2008 PUNCTUATION SPACE
U+2009 THIN SPACE
U+202F NARROW NO-BREAK SPACE
In UnicodeSet notation, this is: [\u0020\u00A0\u2007\u2008\u2009\u202F]

4.2NameValidation

Related to word determination is the issue ofpersonal namevalidation. Implementations sometimes need to validate fields inwhich personal names are entered. The goal is to distinguish betweencharacters like those in “James Smith-Faley, Jr.” and those in“!#@♥≠”. It is important to be reasonably lenient, because users needto be able to add legitimate names, like “di Silva”, even if thenames contain characters such asspace. Typically, thesepersonal name validations should not be language-specific; someonemight be using a Web site in one language while his name is in adifferent language, for example. A basic set of name validationcharacters consists the characters allowed in words according to theabove definition, plus a number of exceptional characters:

Basic Name Validation Characters

[\p{name=/COMMA/}\p{name=/FULLSTOP/}&\p{p}
\p{whitespace}-\p{c}
\p{alpha}
\p{wb=Katakana}\p{wb=Extend}\p{wb=ALetter}\p{wb=MidLetter}\p{wb=MidNumLet}
[\u002D\u055A\u058A\u0F0B\u1806\u2010\u2011\u201B\u2E17\u30A0\u30FB\uFE63\uFF0D]]

This is only a basic set of validation characters; inparticular, the following points should be kept in mind:

It is a lenient, non-language-specific set, and could betailored where only a limited set of languages are permitted, or forother environments. For example, the set can be narrowed if namefields are separated: “,” and “.” may not be necessary if titles arenot allowed.
It includes characters that may not be appropriate foridentifiers, and some that would not be parts of words. It alsopermits some characters that may be part of words in a broad sense,but not part of names, such as in “AIK:are” and “c:a” in Swedish, or hyphenationpoints used in dictionary words.
Additional tests may be needed in cases where security is atissue. In particular, names may be validated by transforming them toNFC format, and then testing to ensure that no characters in theresult of the transformation change under NFKC. A second test is touse the information in Table 5. Recommended Scripts inUnicodeIdentifier and Pattern Syntax [UAX31].If the name has one or more characters with explicit script valuesthat are not inTable 5, then reject the name.

5SentenceBoundaries

Sentence boundaries are often used for triple-click or someother method of selecting or iterating through blocks of text thatare larger than single words. They are also used to determine whetherwords occur within the same sentence in database queries.

Plain text provides inadequate information for determining goodsentence boundaries. Periods can signal the end of a sentence,indicate abbreviations, or be used for decimal points, for example.Without much more sophisticated analysis, one cannot distinguishbetween the two following examples of the sequence <?, ”, space,uppercase-letter>. In the first example, they mark the end of asentence, while in the second they do not.

He said, “Are you going?” John shook his head.

“Are you going?” John asked.

Without analyzing the text semantically, it is impossible to becertain which of these usages is intended (and sometimes ambiguitiesstill remain). However, in most cases a straightforward mechanismworks well.

Note: As with the other default specifications,implementations are free to override (tailor) the results to meetthe requirements of different environments or particular languages.For example, locale-sensitive boundary suppression specificationscan be expressed in LDML [UTS35].Specific sentence boundary suppressions are available in the CommonLocale Data Repository [CLDR]and may be used to improve the quality of boundary analysis.

5.1Default Sentence BoundarySpecification

The following is a general specification for sentence boundaries—language-specific rules in [CLDR] should be used where available.

The Sentence_Break property value assignments are explicitly listedin the corresponding data file in [Props]. The values in thatfile are the normative property values.

For illustration, property values are summarized inTable 4,but the lists of characters are illustrative.

Table 4.Sentence_BreakProperty Values

Value	Summary List of Characters
CR	U+000D CARRIAGE RETURN (CR)
LF	U+000A LINE FEED (LF)
Extend	Grapheme_Extend = Yes,or U+200D ZEROWIDTH JOINER (ZWJ),or General_Category =Spacing_Mark
Sep	U+0085 NEXT LINE (NEL) U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR
Format	General_Category = Format and not U+200CZERO WIDTH NON-JOINER (ZWNJ) and not U+200D ZEROWIDTH JOINER (ZWJ)
Sp	White_Space = Yes and Sentence_Break ≠ Sep andSentence_Break ≠ CR andSentence_Break≠ LF
Lower	Lowercase = Yes and Grapheme_Extend = Noand not in the ranges (for Mkhedruli Georgian) U+10D0 (ა) GEORGIAN LETTER AN ..U+10FA (ჺ) GEORGIAN LETTER AINand U+10FD (ჽ) GEORGIAN LETTER AEN ..U+10FF (ჿ) GEORGIAN LETTER LABIAL SIGN
Upper	General_Category = Titlecase_Letter,or Uppercase = Yesand not in the ranges (for Mtavruli Georgian) U+1C90 (Ა) GEORGIAN MTAVRULI CAPITAL LETTER AN ..U+1CBA (Ჺ) GEORGIAN MTAVRULI CAPITAL LETTER AINand U+1CBD (Ჽ) GEORGIAN MTAVRULI CAPITAL LETTER AEN ..U+1CBF (Ჿ) GEORGIAN LETTER MTAVRULI CAPITAL LABIAL SIGN
OLetter	Alphabetic = Yes,or U+00A0 NO-BREAK SPACE(NBSP),or U+05F3 ( ׳ ) HEBREW PUNCTUATIONGERESH and Lower = No and Upper =No and Sentence_Break ≠ Extend
Numeric	Line_Break = Numeric
ATerm	U+002E ( . ) FULL STOP U+2024 ( ․ ) ONE DOTLEADER U+FE52 ( ﹒ ) SMALL FULL STOP U+FF0E ( ． )FULLWIDTH FULL STOP
SContinue	U+002C ( , ) COMMA U+002D ( - ) HYPHEN-MINUS U+003A ( : ) COLON U+003B ( ; ) SEMICOLON U+037E ( ; ) GREEK QUESTION MARK U+055D ( ՝ ) ARMENIAN COMMA U+060C ( ، ) ARABIC COMMA U+060D ( ‎؍‎ ) ARABIC DATE SEPARATOR U+07F8 ( ߸ ) NKO COMMA U+1802 ( ᠂ ) MONGOLIAN COMMA U+1808 ( ᠈ ) MONGOLIAN MANCHU COMMA U+2013 ( – ) EN DASH U+2014 ( — ) EM DASH U+3001 ( 、 ) IDEOGRAPHIC COMMA U+FE10 ( ︐ ) PRESENTATION FORM FOR VERTICAL COMMA U+FE11 ( ︑ ) PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA U+FE13 ( ︓ ) PRESENTATION FORM FOR VERTICAL COLON U+FE14 ( ︔ ) PRESENTATION FORM FOR VERTICAL SEMICOLON U+FE31 ( ︱ ) PRESENTATION FORM FOR VERTICAL EM DASH U+FE32 ( ︲ ) PRESENTATION FORM FOR VERTICAL EN DASH U+FE50 ( ﹐ ) SMALL COMMA U+FE51 ( ﹑ ) SMALL IDEOGRAPHIC COMMA U+FE54 ( ﹔ ) SMALL SEMICOLON U+FE55 ( ﹕ ) SMALL COLON U+FE58 ( ﹘ ) SMALL EM DASH U+FE63 ( ﹣ ) SMALL HYPHEN-MINUS U+FF0C ( ， ) FULLWIDTH COMMA U+FF0D ( － ) FULLWIDTH HYPHEN-MINUS U+FF1A ( ： ) FULLWIDTH COLON U+FF1B ( ； ) FULLWIDTH SEMICOLON U+FF64 ( ､ ) HALFWIDTH IDEOGRAPHIC COMMA
STerm	Sentence_Terminal = Yes and not ATerm
Close	General_Category = Open_Punctuation,or General_Category = Close_Punctuation,or Line_Break = Quotation and not U+05F3 ( ׳ )HEBREW PUNCTUATION GERESH and ATerm = No and STerm = No
Any	This is not a property value; it is used in therules to represent any code point.

5.1.1Sentence Boundary Rules

The table of sentence boundary rules uses the macro valueslisted in Table 4a. Each macro represents a repeated union of thebasic Sentence_Break property values and is shown in boldface todistinguish it from the basic property values.

Table 4a.Sentence_BreakRule Macros

Macro	Represents
ParaSep	(Sep \| CR \| LF)
SATerm	(STerm \| ATerm)

Break at the start and end oftext, unless the text is empty.
SB1	sot	÷	Any
SB2	Any	÷	eot
Do not break within CRLF.
SB3	CR	×	LF
Break after paragraph separators.
SB4	ParaSep	÷
Ignore Format and Extendcharacters, except after sot,ParaSep, and within CRLF. (SeeSection 6.2,ReplacingIgnore Rules.) This also has the effect of: Any × (Format \|Extend)
SB5	X (Extend \| Format)*	→	X
Do not break after full stop incertain contexts. [See note below.]
SB6	ATerm	×	Numeric
SB7	(Upper \| Lower) ATerm	×	Upper
SB8	ATerm Close* Sp*	×	( ¬(OLetter \| Upper \| Lower \|ParaSep \|SATerm))* Lower
SB8a	SATerm Close* Sp*	×	(SContinue \|SATerm)
Break after sentence terminators,but include closing punctuation, trailing spaces, and any paragraphseparator. [See note below.]
SB9	SATerm Close*	×	(Close \| Sp \|ParaSep)
SB10	SATerm Close* Sp*	×	(Sp \|ParaSep)
SB11	SATerm Close* Sp*ParaSep?	÷
Otherwise, do not break.
SB998	Any	×	Any

Notes:
RulesSB6–SB8 aredesigned to forbid breaks after ambiguous terminators (primarilyU+002E FULL STOP) within strings such as those shown inFigure 3. The contexts which forbidbreaks include occurrence directly before a number, betweenuppercase letters, when followed by a lowercase letter (optionallyafter certain punctuation), or when followed by certain continuationpunctuation such as a comma, colon, or semicolon. These rules permitbreaks in strings such as those shown inFigure4. They cannot detect cases such as “...Mr. Jones...”; moresophisticated tailoring would be required to detect such cases.
RulesSB9–SB11 aredesigned to allow breaks after sequences of the following form, butnot within them:
(STerm | ATerm) Close* Sp* (Sep | CR | LF)?
Note that in unusual cases, a word segment (determinedaccording toSection 4WordBoundaries) may span a sentence break (according toSection5Sentence Boundaries). Inconsistencies between word and sentence boundaries can bereduced by customizingSB11 to take account ofwhether a period is followed by a character from a script that doesnot normally require spaces between words.
Users can run experiments in an interactiveonline demo toobserve default word and sentence boundaries in a given piece oftext.

Figure 3.ForbiddenBreaks on “.”

c.	d
3.	4
U.	S.
... theresp.	leadersare ...
...etc.)’	‘(the ...

Figure 4.Allowed Breaks on“.”

She said “See spot run.”	John shook his head. ...
... etc.	它们指...
...理数字.	它们指...

6ImplementationNotes

6.1Normalization

The boundary specifications are stated in terms of text normalizedaccording to Normalization Form NFD (see Unicode Standard Annex #15,“Unicode Normalization Forms” [UAX15]). In practice,normalization of the input is not required. To ensure that the sameresults are returned for canonically equivalent text (that is, thesame boundary positions will be found, although those may berepresented by different offsets), the grapheme cluster boundaryspecification has the following features:

There is never a break within a sequence of nonspacingmarks.
There is never a break between a base character andsubsequent nonspacing marks.

The specification also avoids certain problems by explicitlyassigning the Extend property value to certain characters, such asU+09BE ( া ) BENGALI VOWEL SIGN AA, to deal with particularcompositions.

The other default boundary specifications never break withingrapheme clusters, and they always use a consistent property valuefor each grapheme cluster as a whole.

6.2Replacing Ignore Rules

An important rule for the default word and sentencespecifications ignores Extend and Format characters. The main purposeof this rule is to always treat a grapheme cluster as a singlecharacter—that is, to not break a single grapheme cluster across two higher-level segments. Forexample, both word and sentence specifications do not distinguishbetween L, V, T, LV, and LVT: thus it does not matter whether thereis a sequence of these or a single one. Formatcharacters are also ignored by default, because these characters arenormally irrelevant to such boundaries.

The “Ignore” rule is then equivalent to making the followingchanges in the rules:

Original		Modified
Replace the “Ignore”rule by the following, to disallow breaks within sequences (exceptafter CRLF and related characters):
X (Extend \| Format)*→X	⇒	(¬Sep) ×(Extend \| Format)
In all subsequentrules, insert (Extend \| Format)* after every boundary propertyvalue, except in negations (such as ¬(OLetter \| Upper ...). (It isnot necessary to do this after the final property, on the rightside of the break symbol.) For example:
Original		Modified
X Y × Z W	⇒	X(Extend \| Format)* Y(Extend \| Format)* × Z(Extend \| Format)* W
X Y ×	⇒	X(Extend \| Format)* Y(Extend \| Format)* ×
An alternateexpression that resolves to a single character is treated as awhole. For example:
Original		Modified
(STerm \| ATerm)	⇒	(STerm \| ATerm)(Extend \| Format)*
This isnotinterpreted as:
	⇏	(STerm(Extend \| Format)* \| ATerm(Extend \|Format)*)

Note: Where the “Ignore” rule usesa different set, such as (Extend | Format | ZWJ) instead of(Extend | Format), the corresponding changes would be made inthe above replacements.

The “Ignore” rules should not be overridden by tailorings, withthe possible exception of remapping some of the Format characters toother classes.

6.3State Machines

The rules for grapheme clusters can be easily converted into a regularexpression, as inTable1b,Combining Character Sequences and Grapheme Clusters. It must be evaluated starting at a known boundary(such as the start of the text), and it will determine the nextboundary position. The resulting regular expression can also be used to generatefast, deterministic finite-state machines that will recognize all thesame boundaries that the rules do.

The conversion into a regular expression is very straightforward forgrapheme cluster boundaries. It is not as easy to convert the wordand sentence boundaries, nor the more complex line boundaries [UAX14].However, it is possible to also convert their rules into fast,deterministic finite-state machines that will recognize all the sameboundaries that the rules do. The implementation of text segmentation in the ICU library follows that strategy.

For more information on Unicode Regular Expressions, see UnicodeTechnical Standard #18, “Unicode Regular Expressions” [UTS18].

6.4Random Access

Random access introduces a further complication. When iteratingthrough a string from beginning to end, a regular expression or statemachine works well. From each boundary to find the next boundary isvery fast. By constructing a state table for the reverse directionfrom the same specification of the rules, reverse iteration ispossible.

However, suppose that the user wants to iterate starting at arandom point in the text, or detect whether a random point in thetext is a boundary. If the starting point does not provide enoughcontext to allow the correct set of rules to be applied, then onecould fail to find a valid boundary point. For example, suppose auser clicked after the first space after the question mark in“Are␣you␣there?␣ ␣No,␣I’m␣not”. On a forward iterationsearching for a sentence boundary, one would fail to find theboundary before the “N”, because the “?” had not been seen yet.

A second set of rules to determine a “safe” starting pointprovides a solution. Iterate backward with this second set of rulesuntil a safe starting point is located, then iterate forward fromthere. Iterate forward to find boundaries that were located betweenthe safe point and the starting point; discard these. The desiredboundary is the first one that is not less than the starting point.The safe rules must be designed so that they function correctly nomatter what the starting point is, so they have to be conservative interms of finding boundaries, and only find those boundaries that canbe determined by a small context (a few neighboring characters).

Figure 5.RandomAccess

random access diagram

This process would represent a significant performance cost ifit had to be performed on every search. However, this functionalitycan be wrapped up in an iterator object, which preserves theinformation regarding whether it currently is at a valid boundarypoint. Only if it is reset to an arbitrary location in the text isthis extra backup processing performed. The iterator may even cachelocal values that it has already traversed.

6.5Tailoring

Rule-based implementation can also be combined with acode-based or table-based tailoring mechanism. For typical statemachine implementations, for example, a Unicode character istypically passed to a mapping table that maps characters to boundaryproperty values. This mapping can use an efficient mechanism such asa trie. Once a boundary property value is produced, it is passed tothe state machine.

The simplest customization is to adjust the values coming outof the character mapping table. For example, to mark the appropriatequotation marks for a given language as having the sentence boundaryproperty value Close, artificial property values can be introducedfor different quotation marks. A table can be applied after the mainmapping table to map those artificial character property values tothe real ones. To change languages, a different small table issubstituted. The only real cost is then an extra array lookup.

For code-based tailoring a different special range of propertyvalues can be added. The state machine is set up so that any specialproperty value causes the state machine to halt and return aparticular exception value. When this exception value is detected,the higher-level process can call specialized code according towhatever the exceptional value is. This can all be encapsulated sothat it is transparent to the caller.

For example, Thai characters can be mapped to a specialproperty value. When the state machine halts for one of these values,then a Thai word break implementation is invoked internally, toproduce boundaries within the subsequent string of Thai characters.These boundaries can then be cached so that subsequent calls for nextor previous boundaries merely return the cached values. Similarly Laocharacters can be mapped to a different special property value,causing a different implementation to be invoked.

7Testing

There is no requirement that Unicode-conformant implementationsimplement these default boundaries. As with the other defaultspecifications, implementations are also free to override (tailor)the results to meet the requirements of different environments orparticular languages. For those who do implement the defaultboundaries as specified in this annex, and wish to check that thattheir implementation matches that specification, three test fileshave been made available in [Tests29].

These tests cannot be exhaustive, because of the large numberof possible combinations; but they do provide samples that test allpairs of property values, using a representative character for eachvalue, plus certain other sequences.

A sample HTML file is also available for each that shows variouscombinations in chart form, in [Charts29]. The header cellsof the chart show the property value.The body cells in the chart showthe break status: whether a break occurs between the rowproperty value and the column property value. If the browser supportstool-tips, then hovering the mouse over a header cellwill show a sample character,plus its abbreviated general category and script.Hovering over the break status will display thenumber of the rule responsible for that status.

Note: Testing two adjacentcharacters is insufficient for determining a boundary.

The chart may be followed by some test cases. These test casesconsist of various strings with the break status between each pair ofcharacters shown by blue lines for breaks and by whitespace fornon-breaks. Hovering over each character (with tool-tips enabled)shows the character name and property value; hovering over the breakstatus shows the number of the rule responsible for that status.

Due to the way they have been mechanically processed forgeneration, the test rules do not match the rules in this annexprecisely. In particular:

The rules are cast into a more regex-style.
The rules “sot ÷”, “÷ eot”, and “÷ Any” are addedmechanically and have artificial numbers.
The rules are given decimal numbers without prefix, so rulessuch as WB13a are given a number using tenths, such as 13.1.
Where a rule has multiple parts (lines), each one isnumbered using hundredths, such as
- 21.01) × $BA
- 21.02) × $HY
- ...
Any “treat as” or “ignore” rules are handled as discussed inthis annex, and thus reflected in a transformation of the rules notvisible in the tests.

The mapping from the rule numbering in this annex to the numberingfor the test rules is summarized inTable 5.

Table 5.Numbering of Rules

Rule in This Annex	Test Rule	Comment
xx1	0.2	sot (start of text)
xx2	0.3	eot (end of text)
SB8a	8.1	Letter style
WB13a	13.1
WB13b	13.2
GB999	999.0	Any
WB999	999.0	Any

Note: Rule numbers may changebetween versions of this annex.

8Hangul SyllableBoundary Determination

In rendering, a sequence of jamos is displayed as a series ofsyllable blocks. The following rules specify how to divide up anarbitrary sequence of jamos (including nonstandard sequences) intothese syllable blocks. The symbols L, V, T, LV, LVT represent thecorresponding Hangul_Syllable_Type property values; the symbol M forcombining marks.

The precomposed Hangul syllables are of two types: LV or LVT.In determining the syllable boundaries, the LV behave as if they werea sequence of jamo L V, and the LVT behave as if they were a sequenceof jamo L V T.

Within any sequence of characters, a syllable break never occursbetween the pairs of characters shown inTable 6. In allcases other than those shown inTable 6, a syllable breakoccurs before and after any jamo or precomposed Hangul syllable. Asfor other characters, any combining mark between two conjoining jamosprevents the jamos from forming a syllable block.

Table 6.Hangul Syllable No-BreakRules

Do Not BreakBetween		Examples
L	L, V, LV or LVT	L × L L × V L × LV L × LVT
V or LV	V or T	V × V V × T LV × V LV × T
T or LVT	T	T × T LVT × T
Jamo, LV or LVT	Combining marks	L × M V × M T × M LV × M LVT × M

Even in Normalization Form NFC, a syllable block may contain aprecomposed Hangul syllable in the middle. An example is L LVT T.Each well-formed modern Hangul syllable, however, can be representedin the form L V T? (that is one L, one V and optionally one T) andconsists of a single encoded character in NFC.

For information on the behavior of Hangul compatibility jamos insyllables, seeSection 18.6, Hangul of [Unicode].

8.1Standard Korean Syllables

Standard Korean syllable block: A sequence of one ormore L followed by a sequence of one or more V and a sequence ofzero or more T, or any other sequence that is canonicallyequivalent.

All precomposed Hangul syllables, which have the form LV orLVT, are standard Korean syllable blocks.
Alternatively, a standard Korean syllable block may beexpressed as a sequence of a choseong and a jungseong, optionallyfollowed by a jongseong.
A choseong filler may substitute for a missing leadingconsonant, and a jungseong filler may substitute for a missingvowel.

Using regular expression notation, a canonically decomposedstandard Korean syllable block is of the following form:

L+ V+ T*

Arbitrary standard Korean syllable blocks have a somewhat morecomplex form because they include any canonically equivalentsequence, thus including precomposed Korean syllables. The regularexpressions for them have the following form:

(L+ V+ T*) | (L* LV V* T*) | (L* LVT T*)

All standard Korean syllable blocks used in modern Korean areof the form <L V T> or <L V> and have equivalent,single-character precomposed forms.

Old Korean characters are represented by a series of conjoiningjamos. While the Unicode Standard allows for two L, V, or Tcharacters as part of a syllable, KS X 1026-1 only allows singleinstances. Implementations that need to conform to KS X 1026-1 cantailor the default rules inSection 3.1 Default Grapheme ClusterBoundary Specification accordingly.

8.2Transforminginto Standard Korean Syllables

A sequence of jamos that do not all match the regular expression fora standard Korean syllable block can be transformed into a sequenceof standard Korean syllable blocks by the correct insertion ofchoseong fillers (L_f ) and jungseong fillers (V_f). This transformation of a string of text into standard Koreansyllables is performed by determining the syllable breaks asexplained in the earlier subsection “Hangul Syllable Boundaries,”then inserting one or two fillers as necessary to transform eachsyllable into a standard Korean syllable as shown inFigure 6.

Figure 6.InsertingFillers

L [^V] → L V_f [^V]

[^L] V → [^L] L_f V

[^V] T → [^V] L_f V_f T

InFigure 6, [^X] indicates a character that is not X, or theabsence of a character.

InTable 7, thefirst row shows syllable breaks in a standard sequence, the secondrow shows syllable breaks in a nonstandard sequence, and the thirdrow shows how the sequence in the second row could be transformedinto standard form by inserting fillers into each syllable. Syllablebreaks are shown bymiddle dots “·”.

Table 7.Korean Syllable BreakExamples

No.	Sequence		Sequence with Syllable Breaks Marked
1	LVTLVLVLV_f L_f VL_fV_f T	→	LVT · LV · LV · LV_f · L_fV · L_f V_f T
2	LLTTVVTTVVLLVV	→	LL · TT · VVTT · VV · LLVV
3	LLTTVVTTVVLLVV	→	LLV_f · L_f V_fTT · L_f VVTT · L_f VV · LLVV

Acknowledgments

Mark Davis is the author of the initial version and has added to andmaintained the text of this annex through Version 14.0. Laurențiu Iancu assisted in updating it forVersions 7.0 through 10.0.

Thanks to Julie Allen, Asmus Freytag, ManishGoregaokar, Andy Heninger, Ted Hopp, TsuyoshiIto, Martin Hosken, Michael Kaplan, Johan Curcio Lindström, Eric Mader, Otto Stolz, Steve Tolkin, Ken Whistler, andKarl Williamson for their feedback on this annex, including earlierversions.

References

For references for this annex, see Unicode Standard Annex #41, “Common References for UnicodeStandard Annexes.”

Modifications

The following summarizes modifications from the previouspublished version of this annex.

Revision 47

Reissued for Unicode 17.0.0.
Section 4.1, Default Word Boundary Specification: Updated the derivation ofWord_Break=ALetter to includeU+00B8 CEDILLA based on usage in Saanich. [184-C27]

Modifications for previous versions are listed in those respective versions.

© 2004–2025 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by theTerms of Use. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.

Use of all Unicode Products, including this publication, is governed by the UnicodeTerms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.

Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries.

Movatterモバイル変換