Movatterモバイル変換


[0]ホーム

URL:


[Unicode]  Technical Reports
 

Unicode Standard Annex #14

Unicode Line Breaking Algorithm

VersionUnicode 5.1.0
AuthorsAsmus Freytag (asmus@unicode.org), Andy Heninger (andy.heninger@gmail.com)
Date2008-03-31
This Versionhttp://www.unicode.org/reports/tr14/tr14-22.html
Previous Versionhttp://www.unicode.org/reports/tr14/tr14-19.html
Latest Versionhttp://www.unicode.org/reports/tr14/
Revision22

Summary

This annex presents the Unicode line breaking algorithm along with detaileddescriptions of each of the character classes established by the Unicode linebreaking property. The line breaking algorithm produces a set of "breakopportunities", or positions that would be suitable for wrapping lineswhen preparing text for display.A model implementation using pair tables is also provided.

Status

This document has been reviewed by Unicode members and other interestedparties, and has been approved for publication by the Unicode Consortium.This is a stable document and may be used as reference material or cited asa normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral part of theUnicode Standard, but is published online as a separate document. TheUnicode Standard may require conformance to normative content in a UnicodeStandard Annex, if so specified in the Conformance chapter of that versionof the Unicode Standard. The version number of a UAX document corresponds tothe version of the Unicode Standard of which it forms a part.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Relatedinformation that is useful in understanding this annex is found in UnicodeStandard Annex #41, “Common References for Unicode Standard Annexes.”  For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports,see [Reports]. For more information about versions of the Unicode Standard, see [Versions].

Contents


1 Overview and Scope

The text of The Unicode Standard [Unicode] presentsa limited description of some of the characters with specific functions inline breaking, but does not give a complete specification of line breaking behavior. This annex provides more detailed information about default line breaking behavior reflecting best practices for the support of multilingual texts.

For most Unicode characters, considerable variation in line breakingbehavior can be expected, including variation based on local or stylisticpreferences. For that reason, the line breaking properties provided forthese characters are informative. Some characters are intended to explicitlyinfluence line breaking. Their line breaking behavior is therefore expectedto be identical across all implementations. As described in this annex, the Unicode Standard assigns normative line breaking properties to those characters. The Unicode Line Breaking Algorithm is a tailorable set of rules thatuses these line breaking properties in context to determine line breakopportunities.

This annex opens with formal definitions, a summary of the line breaking taskand the context in which it occurs in overall textlayout followed by a brief section on conformance requirements. Three main sections follow:

The final two sections discuss issues of customization andimplementation.

2 Definitions

All terms not defined here shall be as defined in the Unicode Standard [Unicode5.0]. The notation defined in this annex differs somewhat from the notation defined elsewhere in the Unicode Standard. All other notation used here without an explicit definition shall be as defined elsewhere in the Unicode Standard.

LD1   Line Fitting:  The process of determining how much text will fit on a line of text, given the available space between the margins and the actual display width of the text.

LD2  Line Break:  The position in the text where one line ends and the next one starts.

LD3  Line Break Opportunity:  A place where a line is allowed to end.

LD4  Line Breaking:  The process of selecting one among several linebreak opportunities such that the resulting line is optimal or ends at a user-requested explicit line break.

LD5  Line Breaking Property: A character property with enumerated values, as listed inTable 1, and separated into normative and informativevalues.

LD6  Line Breaking Class: A class of characters with the sameline breaking property value.

The Line Breaking Classes are described inSection5.1,Description of Line Breaking Properties.

LD7  Mandatory Break: A line must break following a character that has the mandatory break property.

Such a break is also known as aforced break and is indicated in the rules asB !, whereB is the character with the mandatory break property.

LD8  Direct Break: A line break opportunity exists between two adjacent characters of the given line breaking classes.

A direct break is indicated in the rules below as B ÷A, whereB is the character class of the characterbefore andA is the character class of the character after the break. If they are separated by one or more space characters, a break opportunity exists instead after the last space. In the pair table, the optional space characters are not shown.

LD9  Indirect Break:  A line break opportunity exists between two characters of the given line breaking classesonly if they are separated by one or more spaces.

An indirect break is indicated in the pair table inTable 2as B %A, whereB is the character class of the characterbefore andA is the character class of the character after the break. Even though space characters are not shown in the pair table, an indirect break can occuronly if one or more spaces follow B. In the notation of the rules inSection6,Line Breaking Algorithm,this would be represented as two rules:B ×AandBSP+ ÷Awhere the “+” sign means one or more occurrences.

LD10  Prohibited Break: No line break opportunity exists between two characters of the given line breaking classes, even if they are separated by one or more space characters.

A prohibited break is indicated in the pair table inTable 2 asB ^A, whereB is the character class of the characterbefore andA is the character class of the character after the break, and the optional space characters are not shown. In the notation of the rules inSection6,Line Breaking Algorithm, this would be expressed as a rule of the form:BSP* ×A.

LD11  Hyphenation: Hyphenation uses language-specific rules to provide additional line break opportunitieswithin a word.

Table 1 lists all of line breaking classes by name, alsogiving their class abbreviation and their status astailorable or not. The examples and brief indication of line breakingbehavior in this table are merely typical, not exhaustive.Section5.1,Description of Line Breaking Properties,provides a detailed description of each line breaking class, includingdetailed overview of the line breaking behavior for characters of thatclass.

Table 1.Line Breaking Classes (* =non-tailorable)

Class

Descriptive Name

Examples

Characters with This Property...

Non-tailorable Line Breaking Classes

BK *

Mandatory Break

NL, PS

Cause a line break (after)

CR *

Carriage Return

CR

Cause a line break (after), except between CR and LF

LF *

Line Feed

LF

Cause a line break (after)

CM *

Attached Characters and Combining Marks

Combining marks, control codes

Prohibit a line break between the character and the preceding character

NL *Next Line NEL Cause a line break (after)

SG *

Surrogates

Surrogates

Do not occur in well-formed text

WJ *Word Joiner WJ Prohibit line breaks before and after

ZW *

Zero Width Space

ZWSP

Provide a break opportunity

GL *Non-breaking (“Glue”) CGJ, NBSP, ZWNBSP Prohibit line breaks before and after
SP *SpaceSPACE Enable indirect line breaks

Break Opportunities

B2

Break Opportunity Before and After

Em dash

Provide a line break opportunity before and after the character

BA

Break Opportunity After

Spaces, hyphens

Generally provide a line break opportunity after the character

BB

Break Opportunity Before

Punctuation used in dictionaries

Generally provide a line break opportunity before the character

HY

Hyphen

HYPHEN-MINUS

Provide a line break opportunity after the character, except in numeric context

CBContingent Break Opportunity Inline objects Provide a line break opportunity contingent on additional information

Characters Prohibiting Certain Breaks

CL

Closing Punctuation

“)”, “]”, “}”, etc.

Prohibit line breaks before

EX

Exclamation/
Interrogation

“!”, “?”, etc.

Prohibit line breaks before

IN

Inseparable

Leaders

Allow only indirect line breaks between pairs

NS

Nonstarter

small kana

Allow only indirect line breaks before

OP

Opening Punctuation

“(“, “[“, “{“, etc.

Prohibit line breaks after

QU

Ambiguous Quotation

Quotation marks

Act like they are both opening and closing

Numeric Context

IS

Infix Separator (Numeric)

. ,

Prevent breaks after any and before numeric

NU

Numeric

Digits

Form numeric expressions for line breaking purposes

PO

Postfix (Numeric)

%, ¢

Do not break following a numeric expression

PR

Prefix (Numeric)

$, £, ¥, etc.

Do not break in front of a numeric expression

SY

Symbols Allowing Break After

/

Prevent a break before, and allow a break after

Other Characters

AI

Ambiguous (Alphabetic or Ideographic)

Characters with Ambiguous East Asian Width

Act likeAL when the resolvedEAW is N; otherwise, act asID

AL

Ordinary Alphabetic and Symbol Characters

Alphabets and regular symbols

Are alphabetic characters or symbols that are used with alphabetic characters

H2Hangul LV Syllable Hangul Form Korean syllable blocks
H3Hangul LVT Syllable Hangul Form Korean syllable blocks

ID

Ideographic

Ideographs

Break before or after, except in some numeric context

JLHangul L Jamo Conjoining jamo Form Korean syllable blocks
JVHangul V Jamo Conjoining jamo Form Korean syllable blocks
JTHangul T Jamo Conjoining jamo Form Korean syllable blocks

SA

Complex Context Dependent (South East Asian)

South East Asian: Thai, Lao, Khmer

Provide a line break opportunity contingent on additional, language-specific context analysis

XX

Unknown

Unassigned, private-use

Have as yet unknown line breaking behavior or unassigned code positions

 

3 Introduction

Lines are broken as result of one of two conditions. The first condition isthe presence of a mandatory line breaking character. The second condition results from a formatting algorithm having selected among available line break opportunities; ideally the chosen line break results in the optimal layout of the text.

Different formatting algorithms may use different methods to determine an optimal line break. For example, simple implementations consider a single line at a time, trying to find alocally optimal line break. A basic, yet widelyused approach is to allow no compression or expansion of the intercharacter and interword spaces and consider the longest line that fits. More complex formatting algorithms often take into account the interaction of line breaking decisions for the whole paragraph. The well-known text layout system [TEX] implements an example of such aglobally optimal strategy that may make complex tradeoffs across an entire paragraph to avoid unnecessary hyphenation and other legal, but inferior breaks. For a description of this strategy, see [Knuth78].

When compression or expansion is allowed, a locally optimal line break seeks to balance the relative merits of the resulting amounts of compression and expansion for different line break candidates. When expanding or compressing interword space according to commontypographical practice, only the spaces marked by U+0020 SPACE, U+00A0 NO-BREAKSPACE, and U+3000 IDEOGRAPHIC SPACE are subject to compression, and only spaces marked by U+0020SPACE, U+00A0NO-BREAK SPACE,and occasionally spaces marked by U+2009 THIN SPACE are subject to expansion. All other space charactersnormally have fixed width. When expanding or compressing intercharacter space, the presenceof U+200B ZERO WIDTH SPACE or U+2060WORD JOINERis always ignored.

Local custom or document style determines whether and to what degree expansion ofintercharacter spaceis allowed in justifying a line. In languages, such as German, where intercharacter space is commonly used to mark  e m p h a s i s(like this), allowing variable intercharacter spacing would have the unintended effect of adding random emphasis, and is therefore best avoided. In table headings that use Han ideographs, even extreme amounts of intercharacter space commonly occur as short texts are spread out across the entire available space to distribute the characters evenly from end to end.

In linebreaking it is necessary to distinguish between two related tasks. The first is the determination of all legal line break opportunities, given a string of text. This is the scope of the Unicode Line Break Algorithm. The second task is the selection of the actual location for breaking a given line of text. This selection not only takes into account the width of the line compared to the width of the text, but may also apply an additional prioritization of line breaks based on aesthetic and other criteria. What defines an optimal choice for a given line break is outside the scope of this annex, as are methods for its selection.

Finally, text layout systems may support an emergency mode thathandles the case of an unusual line that contains nootherwise permitted line breakopportunities. In such line layout emergencies, line breaks may be placed withno regard to the ordinary line breaking behavior of the characters involved.The details of such an emergency mode are outsidethe scope of this annex, however, it is recommended that grapheme clustersbe kept together.

3.1 Determining Line Break Opportunities

Three principal styles of context analysis determine line break opportunities.

  1. Western: spaces and hyphens are used to determine breaks
  2. East Asian:lines can break anywhere, unless prohibited
  3. South East Asian: line breaks require morphological analysis

The Western style is commonly used for scripts employing the space character. Hyphenation is often used with space-based line breaking to provide additional line break opportunities—however, it requires knowledge of the language and it may need user interaction or overrides.

The second style of context analysis is used with East Asian ideographic and syllabic scripts. In these scripts, lines can break anywhere, exceptbefore or after certain characters. The precise set of prohibited linebreaks may depend on user preference or local custom and is commonlytailorable.

Korean makes use of both styles of line break. When Korean text is justified, the second style is commonly used, even for interspersed Latin letters. But when ragged margins are used, the Western style (relying on spaces) is commonly used instead, even for ideographs.

The third style is used for scripts such as Thai, which do not use spaces, but which restrict word breaks to syllable boundaries, the determination of which requires knowledge of the language comparable to that required by a hyphenation algorithm. Such an algorithm is beyond the scope of the Unicode Standard.

For multilingual text, the Western and East Asian styles can be unified into a single set of specifications, based on the information in this annex. Unicode characters have explicit line breaking properties assigned to them. These properties can be utilized to implement the effect of both of these two styles of context analysis for line break opportunities. Customization for user preferences or document style can then be achieved by tailoring that specification.

In bidirectional text, line breaks are determined before applying rule L1 of the Unicode Bidirectional Algorithm [Bidi]. However, line breaking is strictly independent of directional properties of the characters or of any auxiliary information determined by the application of rules of that algorithm.

4 Conformance

There is no single method for determining line breaks; the rules may differ based on user preference and document layout. Therefore the information in this annex, including the specification of the line breaking algorithm, allows for the necessary flexibility in determining line breaks according to different conventions. However, some characters have been encoded explicitly for their effect on line breaking. Users adding such characters to a text expect that they will have the desired effect. For that reason, these characters have been given required line breaking behavior.

To handle certain situations, some line breaking implementations use techniques that cannot be expressed within the framework of the Unicode Line Breaking Algorithm. Examples include the use of dictionaries of words for languages that do not use spaces, such as Thai; recognition of the language of the text in order to choose among different punctuation conventions; the use of dictionaries of common abbreviations or contractions to resolve ambiguities with periods or apostrophes; or a deeper analysis of common syntaxes for numbers or dates, and so on. The conformance requirements permit variations of this kind.

Processes which support multiple modes for determining line breaks are also accommodated. This situation can arise with marked-up text, rich text, style sheets, or other environments in which a higher-level protocol can carry formatting instructions that prevent or force line breaks in positions that differ from those specified by the Unicode Line Break Algorithm. The approach taken here is to require that such processes have a conforming default line break behavior, and to disclose that they also include overrides or optional behaviors that are invoked via a higher-level protocol.

The methods by which a line layout process chooses optimal line breaks from among the available break opportunities is outside the scope of this specification. The behavior of a line layout process in situations where there are no suitable break opportunities is also outside of the scope of this specification.

4.1 Conformance Requirements

UAX14-C1. A process that determines line breaks in Unicode text, and that purports to implement the Unicode Line Breaking Algorithm, shall do so in accordance with the specifications in this annex. In particular, the following three subconditions shall be met:

  1. The sets of mandatory break positions and of break opportunities which the implementation produces include all of those specified by the rules inSection6.1,Non-tailorable Line Breaking Rules.
  2. There exist no break opportunities or mandatory breaks produced by the implementation that fall on a "non-break" position specified by the rules inSection6.1,Non-tailorable Line Breaking Rules.
  3. If the implementation tailors the behavior ofSection6.2,Tailorable Line Breaking Rules,that fact must be disclosed.

UAX14-C2. If an implementation has a default line breaking operation which conforms to UAX14-C1, but also has overrides based on a higher-level protocol, that fact must be disclosed and any behavior that differs from that specified by the rules ofSection6.1,Non-tailorable Line Breaking Rules, must be documented.

Example: An xml format provides markup which disables all line breaking over some span of text. When the markup is not in place, the default behavior is in conformance according to UAX14-C1. As long as the existence of the option is disclosed, that format can be said to conform to the Unicode Line Breaking Algorithm according to UAX14-C2.

As is the case for all other Unicode algorithms, this specification is a logical description—particular implementations can have more efficient mechanisms as long as they produce the same results. See C18 inChapter 3, Conformance, of [Unicode].While only disclosure of tailorings is required in the conformance clauses, documentation of the differences in behaviors is strongly encouraged.

5 Line Breaking Properties

This section provides detailed narrative descriptions of the line breaking behavior of many Unicode characters. Inmany instances, thedescriptions in this section provide additional informative detail about handling agiven character at the end of a line, or duringline layout, which goes beyond the simpledetermination of line breaks. In some cases, thetext also gives guidance as to preferred characters for achieving a particulareffect in line breaking.

This section also summarizes the membership of character classes for each value of the line breaking property. Notethat the mnemonic names for the line break classes are intended neither asexhaustive descriptions of their membership nor as indicators of theirentire range of behaviors in the line breaking process. Instead, their mainpurpose is to serve as unique, yet broadly mnemonic labels. In other words,as long as their line break behavior is identical, otherwise unrelatedcharacters will be found grouped together in the same line break class.

The classificationby property values defined in this section and in the data file is used as inputinto two algorithms defined inSection6,LineBreaking Algorithm, andSection7,Pair Table-Based Implementation. These sections describe workabledefault line breaking methods.Section8,Customization,discusses how the default line breaking behavior can be tailored to theneeds of particular languages for particular document styles and userpreferences.

Data File

The full classification of all Unicode characters by their line breaking properties is available in the file LineBreak.txt [Data14] in the Unicode Character Database [UCD]. This is a tab-delimited, two-column, plain text file, with code position and line breaking class. A comment at the end of each line indicates the character name. Ideographic, Hangul, Surrogate, and Private Use ranges are collapsed by giving a range in the first column.

Future Updates

As more scripts are added to the Unicode Standard and become more widelyimplemented and used on computers, more line breaking classes may be addedor the assignment of line breaking class may be changed for some characters.Implementers must not make any assumptions to the contrary. Any future updates will be reflected in thelatest version of the data file. (See theUnicode Character Database [UCD] for any specific version of the data file.)

5.1 Description of Line Breaking Properties

Line breaking classes are listed alphabetically. Each line breaking classis marked with an annotation in parentheses with the following meanings:

(A)—the class allows a break opportunityafter in specified contexts

(XA)—the class prevents a break opportunityafter in specified contexts

(B)—the class allows a break opportunitybefore in specified contexts

(XB)—the class prevents a break opportunitybefore in specified contexts

(P)—the class allows a break opportunity for apair of same characters

(XP)—the class prevents a break opportunity for apair of same characters

Note: The use of the lettersB andA in these annotationsmarks the position of the break opportunity relative to the character. It isnot to be confused with the use of the same letters in the other parts ofthis annex, where they indicate the positions of the characters relative tothe break opportunity.

AI: Ambiguous (Alphabetic or Ideograph)

Some characters that ordinarily act likealphabetic or symbol characters (which have theALline breaking class) are treated like ideographs (line breaking classID) in certain East Asian legacy contexts. Their line breaking behaviortherefore depends on the context. In the absence of appropriate context information,they are treated as classAL, seethe note at the end of this description.

As originally defined, the line break classAI containedallcharacters with East_Asian_Width value A (ambiguous width) that would otherwise beAL in thisclassification. For more information on East_Asian_Width and how toresolve it, seeUnicode Standard Annex #11,East Asian Width [EAW].

The original definition included many Latin, Greek, and Cyrilliccharacters. These characters are now classified by default asALbecause use of theALline breaking class better corresponds to modern practice. Where strictcompatibility with older legacy implementations is desired, some of thesecharacters need to betreated asID in certain contexts. This can be done by always tailoring themtoID or by continuing to classifythem asAI and resolving them toIDwhere required.

As part of the samerevision, the set of ambiguous characters has been extended to completely encompassthe enclosed alphanumeric characters used for numbering of bullets.

As updated, theAI line breakingclass includes all characters with East Asian Width A that are outside the range U+0000..U+1FFF, plus the followingcharacters:

24EA CIRCLED DIGIT ZERO
2780..2793 DINGBAT CIRCLED SANS-SERIF DIGIT ONE..DINGBAT NEGATIVE CIRCLEDSANS-SERIF NUMBER TEN

Characters with the line break classAIwith East_Asian_Width value A typically take theAL line breaking classwhen theirresolved East_Asian_Width is N (narrow) and take theline breaking classID when their resolved width is W (wide). The remainingcharacters are then resolved toALorID in a consistent fashion.The details of this resolution arenot specified in this annex. The line breaking rules inSection6,Line Breaking Algorithm, and the pair table inSection7,Pair Table-Based Implementation,merely require that all ambiguous characters have been resolved appropriately as part of assigning line breaking classes to the input characters.

Note: The canonical decompositions ofcharacters of classAI are not necessarilyof classAI themselves, orconversely. TheEast_Asian_Width property A on which the definition ofAI is largely based, does not preserve canonical equivalence.In the context of line breaking, the fact that a character has been assignedclassAI means that the line break implementation must resolve it to eitherAL orID, in theabsence of further tailoring. Ifpreserving canonical equivalence is desired, an implementation is free tomake sure that theresolved line break classes preserve canonicalequivalence. Unless compatibility with particular legacy behavior isimportant, it may be sufficient tomap all such characters toAL. Thisachieves a canonically equivalent resolution of line breaking classes, andis compatible with emerging modern practice that treats these charactersincreasingly like regular alphabetic characters.

AL: Ordinary Alphabetic and Symbol Characters (XP)

Ordinary characters require other characters to provide break opportunities; otherwise, no line breaks are allowed between pairs of them. However, this behavior is tailorable. Insome Far Eastern documents, it may be desirable to allow breaking betweenpairs of ordinary characters—particularly Latin characters and symbols.

Note: Use ZWSP as a manual override to provide break opportunities around alphabetic or symbol characters.

Except as listed explicitly below as part of another line breaking class, andexcept as assigned classAI orID based onEast Asian Width, this class contains the following characters:

ALPHABETIC—all remaining characters of General Categories Lu, Ll, Lt, Lm,and Lo
SYMBOLS—all remaining characters of General Categories Sm, Sk, and So
NON-DECIMAL NUMBERS—all remaining characters of General Categories Nl, and No
PUNCTUATION—all remaining characters of General Categories Pc, Pd, and Po

Plus these characters:

0600..0603 ARABIC NUMBER SIGN..ARABIC SIGN SAFHA
06DD ARABIC END OF AYAH
070F SYRIAC ABBREVIATION MARK
2061..2064 FUNCTION APPLICATION..INVISIBLE PLUS

These characters occur in the middle or at the beginning of words oralphanumeric or symbol sequences. However, when alphabetic characters aretailored to allow breaks, these characters should not allow breaks after.

BA: Break Opportunity After (A)

LikeSPACE, the characters in this class provide a break opportunity; unlikeSPACE, they do not take part in determining indirect breaks. They can be subdivided into several categories.

Breaking Spaces

Breaking spaces are the following subset of characters with General_Category Zs:

1680 OGHAM SPACE MARK

2000

EN QUAD

2001

EM QUAD

2002

EN SPACE

2003

EM SPACE

2004

THREE-PER-EM SPACE

2005

FOUR-PER-EM SPACE

2006

SIX-PER-EM SPACE

2008

PUNCTUATION SPACE

2009

THIN SPACE

200A

HAIR SPACE

205F MEDIUM MATHEMATICAL SPACE

All of these space characters have a specific width, but otherwise behave as breaking spaces. In setting a justified line, none of these spacesnormally changes in width, except for THIN SPACE when used in mathematical notation. See also theSP property.

The Ogham space mark may be rendered visibly between words but it is recommended that it be elided at the end of a line. For more information, seeSection5.7,,Word Separator Characters.

See theID property for U+3000 IDEOGRAPHICSPACE.For a list of all space characters in the Unicode Standard, seeSection 6.2,General Punctuation, in [Unicode5.0].

Tabs

0009

TAB

Except for the effect of the location of the tab stops, the tab character acts similarly to a space for the purpose of line breaking.

Conditional Hyphens

00AD

SOFT HYPHEN (SHY)

SHY marks the place where an optional line break may occur inside a word. Itcan be used with all scripts. SHY is rendered invisibly and has no width: it merely indicates an optional line break. The rendering of the optional line break depends on the script. For the Latin script, rendering the line break typically means displaying a hyphen at the end of the line; however, some languages require a change in spelling surrounding an optional line break. For examples, seeSection5.4,Use of Soft Hyphen.

Breaking Hyphens

Breaking hyphens establish explicit break opportunities immediately after each occurrence.

058A

ARMENIAN HYPHEN

2010

HYPHEN

2012FIGURE DASH
2013EN DASH

Hyphens are graphic characters with width. Because, unlike spaces, they are visible, they are included in the measured part of the preceding line, except where the layout style allows hyphens to hang into the margins.For additionalinformation about how to format line breaks resulting from the presence of hyphens, seeSection5.3,Use of Hyphen.

Visible Word Dividers

The following are other forms of visible word dividers that provide breakopportunities:

05BE HEBREW PUNCTUATION MAQAF

0F0B

TIBETAN MARK INTERSYLLABIC TSHEG

1361

ETHIOPIC WORDSPACE

17D8 KHMER SIGN BEYYAL
17DA KHMER SIGN KOOMUUT

The Tibetantsheg is a visible mark, but it functions effectively like a space to separate words (or other units) in Tibetan. It provides a break opportunity after itself. For additionalinformation, seeSection5.6,Tibetan Line Breaking.

TheEthiopic word space is a visible word delimiter and is kept on the previous line. In contrast, U+1360ETHIOPIC SECTION MARK is typically used in a sequence ofseveral such marks on a separate line, and separated by spaces. As suchlines are typically marked with separate hard line breaks (BK),the section mark is treated like an ordinary symbol and given line breakclassAL.

2027

HYPHENATION POINT

A hyphenation point is a raised dot, which is mainly used in dictionariesand similar works to visibly indicate syllabification of words. Syllablebreaks frequently also are potential line break opportunities in the middle of words. When an actual line break falls inside a word containing hyphenation point characters, the hyphenation point is usually rendered as a regular hyphen at the end of the line.

007C

VERTICAL LINE

In some dictionaries, a vertical bar is used instead of a hyphenation point. In this usage, U+0323 COMBINING DOT BELOW isused to mark stressed syllables, so all breaks are marked by the verticalbar. For an actual break opportunity, the vertical bar is rendered as ahyphen in such usage.

Historic Word Separators

Historic texts, especially ancient ones, often do not use spaces, evenfor scripts where modern use of spaces is standard. Special punctuation wasused to mark word boundaries in such texts. For modern text processing it isrecommended to treat these as line break opportunities by default.WJ canbe used to override this default, where necessary.

16EB RUNIC SINGLE DOT PUNCTUATION
16EC RUNIC MULTIPLE DOT PUNCTUATION
16ED RUNIC CROSS PUNCTUATION
2056 THREE DOT PUNCTUATION
2058 FOUR DOT PUNCTUATION
2059 FIVE DOT PUNCTUATION
205A TWO DOT PUNCTUATION
205B FOUR DOT MARK
205D TRICOLON
205E VERTICAL FOUR DOTS
2E19 PALM BRANCH
2E2A TWO DOTS OVER ONE DOT PUNCTUATION
2E2B ONE DOT OVER TWO DOTS PUNCTUATION
2E2C SQUARED FOUR DOT PUNCTUATION
2E2D FIVE DOT PUNCTUATION
2E30 RING POINT
10100 AEGEAN WORD SEPARATOR LINE
10101 AEGEAN WORD SEPARATOR DOT
10102 AEGEAN CHECK MARK
1039F UGARITIC WORD DIVIDER
103D0 OLD PERSIAN WORD DIVIDER
1091F PHOENICIAN WORD DIVIDER
12470 CUNEIFORM PUNCTUATION SIGN OLD ASSYRIAN WORDDIVIDER

Dandas

DEVANAGARI DANDA is similar to afull stop. Thedanda or historically related symbols are used with several otherIndic scripts. Unlike a full stop, thedanda is not used in numberformatting.DEVANAGARI DOUBLE DANDA marks the end of a verse. It also hasanalogues in other scripts.

0964 DEVANAGARI DANDA
0965 DEVANAGARI DOUBLE DANDA
0E5A THAI CHARACTER ANGKHANKHU
0E5B THAI CHARACTER KHOMUT
104A MYANMAR SIGN LITTLE SECTION
104B MYANMAR SIGN SECTION
1735 PHILIPPINE SINGLE PUNCTUATION
1736 PHILIPPINE DOUBLE PUNCTUATION
17D4 KHMER SIGN KHAN
17D5 KHMER SIGN BARIYOOSAN
1B5E BALINESE CARIK SIKI
1B5F BALINESE CARIK PAREREN
A8CE SAURASHTRA DANDA
A8CF SAURASHTRA DOUBLE DANDA
AA5D CHAM PUNCTUATION DANDA
AA5E CHAM PUNCTUATION DOUBLE DANDA
AA5F CHAM PUNCTUATION TRIPLE DANDA
10A56 KHAROSHTHI PUNCTUATION DANDA
10A57 KHAROSHTHI PUNCTUATION DOUBLE DANDA

Tibetan

0F34 TIBETAN MARK BSDUS RTAGS
0F7F TIBETAN SIGN RNAM BCAD
0F85 TIBETAN MARK PALUTA
0FBE TIBETAN KU RU KHA
0FBF TIBETAN KU RU KHA BZHI MIG CAN
0FD2 TIBETAN MARK NYIS TSHEG

For additional information, seeSection5.6,Tibetan Line Breaking.

Other Terminating Punctuation

Termination punctuation stays with the line, but otherwise allows a break afterit. This is similar toEX, except thatthe latter may be separated by a space from the preceding word withoutallowing a break, whereas these marks are used without spaces.

1804 MONGOLIAN COLON
1805 MONGOLIAN FOUR DOTS
1B5A BALINESE PANTI
1B5B BALINESE PAMADA
1B5C BALINESE WINDU
1B5D BALINESE CARIK PAMUNGKAH
1B60 BALINESE PAMENENG
1C3B LEPCHA PUNCTUATION TA-ROL
1C3C LEPCHA PUNCTUATION NYET THYOOM TA-ROL
1C3D LEPCHA PUNCTUATION CER-WA
1C3E LEPCHA PUNCTUATION TSHOOK CER-WA
1C3F LEPCHA PUNCTUATION TSHOOK
1C7E OL CHIKI PUNCTUATION MUCAAD
1C7F OL CHIKI PUNCTUATION DOUBLE MUCAAD
2CFA COPTIC OLD NUBIAN DIRECT QUESTION MARK
2CFB COPTIC OLD NUBIAN INDIRECT QUESTION MARK
2CFC COPTIC OLD NUBIAN VERSE DIVIDER
2CFF COPTIC MORPHOLOGICAL DIVIDER
2E0E..2E15 EDITORIAL CORONIS..UPWARDS ANCORA
2E17 OBLIQUE DOUBLE HYPHEN
A60D VAI COMMA
A60F VAI QUESTION MARK
A92E KAYAH LI SIGN CWI
A92F KAYAH LI SIGN SHYA
10A50 KHAROSHTHI PUNCTUATION DOT
10A51 KHAROSHTHI PUNCTUATION SMALL CIRCLE
10A52 KHAROSHTHI PUNCTUATION CIRCLE
10A53 KHAROSHTHI PUNCTUATION CRESCENT BAR
10A54 KHAROSHTHI PUNCTUATION MANGALAM
10A55 KHAROSHTHI PUNCTUATION LOTUS

BB: Break Opportunities Before (B)

Characters of this line break class move to the next line at a line break and thus provide a line break opportunity before.

Dictionary Use

00B4

ACUTE ACCENT

1FFD GREEK OXIA

In some dictionaries, stressed syllables are indicated with a spacing acute accent instead of the hyphenation point. In this case the accent moves to the next line, and the preceding line ends with a hyphen.The oxia is canonically equivalent to the acuteaccent.

02DF

MODIFIER LETTER CROSS ACCENT

A cross accent also appears in some dictionaries to mark the stress of the following syllable, and should be handled in the same way as the other stress marking characters in this section. The accent should not be separated from the syllable it marks by a break.

02C8

MODIFIER LETTER VERTICAL LINE

02CC

MODIFIER LETTER LOW VERTICAL LINE

These characters are used in dictionaries to indicate stress and secondary stress when IPA is used. Both are prefixes to the stressed syllable in IPA. Breaking before them keeps them with the syllable.

Note: It is hard to find actual examples in most dictionaries because the pronunciation fields usually occur right after the headword, and the columns are wide enough to prevent line breaks in most pronunciations.

Tibetan and Phags-Pa Head Letters

0F01 TIBETAN MARK GTER YIG MGO TRUNCATED A
0F02 TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA
0F03 TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA
0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA
0F06 TIBETAN MARK CARET YIG MGO PHUR SHAD MA
0F07 TIBETAN MARK YIG MGO TSHEG SHAD MA
0F09 TIBETAN MARK BSKUR YIG MGO
0F0A TIBETAN MARK BKA- SHOG YIG MGO
0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN
0FD1 TIBETAN MARK MNYAM YIG GI MGO RGYAN
0FD3 TIBETAN MARK INITIAL BRDA RNYING YIG MGO MDUN MA
A874 PHAGS-PA SINGLE HEAD MARK
A875 PHAGS-PA DOUBLE HEAD MARK

Tibetan head letters allow a break before. For more information, seeSection5.6,Tibetan Line Breaking.

Mongolian

1806

MONGOLIAN TODO SOFT HYPHEN

Despite its name, this Mongolian character is not an invisible control likeSOFT HYPHEN, but rather a visible character like a regular hyphen. Unlike the hyphen,MONGOLIAN TODO SOFT HYPHEN stays with the following line. Whenever optional line breaks are to be marked invisibly, SOFT HYPHEN should be used instead.

B2: Break Opportunity Before and After (B/A/XP)

2014

EM DASH

TheEM DASH is used to set off parenthetical text. Normally, it is used without spaces. However, this is language dependent.For example, in Swedish, spaces are used around theEM DASH. Line breaks can occur before and after anEM DASH. Because EM DASHesare sometimes used in pairs instead of a single quotation dash, the defaultbehavior is not to break the line between even though not all fonts use connecting glyphs for theEM DASH.

BK: Mandatory Break (A) (Non-tailorable)

Explicit breaks act independently of the surrounding characters.No characters can be added to theBK class aspart of tailoring, but implementations are not required to support the VTcharacter.

000C

FORM FEED (FF)

000B LINE TABULATION (VT)

FORM FEED separates pages. The text on the new page starts at the beginning of the line. In some layout modes there may be novisible advance to a new “page”.

2028

LINE SEPARATOR (LS)

The text after theLine Separatorstarts at the beginning of the line. This is similar to HTML <BR>.

2029

PARAGRAPH SEPARATOR (PS)

The text of the new paragraph starts at the beginning of the line. This character defines a paragraph break, causing suitable formatting to beapplied, for example, interparagraph spacing or first line indentation. LS,FF, VT as well asCR,LF andNL do not define a paragraph break.

Newline Function (NLF)

Newline Functions are defined in the UnicodeStandard as providing additional mandatory breaks. They are not individual characters, but are encoded as sequences of the control characters NEL, LF, and CR. If a character sequence for aNewline Function contains more than one character, it is kept together.The particular sequences that form an NLFdepend on the implementation and other circumstances as described inSection 5.8,Newline Guidelines, of [Unicode5.0].

This specification defines the NLF implicitly. Itdefines the three characterclassesCR,LF, andNL. Their linebreak behavior, defined in ruleLB5 inSection6.1,Non-tailorable Line Breaking Rules, is tobreak afterNL,LF,orCR, but not betweenCR andLF.

CB: Contingent Break Opportunity (B/A)

By default, there is a break opportunity bothbefore andafterany inline object. Object-specific line breaking behavior is implemented inthe associated object itself, and where available can override the defaultto prevent either or both of the default break opportunities. Using U+FFFCOBJECT REPLACEMENT CHARACTERallows the objectanchor to take a character position in the string.

FFFC OBJECT REPLACEMENT CHARACTER

Object-specific line break behavior is best implemented byquerying the object itself, not by replacing theCB line breaking class byanother class.

CL: Closing Punctuation (XB)

The closing character of any set of paired punctuationshould be kept with the preceding character, and the same applies to all forms of wide comma and full stop. This is desirable, even when there areintervening space characters, so as to prevent the appearance of a bareclosing punctuation mark at the head of a line. TheCL line break class contains the following characters plus any characters of General_Category Pe in the Unicode Character Database.

3001..3002

IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP

FE11 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA
FE12 PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP

FE50

SMALL COMMA

FE52

SMALL FULL STOP

FF0C

FULLWIDTH COMMA

FF0E

FULLWIDTH FULL STOP

FF61

HALFWIDTH IDEOGRAPHIC FULL STOP

FF64

HALFWIDTH IDEOGRAPHIC COMMA

CM: Attached Characters and Combining Marks (XB)(Non-tailorable)

Combining Characters

Combining character sequences are treated as units for the purpose of line breaking. The line breaking behavior of the sequence is that of the base character.

The preferred base character for showing combiningmarks in isolation is U+00A0No-Break SPACE. If a line break before or after the combining sequence is desired, U+200BZERO WIDTH SPACEcan be used. The use of U+0020 SPACE as a base characteris deprecated.

For most purposes, combining characters take onthe properties of their base characters, and that is how theCM class istreated in ruleLB9 of this specification. As a result, if the sequence <0021, 20E4>isused to represent a triangle enclosing an exclamation point, itis effectively treated asEX, the linebreak class of the exclamation mark. If U+2061CAUTION SIGNhad been used, which also looks like an exclamation point inside a triangle,it would have the line break class ofAL.Only the latter corresponds to the line breaking behavior expected byusers for this symbol. Toavoid surprising behavior, always use a base character that is a symbolor letter (Line BreakAL) whenusing enclosing combining marks (General_Category Me).

TheCM line break class includes all combining characters with General_Category Mc, Me, and Mn, unless listedexplicitly elsewhere. This includesviramas.

Control and Formatting Characters

Most control and formatting characters are ignored in line breaking and do not contribute to the line width. By giving them classCM, the line breaking behavior of the last preceding character that is not of classCM affects the line breaking behavior.

Note: When control codes and format charactersare rendered visibly during editing, more graceful layout might be achievedby treating them as if they had the line breakclass of the visible symbols instead, that isAL orID. Such visible modes do not violatethe constraint on tailorability, because they are logically equivalent tohaving temporarily substituted symbolcharacters, such as thecharacters from the Control Pictures block, or in some cases, charactersequences, for the actual control characters.

TheCM line break class includes all characters of General_Category Cc and Cf, unless listed explicitly elsewhere.

CR: Carriage Return (A) (Non-tailorable)

000D

CARRIAGE RETURN (CR)

ACR indicates a mandatory break after, unless followed by aLF. See also the discussion underBK.

Note: On some platforms the character sequence <CR, CR, LF> is used to indicate the location of actual line breaks, whereas <CR, LF> is treated like a hard line break. As soon as a user edits the text, the location of all the <CR, CR, LF>sequences may change as the new text breaks differently, while the relativeposition of any <CR, LF> to the surrounding text stays the same. Thisconvention allows an editor to return a buffer and the client to tell which text is displayed on which line by counting the number of <CR, CR, LF> and <CR, LF> sequences.This convention is essentially equivalent tomarkup that captures the result of applying the line break algorithm, not atailoring of the CR character. The <CR, CR, LF> sequences are thus notconsidered part ofthe plain text content.

EX: Exclamation/Interrogation (XB)

Characters in this line break class behave like closing characters, except in relation to postfix(PO) and non-starter characters (NS).

0021

EXCLAMATION MARK

003F

QUESTION MARK

05C6 HEBREW PUNCTUATION NUN HAFUKHA
061B ARABIC SEMICOLON
061E ARABIC TRIPLE DOT PUNCTUATION MARK
061F ARABIC QUESTION MARK
06D4 ARABIC FULL STOP
07F9 NKO EXCLAMATION MARK
0F0D TIBETAN MARK SHAD
0F0E TIBETAN MARK NYIS SHAD
0F0F TIBETAN MARK TSHEG SHAD
0F10 TIBETAN MARK NYIS TSHEG SHAD
0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD
0F14 TIBETAN MARK GTER TSHEG
1802 MONGOLIAN COMMA
1803 MONGOLIAN FULL STOP
1808 MONGOLIAN MANCHU COMMA
1809 MONGOLIAN MANCHU FULL STOP
1944 LIMBU EXCLAMATION MARK
1945 LIMBU QUESTION MARK
2762HEAVY EXCLAMATION MARK ORNAMENT
2763HEAVY HEART EXCLAMATION MARK ORNAMENT
2CF9 COPTIC OLD NUBIAN FULL STOP
2CFE COPTIC FULL STOP
2E2E REVERSED QUESTION MARK
A60C VAI SYLLABLE LENGTHENER
A60E VAI FULL STOP
A876PHAGS-PA MARK SHAD
A877PHAGS-PA MARK DOUBLE SHAD
FE15 PRESENTATION FORM FOR VERTICAL EXCLAMATION MARK
FE16 PRESENTATION FORM FOR VERTICAL QUESTION MARK
FE56..FE57 SMALL QUESTION MARK..SMALL EXCLAMATION MARK
FF01 FULLWIDTH EXCLAMATION MARK
FF1F FULLWIDTH QUESTION MARK

GL: Non-breaking (“Glue”) (XB/XA) (Non-tailorable)

Non-breaking characters prohibit breaks on eitherside, but that prohibition can be overridden bySPorZW.In particular, when NBSP followsSPACE, there is a break opportunity aftertheSPACE and NBSP will go as visible space onto the next line. See alsoWJ. The following liststhe characters of line break classGL with additional description.

00A0

NO-BREAK SPACE (NBSP)

202F

NARROW NO-BREAK SPACE (NNBSP)

180E MONGOLIAN VOWEL SEPARATOR (MVS)

NO-BREAK SPACE is the preferred character to use where two words are to be visually separated but kept on the same line, as in the case of a title and a name “Dr.<NBSP>Joseph Becker”. WhenSPACE follows NBSP, there is no break, because there never is a break in front ofSPACE. NARROW NO-BREAK SPACE is used in Mongolian. Themongolian vowel separator acts like a NNBSP in its line breaking behavior. It additionally affects the shaping of certain vowel characters as described inSection 13.2,Mongolian, of [Unicode5.0].

NARROW NO-BREAK SPACE (NNBSP) is a narrow version ofNO-BREAK SPACE, which except for its display width behaves exactly the same in its line breaking behavior. It is regularly used in Mongolian in certain grammatical contexts (before a particle), where it also influences the shaping of the glyphs for the particle. In Mongolian text, the NNBSP is typically displayed with 1/3 the width of a normal space character.

WhenNARROW NO-BREAK SPACE occurs in French text, it should be interpreted as an “espace fine insécable”.

TheMONGOLIAN VOWEL SEPARATOR is equivalent to a NNBSP in its line breaking behavior, but has different effects in controlling the shaping of its preceding and following characters. It constitutes a word-internal space and is typically displayed with half the width of a NNBSP.

034F

COMBINING GRAPHEME JOINER

This character has no visible glyph and its presence indicates that adjoining characters are to be treated as a graphemic unit, therefore preventing line breaks between them. The use ofgrapheme joiner affects other processes, such as sorting, therefore,U+2060WORD JOINERshould be used if the intent is to merely prevent a line break.

2007

FIGURE SPACE

This is the preferred space to use in numbers. It has the same width as a digit and keeps the number together for the purpose of line breaking.

2011

NON-BREAKING HYPHEN (NBHY)

This is the preferred character to use where words need to be hyphenated but may not be broken at the hyphen. Because of this useas a substitute for ordinary hyphen, the appearance of this character shouldmatch that of U+2010HYPHEN.

0F08 TIBETAN MARK SBRUL SHAD

0F0C

TIBETAN MARK DELIMITER TSHEG BSTAR

0F12 TIBETAN MARK RGYA GRAM SHAD

TheTSHEG BstAR looks exactly like a Tibetantsheg, but can be used to preventa break likeno-break space. It inhibits breaking on either side. Formore information, seeSection5.6,TibetanLine Breaking.

035C..0362 COMBINING DOUBLE BREVE BELOW..COMBINING DOUBLE RIGHTWARDS ARROW BELOW

These diacritics span two characters, so no word or line breaks arepossible on either side.

H2: Hangul LV Syllable (B/A)

This class includes all characters of Hangul Syllable Type LV.

Together with conjoining jamos, Hangul syllables form Korean Syllable Blocks, which are kept together; see [Boundaries].Korean uses space-based line breaking in many styles of documents. Tosupport these, Hangul syllables and conjoining jamo need to be tailored to use classAL. The default in this specification isclassID, which supports the case of Korean documents not using space-based linebreaking. SeeSection8.1,Types of Tailoring. See alsoJL,JT,JV, andH3.

H3: Hangul LVT Syllable (B/A)

This class includes all characters of Hangul Syllable Type LVT. See alsoJL,JT,JV, andH2.

HY: Hyphen (XA)

002D

HYPHEN-MINUS

Some additional context analysis is required to distinguish usage of this character as a hyphen from its usage as a minus sign (or indicator of numerical range). If used as hyphen, it acts likehyphen,which has line break classBA.

Note: Some typescript conventions use runs of HYPHEN-MINUS to stand in for longer dashes or horizontal rules. If actual character code conversionis not performed and it is desired to treat them like the characters orlayout elements they stand for, line breaking needs to support these runs explicitly.

ID: Ideographic (B/A)

Note: This class includes characters other than Han ideographs.

Characters with this property do not require other characters to provide break opportunities; lines can ordinarily break before and after and between pairs of ideographic characters.TheID line break class consists of thefollowing characters:

2E80..2FFF

CJK, KANGXI RADICALS, DESCRIPTION SYMBOLS

3000

IDEOGRAPHIC SPACE

3040..309F

Hiragana (except small characters)

30A0..30FF

Katakana (except small characters)

3400..4DB5

CJK UNIFIED IDEOGRAPHS EXTENSION A

4E00..9FBB

CJK UNIFIED IDEOGRAPHS

F900..FAD9

CJK COMPATIBILITY IDEOGRAPHS

A000..A48F

YI SYLLABLES

A490..A4CF

YI RADICALS

FE62..FE66

SMALL PLUS SIGN to SMALL EQUALS SIGN

FF10..FF19

WIDE DIGITS

20000..2A6D6CJK UNIFIED IDEOGRAPHS EXTENSION B
2F800..2FA1DCJK COMPATIBILITY IDEOGRAPHS SUPPLEMENT

It also includes all of the FULLWIDTH LATIN letters and all of the blocks inthe range 3000..33FF not covered elsewhere.

Note: Use U+2060 WORD JOINER as a manual override to prevent break opportunities around characters of classID.

U+3000IDEOGRAPHIC SPACE may be subject to expansion or compressionduring line justification.

Korean

Korean is encoded with conjoining jamo, Hangul syllables, or both. See alsoJL,JT,JV,H2, andH3.The following set of compatibility jamo is treated asIDby default.

3130..318F

HANGUL COMPATIBILITY JAMO

IN: Inseparable Characters (XP)

Leaders

These characters are intended to be used in consecutive sequence. There isnever a line break between two character of this class.

2024 ONE DOT LEADER
2025 TWO DOT LEADER
2026 HORIZONTAL ELLIPSIS
FE19 PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS

Horizontal ellipsis can be used as a three-dot leader.

IS: Numeric Separator (Infix) (XB)

Characters that usually occur inside a numerical expression may not be separated from the numeric characters that follow, unless a space character intervenes. For example, there is no break in “100.00” or “10,000”, nor in “12:59”.

002C COMMA
002E FULL STOP
003A COLON
003B SEMICOLON
037E GREEK QUESTION MARK (canonically equivalent to 003B)
0589 ARMENIAN FULL STOP
060C ARABIC COMMA
060D ARABIC DATE SEPARATOR
07F8 NKO COMMA
2044 FRACTION SLASH
FE10 PRESENTATION FORM FOR VERTICAL COMMA
FE13 PRESENTATION FORM FOR VERTICAL COLON
FE14 PRESENTATION FORM FOR VERTICAL SEMICOLON

When not used in a numeric context, infix separators are sentence-ending punctuation. Therefore they always prevent breaks before.

Note:Figure Space, not being a punctuation mark, hasbeen given the line break classGL.

JL: Hangul L Jamo (B)

TheJL line break class consists of all characters of Hangul Syllable Type L.

Conjoining jamos form Korean Syllable Blocks, which are kept together; see [Boundaries].Korean uses space-based line breaking in many styles of documents. To supportthese, Hangul syllables and conjoining jamo need to be tailored to use classAL. The default in this specification isclassID, which supports the case of Korean documents not using space-basedline breaking. SeeSection8.1,Types of Tailoring. See alsoJT,JV,H2, andH3.

JT: Hangul T Jamo (A)

TheJT line break class consists of all characters of Hangul Syllable Type T. See alsoJL,JV,H2, andH3.

JV: Hangul V Jamo (XA/XB)

TheJV line break class consists of all characters of Hangul Syllable Type V. See alsoJL,JT,H2, andH3.

LF: Line Feed (A) (Non-tailorable)

000A

LINE FEED (LF)

There is a mandatory break after any LF character, but see the discussion underBK.

NL: Next Line (A) (Non-tailorable)

0085

NEXT LINE (NEL)

TheNL class acts likeBKin all respects (there is a mandatory break after any NEL character).It cannot be tailored, but implementations are not required to support theNEL character; see the discussion underBK.

NS: Nonstarters (XB)

Nonstarter characters cannot start a line, but unlikeCL they may allow a break in some contexts when they follow one or more space characters.Nonstarters include

17D6

KHMER SIGN CAMNUC PII KUUH

203C

DOUBLE EXCLAMATION MARK

203D INTERROBANG
2047 DOUBLE QUESTION MARK
2048 QUESTION EXCLAMATION MARK
2049 EXCLAMATION QUESTION MARK

3005

IDEOGRAPHIC ITERATION MARK

301C

WAVE DASH

303C MASU MARK
303B VERTICAL IDEOGRAPHIC ITERATION MARK

309B.. 309E

KATAKANA-HIRAGANA VOICED SOUND MARK..HIRAGANA VOICED ITERATION MARK

30A0 KATAKANA-HIRAGANA DOUBLE HYPHEN

30FB..30FE

KATAKANA MIDDLE DOT..KATAKANA VOICED ITERATION MARK

A015 YI SYLLABLE WU (misnomer for YI SYLLABLE ITERATION MARK)

FE54..FE55

SMALL SEMICOLON..SMALL COLON

FF1A..FF1B

FULLWIDTH COLON.. FULLWIDTH SEMICOLON

FF65

HALFWIDTH KATAKANA MIDDLE DOT

FF70

HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK

FF9E..FF9FHALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

plus all Hiragana, Katakana, and Halfwidth Katakana “small” characters.

Note: Optionally, theNS restriction may be relaxed andsome or all characters treated likeID to achieve a more permissive style of line breaking, especially in some East Asian document styles.

For additional information about U+30A0KATAKANA-HIRAGANA DOUBLE HYPHEN, seeSection5.5,Use of Double Hyphen.

NU: Numeric (XP)

These characters behave like ordinary characters (AL) in the context ofmost charactersbut activate the prefix and postfix behavior of prefix and postfix characters.

Numeric characters consist of decimal digits (all characters of General_Category Nd), exceptthose with East_Asian_Width F (Fullwidth),plus these characters:

066B ARABIC DECIMAL SEPARATOR
066C ARABIC THOUSANDS SEPARATOR

UnlikeIS characters, the Arabic numericpunctuation does not occur as sentence terminal punctuation outside numbers.

OP: Opening Punctuation (XA)

The opening character of any set of paired punctuationshould be kept with the following character. This is desirable,even when there are intervening space characters, so as to prevent theappearance of a bare opening punctuation mark at the end of a line. TheOP line breakclass consists of all characters of General_Category Ps in the UnicodeCharacter Database, plus

00A1 INVERTED EXCLAMATION MARK
00BF INVERTED QUESTION MARK
2E18 INVERTED INTERROBANG

Note:The first two of these characters used to be classedAIbased on their East_Asian_Width assignment of A. Such characters arenormally resolvedto eitherID orAL.However, the characters listed above are used as punctuation marks inSpanish, where they would behave more like a character of classOP.

PO: Postfix (Numeric) (XB)

Characters that usually follow a numerical expression may not be separated from preceding numeric characters or preceding closing characters, even if one or more space characters intervene. For example, there is no break opportunity in “(12.00) %”.

Some of these characters—inparticular,degree sign andpercent sign—can appear on both sides of a numericexpression. Therefore the line breaking algorithm by default does not breakbetweenPO andnumbers or letters on either side.

The list of postfix characters is

0025

PERCENT SIGN

00A2

CENT SIGN

00B0

DEGREE SIGN

060B

AFGHANI SIGN

066A ARABIC PERCENT SIGN

2030

PER MILLE SIGN

2031

PER TEN THOUSAND SIGN

2032..2037

PRIME..REVERSED TRIPLE PRIME

20A7

PESETA SIGN

2103

DEGREE CELSIUS

2109

DEGREE FAHRENHEIT

FDFC

RIAL SIGN

FE6A

SMALL PERCENT SIGN

FF05

FULLWIDTH PERCENT SIGN

FFE0

FULLWIDTH CENT SIGN

Alphabetic characters are also widely used as unit designators in a postfix position. For purposes ofline breaking, their classification as alphabetic is sufficient to keep them together with the preceding number.

PR: Prefix (Numeric) (XA)

Characters that usually precede a numerical expression may not be separated from following numeric characters or following opening characters,even if a space character intervenes. For example, there is no break opportunity in “$ (100.00)”.

Many currency signs can appear onboth sides, or even the middle, of a numeric expression. Therefore theline breaking algorithm, by default, does not break betweenPR andnumbers or letters on either side.

ThePR line break class consists of all currency symbols (General_Category Sc) except as listed explicitly inPO, as well as the following:

002B

PLUS SIGN

005C

REVERSE SOLIDUS

00B1

PLUS-MINUS

2116

NUMERO SIGN

2212

MINUS SIGN

2213

MINUS-OR-PLUS-SIGN

Note: Many currency symbols may be used either as prefix or aspostfix, depending on local convention. For details on the conventions used,see [CLDR].

QU: Ambiguous Quotation (XB/XA)

Some quotation characters can be opening or closing,or even both, depending on usage. The default is to treat them as both opening and closing.This will prevent some breaks that might have beenlegal for a particular language or usage, such as between a closing quoteand a following opening punctuation.

Note: If language information is available, it can be used todetermine which character is used as the opening quote and which as the closing quote. Seethe information inSection 6.2,General Punctuation, in [Unicode5.0].In such a case, the quotation marks could be tailored to eitherOP orCLdepending on their actual usage.

TheQU line break class consists of characters of General_Category Pf or Pi in the Unicode Character Databaseas well as

0022

QUOTATION MARK

0027

APOSTROPHE

275BHEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT
275CHEAVY SINGLE COMMA QUOTATION MARK ORNAMENT
275DHEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT
275EHEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT
2E00..2E01RIGHT ANGLE SUBSTITUTION MARKER..RIGHT ANGLE DOTTED SUBSTITUTION MARKER
2E06..2E08RAISED INTERPOLATION MARKER..DOTTED TRANSPOSITION MARKER
2E0BRAISED SQUARE

U+23B6BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET is subtly different from the others in this class, in that it isboth an opening and a closing punctuation character at the same time. However, its use is limited to certain vertical text modes in terminal emulation. Instead of creating a one-of-a-kind class for this rarely used character, assigning it to theQU class approximates the intended behavior.

SA: Complex-Context Dependent (South East Asian) (P)

Runs of these characters require morphological analysis to determine break opportunities. This is similar to, for example, a hyphenation algorithm. For the characters that have this property,no break opportunities will befound otherwise. Therefore complex context analysis, often involvingdictionary lookup of some form, is required to determine non-emergency linebreaks.

If such analysis is not available, it is recommended to treat them asAL.

Note: These characters can be mapped into their equivalent line breaking classes asthe result of dictionary lookup, thus permitting a logical separation of this algorithm from the morphological analysis.

The classSA consists of all characters of General_Category Cf, Lo, Lm, Mn,or Mcin the following ranges,except as noted elsewhere:

0E00..0E7F Thai
0E80..0EFF Lao
1000..109F Myanmar
1780..17FF Khmer
1950..197F Tai Le
1980..19DF New Tai Lue

SG: Surrogates (XP) (Non-tailorable)

Line break classSG comprises all code points with General_Category Cs. The line breaking behavior of isolated surrogates is undefined. In UTF-16,paired surrogates represent non-BMP code points. Such code points must beresolved before assigning line break properties. In UTF-8 and UTF-32surrogate code points represent corrupted data and their line break behavioris undefined.

Note: The use of this line breaking class is deprecated. It was oflimited usefulness for UTF-16 implementations that did not support characters beyond the BMP. The correct implementation is to resolve apair of surrogates into a supplementary character before line breaking.

SP: Space (A) (Non-tailorable)

The space characters are used as explicit break opportunities;they allow line breaks before most other characters. However, spaces at the end of a line are ordinarily not measured for fit. If there is a sequence of space characters, and breaking after any of the space characters would result in the same visible line, then the line breaking position after the last space character in the sequence is the locally most optimal one. In other words, when the last character measured for fit isbefore the space character, any number of space characters are kept together invisibly on the previous line and the first non-space character starts the next line.

0020SPACE (SP)

Note: By default,SPACE, but none of the other breaking spaces, is used in determining an indirect break. For other breaking space characters, seeBA.

SY: Symbols Allowing Break After (A)

TheSY line breaking property is intended to provide a break opportunity after, except in front of digits, so as to not break “1/2” or “06/07/99”.

002F SOLIDUS

URLs are now so common in regular plain text that they need to be takeninto account when assigning general-purpose line breaking properties. Slash (solidus) is allowed as an additional, limited break opportunity to improve layout of Web addresses. As a side effect, some common abbreviationssuch as “w/o” or “A/S”, which normally would not be broken,acquire a linebreak opportunity. The recommendation in this case is for the layout systemnot to utilize a line break opportunity allowed bySY unless the distancebetween it and the next line break opportunity exceeds an implementation-defined minimal distance.

Note: Normally, symbols are treated asAL.However, symbols can be added to this line breaking class or classesBA,BB,andB2 by tailoring.This can be used to allow additional line breaks—for example, after “=”. Mathematics requires additional specifications for line breaking, which are outside the scope of this annex.

WJ: Word Joiner (XB/XA) (Non-tailorable)

These characters glue together left and right neighbor characters suchthat they are kept on the same line.

2060WORD JOINER (WJ)
FEFF ZERO WIDTH NO-BREAK SPACE (ZWNBSP)

The word joiner character is the preferred choice for an invisible character to keep other characters together that would otherwise be split across the line at a direct break. The character FEFF has the same effect, but because it is also used in an unrelated way as abyte order mark, the use of the WJ as the preferred interword glue simplifies the handling of FEFF.

By definition, WJ and ZWNBSP take precedence over the action ofSP, but notZW.

XX: Unknown (XP)

TheXX line break class consists of all characters with General_Category Coand all code points with General_Category Cn.

Unassigned code positions, private-use characters, and characters for which reliable line breaking information is not available are assigned this default line breaking property by default. The default behavior for this class is identical to classAL. Users can manually insert ZWSP orword joiner around characters of classXX to allow or prevent breaks as needed.

In addition, implementations can override or tailor this default behavior—for example, by assigning characters the propertyID or another class.Doing so may give better default behavior for their users. There are other possible means of determining the desired behavior of private-usecharacters. For example, one implementation might treat any private-use character in ideographic context asID, while another implementation might support a method for assigning specific properties to specific definitions of private-use characters. The details of such use of private-use characters are outside the scope of this standard.

For supplementary characters, a useful default is to treat characters in the range 10000..1FFFD asAL and characters in the ranges 20000..2FFFD and 30000..3FFFD asID, until the implementation can be revised to take into account the actual line breaking properties for these characters.

For more information on handling default property values for unassigned characters, see the discussion on default property values inSection 5.3, Unknown and Missing Characters, of [Unicode5.0].

The line breaking rules inSection6,Line Breaking Algorithm, and the pair table inSection7,Pair Table-Based Implementation, assume that all unknown characters have been assigned one of the other line breaking classes, such asAL, as part of assigning line breaking classes to the input characters.

Implementations that do not support a givencharacter should also treat it as unknown (XX).

ZW: Zero Width Space (A) (Non-tailorable)

200B

ZERO WIDTH SPACE (ZWSP)

This character is used to enable additional (invisible) break opportunities wherever SPACEcannot be used. As its name implies, it normally has no width. However,its presence between two characters does not prevent increased letterspacing in justification.

5.2 Dictionary Usage

Dictionaries follow specific conventions that guide their use of special characters to indicate features of the terms they list. Marks used for some of theseconventions may occur near line break opportunities and therefore interactwith line breaking. For example, in one dictionary a natural hyphen in aword becomes a tilde dash when the word is split.

Examples of conventions used in several dictionaries are brieflydescribed in this subsection. Where possible, the line breaking propertiesfor characters commonly used in dictionaries have been assigned toaccommodate these and similar conventions bydefault. However, implementing the full conventions in dictionariesrequires tailoring of line break classes and rulesor other types of special support.

Looking up the noun “syllable” in eight dictionaries yields eightdifferent conventions:

Dictionary of the English Language (Samuel Johnson, 1843)SY´LLABLE where´ is an oversized U+02B9 and follows the vowel of the main syllable (not the syllable itself).

Oxford English Dictionary (1st Edition)si·lă'bl where · is a slightly raised middle dot indicating the vowel of the stressed syllable (similar to Johnson’s acute). The letter ă is U+0103. The ' is an apostrophe.

Oxford English Dictionary (2nd Edition) has gone to IPA'sIləb(ə)l where' is U+02C8, I is U+026A, and ə is U+0259 (both times). The' comes before the stressed syllable. The () indicate theschwa may be omitted.

Chambers English Dictionary (7th Edition)sil´ə-bl where the stressed syllable is followed by´ U+02B9, ə is U+0259, and- is a hyphen. When splitting a word likeabate´- ment, the stress mark´ goes after stressed syllable followed by the hyphen. No special convention isused when splitting at hyphen.

BBC English DictionarysIləbl whereI is <U+026A, U+0332>and ə is U+0259. The vowel of the stressed syllable is underlined.

Collins Cobuild English Language DictionarysIləbə°l whereI is <U+026A, U+0332> and has the same meaning as in theBBC English Dictionary. The ə is U+0259 (both times). The° is a U+2070 and indicates theschwa may be omitted.

Readers Digest Great Illustrated Dictionarysyl·la·ble (sílləb'l)The spelling of the word has hyphenation points (· is a U+2027) followed byphonetic spelling. The vowel of the stressed syllable is given an accent, rather than being followed by an accent. The ' is an apostrophe.

Webster’s 3rd New International Dictionarysyl·la·ble /'siləbəl/ The spelling of the word has hyphenation points (· is a U+2027) and is followed by phonetic spelling. The stressed syllable is preceded by' U+02C8. The ə’s areschwas as usual.Webster’s splits wordsat the end of a line with a normal hyphen. A U+2E17DOUBLE OBLIQUE HYPHEN indicates that a hyphenated word issplit at the hyphen.

Some dictionaries use a character that looks like a vertical series of fourdots to indicate places where there is a syllable, but no allowable break.This can be represented by a sequenceof U+205EVERTICAL FOUR DOTS followed by U+2060WORD JOINER.

5.3 Use of Hyphen

The rules for treating hyphens in line breakingvary by language. In many instances, these rules are not supported as such in thealgorithm, but the correct appearance can be realized by using anon-breaking hyphen.

Some languages and some transliteration systemsuse a hyphen at the first position in a word. For example, the Finnishorthography uses a hyphen at the start of a word in certain types ofcompounds of the form xxx yyy -zzz (where xxx yyy is a two-word expressionthat acts as the first part of a compound noun, with zzz as the secondpart). Line break after the hyphen is not allowed here; therefore, instead ofa regular hyphen, U+2011NON-BREAKING HYPHENshould be used.

There are line breaking conventions thatmodify the appearance of a line break when the line break opportunity isbased on an explicit hyphen. Instandard Polish orthography, explicit hyphens are always promoted to thenext line if a line break occurs at that location in the text. For example,if, given the sentence "Tam wisi czerwono-niebieska flaga" ("Therehangs a red-blue flag"), the optimal line break occurs at the location ofthe explicit hyphen, an additional hyphenwill be displayed at the beginning of the next line like this:

Tam wisi czerwono-
-niebieska flaga.

The same convention is used in Portuguese, where the useof hyphens is common, because they are mandatory for verbs forms that include apronoun. Homographs or ambiguity may arise if hyphens are treatedincorrectly: for example, "disparate" means "folly" while "dispara-te" means "fireyourself" (or "fires onto you"). Therefore the former needs to be linebroken as

dispara-
te

and the latter as

dispara-
-te.

A recommended practice is to type <SHY,NBHY> instead of <HYPHEN> to achieve promotion of the hyphen to the nextline. This practice is reportedly already common and supported by major textlayout applications. See alsoSection5.4,Use of Soft Hyphen.

5.4 Use of Soft Hyphen

Unlike U+2010HYPHEN, which always has a visible rendition, the character U+00AD SOFT HYPHEN (SHY) is an invisible format character that merely indicates a preferred intraword line break position. If the line is broken at that point, then whatever mechanism is appropriate for intraword line breaks should beinvoked, just as if the line break had been triggered by another hyphenation mechanism,such as a dictionary lookup. Depending on the language and the word, thatmay produce different visible results—for example

The following are a few examples of spelling changes. Each example shows the linebreak as “ / ” and any inserted hyphens. There are many other cases.

The inserted hyphen glyph can take a wide variety of shapes, as appropriate for the situation. Examples include shapes like U+2010HYPHEN, U+058A ARMENIAN HYPHEN, U+180A MONGOLIAN NIRUGU, or U+1806 MONGOLIAN TODO SOFT HYPHEN.

When a SHY is used to represent a possible hyphenation location, the spelling is that of the word without hyphenation: “tug<SHY>gummi”. It is up to the line breaking implementation to make any necessary spelling changes when such a possible hyphenation is actually used.

Sometimes it is desirable to encode text that includes line breakingdecisions and will not be further broken into lines. If such text includes hyphenations, the spelling needs to reflect the changes due to hyphenation: “tugg<U+2010>/ gummi”, including the appropriate character for any inserted hyphen. For a list of dash-like characters in Unicode, see Section 6.2, General Punctuation, in [Unicode5.0].

Hyphenation, and therefore the SHY, can be usedwith the Arabic script. If the renderingsystem breaks at that point, the display—including shaping—should be whatis appropriate for the given language. Forexample, sometimes a hyphen-like mark is placedon the end of the line. This mark looks like akashida, but is notconnected to the letter preceding it. Instead, theappearance of the mark is as if it had been placed—and the linedivided—after the contextual shapes for the line have been determined. Formore information on shaping, see [Bidi] andSection 8.2, Arabic, of [Unicode5.0].

There are three types of hyphens: explicit hyphens, conditional hyphens,and dictionary-inserted hyphens resulting from a hyphenation process. Thereis no character code for the third kind of hyphen. If adistinction is desired, the fact that a hyphen is dictionary-inserted andnot user-supplied can only be represented out of band or by using another control code instead of SHY.

The action of a hyphenation algorithm is equivalent to the insertion of a SHY. However, when a word contains an explicit SHY, it is customarily treated as overriding the action of the hyphenator for that word.

The sequence <SHY, NBHY> is given a particularinterpretation, seeSection5.3,Use of Hyphen.

5.5 Use of Double Hyphen

In some fonts, noticeably Fraktur fonts, it is customary to use a double-stroke form of the hyphen, usually oblique. Such use is merely a font-basedglyph variation and does not affect line breaking in any way. In texts usingsuch a font, automatic hyphenation or SHY would also result in the displayof a double-stroke, oblique hyphen.

In some dictionaries, such asWebster’s 3rd New International Dictionary, double-stroke, oblique hyphens are used to indicatean explicit hyphen at the end of the line, in other words, a hyphen thatwould be retained when the term shownis not line wrapped. To support this, it is not necessary to store a specialcharacter in the data; one merely needs to substitute the glyph of any ordinary hyphen thatwinds up at the end of a line. For example, if the shape of thespecial hyphen, as in this case, matches an existing character, such asU+2E17DOUBLE OBLIQUE HYPHEN,that character can be substituted temporarily fordisplay purposes by the line formatter. In such convention, automatic hyphenation orSHY would result in the display of an ordinary hyphen without furthersubstitution. (See alsoSection5.3,Use of Hyphen).

Certain linguistic notations make use of a double-stroke, oblique hyphento indicate specific features. The U+2E17DOUBLE OBLIQUE HYPHEN character used in this case is not a hyphenand does not represent a line break opportunity. Automatic hyphenation orSHY would result in the display of an ordinary hyphen.

U+30A0KATAKANA-HIRAGANADOUBLE HYPHEN is used in scientificnotation, for example, to mark the presence of a space that would otherwisehave been lost in transcribing text, such as the name of a chemicalcompound, into Katakana. In such notation, ordinary hyphens are retained.

5.6 Tibetan Line Breaking

The Tibetan script uses spaces sparingly,relying instead on thetsheg. There is no punctuation equivalent to aperiod in Tibetan; Tibetanshad characters indicate the end of a“phrase,” not a sentence. “Phrases” are often metrical—that is, writtenafter everyN syllables—and a new sentence can often start within themiddle of a phrase. Sentence boundaries need to be determinedgrammatically rather than by punctuation.

Traditionally there is nothing akin to aparagraph in Tibetan text. It is typical to have many pages of textwithout a paragraph break—that is, without an explicit line break. The closest thing to a paragraph in Tibetan is anew section or topic starting with U+0F12 or U+0F08. However, these occurinline: one section ends and a new one starts on the same line, and the newsection is marked only by the presence of one of these characters.

Some modern books, newspapers, and magazinesformat text more like English with a break before each section or topic—and (often) the title of the section on a separate line. Where this is done,authors insert an explicit line break. Western punctuation (full stop,question mark, exclamation mark, comma, colon, semicolon, quotes) isstarting to appear in Tibetan documents, particularly those published inIndia, Bhutan, and Nepal. Because there are no formal rules for their use inTibetan, they get treated generically by default. In Tibetan documentspublished in China, CJK bracket and punctuation characters occur frequently;it is recommended to treat these as in Chinesewritten horizontally.

Note: The detailed rules for formatting Tibetan texts arecomplex, and the original assignment of line break classes was found to bewholly insufficient for the purpose. In [Unicode4.1], the assignment of linebreak classes for Tibetan was revised significantly in an attempt tobetter model Tibetan line breaking behavior. No new rules or line breakclasses were added.

The set of line break classes for Tibetan are expected to provide a good startingpoint, even though there is limited practical experience in theirimplementation. As more experience is gained, some modifications, possiblyincluding new rules or additional line break classes, can be expected.

It is the stated intention of the UnicodeConsortium to review these assignments in a future version and to furnish amore detailed and complete description of Tibetan line breaking and line formattingbehavior.

5.7 Word Separator Characters

Visible word separator characters may behave in one of three ways at line breaks. As an example, consider the textThe:quick:brown:fox:jumped.”, where the colon (:) represents a visible word separator, with a break between "brown" and "fox". The desired visual appearance could be one of the following:

1. suppress the visible word separator

The:quick:brown
fox:jumped.

2. break before the visible word separator

The:quick:brown
:fox:jumped.

3. break after the visible word separator

The:quick:brown:
fox:jumped.

Both (2) and (3) can be expressed with the Unicode Line Breaking Algorithm by tailoring the Line Break property value for the word separator character to beBreak Before orBreak After, respectively.

For case (1), the line break opportunity is positioned after the word separator character, as in case (3), but the visual display of the character is suppressed. The means by which a line layout and display process inhibits the visible display of the separator character are outside of the scope of the Line Break algorithm. U+1680OGHAM SPACE MARK is an example of a character which may exhibit this behavior.

6 Line Breaking Algorithm

Unicode Standard Annex #29,Text Boundaries” [Boundaries], describes a particular method for boundary detection. It is based on a set of hierarchical rules and character classifications. That method is well suited for implementation of some of the advanced heuristics for line breaking.

A slightly simplified implementation of such an algorithm can be devised that uses a two-dimensional table to resolve break opportunities between pairs or characters. It is described inSection7,PairTable-Based Implementation.

The line breaking algorithm presented in this section can be expressed in a series of rules that take line breaking classes defined inSection5.2,Description of Line Breaking Properties, as input.The title of each rule contains a mnemonic summary of the main effect of therule. The formal statement of each line breaking rules consists either of aremap rule or of one or more regular expressions containing one or moreline breaking classes and one of three special symbols indicating the typeof line break opportunity:

! Mandatory break at the indicated position

× No break allowed at the indicated position

÷ Break allowed at the indicated position

The rules are applied in order. That is, there is an implicit “otherwise”at the front of each rule following the first. It is possible to constructalternate sets of such rules that are fully equivalent. To be equivalent, analternate set of rules must have the same effect.

The distinction between a direct break and an indirect break as defined inSection2,Definitions, is handled in ruleLB18,which explicitly considers the effect ofSP. Because rules are applied in order, allowing breaks followingSP in ruleLB18 implies that any prohibited break in rulesLB19LB30 is equivalent to an indirect break.

The examples for each rule use representative characters, where ‘H’ stands for an ideographs,‘h’ for small kana, and ‘9’ for digits.Except where a rule contains no expressions, the italicized text of the ruleis intended merely as a handy summary.

The algorithm consists of a part for whichtailoring is prohibited and a freely tailorable part.

6.1 Non-tailorable Line Breaking Rules

The rules in this subsection and the membershipin the classesBK,CM,CR,GL,LF,NL,SP,WJ, and ZWdefine behavior that is required of all line break implementations; seeSection4,Conformance.

Resolve line breaking classes:

LB1  Assign a line breaking class to each code point of the input. ResolveAI,CB,SA,SG,andXX into other line breaking classes depending on criteria outside the scope of this algorithm.

In the absence of such criteria,it is recommended that classesAI,SA,SG,andXX be resolved toAL, except that characters of classSAthat have General_Category Mn or Mc be resolved toCM (seeSA). Unresolved classCB is handled in ruleLB20.

Start and end of text:

There are two special logical positions:sot, which occurs before the first character in the text, andeot, which occurs after the last character in the text. Thus anempty string would consist ofsot followed immediately byeot. With thesetwo definitions, the line break rules for start and end of text can bespecified as follows:

LB2  Never break at the start of text.

sot ×

LB3  Always break at the end of text.

! eot

These two rules are designed to deal with degenerate cases, so that thereis at least one character on each line, and at least one line break for the whole text. Emergency line breaking behavior usually also allows line breaks anywhere on the line if a legal line break cannot be found. This has the effect of preventing text from running into the margins.

Mandatory breaks:

A hard line break can consist ofBK or a Newline Function (NLF) as described inSection 5.8, Newline Guidelines, of [Unicode5.0]. These three rules are designed to handle the line ending and line separating characters as described there.

LB4  Always break after hard linebreaks.

BK !

LB5  TreatCR followed byLF, as well asCR,LF, andNL as hard line breaks.

CR × LF

CR !

LF !

NL !

LB6  Do not break before hard line breaks.

× ( BK | CR | LF | NL )

 

Explicit breaks and non-breaks:

LB7  Do not break before spaces or zerowidth space.

× SP

× ZW

LB8  Break after zero width space.

ZW ÷

Combining marks:

See alsoSection9.2,Legacy Support for Space Characteras Base for Combining Marks.

LB9  Do not break a combining character sequence; treat it as if it has the line breaking class of the base character in all of the following rules.

Treat X CM* as if it were X.

where X is any line break class except BK,CR,LF,NL,SP, orZW.

At any possible break opportunity betweenCM and a following character,CM behaves as if it had thetype of its base character. Note that despite the summary title of this rule itis not limited to standard combining character sequences. For the purposesof line breaking, sequences containing most of the control codes or layoutcontrol characters are treated like combiningsequences.

LB10  Treat any remaining combining mark asAL.

Treat any remaining CM as it if were AL.

This catches the case where aCM is the first character on the line orfollowsSP,BK,CR,LF,NL, orZW.

Word joiner:

LB11  Do not break before or afterWord joiner and related characters.

× WJ

WJ ×

Non-breaking characters:

LB12  Do not break after NBSP and related characters.

GL ×

6.2 Tailorable Line Breaking Rules

The following rules and the classes referenced in them provide a reasonable default set of line break opportunities. Implementations SHOULD implement them unless alternate approaches produce better results for some classes of text or applications. When using alternative rules or algorithms, implementations must ensure that the mandatory breaks, break opportunities and non-break positions determined by the algorithm and rules ofSection6.1,Non-tailorable Line Breaking Rules, are preserved. SeeSection4,Conformance.

Non-breaking characters:

LB12a  Do not break before NBSP and related characters, except after spaces and hyphens

[^SP BA HY] × GL

The expression [^SP,BA, HY] designates any line break class other thanSP,BAorHY. The symbol ^ is used, instead of !,to avoid confusion with the use of ! to indicate an explicit break. Unlike the case forWJ, inserting aSP overrides the non-breaking nature ofaGL. Allowinga break afterBA orHY matches widespread implementation practiceand supports a common way of handling special line breaking ofexplicit hyphens, such as in Polish and Portuguese. SeeSection5.3,Use of Hyphen.

Opening and closing:

These have special behavior with respect to spaces, and therefore come before rule LB18.

LB13  Do not break before ‘]’ or ‘!’ or ‘;’ or ‘/’, even after spaces.

× CL

× EX

× IS

× SY

LB14  Do not break after ‘[’, even after spaces.

OP SP* ×

LB15  Do not break within ‘”[’, even with intervening spaces.

QU SP* × OP

For more information on this rule, see the notein the description for theQU class.

LB16  Do not break between closing punctuation and a nonstarter (lb=NS), even with intervening spaces.

CL SP* × NS

LB17  Do not break within ‘——’, even with intervening spaces.

B2 SP* × B2

Spaces:

LB18  Break after spaces.

SP ÷

Special case rules:

LB19  Do not break before or after quotation marks, such as ‘ ” ’.

× QU

QU ×

LB20  Break before and after unresolvedCB.

÷ CB

CB ÷

Conditional breaks should be resolved external to the line breaking rules. However, the default action is to treat unresolvedCB as breaking before and after.

LB21  Do not break before hyphen-minus, other hyphens, fixed-width spaces, small kana, and other non-starters, or after acute accents.

× BA

× HY

× NS

BB ×

LB22  Do not break between two ellipses, or between letters or numbers and ellipsis.

AL × IN

ID × IN

IN × IN

NU × IN

Examples: ‘9...’, ‘a...’, ‘H...’

Numbers:

Do not break alphanumerics.

LB23  Do not break within ‘a9’, ‘3a’, or ‘H%’.

ID × PO

AL × NU

NU × AL

LB24  Do not break betweenprefix and letters or ideographs.

PR × ID

PR × AL

PO × AL

In general, it is recommended to not break lines inside numbers of the form describedby the following regular expression:

(PR |PO) ? (OP |HY ) ?NU (NU |SY |IS) *CL ? (PR |PO) ?

Examples:  $(12.35)    2,1234    (12)¢    12.54¢

The default line breaking algorithm approximates this with the followingrule. Note that some cases have already been handled, such as ‘9,’, ‘[9’. For atailoring that supports the regular expression directly, as well as a key tothe notation seeSection8.2,Examples of Customization.

LB25  Do not break between the following pairs of classesrelevant to numbers:

CL × PO

CL × PR

NU × PO

NU × PR

PO × OP

PO × NU

PR × OP

PR × NU

HY × NU

IS × NU

NU × NU

SY × NU

Example pairs: ‘$9’, ‘$[’, ‘$-’, ‘-9’, ‘/9’, ‘99’, ‘,9’, ‘9%’ ‘]%’

Korean syllable blocks

Conjoining jamo, Hangul syllables, or combinations of both form KoreanSyllable Blocks. Such blocks are effectively treated as if they were Hangulsyllables; no breaks can occur in the middle of a syllable block. SeeUnicode Standard Annex #29, “TextBoundaries” [Boundaries], for more informationon Korean Syllable Blocks.

LB26 Do not break a Koreansyllable.

JL × (JL | JV | H2 |H3)

(JV | H2) × (JV | JT)

(JT | H3) × JT

where the notation (JT | H3) means JT or H3. The effective line breaking class for the syllable block matches theline breaking class for Hangul syllables, which isID by default. This is achieved bythe following rule:

LB27 Treat a Korean Syllable Block thesame asID.

(JL | JV | JT | H2 | H3) × IN

(JL | JV | JT | H2 | H3) × PO

PR × (JL | JV | JT | H2 | H3)

When Korean usesSPACE for line breaking, the classes in ruleLB26, as well as characters ofclassID,are often tailored toAL; seeSection8,Customization.

Finally, join alphabetic letters into words and break everything else.

LB28  Do not break between alphabetics (“at”).

AL × AL

LB29  Do not break between numeric punctuation and alphabetics(“e.g.”).

IS × AL

LB30 Withdrawn. In Unicode 5.0, rule LB30 was intended to prevent breaks in cases where a part of a word appears between delimiters—for example, in “person(s)”. The rule was withdrawn because it prevented desirable breaks after certain Asian punctuation characters with class CL. SeeExample 9 ofSection8,Customization, for options for restoring the functionality.

LB31  Break everywhere else.

ALL ÷

÷ ALL

7 Pair Table-Based Implementation

A two-dimensional table can be used to resolve break opportunitiesbetween pairs of characters. This section defines such a table. The rows of the table are labeled with thepossible values of the line breaking property of the leading character inthe pair. The columns are labeled with the line breaking class for thefollowing character of the pair. Each intersection is labeled with theresulting line break opportunity.

The Japanese standard JIS X 4051-1995 [JIS] providesan example of a similar table-based definition. However, it uses line breakingclasses whose membership is not solely determined by the line breakingproperty (as in this annex), but in some cases by heuristic analysis ormarkup of the text.

The implementation provided here directly uses the line breaking classesdefined previously.

7.1 Minimal Table

If two rows of the table have identical values and the correspondingcolumns also have identical values, then the two line breaking classes canbe coalesced. For example, the JIS standard uses 20 classes, of which only 14appear to be unique. Any minimal table representation is unique, except fortrivial reordering of rows and columns. Minimaltables for which the rows and columns are sorted alphabetically can bemechanically compared for differences. This is in contrast to the rules,where identical results can be achieved by sets of rules that cannot beeasily compared by looking at their textual representation. However, any setof rules that is equivalent to a minimal pair table can be used toautomatically generate such a table, which can then be used for comparison.The rules inSection6,Line Breaking Algorithm,can be expressed as minimal pair tables if theextended context used as described below.

7.2 Extended Context

Most of the rules inSection6,Line Breaking Algorithm,involve only pairs of characters, or they apply to asingle line break class preceded or followed by any character. These rulescan be represented directly in a pair table. However, rulesLB14LB17 require extended context to handle spaces.

By broadening the definition of a pair fromB A, whereB isthe line breaking class before a break andA the one after, toBSP*A, whereSP* is an optional run of spacecharacters, the same table can be used to distinguish between cases whereSP can or cannot provide a line breakopportunity (that is, direct and indirect breaks). Rules equivalent to theones given inSection6,Line Breaking Algorithm,can be formulated without explicit use ofSPby using% to express indirect breaks instead. These rules can thenbe simplified to involve only pairs of classes—that is, only constructionsof the form:

B ÷A

B %A

B ×A

where eitherA orB may be empty. These simplified rulescan be automatically translated into a pair table, as inTable 2. Linebreaking analysis then proceeds by pair table lookup as explained below. (For readability in table layout, the symbol ^ is used in the table insteadof × and _ is used instead of ÷.)

RuleLB9 requires extended context for handlingcombining marks. This extended context must also be built into the code thatinterprets the pair table. For convenience in detecting the condition whereA =CM,the symbols # and @ are used in the pair table, instead of % and ^, respectively. SeeSection7.5,Combining Marks.

7.3 Example Pair Table

Table 2 implements the line breaking behavior described in this annex,with the limitation that only context of the formBSP*A is considered.BK,CR,LF,NL,andSP classes are handled explicitly inthe outer loop, as given in the code sample below. Pair context of the formBCM* can be handled by handling thespecial entries @ and # in the driving loop, as explained inSection7.5,Combining Marks. Conjoiningjamos are considered separately inSection7.6,Conjoining Jamos. InTable 2, the rows are labeledwith theBclass and the columns are labeled with theA class.

Table 2.Example Pair Table

 OPCLQUGLNSEXSYISPRPONUALIDINHYBABBB2ZWCMWJH2H3JLJVJT
OP^^^^^^^^^^^^^^^^^^^@^^^^^^
CL_^%%^^^^%%____%%__^#^_____
QU^^%%%^^^%%%%%%%%%%^#^%%%%%
GL%^%%%^^^%%%%%%%%%%^#^%%%%%
NS_^%%%^^^______%%__^#^_____
EX_^%%%^^^______%%__^#^_____
SY_^%%%^^^__%___%%__^#^_____
IS_^%%%^^^__%%__%%__^#^_____
PR%^%%%^^^__%%%_%%__^#^%%%%%
PO%^%%%^^^__%%__%%__^#^_____
NU_^%%%^^^%%%%_%%%__^#^_____
AL_^%%%^^^__%%_%%%__^#^_____
ID_^%%%^^^_%___%%%__^#^_____
IN_^%%%^^^_____%%%__^#^_____
HY_^%_%^^^__%___%%__^#^_____
BA_^%_%^^^______%%__^#^_____
BB%^%%%^^^%%%%%%%%%%^#^%%%%%
B2_^%%%^^^______%%_^^#^_____
ZW__________________^_______
CM_^%%%^^^__%%_%%%__^#^_____
WJ%^%%%^^^%%%%%%%%%%^#^%%%%%
H2_^%%%^^^_%___%%%__^#^___%%
H3_^%%%^^^_%___%%%__^#^____%
JL_^%%%^^^_%___%%%__^#^%%%%_
JV_^%%%^^^_%___%%%__^#^___%%
JT_^%%%^^^_%___%%%__^#^____%

Resolved outside the pair table: AI,BK,CB,CRLF,NL,SA,SG,SP,XX

Table 2 uses the following notation:

^ denotes aprohibited break: B ^ A is equivalent toBSP* ×A; in otherwords, never break before A and after B, even if one or more spacesintervene.

% denotes anindirect break opportunity: B % A is equivalenttoB ×Aand BSP+ ÷A; in other words, donot break before A, unless one or more spaces follow B.

@ denotes aprohibitedbreak for combiningmarks: B @ A isequivalent toBSP*×A,where A is of classCM. For moredetails, seeSection7.5,CombiningMarks.

# denotes anindirect break opportunity for combining marksfollowing a space: B # A is equivalent to (B ×Aand BSP+ ÷ A), where A is of classCM.

_ denotes adirect break opportunity (equivalent to ÷ asdefined above).

Note: In the online edition, hovering over the cells in a browser with tool-tips enabled reveals therule number that determines the breaking status for the pair in question.When a pair must be tested both with and without intervening spaces, multiplerules are given. Hovering over a line breaking class name gives arepresentative member of the class and additional information. Clicking onany line break class name anywhere in the document jumps to the definition.

7.4 Sample Code

The following two sections provide sample code [Code14]that demonstrates how the pair table is used. For acomplete implementation of the line breaking algorithm,ifstatements to handle the line breaking classesCR,LF, andNLneed to be added. They have been omitted here for brevity, but seeSection7.7,Explicit Breaks.

The sample code assumes that the line breaking classesAI,CB,SG, andXXhave been resolved according to ruleLB1 as part of initializing thepcls array. The code further assumes that thecomplex line break analysis for characters with line break classSA ishandled in functionfindComplexBreak, for which the followingplaceholder is given:

// placeholder function for complex break analysis// cls - resolved line break class, may differ from pcls[0]// pcls - pointer to array of line breaking classes (input)// pbrk - pointer to array of line breaking opportunities (output)// cch - remaining length of inputint   findComplexBreak(enum break_class cls,enum break_class *pcls,enum break_action *pbrk,int cch)    {if (!cch)return 0;for (int ich = 1; ich< cch; ich++) {// .. do complex break analysis here// and report any break opportunities in pbrk ..                pbrk[ich-1] = PROHIBITED_BRK;// by default, no breakif (pcls[ich] != SA)break;            }return ich;    }

The entries in the example pairtable correspond to the following enumeration. For diagnostic purposes, thesample code returns these value to indicate not only the location but alsothe type of rule that triggered a given break opportunity.

enum break_action {           DIRECT_BRK = 0,// _ in table           INDIRECT_BRK,// % in table           COMBINING_INDIRECT_BRK,// # in table           COMBINING_PROHIBITED_BRK,// @ in table           PROHIBITED_BRK,// ^ in table           EXPLICIT_BRK };// ! in rules

Because the contexts involved inindirect breaks of the formBSP*A are of indefinite length,they need to be handled explicitly in the driver code. The sampleimplementation of afindLineBrk function below remembers theline breakclass for the last characters seen, but skips any occurrence ofSP withoutresetting this value. Once characterA is encountered, a simplelookback is used to see if it is preceded by aSP. This lookback is necessary only ifB %A.To handle the case of aSP followingsot, it is necessary to setcls to a dummy value. UsingWJ gives thecorrect result and, as required, is unaffected by any tailoring.

// handle spaces separately, all others by table// pcls - pointer to array of line breaking classes (input)// pbrk - pointer to array of line break opportunities (output)// cch - number of elements in the arrays (“count of characters”) (input)// ich - current index into the arrays (variable) (returned value)// cls - current resolved line break class for 'before' character (variable)intfindLineBrk(enum break_class *pcls,enum break_action *pbrk,int cch)    {if (!cch)return 0;enum break_class cls = pcls[0];// class of 'before' character        // treat SP at start of input as if it followed a WJif(cls == SP)            cls = WJ;// loop over all pairs in the string up to a hard breakfor (int ich = 1; (ich< cch) && (cls != BK); ich++) {// to handle explicit breaks, replace code from "for" loop condition            // above to comment below by code given in Section 7.7
// handle spaces explicitlyif (pcls[ich] == SP) { pbrk[ich-1] = PROHIBITED_BRK;// apply rule LB7: × SPcontinue;// do not update cls }// handle complex scripts in a separate functionif (pcls[ich] == SA) { ich += findComplexBreak(cls, &pcls[ich-1], &pbrk[ich-1], cch - (ich-1));if (ich< cch) cls = pcls[ich];continue; }// lookup pair table information in brkPairs[before, after];enum break_action brk = brkPairs[cls][pcls[ich]]; pbrk[ich-1] = brk;// save break action in output arrayif (brk == INDIRECT_BRK) {// resolve indirect breakif (pcls[ich - 1] == SP)// if context is A SP + B pbrk[ich-1] = INDIRECT_BRK;// break opportunityelse// else pbrk[ich-1] = PROHIBITED_BRK;// no break opportunity }// handle breaks involving a combining mark (see Section 7.5)// save cls of 'before' character (unless bypassed by 'continue') cls = pcls[ich]; } pbrk[ich-1] = EXPLICIT_BRK; // always break at the endreturn ich; }

The function returns all of the break opportunities in the array pointed tobypbrk, using the values in the table. On return, pbrk[ich]is the type of break after the character at indexich.

A common optimization in implementation is todetermine only the nearest line break opportunity prior to the position ofthe first character that would cause the line to become overfull. Such anoptimization requires backward traversal of the string insteadof forward traversal as shown in the sample code.

7.5 Combining Marks

The implementation of combiningmarks in the pair table presents an additional complication because ruleLB9defines a contextXCM* that is of arbitrary length. There are somesimilarities to the way contexts of the formBSP*A that are involved inindirect breaks are evaluated. However, contexts of the formSPCM* orCM*SP also need to be handled, while ruleLB10 requires someCM* to be treatedlikeAL.

Implementing LB10.This rule can be reflected directlyin the example pair table inTable 2by assigning the same values in the row markedCMas in the row markedAL. Incidentally,this isequivalent to rewriting the rulesLB11LB31 by duplicating any expressionthat contains anAL on its lefthandside with another expression that contains aCM. For example,inLB22

AL × IN

would become

AL × IN

CM × IN

Rewriting these rules as indicated here (and then deletingLB10)is fully equivalent to the original rules because ruleLB9 already accounts for allCMs that are not supposedto be treated likeAL. For completeprescription see Example 9 inSection8.2,Examples ofCustomization.

Implementing LB9. RuleLB9 is implemented in theexample pair table inTable 2 by assigning a special # entry in the columnmarkedCM for all rows referring to a linebreak class that allows adirect or indirect break afteritself. (Note that the intersection between the row for classZW and the columnfor classCM must be assigned “_”because of ruleLB8.) The # corresponds to abreak_action value ofCOMBINING_INDIRECT_BREAK,which triggers the following code in the sample implementation:

elseif (brk == COMBINING_INDIRECT_BRK) { // resolve combining mark break        pbrk[ich-1] = PROHIBITED_BRK; // do not break before CMif (pcls[ich-1] == SP){#ifndef LEGACY_CM   // new: space is not a base                pbrk[ich-1] = COMBINING_INDIRECT_BRK;// apply rule SP ÷#else                pbrk[ich-1] = PROHIBITED_BRK;  // legacy: keep SP CM togetherif (ich > 1)                    pbrk[ich-2] = ((pcls[ich - 2] == SP) ?                                                  INDIRECT_BRK : DIRECT_BRK);#endif        }else// apply rule LB9: X CM * -> Xcontinue; // do not update cls    }

When handling aCOMBINING_INDIRECT_BREAK, the last remembered line break classin variablecls isnot updated, except for those cases covered by ruleLB10. Atailoring ofruleLB9 that keeps the lastSPACE character preceding a combining mark,if any, and therefore breaks before thatSPACE character can easily beimplemented as shown in the sample code. (SeeSection9.2,Legacy Support for Space Character as Base forCombining Marks.)

Any rows inTable 2 for line break classes thatprohibit breaks after must be handled explicitly. In the example pair table,these are assigned a special entry “@”, which correspondsto a special break action ofCOMBINING_PROHIBITED_BREAK that triggers the followingcode:

elseif (brk == COMBINING_PROHIBITED_BRK) {// this is the case OP SP* CM        pbrk[ich-1] = COMBINING_PROHIBITED_BRK;// no break allowedif (pcls[ich-1] != SP)continue;// apply rule LB9: X CM* -> X    }

The only line break class that unconditionally prevents breaks across afollowingSP isOP. The preceding code fragment ensures thatOPCM ishandled according to ruleLB9 andOPSPCMis handled asOPSPALaccording to ruleLB10.

7.6 Conjoining Jamos

For Korean Syllable Blocks, the information in ruleLB26is represented by a simple pair table shown inTable 3.

Table 3.Korean Syllable Block Pair Table

 H2H3JLJVJT
H2___%%
H3____%
JL%%%%_
JV___%%
JT____%

When constructing a pair table such asTable 2, this pair table for Korean syllable blocks inTable 3 is merged with themain pair table for all other line break classes by adding the cells fromTable 3beyond the lower-right corner of the main pair table. Next, according to ruleLB27, any empty cells in the new rows are filled with the same values asin the existing row for classID, and any empty cells for the newcolumns are filled with the same values as in the existing column for classID. The resulting merged table is shown inTable 2.

7.7 Explicit Breaks

Handling explicit breaks is straightforward in the driver code, althoughit does clutter up the loop condition and body of the loop a bit. For completeness, the following sampleshows how to change the loop condition and addif statements—bothbefore and inside theloop—that handleBK,NL,CR,andLF. BecauseNL andBK behave identically by default, this codecan be simplified in implementations where the character classification ischanged so thatBK will always be substituted forNLwhen assigning the line break class. Because this optimization does not changethe result, it is not considered a tailoring and does not affectconformance.

// handle case where input starts with an LFif (cls == LF)         cls = BK;// treat initial NL like BKif (cls == NL)         cls = BK;// loop over all pairs in the string up to a hard break or CRLF pairfor (int ich = 1; (ich< cch) && (cls != BK) && (cls != CR || pcls[ich] == LF); ich++) {// handle BK, NL and LF explicitlyif (pcls[ich] == BK ||pcls[ich] == NL ||  pcls[ich] == LF)        {            pbrk[ich-1] = PROHIBITED_BRK;            cls = BK;continue;        }// handle CR explicitlyif(pcls[ich] == CR)        {            pbrk[ich-1] = PROHIBITED_BRK;            cls = CR;continue;        }// handle spaces explicitly...

8 Customization

A real-world line breaking algorithm has to be tailorable to some degree to meet user or document requirements.

In Korean, for example, two distinct line breaking modes occur, which can be summarized as breaking after each character or breaking after spaces (as in Latin text). The former tends to occur when text is set justified; the latter, when ragged margins are used. In that case, even ideographs are broken only at space characters. In Japanese, for example, tighter and looser specifications of prohibited line breaks may be used.

Specialized text or specialized text constructs may need specific linebreaking behavior that differs from the default line breaking rules given inthis annex. This may require additional tailorings beyond those consideredin this section. For example, the rules given here are insufficient formathematical equations, whether inline or in display format. Likewise, textthat commonly contains lengthy URLs might benefit from special tailoring that suppressesSY ×NUfrom ruleLB25 within the scope of aURL to allow breaks after a “/” separated segment in the URL regardless ofwhether the next segment starts with a digit.

The remainder of this section gives an overview of common types of tailorings and examples of how to customize the pair tableimplementation of the line breaking algorithmfor these tailorings.

8.1 Types of Tailoring

There are three principal ways of tailoring the sample pair table implementation ofthe line breaking algorithm:

  1. Changing the line breaking class assignment for some characters
    This is usefulin cases where the line breaking properties of one class of characters are occasionally lumped together with the properties of another class to achieve a less restrictive line breaking behavior.
  2. Changing the table value assigned to a pair of character classes
    This is particularly useful if the behavior can be expressed by a change at a limited number of pair intersections. This form of customization is equivalent to permanently overriding some of the rules inSection6,Line Breaking Algorithm.
  3. Changing the interpretation of the line breaking actions
    This is a dynamic equivalent of the preceding. Instead of changing the values for the pair intersection directly in the table, they are labeled with special values that cause different actions for different customizations. This is most suitable when customizations need to be enabled at run time.

Beyond these three straightforward customization steps, it is always possible to augment the algorithm itself—for example, by providing specialized rules to recognize and break common constructs, such as URLs, numericexpressions, and so on. Such open-ended customizations place no limits on possible changes, other than therequirement that characters with normative line breaking properties becorrectly implemented.

Reference [Cedar97] reports on a real-world implementation of a pair table-based implementation of a line breaking algorithm substantially similar to the one presented here, and including the types of customizations presented in this section. That implementation wasable to simultaneously meet the requirements of customers in many European and East Asian countries with a single implementation of the algorithm.

8.2 Examples of Customization

Example 1. The exact method of resolving the line break class forcharacters with classSA is notspecified in the default algorithm. One method of implementing line breaks for complex scripts is to invoke context-based classification for all runs of characters with classSA. For example, a dictionary-based algorithm could return different classes for Thai letters depending on their context: letters at the start of Thai words would becomeBB and other Thai letters would becomeAL. Alternatively, for text consisting ofor predominantly containing characters with line breaking classSA, it may be useful to instead defer the determination of line breaks to adifferent algorithm entirely.Section7.4,Sample Code, sketches such approach in which the interface to the dictionary-based algorithm directly reports break opportunities.

Example 2. To implement terminal style line breaks, it would be necessary to allowbreaks at fixed positions. These could occurinside a run of spaces or in the middle of words without regard tohyphenation. Such a modification essentially disregards the output ofthe linebreaking algorithm, and is therefore not a conformant tailoring. Fora system that supports both regular line breaking and terminal style linebreaks, only some of its line break modes would be conformant.

Example 3.Depending on the nature of the document, Korean either uses implicit breaking around characters (type 2 as defined above inSection3,Introduction) or uses spaces (type 1). Space-based layout is common in magazines and other informal documents with ragged margins, while books, with both margins justified, use the other type, as it affords more line break opportunities and therefore leads to better justification. Reference [Suign98] shows how the necessary customizations can be elegantly handled by selectively altering the interpretation of the pair entries. Only the intersections ofID/ID,AL/ID, andID/AL are affected. For alphabetic style line breaking, breaks for these cases require space; for ideographic style line breaking, these cases do not require spaces. Therefore, one defines a pseudo-action, which is then resolved into either direct or indirect break action based on user selection of the preferred behavior for a given text.

Example 4.Sometimes in a Far Eastern context it is necessary toallow alphabetic characters and digit strings to break anywhere. According to reference [Suign98], this can again be done in the same way as Korean. In this case the intersections ofNU/NU,NU/AL,AL/AL, andAL/NUare affected.

Example 5.Some users prefer to relax the requirement that Kana syllables be kept together. For example, the syllablekyu, spelled with the two kanasKI and “smallyu”, would no longer be kept together as ifKI andyuwere atomic. This customization can be handled via the first method by changing the classification of the Kana small characters fromNS toID as needed.

Example 6. Some implementations may wish totailor the line breaking algorithm to resolve graphemeclusters according to Unicode Standard Annex #29, “Text Boundaries” [Boundaries],as a first stage. Generally, the line breaking algorithm does not create linebreak opportunities within default grapheme clusters; therefore such atailoring would be expected to produce results that for most practical casesare close to what are defined by the default algorithm. However, if such atailoring is chosen, characters that are members of line break classCM butnot part of the definition of default grapheme clusters must still behandled by rulesLB9 andLB10, or by some additional tailoring.

Example 7. Regular expression-based line breaking engines might get better resultsusing a tailoring that directly implements the following regular expressionfor numeric expressions:

(PR |PO) ? (OP |HY ) ?NU (NU |SY |IS) *CL ? (PR |PO) ?

This is equivalentto replacing the ruleLB25by the following tailored rule:

Regex-Number: Do notbreak numbers.

(PR | PO) × ( OP | HY )?NU

( OP | HY ) × NU

NU × (NU | SY |IS)

NU (NU | SY | IS)*× (NU | SY | IS | CL)

NU (NU | SY | IS)*CL? × (PO | PR)

This customized rule uses extended contextsthat cannot berepresented in a pair table. In these tailored rules, (PR | PO) meansPR orPO,the Symbol “?” means 0 or one occurrence and the symbol “*” means 0 or moreoccurrences. The last two rules can have a left side of any non-zero length.

When the tailored rule is used,LB13need to be tailored as follows:

[^NU] × CL

× EX

[^NU] × IS

[^NU] × SY

Otherwise, single digits might be handled by ruleLB13before being handled in the regular expression. In these tailored rules [^NU] designates anyline break class other thanNU. Thesymbol ^ is used, instead of !, to avoid confusion with the use of! to indicate an explicit break.

Example 8. For some implementations it may be difficult toimplementLB9 due to the added complexity of its indefinite lengthcontext. Because combining marks are most commonly applied to characters of classAL, ruleLB10by itself generally produces acceptable results for such implementations,but such an approximation is not a conformant tailoring.

Example 9. Prevent breaks when part of a word appears within parentheses—for example, in “person(s)”.

  1. Reclassify U+0029, RIGHT PARENTHESIS, from line break classCL (Closing Punctuation) to line break classIS (Numeric Infix Separator).
  2. Add the following rule as LB 30:

(AL | NU) × OP

This customization is one possible way of achieving the original purpose of LB30—preventing breaks in words like "person(s)"—without the undesired side effect of preventing breaks after Asian punctuation characters having line breaking classCL (Closing Punctuation).

9Implementation Notes

This section provides additional notes on implementation issues.

9.1 Combining Marks in RegularExpression-Based Implementations

For implementations that use regular expressions, it is notpossible to directly express rulesLB9 andLB10.However, it is possible to make these rules unnecessary by rewritingallthe rules fromLB11 on down so that theoverall result of the algorithm is unchanged. This restatement of the rules istherefore not a tailoring, but rather an equivalent statement of the algorithmthat can be directly expressed as regular expressions.

To replace ruleLB9, terms ofthe form

B #A

B SP* #A

B #

B SP* #

are replaced by terms of the form

B CM* #A

B CM* SP* #A

B CM* #

B CM* SP* #

whereB andA are any line break class or set ofalternate line break classes, such as (X |Y), and where # is any of the threeoperators !, ÷, or ×.

Note that becausesot,BK,CR,LF,NL, andZW are all handled by rulesaboveLB9, these classes cannot occur inpositionB in any rule that is rewritten as shown here.

ReplaceLB10by the following rule:

×  CM

For each rule containing AL on its left side,add a rule that is identical except for the replacement of AL by CM, but takingcare of correctly handling sets of alternate line break classes. For example,for rule

(AL | NU) × OP

add another rule

CM × OP.

These prescriptions for rewriting the rules are, inprinciple, valideven where the rules have been tailored as permitted inSection4,Conformance. However, for extended context rulessuch as inExample 7, additionalconsiderations apply. These are described inSection6.2,Replacing Ignore Rules, ofUnicode Standard Annex #29, “Text Boundaries” [Boundaries].

9.2 Legacy Support for Space Character as Base for Combining Marks

As stated in [Unicode5.0],Section 7.9, Combining Marks,combining characters are shown in isolation by applying them to U+00A0NO-BREAK SPACE (NBSP). In earlier versions, this recommendation included the use of U+0020 SPACE.This use of SPACE for this purpose is now deprecatedbecause it has been found to lead to many complications in text processing. Whetherusing either NBSP or SPACE as the base character, the visual appearance is the same, but the line breaking behavior is different. Under the currentrules,SPCM*will allow a break betweenSP andCM*, which could result in a new linestarting with a combining mark. Previously,whenever the base character wasSP,the sequencesCM* andSPCM* were defined to act like indivisible clusters,allowing breaks on either side likeID.

Where backward compatibility with documents created under the priorpractice is desired, the following tailoring should be applied to thoseCMcharactersthat have a General_Category value of Combining_Mark (M):

Legacy-CM: In all of the rules following ruleLB8, if a space is the base character for acombining mark, the space is changed to typeID. In other words, break beforeSPin the same cases as one would break before anID.

TreatSPCM* as if it wereID.

While this tailoring changes the location of the line breakopportunities in the string, it is ordinarily not expected to affect the display ofthe text. That is because spaces at the end of the line are normallyinvisible and the recommended display for isolated combining marks is thesame as if they were applied to a preceding SPACEor NBSP.

10 Testing

As with the other default specifications, implementations are free to override (tailor) the results to meet the requirements of different environments or particular languages as described inSection4,Conformance. For those who do implement the default breaks as specified in this annex, and wish to check that that their implementation matches that specification, a test file has been made available in [Tests14].

These tests cannot be exhaustive, because of the large number of possible combinations; but they do provide samples that test all pairs of property values, using a representative character for each value, plus certain other sequences.

A sample HTML file is also available for each that shows various combinations in chart form, in [Charts14]. The header cells of the chart consist of a property value, followed by a representative code point number. The body cells in the chart show the break status: whether a break occurs between the row property value and the column property value. If the browser supports tool-tips, then hovering the mouse over the code point number will show the character name, General_Category and Script property values. Hovering over the break status will display the number of the rule responsible for that status.

Note: To determine a break it is generally not sufficient to just test the two adjacent characters.

The chart is followed by some test cases. These test cases consist of various strings with the break status between each pair of characters shown by blue lines for breaks and by whitespace for non-breaks. Hovering over each character (with tool-tips enabled) shows the character name and property value; hovering over the break status shows the number of the rule responsible for that status.

Due to the way they have been mechanically processed for generation, the test rules do not match the rules in this annex precisely. In particular:

  1. The rules are cast into a more regex-style.
  2. The rules “sot”, “eot”, and “Any” are added mechanically and have artificial numbers.
  3. The rules are given decimal numbers without prefix, so rules such as LB14 are given a number using tenths, such as 14.0.
  4. Where a rule has multiple parts (lines), each one is numbered using hundredths, such as
    • 13.01) [^NU] × CL
    • 13.02) × EX
    • ...
  5. LB9 and LB10are handled as described inSection9.1,Combining Marks in Regular Expression-Based Implementations,resulting in a transformation of the rules not visible in the tests.

The mapping from the rule numbering in this annex to the numbering for the test rules is summarized in Table 4.

Table 4.Numbering of Rules

Rule in This AnnexTest RuleComment
LB20.2start of text
LB30.3end of text
LB12a12.0GL ×
LB12b12.1[^SP, BA, HY] × GL
LB31999÷ any

 

References

For references for this annex, see Unicode Standard Annex #41, “CommonReferences for Unicode Standard Annexes.”

Acknowledgments

Asmus Freytag is the author of the initial version and has added to andmaintained the text of this annex.

The initial assignments of properties are based on input by Michel Suignard. Mark Davis provided algorithmic verification and formulation of the rules, and detailed suggestions on the algorithm and text. Ken Whistler, Rick McGowan and other members of the editorial committee provided valuable feedback. Tim Partridge enlarged the information on dictionary usage. Sun Gi Hong reviewed the information on Korean and provided copious printed samples. Eric Muller reanalyzed the behavior of the soft hyphenand collected the samples. Adam Twardoch provided the Polish example.António Martins-Tuválkin supplied information about Portuguese. TomoyukiSadahiro provided information on use of U+30A0. Christopher Fynn provided the backgroundinformation on Tibetan line breaking. Andrew West, Kamal Mansour, Andrew Glass,Daniel Yacob, and Peter Kirk suggested improvements for Mongolian, Arabic,Kharoshthi, Ethiopic, and Hebrew punctuation characters, respectively.Kent Karlsson reviewed the line break properties for consistency. Jerry Hall reviewed the sample code. Elika J. Etemad (fantasai)reviewed the entire document in an effort to make it easier to reference fromexternal standards. Manyothers provided additional review of the rules and property assignments.

Modifications

Change History

For details of the change history, see the online copy of this annex at http://www.unicode.org/reports/tr14/.

Rule Numbering Across Versions

Table 5 documents changes in the numbering of line breaking rules.A duplicate number indicates that a rule was subsequentlysplit. (In each version, the rules are applied in their numericalorder, not in the order they appear in this table.) Versions prior to 3.0.1 are not documented here.

Table 5.Rule Numbering Across Versions

5.1.05.0.04.1.04.0.14.0.03.2.03.1.03.0.1
LB11111111
LB222a2a2a2a2a2a
LB332b2b2b2b2b3b
LB443a3a3a3a3a3a
LB553b3b3b3a3a3a
LB663c3c3c3b3b3b
LB77444444
LB88555555
  deprecated7a7a777
LB997b7b7b666
LB10107c7c7c   
LB111111b11b11b131313
LB12121311b11b131313
LB12a121311b11b131313
LB1313888888
LB1414999999
LB1515101010101010
LB1616111111111111
LB171711a11a11a11a11a 
LB1818121212121212
LB1919141414141414
LB202014a14a14a   
LB2121151515151515
LB2222161616161616
LB2323171717171717
LB2424181818181818
LB2525181818181818
  removed18b18b15b15b15b
LB262618b66666
LB272718c66666
LB2828191919191919
LB292919b19b    
LB3030      
LB3131202020202020

 

Change History

The following documents the changes introduced by each revision.

Revision 22:

  • Updated for Version 5.1.0.
  • Added 2E18, INVERTED INTERROBANG, to class OP.
  • Added 2064, INVISIBLE PLUS, to class AL.
  • Added 2E00..2E01, 2E06..2E08, 2E0B to class QU.
  • Removed LB30, to correct regression for U+3002 IDEOGRAPHIC FULL STOP
  • AddExample 9 toSection8,Customization.
  • Substantial revisions toSection4,Conformance.
  • Section5.7,Word Separators, added.
  • Section10,Testing, added.
  • Renumber rules for consistency: 12a ->12;  12b ->12a
  • Added 02DF, MODIFIER LETTER CROSS ACCENT, to class BB.
  • Added discussion for 202F NARROW NO-BREAK SPACE and 180E MONGOLIAN VOWEL SEPARATOR
  • Corrected typos in LB13 and LB16
  • Added characters introduced with Unicode 5.1 to the lists associated with the line break properties.
  • AddedSection5.2 on the special handling of hyphens. EditedSection3 for clarity.
  • Improved delineation between normative and informative information.
  • Changed fromEX toIS
    • 060C ( ، ) ARABIC COMMA
  • Changed fromEX toPO
    • 066A ( ٪ ) ARABIC PERCENT SIGN
  • Changed fromAI toOP
    • 00A1 ( ¡ ) INVERTED EXCLAMATION MARK
    • 00BF ( ¿ ) INVERTED QUESTION MARK
  • Changed fromBA toEX
    • 1802 ( ᠂ ) MONGOLIAN COMMA
    • 1803 ( ᠃ ) MONGOLIAN FULL STOP
    • 1808 ( ᠈ ) MONGOLIAN MANCHU COMMA
    • 1809 ( ᠉ ) MONGOLIAN MANCHU FULL STOP
    • 2CF9 ( ⳹ ) COPTIC OLD NUBIAN FULL STOP
    • 2CFE ( ⳾ ) COPTIC FULL STOP
  • Changed fromBA toAL
    • 1A1E ( ᨞ ) BUGINESE PALLAWA
  • Changed fromAL toBB
    • 1FFD ( ´ ) GREEK OXIA
  • Added a note on lack of canonical equivalence for the definition of ambiguous characters.
  • Corrected typos in the sample source code.
  • Split rule LB12 to accommodate Polish and Portuguese hyphenation.
Revisions 20 and 21 being a proposed update, only changes between revisions 19 and 22 arenoted here.

Revision 19:

  • Changed 000B fromCM toBK, changed 035C fromCMtoGL.
  • Changed 17D9 fromNS toAL. 203D, 2047..2049 fromAL toNS.
  • Corrected listing ofNS propertyto match the data file to remove 17D8 and 17DA.
  • The data file has beencorrected to match the listing of theBAproperty to include 1735 and 1736, also changed 05BE and 103D0 fromAL toBA.
  • Changed the brackets 23B4.23B6 toAL.
  • Updated theSA property to makeit more generic, includes changing many characters from CM to SA.
  • Reflected newcharacters
  • Made several text changes forclarifications, including reworded the intro to Section 6.
  • Added Section 9.
  • Restated the conformance clauses and reorganized the algorithm into a tailorable anda non-tailorable part; thisaffects text in Sections 4, 5, and 6.
  • Removed redundant term PR x HY from rule 18 and rule into newLB24 and LB26 to provide better granularity for tailoring,
  • Moved rule 11b and 13 above rule 8 (new LB13), restating rule 13(new LB12) topreserve its effect in the new location.
  • Added new rule LB30 to handle words like “person(s)”.
  • Renumbered the rules.
  • Extensive copy-editing as part of Unicode 5.0 publication.
Revision 18 being a proposed update, only changes between revisions 17 and 19 arenoted here.

Revision 17:

  • Significantly revised the line breakclasses for Tibetan, as well as Mongolian and Arabic Punctuation.
  • Added section 5.6 on Tibetan and section 7.7 on handling explicitbreaks.
  • Added line break class assignments for Unicode 4.1characters.
  • Significantly revised the line break classassignments fordanda characters and made it consistent acrossscripts.
  • LB6: Replaced by new rules 18b and 18c, using new classesJL,JV,JT,H2, andH3.
  • LB7a: Deprecated rule 7a becauseSPACE as base character for standalone combining marks is deprecated.
  • LB7b: Revised 7b and section 7.5 as well as Table 2to match the deprecation of rule 7a.
  • LB7b: Clarified that this rule does not apply toSP.
  • LB11a: Added a missingSP * to make the formula match the rule.
  • LB18b: Removed the existing rule18b because it was redundant.
  • Corrected an erratum onrevision 14 by splittingGL fromWJ in rule 11b and moving to a newrule 13.
  • Updated the pair table and sample code tomatch the changes in the rules.
  • Updated the regularexpression for numbers.
  • Added several notes onimplementation techniques.
  • Moved all suggested tailorings from the rules section to theexamples in section 8.2.
Revision 16 being a proposed update, only changes between revisions 17 and 15 arenoted here.

Revision 15:

  • LB19b: Added new rule 19b.
  • Changed line breaking class: combining double diacritics fromCMtoGL, 037A and 2126 to match their canonical equivalents,2140 corrected toAL, Arabic numericalseparators fromAL toNU,many alphabetic characters that areEAW=Ambiguous fromAI toALto better reflect current practice, remaining circled numbers and letters fromAL toAIfor consistency.
  • Added a note on the behavior of U+200B and U+3000 when lines arejustified.
  • Reconciled the data file and description of line breaking classes in section 5. 
  • Reconciled the rules and pair table implementation of the algorithm.
  • Updated the text of the conformance statement in section 4.
  • Added section 5.5 on use of double hyphen.
  • Updated styles and table formatting.
  • Minor edits throughout.

Revision 14:

  • Added new line breaking classesNL andWJ to better support NEL and Word Joiner. 
  • Deprecated the use of classSG.
  • Several changes to the rules. Moved rule 15b to 18b, added 14b, moved 13 to 11b. Split rule 6 in to 6a and 7b and split rule 3a into 3a and 3b. Restated rule 7a and added rule 7c.
  • Updated the pair table and sample code, adding a special token '#' to account for breaks beforeSP followed byCM.
  • Clarified the behavior of SHY andMONGOLIAN TODO SOFT HYPHEN, as well as WJ and ZWNBSP.
  • Added a new subsection 5.4 onSOFT HYPHEN and a new subsection 7.6 on conjoining jamos.
  • Added to the discussion on how to treat combining marks.
  • Clarified the conformance requirements in Section 4
  • Added a definition ofline breaking class as synonym for the unwieldyline breaking property value.
  • Expanded the introduction in Section 3.
  • Moved subsections on customization into a new Section 8 and expanded the text.
  • Many edits throughout the text to update it for Unicode 4.0.0.
Revision 11 being a proposed update, only changes between revisions 12 and 14 arenoted here.

Revision 12:

  • Change header for publication of Unicode. Fixed a few additional typos. 
  • Updated for publication of Unicode, Version 3.2
  • Added Word joiner toGL and noted that it now is the preferred character instead of FEFF
  • Added LB class assignments for the new Unicode 3.2 characters to the data file. Only characters whose LB class differs from those of characters with related General_Category are noted explicitly in this text.
Revision 11, being a proposed update, only changes between revisions 10 and 12 arenoted here.]

Revision 10:

  • Changed header for publication of Unicode 3.1. Fixed a few additional typos.

Revision 9: 

  • Fixed several typos, reformatted and sorted some lists by code points
  • Reconciled the data file and the description forBB (00B4),XX (PUA),AI (2015,25C8,PUA),ID (FE6B),BA (00B4)
  • Restored PUA toXX.
  • LB7: Restored the rule, and fixed the note so it matches the rule and Section 7.7 of [Unicode4.0].
  • LB11a: added a rule to reconcile the rules against pair table entry B2 ^ B2
  • LB19: added an entry to reconcile the rules against pair table entry PR % ID
  • Reworked Section 7.5
  • Removed two unused definitions (overfull and underfull)

Revision 8: 

  • New status section, changed format of references. Fixed several typos.
  • Added headers to Table 1
  • Added a note on use of B and A
  • Added mention of PUA toAI and removed mention of PUA fromXX because the data file assignsAI to them.
  • Clarified the membership and implication of classCM andID.
  • Updated classID by the new ranges for 3.1.
  • LB6: Clarified the description of LB6 to clarify how it affects conjoining Jamo.
  • LB7: Fixed the note so it matches the rule.
  • LB17: Fixed the regular expression for numbers in the explanation for this rule.
  • Reworded Sections 7.6 and 7.7 to clarify the customization process.

Revision 7: 

  • Fixed several typos.
  • New header.

Revision 6: 

  • Rewrite and reorganization of the text as part of the publication of the Unicode Standard, Version 3.0.

No change history is available for earlier revisions.


Copyright © 1998-2008 Unicode, Inc. All Rights Reserved.The Unicode Consortium makes no expressed or implied warranty of any kind,and assumes no liability for errors or omissions. No liability is assumedfor incidental and consequential damages in connection with or arising outof the use of the information or programs contained or accompanying thistechnical report. The UnicodeTerms of Use apply.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.


[8]ページ先頭

©2009-2025 Movatter.jp