Movatterモバイル変換


[0]ホーム

URL:


[Unicode]
 

Unicode® Standard Annex #9

Unicode Bidirectional Algorithm

VersionUnicode 16.0.0
EditorsManish Goregaokar मनीष गोरेगांवकर (manish@unicode.org), Robin Leroy (eggrobin@unicode.org)
Date2024-09-02
This Version https://www.unicode.org/reports/tr9/tr9-50.html
Previous Versionhttps://www.unicode.org/reports/tr9/tr9-48.html
Latest Version https://www.unicode.org/reports/tr9/
Latest Proposed Update https://www.unicode.org/reports/tr9/proposed.html
Revision50

Summary

This annex describes specifications for the positioning of characters in text containing characters flowing from right to left, such as Arabic or Hebrew.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved for publication by the Unicode Consortium. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.

A Unicode Standard Annex (UAX) forms an integral part of theUnicode Standard, but is published online as a separate document. TheUnicode Standard may require conformance to normative content in a UnicodeStandard Annex, if so specified in the Conformance chapter of that versionof the Unicode Standard. The version number of a UAX document corresponds tothe version of the Unicode Standard of which it forms a part.

Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, “Common References for Unicode Standard Annexes.” For the latest version of the Unicode Standard, see [Unicode]. For a list of current Unicode Technical Reports, see [Reports]. For more information about versions of the Unicode Standard, see [Versions]. For any errata which may apply to this annex, see [Errata].

Contents


1Introduction

The Unicode Standard prescribes amemory representation orderknown as logical order. When text is presented in horizontal lines, mostscripts display characters from left to right. However, there are severalscripts (such as Arabic or Hebrew) where the natural ordering of horizontaltext in display is from right to left. If all of the text hasa uniform horizontal direction, then the ordering of the display text is unambiguous.

However, because these right-to-left scriptsuse digits that are written from left to right, the text is actuallybidirectional:a mixture of right-to-leftandleft-to-right text. In addition todigits, embedded words from English and other scripts are also written fromleft to right, also producing bidirectional text. Without a clearspecification, ambiguities can arise in determining the ordering of the displayed characters when thehorizontal direction of the text is not uniform.

This annex describes the algorithm used to determine the directionality for bidirectional Unicode text. The algorithm extends the implicit model currently employed by a number of existing implementations and adds explicit formatting characters for special circumstances. In most cases, there is no need to include additional information with the text to obtain correct display ordering.

However, in the case of bidirectional text, there are circumstances wherean implicit bidirectional ordering is not sufficient to producecomprehensible text. To deal with these cases, a minimal set of directionalformatting characters is defined to control the ordering of characters whenrendered. This allows exact control of the display ordering for legibleinterchange and ensures that plain text used for simple items like filenames or labels can always be correctly ordered for display.

The directional formatting characters are usedonly to influence the display ordering of text. In all other respects they should be ignored—they have no effect on the comparison of text or on word breaks, parsing, or numeric analysis.

Each character has an implicitbidirectional type. The bidirectional types left-to-right and right-to-left are calledstrong types, and characters of those types are called strong directional characters. The bidirectional types associated with numbers are calledweak types, and characters of those types are called weak directional characters. With the exception of the directional formatting characters, the remaining bidirectional types and characters are called neutral. The algorithm uses the implicit bidirectional types of the characters in a text to arrive at a reasonable display ordering for text.

When working with bidirectional text, the characters are still interpreted in logical order—only the display is affected. The display ordering of bidirectional text depends on the directional properties of the characters in the text. Note that thereare important security issues connected with bidirectional text: for more information, see [UTR36].

2Directional Formatting Characters

Three types of explicit directional formatting characters are used to modify the standard implicit UnicodeBidirectional Algorithm (UBA). In addition, there are implicit directional formatting characters, theright-to-left and left-to-right marks. The effects of all of these formatting characters are limited to the current paragraph; thus, they are terminated by aparagraph separator.

These formatting characters all have the propertyBidi_Control,and are divided into three groups:

Implicit Directional Formatting CharactersLRM, RLM, ALM
Explicit Directional Embedding and Override Formatting CharactersLRE, RLE, LRO, RLO, PDF
Explicit Directional Isolate Formatting CharactersLRI, RLI, FSI, PDI

Although the termembedding is used for some explicit formatting characters, the text within the scope of the embedding formatting characters is not independent of the surrounding text. Characters within an embedding can affect the ordering of characters outside, and vice versa. This is not the case with the isolate formatting characters, however. Characters within an isolate cannot affect the ordering of characters outside it, or vice versa. The effect that an isolate as a whole has on the ordering of the surrounding characters is the same as that of a neutral character, whereas an embedding or override roughly has the effect of a strong character.

Directional isolate characters were introduced in Unicode 6.3 after it became apparent that directional embeddings usually have too strong an effect on their surroundings and are thus unnecessarily difficult to use. The new characters were introduced instead of changing the behavior of the existing ones because doing so might have had an undesirable effect on those existing documents that do rely on the old behavior. Nevertheless, the use of the directional isolates instead of embeddings is encouraged in new documents – once target platforms are known to support them.

On web pages, theexplicit directional formatting characters (of all types – embedding, override, and isolate) should be replaced by other mechanisms suitable for HTML and CSS. For information on the correspondence between explicit directional formatting characters and equivalent HTML5 markup and CSS properties, seeSection 2.7,Markup and Formatting Characters.

2.1Explicit Directional Embeddings

The following characters signal that a piece of text is to be treated as embedded. For example, an English quotation in the middle of an Arabic sentence could be marked as being embedded left-to-right text. If there were a Hebrew phrase in the middle of the English quotation, that phrase could be marked as being embedded right-to-left text. Embeddings can be nested one inside another, and in isolates and overrides.

Abbr.Code PointNameDescription
LREU+202ALEFT-TO-RIGHT EMBEDDINGTreat the following text as embedded left-to-right.
RLEU+202BRIGHT-TO-LEFT EMBEDDINGTreat the following text as embedded right-to-left.

The effect of right-left line direction, for example, can be accomplished by embedding the text with RLE...PDF. (PDF will be described inSection 2.3,Terminating Explicit Directional Embeddings and Overrides.)

2.2Explicit Directional Overrides

The following characters allow the bidirectional character types to be overridden when required for special cases, such as for part numbers. They are to be avoided wherever possible, because of security concerns. For more information, see [UTR36]. Directional overrides can be nested one inside another, and in embeddings and isolates.

Abbr.Code PointNameDescription
LROU+202DLEFT-TO-RIGHT OVERRIDEForce following characters to be treated as strong left-to-right characters.
RLOU+202ERIGHT-TO-LEFT OVERRIDEForce following characters to be treated asstrong right-to-left characters.

The precise meaning of these characters will be made clear in the discussion of the algorithm. The right-to-left override, for example, can be used to force a part number made of mixed English, digits and Hebrew letters to be written from right to left.

2.3Terminating Explicit Directional Embeddings and Overrides

The following character terminates the scope of the last LRE, RLE, LRO, or RLO whose scope has not yet been terminated.

Abbr.Code PointNameDescription
PDFU+202CPOP DIRECTIONAL FORMATTINGEnd the scope of the last LRE, RLE, RLO, or LRO.

The precise meaning of this character will be made clear in the discussionof the algorithm.

2.4Explicit Directional Isolates

The following characters signal that a piece of text is to be treated as directionally isolated from its surroundings. They are very similar to the explicit embedding formatting characters. However, while an embedding roughly has the effect of a strong character on the ordering of the surrounding text, an isolate has the effect of a neutral like U+FFFC OBJECT REPLACEMENT CHARACTER, and is assigned the corresponding display position in the surrounding text. Furthermore, the text inside the isolate has no effect on the ordering of the text outside it, and vice versa.

In addition to allowing the embedding of strongly directional text without unduly affecting the bidirectional order of its surroundings, one of the isolate formatting characters also offers an extra feature: embedding text while inferring its direction heuristically from its constituent characters.

Isolates can be nested one inside another, and in embeddings and overrides.

Abbr.Code PointNameDescription
LRIU+2066LEFT‑TO‑RIGHT ISOLATETreat the following text as isolated and left-to-right.
RLIU+2067RIGHT‑TO‑LEFT ISOLATETreat the following text as isolated and right-to-left.
FSIU+2068FIRST STRONG ISOLATETreat the following text as isolated and in the direction of its first strong directional character that is not inside a nested isolate.

The precise meaning of these characters will be made clear in the discussion of the algorithm.

2.5Terminating Explicit Directional Isolates

The following character terminates the scope of the last LRI, RLI, or FSI whose scope has not yet been terminated, as well as the scopes of any subsequent LREs, RLEs, LROs, or RLOs whose scopes have not yet been terminated.

Abbr.Code PointNameDescription
PDIU+2069POP DIRECTIONAL ISOLATEEnd the scope of the last LRI, RLI, or FSI.

The precise meaning of this character will be made clear in the discussion of the algorithm.

2.6Implicit Directional Marks

These characters are very light-weight formatting. They act exactly like right-to-left or left-to-right characters, except that they do not display or have any other semantic effect. Their use is more convenient than using explicit embeddings or overrides because their scope is much more local.

Abbr.Code PointNameDescription
LRMU+200ELEFT-TO-RIGHT MARKLeft-to-right zero-width character
RLMU+200FRIGHT-TO-LEFT MARKRight-to-left zero-width non-Arabic character
ALMU+061CARABIC LETTER MARKRight-to-left zero-width Arabic character

There is no special mention of the implicit directional marks in the following algorithm. That is because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display.

2.7Markup and Formatting Characters

The explicit formatting characters introduce state into the plain text, which must be maintained when editing or displaying the text. Processes that are modifying the text without being aware of this state may inadvertently affect the rendering of large portions of the text, for example by removing a PDF.

The Unicode Bidirectional Algorithm is designed so that the use of explicit formatting characters can be equivalently represented by out-of-line information, such as stylesheet information or markup. Conflicts can arise if markup and explicitly formatting characters are both used in the same paragraph. Where available, markup should be used instead of the explicit formatting characters: for more information, see [UnicodeXML]. However, any alternative representation is only to be defined by reference to the behavior of the corresponding explicit formatting characters in this algorithm, to ensure conformance with the Unicode Standard.

HTML5 [HTML5] and CSS3 [CSS3Writing] provide support for bidi markup as follows:

UnicodeEquivalent MarkupEquivalent CSSComment
RLI ... PDIdir = "rtl"direction:rtl;unicode-bidi:isolatedir attribute on any element
LRI ... PDIdir = "ltr"direction:ltr;unicode-bidi:isolatedir attribute on any element
FSI ... PDI<bdi>, dir = "auto"unicode-bidi:plaintextdir attribute on any element
RLE ... PDF direction:rtl;unicode-bidi:embedmarkup not available in HTML
LRE ... PDF direction:ltr;unicode-bidi:embedmarkup not available in HTML
RLO ... PDF direction:rtl;unicode-bidi:bidi-overridemarkup not available in HTML
LRO ... PDF direction:ltr;unicode-bidi:bidi-overridemarkup not available in HTML
FSI RLO . . . PDF PDI<bdo dir = "rtl">direction:rtl;unicode-bidi:isolate-override 
FSI LRO . . . PDF PDI<bdo dir = "ltr">direction:ltr;unicode-bidi:isolate-override 

Unlike HTML4.0, HTML5 does not provide exact equivalents for LRE, RLE, LRO, and RLO, although the dir attribute and the BDO element as outlined above should in most cases work as well or better than those formatting characters. When absolutely necessary, CSS can be used to get exact equivalents for LRE, RLE, LRO, and RLO, as well as for LRI, RLI, and FSI.

Whenever plain text is produced from a document containing markup, the equivalent formatting characters should be introduced, so that the correct ordering is not lost. For example, whenever cut and paste results in plain text this transformation should occur.

3Basic Display Algorithm

The Unicode Bidirectional Algorithm(UBA) takes a stream of text as input and proceeds infour main phases:

The algorithm reorders text only within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see Section 4.4, Directionality, andSection 5.8, Newline Guidelines of [Unicode]). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs.

Combining characters always attach to the preceding base character in the memory representation. Even after reordering for display and performing character shaping, the glyph representing a combining character will attach to the glyph representing its base character in memory. Depending on the line orientation and the placement direction of base letterform glyphs, it may, for example, attach to the glyph on the left, or on the right, or above.

This annex uses the numbering conventions for normativedefinitions and rules inTable 1.

Table 1.Normative Definitions and Rules

NumberingSection
BDnDefinitions
PnParagraph levels
XnExplicit levels and directions
WnWeak types
NnNeutral types
InImplicit levels
LnResolved levels

3.1Definitions

3.1.1Basics

BD1. Thebidirectional character types are values assigned to each Unicode character, including unassigned characters.The formal property name in theUnicode Character Database [UCD]is Bidi_Class.

BD2.Embedding levels are numbers that indicate how deeply the text is nested, and the default direction of text on that level. The minimum embedding level of text is zero, and the maximum explicit depth is125, a value referred to asmax_depth in the rest of this document.

As rulesX1 throughX8 will specify, embedding levels are set by explicit formatting characters (embedding, isolate, and override); higher numbers mean the text is more deeply nested. The reason for having a limitation is to provide a precise stack limit for implementations to guarantee the same results. A maximum explicit level of 125 is far more than sufficient for ordering, even with mechanically generated formatting; the display becomes rather muddied with more than a small number of embeddings.

For implementation stability, this specification now guarantees that the value of 125 for max_depth will not be increased (or decreased) in future versions. Thus, it is safe for implementations to treat the max_depth value as a constant. The max_depth value has been 125 since UBA Version 6.3.0.

BD3. The default direction of the current embedding level(for the character in question) is called theembedding direction. It isL if the embedding level is even, andR if the embedding level is odd.

For example, in a particular piece of text, level 0 is plain English text. Level 1 is plain Arabic text, possibly embedded within English level 0 text. Level 2 is English text, possibly embedded within Arabic level 1 text, and so on. Unless their direction is overridden, English text and numbers will always be an even level; Arabic text (excluding numbers) will always be an odd level. The exact meaning of the embedding level will become clear when the reordering algorithm is discussed, but the following provides an example of how the algorithm works.

BD4. Theparagraph embedding level is the embedding level that determines the default bidirectional orientation of the text in that paragraph.

BD5. The direction of the paragraph embedding level is called the paragraph direction.

BD6. Thedirectional override status determines whether thebidirectional type of characters is to be reset. The directional override statusis set by using explicit directional formatting characters. This status has three states, as shown inTable 2.

Table 2.Directional Override Status

StatusInterpretation
NeutralNo override is currently active
Right-to-leftCharacters are to be reset toR
Left-to-rightCharacters are to be reset toL

BD7. Alevel run is a maximal substring of characters that have the same embedding level. It is maximal in that no character immediately before or after the substring has the same level (a level run is also known as adirectional run).

As specified below, level runs are important at two different stages of the Bidirectional Algorithm. The first stage occurs after rulesX1 throughX9 have assigned an explicit embedding level to each character on the basis of the paragraph direction and the explicit directional formatting characters. At this stage, in ruleX10, level runs are used to build up the units to which subsequent rules are applied. Those rules further adjust each character’s embedding level on the basis of its implicit bidirectional type and those of other characters in the unit – but not outside it. The level runs resulting from these resolved embedding levels are then used in the actual reordering of the text by ruleL2. The following example illustrates level runs at this later stage of the algorithm.

Example

In this and the following examples, case is used to indicate differentimplicit character types for those unfamiliar with right-to-left letters.Uppercase letters stand for right-to-left characters (such as Arabic orHebrew), and lowercase letters stand for left-to-right characters (such asEnglish or Russian).

Memory:            car is THE CAR in arabicCharacter types:   LLL-LL-RRR-RRR-LL-LLLLLLParagraph level:   0Resolved levels:   000000011111110000000000

Notice that the neutral character (space) between THE and CAR gets the level of the surrounding characters. The level of the neutral characters could be changed byinserting appropriate directional marks around neutral characters, or using explicit directional formatting characters.

3.1.2Matching Explicit Directional Formatting Characters

BD8. Anisolate initiator is a character of type LRI, RLI, or FSI.

As rulesX5a throughX5c will specify, an isolate initiator raises the embedding level for the characters following it when the rules enforcing the depth limit allow it.

BD9. Thematching PDI for a given isolate initiator is the one determined by the following algorithm:

Note that all formatting characters except for isolate initiators and PDIs are ignored when finding the matching PDI.

Note that this algorithm assigns a matching PDI (or lack of one) to an isolate initiator whether the isolate initiator raises the embedding level or is prevented from doing so by the depth limit rules.

As ruleX6a will specify, a matching PDI returns the embedding level to the value it had before the isolate initiator that the PDI matches. The PDI itself is assigned the new embedding level. If it does not match any isolate initiator, or if the isolate initiator did not raise the embedding level, it leaves the embedding level unchanged. Thus, an isolate initiator and its matching PDI are always assigned the same explicit embedding level, which is the one outside the isolate. In the later stages of the Bidirectional Algorithm, an isolate initiator and its matching PDI function as invisible neutral characters, and their embedding level then helps ensure that the isolate has the effect of a neutral character on the display order of the text outside it, and is assigned the corresponding display position in the surrounding text.

BD10. Anembedding initiator is a character of type LRE, RLE, LRO, or RLO.

Note that an embedding initiator initiates either a directional embedding or a directional override; its name omits overrides only for conciseness.

As rulesX2 throughX5 will specify, an embedding initiator raises the embedding level for the characters following it when the rules enforcing the depth limit allow it.

BD11. Thematching PDF for a given embedding initiator is the one determined by the following algorithm:

Note that this algorithm assigns a matching PDF (or lack of one) to an embedding initiator whether it raises the embedding level or is prevented from doing so by the depth limit rules.

Although the algorithm above serves to give a precise meaning to the term “matching PDF”, note that the overall Bidirectional Algorithm never actually calls for its use to find the PDF matching an embedding initiator. Instead, rulesX1 throughX7 specify a mechanism that determines what embedding initiator scope, if any, is terminated by a PDF, i.e. which valid embedding initiator a PDF matches.

As ruleX7 will specify, a matching PDF returns the embedding level to the value it had before the embedding initiator that the PDF matches. If it does not match any embedding initiator, or if the embedding initiator did not raise the embedding level, a PDF leaves the embedding level unchanged.

As ruleX9 will specify, once explicit directional formatting characters have been used to assign embedding levels to the characters in a paragraph, embedding initiators and PDFs are removed (or virtually removed) from the paragraph. Thus, the embedding levels assigned to the embedding initiators and PDFs themselves are irrelevant. In this, embedding initiators and PDFs differ from isolate initiators and PDIs, which continue to play a part in determining the paragraph’s display order as mentioned above.

BD12. Thedirectional isolate status is a Boolean value set by using isolate formatting characters: it is true when the current embedding level was started by an isolate initiator.

BD13. Anisolating run sequence is a maximal sequence of level runs such that for all level runs except the last one in the sequence, the last character of the run is an isolate initiator whose matching PDI is the first character of the next level run in the sequence. It is maximal in the sense that if the first character of the first level run in the sequence is a PDI, it must not match any isolate initiator, and if the last character of the last level run in the sequence is an isolate initiator, it must not have a matching PDI.

The set of isolating run sequences in a paragraph can be computed by the following algorithm:

Note that:

In the following examples, assume that:

Example 1

Paragraph text:text1·RLE·text2·PDF·RLE·text3·PDF·text4

Level runs:

Resulting isolating run sequences:

Example 2

Paragraph text:text1·RLI·text2·PDI·RLI·text3·PDI·text4

Level runs:

Resulting isolating run sequences:

Example 3

Paragraph text:text1·RLI·text2·LRI·text3·RLE·text4·PDF·text5·PDI·text6·PDI·text7

Level runs:

Resulting isolating run sequences:

As ruleX10 will specify, an isolating run sequence is the unit to which the rules following it are applied, and the last character of one level run in the sequence is considered to be immediately followed by the first character of the next level run in the sequence during this phase of the algorithm. Since those rules are based on the characters' implicit bidirectional types, an isolate really does have the same effect on the ordering of the text surrounding it as a neutral character – or, to be more precise, a pair of neutral characters, the isolate initiator and the PDI, which behave in those rules just like neutral characters.

3.1.3Paired Brackets

The following definitions utilize the normative properties Bidi_Paired_Bracket and Bidi_Paired_Bracket_Type defined in theBidiBrackets.txt file [Data9] of theUnicode Character Database [UCD].

BD14. Anopening paired bracket is a character whose Bidi_Paired_Bracket_Type property value is Open and whose current bidirectional character type is ON.

BD15. Aclosing paired bracket is a character whose Bidi_Paired_Bracket_Type property value is Close and whose current bidirectional character type is ON.

BD16. Abracket pair is a pair of characters consisting of an opening paired bracket and a closing paired bracket such that the Bidi_Paired_Bracket property value of the former or its canonical equivalent equals the latter or its canonical equivalent and which are algorithmically identified at specific text positions within an isolating run sequence. The following algorithm identifies all of the bracket pairs in a given isolating run sequence:

Note that bracket pairs can only occur in an isolating run sequence because they are processed in ruleN0 after explicit level resolution. SeeSection 3.3.2,Explicit Levels and Directions.

Examples of bracket pairs

TextPairings1 2 3 4 5 6 7 8a ) b ( cNonea ( b ] cNonea ( b ) c2-4a ( b [ c ) d ]2-6a ( b ] c ) d2-6a ( b ) c ) d2-4a ( b ( c ) d4-6a ( b ( c ) d )2-8, 4-6a ( b { c } d )2-8, 4-6

3.1.4Additional Abbreviations

Table 3 lists additional abbreviations used in the examples and internal character types used in the algorithm.

Table 3.Abbreviations for Examples and Internal Types

SymbolDescription
NINeutral or Isolate formatting character (B,S,WS,ON,FSI,LRI,RLI,PDI).
eThe text ordering type (L orR) that matches theembedding level direction (even or odd).
oThe text ordering type (L orR) that matches the directionopposite the embedding level direction (even or odd).
Note that o is the opposite of e.
sosThe text ordering type (L orR) assigned to the virtual position before an isolating run sequence.
eosThe text ordering type (L orR) assigned to the virtual position after an isolating run sequence.

 

3.2Bidirectional Character Types

The normative bidirectional character types for each character are specified in theUnicode Character Database [UCD] and are summarized inTable 4. This is a summary only: there are exceptions to the general scope. For example, certain characters such as U+0CBF KANNADA VOWEL SIGN I are given Type L (instead of NSM) to preserve canonical equivalence.

Table 4.Bidirectional Character Types

CategoryTypeDescriptionGeneral Scope
StrongLLeft-to-RightLRM, most alphabetic, syllabic, Han ideographs,non-European or non-Arabic digits, ...
RRight-to-LeftRLM, Hebrew alphabet, and related punctuation
ALRight-to-Left ArabicALM, Arabic, Thaana, and Syriac alphabets, most punctuation specific to those scripts, ...
WeakENEuropean NumberEuropean digits, Eastern Arabic-Indic digits, ...
ESEuropean Number SeparatorPLUS SIGN, MINUS SIGN
ETEuropean Number TerminatorDEGREE SIGN, currency symbols, ...
ANArabic NumberArabic-Indic digits, Arabic decimal and thousands separators, ...
CSCommon Number SeparatorCOLON, COMMA, FULL STOP, NO-BREAK SPACE, ...
NSMNonspacing MarkCharacters with the General_Category values: Mn (Nonspacing_Mark) and Me (Enclosing_Mark)
BNBoundary NeutralDefault ignorables, non-characters, and control characters, other thanthose explicitly given other types.
NeutralBParagraph SeparatorPARAGRAPH SEPARATOR,appropriate Newline Functions, higher-level protocol paragraphdetermination
SSegment SeparatorTab
WSWhitespaceSPACE, FIGURE SPACE, LINE SEPARATOR, FORM FEED,General Punctuation spaces, ...
ONOther NeutralsAll other characters, including OBJECT REPLACEMENT CHARACTER
Explicit FormattingLRELeft-to-Right EmbeddingLRE
LROLeft-to-Right OverrideLRO
RLERight-to-Left EmbeddingRLE
RLORight-to-Left OverrideRLO
PDFPop Directional FormatPDF
LRILeft-to-Right IsolateLRI
RLIRight-to-Left IsolateRLI
FSIFirst Strong IsolateFSI
PDIPop Directional IsolatePDI

 

3.3Resolving Embedding Levels

The body of the Bidirectional Algorithm uses bidirectional character types, explicit formatting characters, and bracket pairs to produce a list of resolved levels. This resolution process consists of the following steps:

3.3.1The Paragraph Level

P1. Split the text into separate paragraphs. A paragraph separator (type B) is kept with the previous paragraph. Within each paragraph, apply all the other rules of this algorithm.

P2. In each paragraph, find the first character of type L, AL, or R while skipping over any characters between an isolate initiator and its matching PDI or, if it has no matching PDI, the end of the paragraph.

Note that:

P3. If a character is found inP2 and it is of type AL or R, then set the paragraph embedding level to one; otherwise, set it to zero.

Whenever a higher-level protocol specifies the paragraph level, rulesP2 andP3 may be overridden: seeHL1.

3.3.2Explicit Levels and Directions

This phase of the algorithm determinesexplicitembedding levels: levels introduced by explicitdirectional formatting characters (embedding, override, and isolate). This phase tracks how they nest, eventually producing a tagging of different ranges of text with their embedding levels, the levels increasing with deeper nesting.

This is done by applying the explicit level ruleX1, which performs a logical pass over the paragraph, applying rulesX2X8 to each characters in turn. The following variables are used during this pass:

Note that there is no need for a valid embedding count in order to tell whether a PDF encountered by the pass matches a valid embedding initiator or nothing at all. That can be decided by checking the directional isolate status of the last entry on the directional status stack and the number of entries on the stack. If the last entry has a true directional isolate status, it is for a directional isolate within whose scope the PDF lies. Since the PDF cannot match an embedding initiator outside that isolate, and there are no embedding entries within the isolate, it matches nothing at all. And if the last entry has a false directional isolate status, but is also the only entry on the stack, it belongs to paragraph level, and thus once again the PDF matches nothing at all.

As each character is processed, these variables’ values are modified and the character’s explicit embedding level is set as defined by rulesX2 throughX8 on the basis of the character’s bidirectional type and the variables’ current values.

X1. At the beginning of a paragraph, perform the following steps:

Explicit Embeddings

X2. With each RLE, perform the following steps:

For example, assuming the overflow counts are both zero, level 0 → 1; levels 1, 2 → 3; levels 3, 4 → 5; and so on. At max_depth or if either overflow count is non-zero, the level remains the same (overflow RLE).

X3. With each LRE, perform the following steps:

For example, assuming the overflow counts are both zero, levels 0, 1 → 2; levels 2, 3 → 4; levels 4, 5 → 6; and so on. At max_depth or max_depth-1 (which, being even, would have to go to max_depth+1) or if either overflow count is non-zero, the level remains the same (overflow LRE).

Explicit Overrides

An explicit directional override sets the embedding level in the same way the explicit embedding formatting characters do, but also changes the bidirectional character type of affected characters to the override direction.

X4. With each RLO, perform the following steps:

X5. With each LRO, perform the following steps:

Isolates

X5a. With each RLI, perform the following steps:

X5b. With each LRI, perform the following steps:

X5c. With each FSI, apply rulesP2 andP3 to the sequence of characters between the FSI and its matching PDI, or if there is no matching PDI, the end of the paragraph, as if this sequence of characters were a paragraph. If these rules decide on paragraph embedding level 1, treat the FSI as an RLI in ruleX5a. Otherwise, treat it as an LRI in ruleX5b.

Note that the new embedding level is not set to the paragraph embedding level determined by P2 and P3. It goes up by one or two levels, as it would for an LRI or RLI.

Non-formatting characters

X6. For all types besides B, BN, RLE, LRE, RLO, LRO, PDF, RLI, LRI, FSI, and PDI:

In other words, if the directional override status of the last entry on the directional status stack is neutral, then characters retain their normal types: Arabic characters stay AL, Latin characters stay L, spaces stay WS, and so on. If the directional override status is right-to-left, then characters become R. If the directional override status is left-to-right, then characters become L.

Note that the current embedding level is not changed by this rule.

Terminating Isolates

A PDI terminates the scope of the isolate initiator it matches. It also terminates the scopes of all embedding initiators within the scope of the matched isolate initiator for which a matching PDF has not been encountered. If it does not match any isolate initiator, it is ignored.

X6a. With each PDI, perform the following steps:

Note that the level assigned to an isolate initiator is always the same as that assigned to the matching PDI.

Terminating Embeddings and Overrides

A PDF terminates the scope of the embedding initiator it matches. If it does not match any embedding initiator, it is ignored.

X7. With each PDF, perform the following steps:

End of Paragraph

X8. All explicit directional embeddings, overrides and isolates are completely terminated at the end of each paragraph.

3.3.3Preparations for Implicit Processing

The explicit embedding levels that have been assigned to the characters by the preceding rules will be further adjusted (in rulesI1I2) on the basis of the characters' implicit bidirectional types (computed in rulesW1W7,N0N2). The adjustment made for a given character will then depend on the characters around it. However, this dependency is limited by logically dividing the paragraph into sub-units, and doing the subsequent implicit processing on each unit independently, as shown in the next two steps.

X9. Remove all RLE, LRE, RLO, LRO, PDF, and BN characters.

X10. Perform the following steps:

Here are some examples, each of which is assumed to be a paragraph with base level 0 where no character sequencetexti contains explicit directional formatting characters or paragraph separators. The dots in the examples are intended to separate elements for visual clarity; they are not part of the text.

Example 1:text1·RLE·text2·LRE·text3·PDF·text4·PDF·RLE·text5·PDF·text6

Isolating Run SequenceEmbedding Levelsoseos
text10LR
text21RL
text32LL
text4·text51LR
text60RL

Example 2:text1·RLI·text2·LRI·text3·PDI·text4·PDI·RLI·text5·PDI·text6

Isolating Run SequenceEmbedding Levelsoseos
text1·RLI·PDI·RLI·PDI·text60LL
text2·LRI·PDI·text41RR
text32LL
text51RR

Example 3:text1·RLE·text2·LRI·text3·RLE·text4·PDI·text5·PDF·text6

Isolating Run SequenceEmbedding Levelsoseos
text10LR
text2·LRI·PDI·text51RR
text32LR
text43RR
text60RL

 

3.3.4Resolving Weak Types

Weak types are now resolved one isolating run sequence at a time. At isolating run sequence boundaries where the type of the character on the other side of the boundary is required, the type assigned tosos oreos is used.

First, each nonspacing mark is resolved based on the character it follows.

W1. Examine each nonspacing mark (NSM) in the isolating run sequence, and change the type of the NSM to Other Neutral if the previous character is an isolate initiator or PDI, and to the type of the previous character otherwise. If the NSM is at the start of the isolating run sequence, it will get the type ofsos. (Note that in an isolating run sequence, an isolate initiator followed by an NSM or any type other than PDI must be an overflow isolate initiator.)

Assume in this example thatsos is R:

AL  NSM NSM → AL  AL  ALsos NSM     →sos RLRI NSM     → LRI ONPDI NSM     → PDI ON

The text is next parsed for numbers. This pass will change the directionaltypes European Number Separator, European Number Terminator, and CommonNumber Separator to be European Number text, Arabic Number text, or OtherNeutral text. The text to be scanned may have already had its type alteredby directional overrides. If so, then it will not parse as numeric.

W2. Search backward from each instance of a European number until the first strong type (R, L, AL, orsos) is found. If an AL is found, change the type of the European number to Arabic number.

AL EN     → AL ANAL NI EN  → AL NI ANsos NI EN →sos NI ENL NI EN   → L NI ENR NI EN   → R NI EN

W3. Change all ALs to R.

W4. A single European separator between two European numbers changes to a European number. A single common separator between two numbers of the same type changes to that type.

EN ES EN → EN EN ENEN CS EN → EN EN ENAN CS AN → AN AN AN

W5. A sequence of European terminators adjacent to European numbers changes to all European numbers.

ET ET EN → EN EN ENEN ET ET → EN EN ENAN ET EN → AN EN EN

W6. All remaining separators and terminators (after the application ofW4 andW5) change to Other Neutral.

AN ET    → AN ONL  ES EN → L  ON ENEN CS AN → EN ON ANET AN    → ON AN

W7. Search backward from each instance of a European number until the first strong type (R, L, orsos) is found. If an L is found, then change the type of the European number to L.

L  NI EN → L  NI  LR  NI EN → R  NI  EN

3.3.5Resolving Neutral and Isolate Formatting Types

In the next phase, neutral and isolate formatting (i.e.NI) characters are resolved one isolating run sequence at a time. Its results are that allNIs become eitherR orL. Generally,NIs take on the direction of the surrounding text. In case of a conflict, they take on the embedding direction. At isolating run sequence boundaries where the type of the character on the other side of the boundary is required, the type assigned tosos oreos is used.

Bracket pairs within an isolating run sequence are processed as units so that both the opening and the closing paired bracket in a pair resolve to the same direction.

N0. Process bracket pairs in an isolating run sequence sequentially in the logical order of the text positions of the opening paired brackets using the logic given below. Within this scope, bidirectional types EN and AN are treated as R.

Example 1. Bracket pairs are resolved sequentially in logical order of the opening paired brackets.

(RTL paragraph direction)

StorageAB(CD[&ef]!)gh
Bidi_ClassRONRONONLONONONL
N0 applied (first pair)N0b: ON→RN0b: ON→R
N0 applied (second pair)N0c2: ON→RN0c2: ON→R
Displaygh(![ef&]DC)BA

Example 2. Bracket pairs enclosing mixed strong types take the paragraph direction.

(RTL paragraph direction)

Storagesmith(fabrikamARABIC)HEBREW
Bidi_ClassLWSONLWSRONWSR
N0 appliedN0b: ON→RN0b: ON→R
DisplayWERBEH (CIBARA fabrikam) smith

Note that in the above example, the resolution of the bracket pairs isstable if the order of smith and HEBREW, or fabrikam and ARABIC, is reversed.

Example 3. Bracket pairs enclosing strong types opposite theembedding direction with additional strong-type context take the directionopposite the embedding direction.

(RTL paragraph direction)

StorageARABICbook(s)
Bidi_ClassRWSLONLON
N0 appliedN0c1: ON→LN0c1: ON→L
Displaybook(s) CIBARA

N1. A sequence ofNIs takes the direction of the surrounding strong text if the text on both sides has the same direction. European and Arabic numbers act as if they were R in terms of their influence onNIs. The start-of-sequence (sos) and end-of-sequence (eos) types are used at isolating run sequence boundaries.

 L  NI   L  →   L  L   L R  NI   R  →   R  R   R R  NI  AN  →   R  R  AN R  NI  EN  →   R  R  ENAN  NI   R  →  AN  R   RAN  NI  AN  →  AN  R  ANAN  NI  EN  →  AN  R  ENEN  NI   R  →  EN  R   REN  NI  AN  →  EN  R  ANEN  NI  EN  →  EN  R  EN

N2. Any remainingNIs take the embedding direction.

NI → e

The embedding direction for the givenNIcharacter is derived from its embedding level: L if the character is set toan even level, and R if the level is odd. (SeeBD3.)

Assume in the following example thateos is L andsos is R. Then an application ofN1 andN2 yields the following:

L   NIeos → L   LeosR   NIeos → R   eeossos NI L   →sos e Lsos NI R   →sos R R

Examples. A list of numbers separated by neutrals and embedded in a directional run will come out in the run’s order.

Storage:he said "THE VALUES ARE 123, 456, 789, OK".Display:he said "KO ,789 ,456 ,123 ERA SEULAV EHT".

In this case, both the comma and the space between the numbers take on the direction of the surrounding text (uppercase = right-to-left), ignoring the numbers. The commas are not considered part of the number because they are not surrounded on both sidesby digits (see Section 3.3.4,Resolving Weak Types). However, if there is a preceding left-to-right sequence, then European numbers will adopt that direction:

Storage:IT IS A bmw 500, OK.Display:.KO ,bmw 500 A SI TI

3.3.6Resolving Implicit Levels

In the final phase, the embedding level of text may be increased, based on the resolved character type. Right-to-left text will always end up with an odd level, and left-to-right and numeric text will always end up with an even level. In addition, numeric text will always end up with a higher level than the paragraph level. (Note that it is possible for text to end up at level max_depth+1 as a result of this process.) This results in the following rules:

I1. For all characters with an even (left-to-right) embedding level, those of type R go up one level and those of type AN or EN go up two levels.

I2. For all characters with an odd (right-to-left) embedding level, those of type L, EN or AN go up one level.

Table 5 summarizes the results of the implicit algorithm.

Table 5.Resolving Implicit Levels

TypeEmbedding Level
EvenOdd
LELEL+1
REL+1EL
ANEL+2EL+1
ENEL+2EL+1

 

3.4Reordering Resolved Levels

The following rules describe the logical process of finding the correct display order.As opposed to resolution phases, these rules act on a per-line basis and are appliedafter any line wrapping is applied to the paragraph.

Logically there are the following steps:

L1. On each line, reset the embedding level of the following characters to the paragraph embedding level:

  1. Segment separators,
  2. Paragraph separators,
  3. Any sequence of whitespace characters and/or isolate formatting characters (FSI, LRI, RLI, and PDI) preceding a segment separator or paragraph separator, and
  4. Any sequence of whitespace characters and/or isolate formatting characters (FSI, LRI, RLI, and PDI) at the end of the line.

In combination with the following rule, this means that trailing whitespace will appear at the visual end of the line (in the paragraph direction). Tabulation will always have a consistent direction within a paragraph.

L2. From the highest level found in the text to the lowest odd level on each line, including intermediate levels not actually present in the text, reverse any contiguous sequence of characters that are at that level or higher.

This rule reverses a progressively larger series of substrings.

The following examples illustrate the reordering, showing thesuccessive steps in application of RuleL2. The original text is shown in the "Storage" row in theexample tables. The invisible, zero-width formatting characters LRI, RLI, and PDI are represented with the symbols>,<, and=, respectively. The application of the rules fromSection 3.3,Resolving Embedding Levelsand of the RuleL1 results in the resolved levels listed in the "ResolvedLevels" row. (Since these examples only make use of the isolate formatting characters, RuleX9 does not remove any characters. Note that Example 3 would not work if it used embeddings instead because the two right-to-left phrases would have merged into a single right-to-left run, together with the neutral punctuation in between.) Each successive row thereafter shows one pass of reversal from RuleL2, such as"Reverse levels 1-2". At each iteration, the underlining shows the text that has been reversed.

The paragraph embedding level for the first, second, and third examples is 0 (left-to-right direction),and for the fourth example is 1 (right-to-left direction).

Example 1. (embedding level = 0)

Storagecar means CAR.
Resolved levels     00000000001110
Reverse level 1car meansRAC.
Displaycar means RAC.

Example 2. (embedding level = 0)

Storage<car MEANS CAR.=
Resolved levels0222111111111110
Reverse level 2<rac MEANS CAR.=
Reverse levels 1-2<.RAC SNAEM car=
Display.RAC SNAEM car

Example 3. (embedding level = 0)

Storagehe said “<car MEANS CAR=.” “<IT DOES=,” she agreed.
Resolved levels000000000022211111111110000001111111000000000000000
Reverse level 2he said “<rac MEANS CAR=.” “<IT DOES=,” she agreed.
Reverse levels 1-2he said “<RAC SNAEM car=.” “<SEOD TI=,” she agreed.
Displayhe said “RAC SNAEM car.” “SEOD TI,” she agreed.

Example 4. (embedding level = 1)

StorageDID YOU SAY ’>he said “<car MEANS CAR=”=‘?
Resolved levels111111111111112222222222444333333333322111
Reverse level 4DID YOU SAY ’>he said “<rac MEANS CAR=”=‘?
Reverse levels 3-4DID YOU SAY ’>he said “<RAC SNAEM car=”=‘?
Reverse levels 2-4DID YOU SAY ’>”=rac MEANS CAR<“ dias eh=‘?
Reverse levels 1-4?‘=he said “<RAC SNAEM car=”>’ YAS UOY DID
Display?‘he said “RAC SNAEM car”’ YAS UOY DID

L3. Combining marks applied to a right-to-left base character will at this point precede their base character. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character must be reversed.

Many font designers provide default metrics for combining marks that support rendering by simple overhang. Because of the reordering for right-to-left characters, it is common practice to make the glyphs for most combining characters overhang to the left (thus assuming the characters will be applied to left-to-right base characters) and make the glyphs for combining characters in right-to-left scripts overhang to the right (thus assuming that the characters will be applied to right-to-left base characters). With such fonts, the display ordering of the marks and base glyphs may need to be adjusted when combining marks are applied to “unmatching” base characters. See Section 5.13, Rendering Nonspacing Marks of [Unicode], for more information.

L4. A character is depicted by a mirrored glyph if and only if (a)the resolved directionality of that character is R,and (b) the Bidi_Mirrored property value of that character is Yes.

For example, U+0028 LEFT PARENTHESIS—which is interpreted in the UnicodeStandard as an opening parenthesis—appears as “(” when its resolvedlevel is even, and as the mirrored glyph “)” when its resolved level is odd. Note that forbackward compatibility the characters U+FD3E ( ﴾ ) ORNATE LEFT PARENTHESISand U+FD3F ( ﴿ ) ORNATE RIGHT PARENTHESIS are not mirrored.

3.5Shaping

Cursively connected scripts, such as Arabic orSyriac, require the selection of positional character shapes that depend onadjacent characters (seeSection 9.2, Arabic of [Unicode]).Shaping is logically appliedafter RuleI2 of the Bidirectional Algorithmand is limited to characters within the same level run. (Note that there is no practical difference between limiting shaping to a level run and an isolating run sequence because the isolate initiator and PDI characters are defined to have joining type U, i.e. non-joining. Thus, the characters before and after a directional isolate will not join across the isolate, even if the isolate is empty or overflows the depth limit.) Consider the following example string of Arabic characters, which is represented in memory as characters 1, 2, 3, and 4, and where the first two characters are overridden to be LTR. To show both paragraph directions, the next two are embedded, but with the normal RTL direction.

1234

062C
JEEM

0639
AIN

0644
LAM

0645
MEEM

L

L

R

R

One can use explicit directional formatting characters to achieve this effect in plain text or use markup in HTML, as in the examples below. (Thebold text would be for the right-to-left paragraph direction.)

The resulting shapes will be the following, according to the paragraph direction:

Left-Right ParagraphRight-Left Paragraph
1243

JEEM-F

AIN-I

MEEM-F

LAM-I
4312

MEEM-F

LAM-I

JEEM-F

AIN-I

 

3.5.1Shaping and Line Breaking

The process of breaking a paragraph into one ormore lines that fit within particular bounds is outside the scope of theBidirectional Algorithm. Where character shaping is involved, the widthcalculations must be based on the shaped glyphs.

Note that thesoft-hyphen (SHY) works incursively connected scripts as it does in other scripts. That is, itindicates a point where the line could be broken in the middle of a word. Ifthe rendering system breaks at that point, the display—including shaping—should be what is appropriate for the given language. For more informationon this and other line breaking issues, see Unicode Standard Annex #14, “Line Breaking Properties” [UAX14].

4Bidirectional Conformance

The Bidirectional Algorithm specifies part of the intrinsic semantics of right-to-left characters and is thus required for conformance to the Unicode Standard where any such characters are displayed.

A process that claims conformance to this specification shall satisfy the following clauses:

UAX9-C1.In the absence of a permissible higher-level protocol, a process that renders text shall display all visible representations of characters (excluding formatting characters) in the order described by Section 3,Basic Display Algorithm, of this annex. In particular, this includes definitionsBD1BD16 and stepsP1P3,X1X10,W1W7,N0N2,I1I2, andL1L4.

As is the case for all other Unicode algorithms, this is alogical description—particular implementations can have more efficient mechanisms as long as they produce the same results. See C18 inChapter 3, Conformance of [Unicode], and the notes following.

UAX9-C2.The only permissible higher-level protocols are those listed in Section 4.3,Higher-Level Protocols. They areHL1,HL2,HL3,HL4,HL5, andHL6.

Note: The use of higher-level protocols introduces interchange problems, since the text may be displayed differently as plain text; see Section 6.5,Conversion to Plain Text. This can have security implications. Higher-level protocols are recommended wherever the semantics of segment order are more significant than those of displayed order, as is the case for source text. For detailed examples for which use of HL4 would be recommended, see Section 4.3.1,HL4 Example 1 for XML and Section 4.3.2,HL4 Example 2 for Program Text. For more information, see Section 4.1,Bidirectional Ordering, in Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55], as well as Unicode Technical Report #36, “Unicode Security Considerations” [UTR36].

4.1Boundary Neutrals

The goal in marking a formatting or control character as BN is that it have no effect on the rest of the algorithm. (ZWJ and ZWNJ are exceptions; seeX9). Becauseconformance does not require the precise ordering of formatting characters with respect to others, implementations can handle them in different ways as long as they preserve the ordering of the other characters.

4.2Explicit Formatting Characters

As with any Unicode characters, systems do not have to support any particular explicit directional formatting character (although it is not generally useful to include a terminating character without including the initiator). Generally, conforming systems will fall into four classes:

4.3Higher-Level Protocols

The following clauses are the only permissible ways for systems to apply higher-level protocols to the ordering of bidirectional text. Some of the clauses apply tosegments of structured text. This refers to the situation where text is interpreted as being structured, whether with explicit markup such as XML or HTML, or internally structured such as in a word processor or spreadsheet. In such a case, a segment is span of text that is distinguished in some way by the structure.

HL1. OverrideP3, and set the paragraph embedding level explicitly. This doesnot apply when deciding how to treat FSI in ruleX5c.

HL2. OverrideW2, and set EN or AN explicitly.

HL3. Emulate explicit directional formatting characters.

HL4. Apply the Bidirectional Algorithm to segments.

HL5. Provide artificial context.

HL6. Additional mirroring.

ClausesHL1 andHL3 are specialized applications of the more general clausesHL4 andHL5. They are provided here explicitly because they directly correspond to common operations.

4.3.1HL4 Example 1 for XML

As an example of the application ofHL4, suppose an XML document contains the following fragment. (Note: This is a simplified example for illustration: element names, attribute names, and attribute values could all be involved.)

ARABICenglishARABIC<e1 type='ab'>ARABICenglish<e2 type='cd'>english

This can be analyzed as being five different segments:

  1. ARABICenglishARABIC
  2. <e1 type='ab'>
  3. ARABICenglish
  4. <e2 type='cd'>
  5. english

To make the XML file readable as source text, the display in an editorcould order these elements all in a uniform direction (for example, all left-to-right) and apply theBidirectional Algorithm to each field separately. It could also choose to order the element names, attribute names, and attribute values uniformly in the same direction (for example, all left-to-right). For final display, the markup could be ignored, allowing all of the text (segments a, c, and e) to be reordered together.

4.3.2HL4 Example 2 for Program Text

Consider the following two lines:

(1)x + tav == 1
(2)x + תו == 1

Internally, they are the same except that the ASCII identifiertav in line (1) is replaced by the Hebrew identifierתו in line (2). However, with a plain text display (with left-to-right paragraph direction) the user will be misled, thinking that line (2) is a comparison between(x + 1) andתו, whereas it is actually a comparison between(x + תו) and1. The misleading rendering of (2) occurs because the directionality of the identifier תו influences subsequent weakly-directional tokens, so that the entire sequence “תו == 1” is at a higher resolved level.

This is illustrated in the first row of the following table, wherein characters at a resolved level higher than the embedding level are highlighted. Note that while the RTL display of that expression (second row) is not misleading, as the left-to-right directionality ofx does not influence the subsequent text, a similar issue would arise if the terms were swapped (third row).

Paragraph directionUnderlying RepresentationDisplay
LTRx + תו == 1x +תו == 1
RTLx + תו == 1x + תו ==1
RTLתו + x == 1תו +x == 1

It would be better to apply protocol HL4 when displaying these expressions, treating each identifier as a separate segment, thus isolating it from the rest of the source text, and then ordering the segments in a consistent direction, as shown in the following table.

Segment orderSegmentsDisplay
LTRx + תו == 1x+תו==1
RTLx + תו == 1x+תו==1
RTLתו + x == 1תו+x==1

A specification for the application of protocol HL4 to program text is given inSection 4.1, Bidirectional Ordering, in Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

4.3.3HL4 Example 3 for URLs

When a URL is displayed simply using the BIDI algorithm, the following results are produced. As per convention, uppercase represents RTL letters:

EnvironmentDisplay
LTR http://ab.cd.com/mn/op
http://ab.cd.HG.FE.com/LK/JI/mn/op
http://LK/JI/HG.FE
RTL http://ab.cd.com/mn/op
mn/op/LK/JI/com.HG.FE.http://ab.cd
LK/JI/HG.FE//:http

Note that the various fields of the URL can appear to the user in a jumbled order. Moreover, if any of the fields contain mixed bidi text (including digits), part of the contents of a field may flip around a delimiter, as in the following:

Memory positions

Memory pos.012345678910111213141516
Character/0אב1ab2/3cd4וד5/

Display positions

Display pos.012345678910111213141516
Memory pos.161514134567891011123210
Character/5דו1ab2/3cd4בא0/

The BIDI display process described inSection 4.1, Bidirectional Ordering, in Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55], can be applied to URLs to remedy this situation.

In applying the rules of that section, the atoms are both the delimiters and the text runs that they delimit. The latter are referred to as literals in the following. Delimiters include both the characters that separate the scheme, host, path, query, and fragment, plus any delimiters within each of those parts, such as query operators. For example:

http://foo.com/dir1/dir2?hl=fr&rl=CA#fii

The atoms are then displayed in monotonic order (RTL or LTR), and each literal is displayed with a paragraph direction equal to that monotonic order. This results in the following orderings:

EnvironmentDisplay
LTR http://ab.cd.com/mn/op
http://ab.cd.FE.HG.com/JI/LK/mn/op
http://FE.HG/JI/LK
RTL op/mn/com.cd.ab//:http
op/mn/LK/JI/com.HG.FE.cd.ab//:http
LK/JI/HG.FE//:http
Note: This display process is useful even when the labels follow the rule defined by RFC 5893:
  1. It applies to the entire URL, including the path, query, and fragment parts, which are out of scope for that RFC.
  2. Even within a domain name, the rule specified by RFC 5893 does not suffice to make display order of the sequence of labels consistent with network order, as documented in Section 2 of that RFC.
Being adisplay process, as opposed to a requirement on the text being displayed, it does not conflict with that RFC.

4.4Bidirectional ConformanceTesting

TheUnicode Character Database [UCD] includes two files that provide conformance tests for implementations of the Bidirectional Algorithm [Tests9]. One of the test files,BidiTest.txt, comprises exhaustive test sequences of bidirectional types up to a given length, currently 4. The other test file,BidiCharacterTest.txt, contains test sequences of explicit code points, including, for example, bracket pairs. The format of each test file is described in the header of that file.

5Implementation Notes

5.1Reference Code

Reference implementations of the Bidirectional Algorithm written in C and in Java are available. The source code can be downloaded from [Code9]. Implementers are encouraged to use these resources to test their implementations.

The reference code is designed to follow the steps of the algorithm without applying any optimizations. An example of an effective optimization is to first test for right-to-left characters and invoke the Bidirectional Algorithm only if they are present. Another example of optimization is in matching bracket pairs. The bidirectional bracket pairs (the characters with Bidi_Paired_Bracket_Type property values Open and Close) constitute a subset of the characters with bidirectional type ON. Conversely, the characters with a bidirectional type distinct from ON have the Bidi_Paired_Bracket_Type property value None. Therefore, lookup of Bidi_Paired_Bracket_Type property values for the identification of bracket pairs can be optimized by restricting the processing to characters whose bidirectional type is ON.

An online demo is also available at [Demo9], which shows the results of the Bidirectional Algorithm, as well as the embedding levels and the rules invoked for each character. Implementers are cautioned when using that online demo that it implementsthe rules for UBA as of Version 6.2, and has not been updated for the major changes to UBA in Version 6.3 andsubsequent versions. The online demo also does not handle supplemental characters gracefully.

5.2Retaining BNs and Explicit Formatting Characters

Some implementations may wish to retain the explicit directional embedding and override formatting characters and BNs when running the algorithm. In fact, retention of these formatting characters and BNs is important to users who need to display a graphic representation of hidden characters, and who thus need to obtain their visual positions for display.

The following describes how this may be done by implementations that do retain these characters through the steps of the algorithm. Note that this description is an informative implementation guideline; it should provide the same results as the explicit algorithm above, but in case of any deviation the explicit algorithm is the normative statement for conformance.

6Usage

6.1Joiners

As described underX9, thezero width joiner andnon-joiner affect the shaping of the adjacent characters—those that are adjacent in the original backing-store order—even though those characters may end up being rearranged to be non-adjacent by theBidirectional Algorithm. To determine the joining behavior of a particularcharacter after applying the Bidirectional Algorithm, there are two mainstrategies:

6.2Vertical Text

In the case of vertical line orientation, there are multiple ways to display bidirectional text. Somemethods use the Bidirectional Algorithm, and some do not. The Unicode Standard does not specify whethertext is presented with horizontal or vertical layout, or for verticallayout whether elements within the line are rotated. That is left up to higher-level protocols.For example, one of the common approaches for vertical line orientation is to rotate all the glyphsuniformly 90° clockwise. TheBidirectional Algorithm is used with this method. While some characters end up ordered from bottom to top,this method can represent a mixture of Arabic and Latin glyphs in the same way as occursfor horizontal line orientation.

Another possible approach is to render the text in a uniform single direction from top to bottom. This method has multiplevariations to determine the orientation of characters. One variant uses the Bidirectional Algorithm todetermine the level of the text, but then the levels are not used to reorder the text. Instead, the levels areused to determine therotation of each segment of the text.Sometimes vertical lines follow a vertical baseline in whicheach character is oriented as normal (with no rotation), with characters ordered from top to bottomwhether they are Hebrew, numbers, or Latin. When setting text using the Arabic script in vertical lines, itis more common to employ a horizontal baseline that is rotated by 90° counterclockwise so that thecharacters are ordered from top to bottom. Latin text and numbers may be rotated 90° clockwise sothat those characters are also ordered from top to bottom.

6.3Formatting

Because of the implicit character types and the heuristics for resolving neutral and numeric directional behavior, the implicit bidirectional ordering will generally produce the correct display without any further work. However, problematic cases may occur when a right-to-left paragraph begins with left-to-right characters, or there are nested segments of different-direction text, or there are weak characters on directional boundaries. In these cases, embeddings or directional marks may be required to get the right display. Part numbers may also require directional overrides.

The most common problematic case is that of neutrals on the boundary of an embedded language. This can be addressed by setting the level of the embedded text correctly. For example, with all the text at level 0 the following occurs:

Memory: he said "I NEED WATER!", and expired.Display: he said "RETAW DEEN I!", and expired.

If the exclamation mark is to be part of the Arabic quotation, then the user can select the textI NEED WATER! and explicitly mark it as embedded Arabic, which produces the following result:

Memory:  he said "RLII NEED WATER!PDI", and expired.Display: he said "!RETAW DEEN I", and expired.

However, an often simpler and better method of doing this is to place a right directional mark (RLM) after the exclamation mark. Because the exclamation mark is now not on a directional boundary, this produces the correct result. This is the best approach when manually editing text or programmatically generating text meant to be edited, or dealing with an application that simply does not support explicit formatting characters.

Memory:  he said "I NEED WATER!RLM", and expired.Display: he said "!RETAW DEEN I", and expired.

This latter approach is preferred because it does not make use of the explicit formatting characters, which can easily get out of sync if not fully supported by editors and other string manipulation. Nevertheless, the explicit formatting characters are absolutely necessary in cases where text of one direction contains text of the opposite direction which itself contains text of the original direction. Such cases are not as rare as one might think, because Latin-script brand names, technical terms, and abbreviations are often written in their original Latin characters when used in non-Latin-script text, including right-to-left text, as in the following:

Memory:  it is called "RLIAN INTRODUCTION TO javaPDI" - $19.95 in hardcover.Display: it is called "java OT NOITCUDORTNI NA" - $19.95 in hardcover.

Thus, when text is programmatically generated by inserting data into a template, and is not intended for later manual editing, and a particular insert happens to be of the opposite direction to the template's text, it is easiest to wrap the insert in explicit formatting characters (or their markup equivalent) declaring its direction, without analyzing whether it is really necessary to do so, or if the job could be done just with stateless directional marks.

Furthermore, in this common scenario, it is highly recommended to use directional isolate formatting characters as opposed to directional embedding formatting characters (once targeted display platforms are known to support isolates). This is because embeddings affect the surrounding text similarly to a strong character, whereas directional isolates have the effect of a neutral. The embeddings' stronger effect is often difficult to anticipate and is rarely useful. To demonstrate, here is the example above with embeddings instead of isolates:

Memory:  it is called "RLEAN INTRODUCTION TO javaPDF" - $19.95 in hardcover.Display: it is called "$19.95 - "java OT NOITCUDORTNI NA in hardcover.

This, of course, is not the intended display, and is due to the number “sticking” to the preceding RTL embedding (along with all the neutral characters in between), just as it would “stick” to a preceding RTL character.

Directional isolates also offer a solution to the very common case where the direction of the text to be programmatically inserted is not known. Instead of analyzing the characters of the text to be inserted in order to decide whether to use an LRE or RLE (or LRI or RLI - or nothing at all), the software can take the easy way out andalways wrap each unknown-direction insert in an FSI and PDI. Thus, an FSI instead of an RLI in the example above would produce the same display. FSI's first-strong heuristic is not infallible, but it will work most of the time even on mixed-script text.

Although wrapping inserts in isolates is a useful technique, it is best not to wrap text that is known to contain no opposite-direction characters that are not already wrapped in an isolate. Unnecessary layers of wrapping not only add bulk and complexity; they can also wind up exceeding the depth limit and rendering ineffective the innermost isolates, which can make the text display incorrectly. One very common case of an insert that does not need wrapping is one known to be localized to the context locale, e.g. a translated message with all its inserted values either themselves localized, or wrapped in an isolate.

6.4Separating Punctuation Marks

A common problem case is where the text really represents asequence of items with separating punctuation marks,often programmatically concatenated. These separators are often strings of neutral characters. For example, a web page might have the following at thebottom:

advertising programs -business solutions -privacy policy-help -about

This might be built up on the server by concatenating a variable number of strings with " - " as aseparator, for example. If all of the text is translated into Arabic or Hebrew and the overallpage direction is set to be RTL, then the right result occurs, such as the following:

TUOBA -PLEH -YCILOP YCAVIRP -SNOITULOS SSENISUB-SMARGORP GNISITREVDA

However, suppose that in the translation, there remain some LTR characters. This is notuncommon for company names, product names, technical terms, and so on. If one of the separators is bounded onboth sides by LTR characters, then the result will be badly jumbled. For example, suppose that"programs" in the first term and "business" in the second were left in English. Then the resultwould be

TUOBA -PLEH -YCILOP YCAVIRP -SNOITULOSprograms-business GNISITREVDA

The result is a jumble, withthe apparent first term being "advertising business" and the second being "programs solutions".The simplest solution for this problem is to include an RLM character in each separator string. That willcause each separator to adopt a right-to-left direction, and produce the correct output:

TUOBA -PLEH -YCILOP YCAVIRP -SNOITULOSbusiness-programsGNISITREVDA

The explicit formatting characters (LRE, RLE, and PDF or LRI, RLI, FSI, and PDI) can be used to achieve thesame effect; web pages would use spans with the attributesdir="ltr" ordir="rtl".Each separate field would be embedded, excluding the separators. In general, LRM and RLM arepreferred to the explicit formatting characters because their effects are more local in scope, and are morerobust than the dir attributes when text is copied. (Ideally programs would convertdirattributes to the corresponding explicit formatting characters when converting to plain text, but that is notgenerally supported.)

6.5Conversion to Plain Text

For consistent appearance, when bidirectional text subject to a higher-level protocol is to be converted to Unicode plain text, formatting characters should be inserted to ensure that the display order resulting from the application of the Unicode Bidirectional Algorithm matches that specified by the higher-level protocol. The same principle should be followed whenever text using a higher-level protocol is converted to marked-up text that is unaware of the higher-level protocol. For example, if a higher-level protocol sets the paragraph direction to 1 (R) based on the number of L versus R/AL characters, when converted to plain text the paragraph would be embedded in a bracketing pair of RLE..PDF formatting characters. If the same text were converted to HTML4.0 the attribute dir = "rtl" would be added to the paragraph element.

The display of program text is subject to higher-lever protocols; seeSection 4.3.2,HL4 Example 2 for Program Text, andSection 4.1, Bidirectional Ordering, in Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55]. However, in addition to preserving the appearance resulting from higher-level protocols, program text must be converted to plain text in a semantics-preserving way, by inserting characters that are ignored by the compiler. It is recommended that computer languages allow for the insertion of some formatting characters in appropriate locations without changing the meaning of a program; for computer languages that allow this insertion, a procedure is specified for conversion to plain text. SeeSection 4.1, Whitespace, in Unicode Standard Annex #31, “Unicode Identifiers and Syntax” [UAX31], andSection 5.2, Conversion to Plain Text, in Unicode Technical Standard #55, “Unicode Source Code Handling” [UTS55].

7Mirroring

The mirrored property is important to ensure that the correct characters are used for the desired semantic. This is of particular importance where the name of a character does not indicate the intended semantic, such as with U+0028 “(” LEFT PARENTHESIS. While the name indicates that it is a left parenthesis, the character really expresses anopen parenthesis—theleading character in a parenthetical phrase, not the trailing one.

Some of the characters that do not have the Bidi_Mirrored propertymay be rendered with mirrored glyphs, according to a higher levelprotocol that adds mirroring: see Section 4.3,Higher-Level Protocols, especiallyHL6. Except in such cases, mirroring must be doneaccording to ruleL4, to ensure that the correct character is used to express the intended semantic,and to avoid interoperability and security problems.

Implementing ruleL4 calls for mirrored glyphs. These glyphs may not be exactgraphical mirror images. For example, clearly an italic parenthesis is not an exact mirror image of another— “(” is not the mirror image of “)”. Instead, mirror glyphs are those acceptable as mirrors within the normal parameters of the font in which they are represented.

In implementation, sometimes pairs of characters are acceptable mirrors for one another—for example, U+0028 “(” LEFT PARENTHESIS and U+0029“)” RIGHT PARENTHESIS or U+22E0 “⋠” DOES NOT PRECEDE OR EQUAL and U+22E1 “⋡” DOES NOT SUCCEED OR EQUAL. Other characters such as U+2231 “∱” CLOCKWISE INTEGRAL do not have corresponding characters that can be used for acceptable mirrors. The informative BidiMirroring.txt data file[Data9], lists the paired characters with acceptable mirror glyphs.The formal property name for this data in theUnicode Character Database [UCD]isBidi_Mirroring_Glyph. A comment in the file indicates where the pairs are “best fit”: they should be acceptable in rendering, although ideally the mirrored glyphs may have somewhat different shapes.

Migration Issues

There are two major enhancements in the Unicode 6.3 version of the UBA:

Implementations of the new directional isolates should see very few compatibility issues; the UBA has been carefully modified to minimize differences for older text written without them. There are a few edge cases near the limit of the number of levels where there are some differences, but those are not likely to be encountered in practice.

With bracket pairs, there may be more changes. The problem is that without knowing (or having good UI access to) the directional marks or embeddings, people have constructed text with the correct visual appearance but incorrect underlying structure (eg …[…[…, appearing as …[…]…). The new algorithm catches cases like these, because such malformed sequences of brackets are not matched.

However, there are some cases where older implementations without ruleN0 produced the desired appearance, and newer implementations will not. The user feedback on implementations was sufficiently positive that the decision was made to addN0.

There are also incompatibilities from some implementation's failing to updating correctly to previous versions of Unicode, notably in the mishandling solidus such that "T 1/2" (T is an Arabic character) appears incorrectly as "2/1 T".

To mitigate compatibility problems, it is strongly recommended that implementations take the following steps:

Section Reorganization

In Unicode 6.3, there was significant reorganization of the text. The following table shows the new and old section numbers.

Unicode 6.3Unicode 6.2
2.4Explicit Directional Isolatesn/a
2.5Terminating Explicit Directional Isolatesn/a
2.6Implicit Directional Marks2.4
3.3.3Preparations for Implicit Processingn/a
3.3.4Resolving Weak Types
…3.3.6Resolving Implicit Levels
3.3.3
…3.3.5
6.1Joiners5.3
6.2Vertical Text5.4
6.3Formatting5.5
6.4Separating Punctuation Marks5.6
6.5Conversion to Plain Textn/a
Migration Issues5.7

 

Acknowledgments

Mark Davis created the initial version of this annex and maintained the text until 2023. Ken Whistler also maintained the text from 2022–2023. Aharon Lanin and Andrew Glass made substantial additions to Revision 29 (Unicode 6.3.0).Robin Leroy made substantial additions to Revision 46 (Unicode 15.0.0).

Thanks to the following people for their contributions to the BidirectionalAlgorithm or for their feedback on earlier versions of this annex:Ahmed Talaat (أحمد طلعت),Alaa Ghoneim (علاء غنيم),Asmus Freytag, Avery Bishop, Ayman Aldahleh (أيمن الدحلة), Behdad Esfahbod (بهداد اسفهبد), Doug Felt,Dwayne Robinson,Eric Mader, Ernest Cline, Gidi Shalom-Bendor (גידי שלום-בן דור), Gilead Almosnino (גלעד אלמוסנינו), Isai Scheinberg,Israel Gidali (ישראל גידלי), Joe Becker, John McConnell, Jonathan Kew, Jonathan Rosenne (יונתן רוזן),Kamal Mansour (كمال منصور), Kenneth Whistler,Khaled Sherif (خالد شريف), Koji Ishii, Laurențiu Iancu,Maha Hassan (مها حسن), Markus Scherer, Martin Dürst, Mati Allouche (מתתיהו אלוש), Michel Suignard, Mike Ksar (ميشيل قصار),Murray Sargent, Paul Nelson, Pedro Navarro, Peter Constable, Rick McGowan,Robert Steen,Roozbeh Pournader (روزبه پورنادر),Solra Bizna, Steve Atkin, and Thomas Milo (تُومَاسْ مِيلُو).

References

For references for this annex, see Unicode Standard Annex #41, “CommonReferences for Unicode Standard Annexes.”

Modifications

The following summarizes modifications from the previous version of thisannex.

Revision 50

Previous revisions can be accessed with the “Previous Version” link in the header.


© 1999–2024 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by theTerms of Use. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.

Use of all Unicode Products, including this publication, is governed by the UnicodeTerms of Use. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.

Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries.


[8]ページ先頭

©2009-2025 Movatter.jp