Movatterモバイル変換


[0]ホーム

URL:


 previous next  contents  elements  attributes  index

8 Language information and textdirection

Contents

  1. Specifying the language of content: thelang attribute
    1. Language codes
    2. Inheritance of language codes
    3. Interpretation of languagecodes
  2. Specifying the direction of text andtables: thedir attribute
    1. Introduction to the bidirectionalalgorithm
    2. Inheritance of text directioninformation
    3. Setting the direction of embeddedtext
    4. Overriding the bidirectional algorithm:theBDO element
    5. Character references for directionalityand joining control
    6. The effect of style sheets onbidirectionality

This section of the document discusses two important issues that affect theinternationalization of HTML: specifying the language (thelangattribute) and direction (thedir attribute) of text in a document.

8.1Specifying the language ofcontent: thelang attribute

Attribute definitions
lang =language-code[CI]
This attribute specifies the base language of an element's attribute valuesand text content. The default value of this attribute is unknown.

Language information specified via thelangattribute may be used by a user agent to control rendering in a variety ofways. Some situations where author-supplied language information may be helpfulinclude:

Thelang attribute specifies the language of element content andattribute values; whether it isrelevantfor a given attribute depends on the syntax and semantics of the attribute andthe operation involved.

The intent of thelang attribute is to allow user agents to rendercontent more meaningfully based on accepted cultural practice for a givenlanguage. This does not imply that user agents should render characters thatare atypical for a particular language in less meaningful ways; user agentsmust make a best attempt torender all characters,regardless of the value specified bylang.

For instance, if characters from the Greek alphabet appear in the midst ofEnglish text:

<P><Q lang="en">Her super-powers were the result of&gamma;-radiation,</Q> he explained.</P>

a user agent (1) should try to render the English content in an appropriatemanner (e.g., in its handling the quotation marks) and (2) must make a bestattempt to render γ even though it is not an English character.

Please consult the section onundisplayable characters for related information.

8.1.1Language codes

Thelang attribute's value is a language code that identifies a naturallanguage spoken, written, or otherwise used for the communication ofinformation among people. Computer languages are explicitly excluded fromlanguage codes.

[RFC1766] defines and explains the language codes that must be used in HTMLdocuments.

Briefly, language codes consist of a primary code and a possibly emptyseries of subcodes:

        language-code = primary-code ( "-" subcode )*

Here are some sample language codes:

Two-letter primary codes are reserved for[ISO639] languageabbreviations. Two-letter codes include fr (French), de (German), it (Italian),nl (Dutch), el (Greek), es (Spanish), pt (Portuguese), ar (Arabic), he(Hebrew), ru (Russian), zh (Chinese), ja (Japanese), hi (Hindi), ur (Urdu), andsa (Sanskrit).

Any two-letter subcode is understood to be a[ISO3166] countrycode.

8.1.2 Inheritance of language codes

An element inherits language code information according to the followingorder of precedence (highest to lowest):

In this example, the primary language of the document is French ("fr"). Oneparagraph is declared to be in Spanish ("es"), after which the primary languagereturns to French. The following paragraph includes an embedded Japanese ("ja")phrase, after which the primary language returns to French.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"   "http://www.w3.org/TR/html4/strict.dtd"><HTML lang="fr"><HEAD><TITLE>Un document multilingue</TITLE></HEAD><BODY>...Interpreted as French...<P lang="es">...Interpreted as Spanish...<P>...Interpreted as French again...<P>...French text interrupted by<EM lang="ja">some         Japanese</EM>French begins here again...</BODY></HTML>
Note. Table cells may inheritlangvalues not from its parent but from the first cell in a span. Please consultthe section onalignmentinheritance for details.

8.1.3 Interpretation of language codes

In the context of HTML, a language code should be interpreted by user agentsas a hierarchy of tokens rather than a single token. When a user agent adjustsrendering according to language information (say, by comparing style sheetlanguage codes andlang values), it should always favor an exact match, butshould also consider matching primary codes to be sufficient. Thus, if thelang attribute value of "en-US" is set for theHTMLelement, a user agent should prefer style information that matches "en-US"first, then the more general value "en".

Note. Language code hierarchies do not guarantee thatall languages with a common prefix will be understood by those fluent in one ormore of those languages. They do allow a user to request this commonality whenit is true for that user.

8.2Specifying the direction oftext and tables: thedir attribute

Attribute definitions

dir =LTR |RTL[CI]
This attribute specifies the base direction of directionally neutral text(i.e., text that doesn't have inherent directionality as defined in[UNICODE]) in an element's content and attribute values. It also specifiesthedirectionality of tables.Possible values:
  • LTR: Left-to-right text or table.
  • RTL: Right-to-left text or table.

In addition to specifying the language of a document with thelangattribute, authors may need to specify thebasedirectionality (left-to-right or right-to-left) of portions of adocument's text, of table structure, etc. This is done with thedirattribute.

The[UNICODE] specification assigns directionality to characters anddefines a (complex) algorithm for determining the proper directionality oftext. If a document does not contain a displayable right-to-left character, aconforming user agent is not required to apply the[UNICODE] bidirectionalalgorithm. If a document contains right-to-left characters, and if the useragent displays these characters, the user agent must use the bidirectionalalgorithm.

Although Unicode specifies special characters that deal with text direction,HTML offers higher-level markup constructs that do the same thing: thedirattribute (do not confuse with theDIR element) and theBDOelement. Thus, to express a Hebrew quotation, it is more intuitive to write

<Q lang="he" dir="rtl">...a Hebrew quotation...</Q>

than the equivalent with Unicode references:

&#x202B;&#x05F4;...a Hebrew quotation...&#x05F4;&#x202C;

User agents mustnot use thelangattribute to determine text directionality.

Thedir attribute is inherited and may be overridden. Please consult thesection on theinheritance of text directioninformation for details.

8.2.1Introduction to the bidirectional algorithm

The following example illustrates the expected behavior of the bidirectionalalgorithm. It involves English, a left-to-right script, and Hebrew, aright-to-left script.

Consider the following example text:

  english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

The characters in this example (and in all related examples) are stored inthe computer the way they are displayed here: the first character in the fileis "e", the second is "n", and the last is "6".

Suppose the predominant language of the document containing this paragraphis English. This means that the base direction is left-to-right. The correctpresentation of this line would be:

english1 2WERBEH english3 4WERBEH english5 6WERBEH         <------          <------          <------            H                H                H------------------------------------------------->                       E

The dotted lines indicate the structure of the sentence: Englishpredominates and some Hebrew text is embedded. Achieving the correctpresentation requires no additional markup since the Hebrew fragments arereversed correctly by user agents applying the bidirectional algorithm.

If, on the other hand, the predominant language of the document is Hebrew,the base direction is right-to-left. The correct presentation is therefore:

6WERBEH english5 4WERBEH english3 2WERBEH english1        ------->         ------->         ------->            E                E                E<-------------------------------------------------                       H

In this case, the whole sentence has been presented as right-to-left and theembedded English sequences have been properly reversed by the bidirectionalalgorithm.

8.2.2Inheritance of text directioninformation

The Unicode bidirectional algorithm requires a base text direction for textblocks. To specify the base direction of a block-level element, set theelement'sdir attribute. The default value of thedirattribute is "ltr" (left-to-right text).

When thedir attribute is set for a block-level element, it remains in effectfor the duration of the element and any nested block-level elements. Settingthedir attribute on a nested element overrides the inherited value.

To set the base text direction for an entire document, set thedirattribute on theHTML element.

For example:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"   "http://www.w3.org/TR/html4/strict.dtd"><HTML dir="RTL"><HEAD><TITLE>...a right-to-left title...</TITLE></HEAD>...right-to-left text...<P dir="ltr">...left-to-right text...</P><P>...right-to-left text again...</P></HTML>

Inline elements, on the other hand, do not inherit thedirattribute. This means that an inline element without adirattribute doesnot open an additional level of embedding withrespect to the bidirectional algorithm. (Here, an element is considered to beblock-level or inline based on its default presentation. Note that theINS andDELelements can be block-level or inline depending on their context.)

8.2.3 Setting the direction of embedded text

The[UNICODE] bidirectional algorithm automatically reverses embeddedcharacter sequences according to their inherent directionality (as illustratedby the previous examples). However, in general only one level of embedding canbe accounted for. To achieve additional levels of embedded direction changes,you must make use of thedir attribute on an inline element.

Consider the same example text as before:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

Suppose the predominant language of the document containing this paragraphis English. Furthermore, the above English sentence contains a Hebrew sectionextending from HEBREW2 through HEBREW4 and the Hebrew section contains anEnglish quotation (english3). The desired presentation of the text is thus:

english1 4WERBEH english3 2WERBEH english5 6WERBEH                 ------->                    E         <-----------------------                    H------------------------------------------------->                    E

To achieve two embedded direction changes, we must supply additionalinformation, which we do by delimiting the second embedding explicitly. In thisexample, we use theSPAN element and thedir attribute to mark up the text:

english1 <SPAN dir="RTL">HEBREW2 english3 HEBREW4</SPAN> english5 HEBREW6

Authors may also use special Unicode characters to achieve multiple embeddeddirection changes. To achieve left-to-right embedding, surround embedded textwith the characters LEFT-TO-RIGHT EMBEDDING ("LRE", hexadecimal 202A) and POPDIRECTIONAL FORMATTING ("PDF", hexadecimal 202C). To achieve right-to-leftembedding, surround embedded text with the characters RIGHT-TO-LEFT EMBEDDING("RTE", hexadecimal 202B) and PDF.

Using HTML directionality markup with Unicodecharacters. Authors and designers of authoring software should beaware that conflicts can arise if thedir attribute is used on inlineelements (includingBDO) concurrently with the corresponding[UNICODE] formatting characters. Preferably one or the other should be usedexclusively. The markup method offers a better guarantee of document structuralintegrity and alleviates some problems when editing bidirectional HTML textwith a simple text editor, but some software may be more apt at using the[UNICODE] characters. If both methods are used, great care should beexercised to insure proper nesting of markup and directional embedding oroverride, otherwise, rendering results are undefined.

8.2.4 Overriding the bidirectional algorithm: theBDO element

<!ELEMENTBDO - - (%inline;)*          -- I18N BiDi over-ride --><!ATTLIST BDO%coreattrs;                          --id,class,style,title --lang%LanguageCode; #IMPLIED  -- language code --dir         (ltr|rtl)      #REQUIRED -- directionality --  >

Start tag:required, End tag:required

Attribute definitions

dir =LTR|RTL[CI]
This mandatory attribute specifies the base direction of the element's textcontent. This direction overrides the inherent directionality of characters asdefined in[UNICODE]. Possible values:
  • LTR: Left-to-right text.
  • RTL: Right-to-left text.

Attributes defined elsewhere

The bidirectional algorithm and thedir attribute generally suffice tomanage embedded direction changes. However, some situations may arise when thebidirectional algorithm results in incorrect presentation. TheBDOelement allows authors toturn off the bidirectional algorithmfor selected fragments of text.

Consider a document containing the same text as before:

english1 HEBREW2 english3 HEBREW4 english5 HEBREW6

but assume that this text has already been put in visual order. One reasonfor this may be that the MIME standard ([RFC2045],[RFC1556]) favors visual order, i.e., that right-to-left charactersequences are inserted right-to-left in the byte stream. In an email, the abovemight be formatted, including line breaks, as:

english1 2WERBEH english34WERBEH english5 6WERBEH

This conflicts with the[UNICODE] bidirectionalalgorithm, because that algorithm would invert2WERBEH,4WERBEH, and6WERBEH a second time, displaying the Hebrew wordsleft-to-right instead of right-to-left.

The solution in this case is to override the bidirectional algorithm byputting the Email excerpt in aPRE element (to conserve line breaks) and eachline in aBDO element, whosedir attribute is set toLTR:

<PRE><BDO dir="LTR">english1 2WERBEH english3</BDO><BDO dir="LTR">4WERBEH english5 6WERBEH</BDO></PRE>

This tells the bidirectional algorithm "Leave me left-to-right!" and wouldproduce the desired presentation:

english1 2WERBEH english34WERBEH english5 6WERBEH

TheBDO element should be used in scenarios where absolute control oversequence order is required (e.g., multi-language part numbers). Thedir attribute is mandatory for this element.

Authors may also use special Unicode characters to override thebidirectional algorithm -- LEFT-TO-RIGHT OVERRIDE (202D) or RIGHT-TO-LEFTOVERRIDE (hexadecimal 202E). The POP DIRECTIONAL FORMATTING (hexadecimal 202C)character ends either bidirectional override.

Note. Recall that conflicts can arise if thedirattribute is used on inline elements (includingBDO) concurrently with thecorresponding[UNICODE] formatting characters.

Bidirectionality and character encoding According to[RFC1555] and[RFC1556], there are special conventions for the use of"charset" parameter values to indicate bidirectional treatment in MIME mail, inparticular to distinguish between visual, implicit, and explicitdirectionality. The parameter value "ISO-8859-8" (for Hebrew) denotes visualencoding, "ISO-8859-8-i" denotes implicit bidirectionality, and "ISO-8859-8-e"denotes explicit directionality.

Because HTML uses the Unicode bidirectionality algorithm, conformingdocuments encoded using ISO 8859-8 must be labeled as "ISO-8859-8-i". Explicitdirectional control is also possible with HTML, but cannot be expressed withISO 8859-8, so "ISO-8859-8-e" should not be used.

The value "ISO-8859-8" implies that the document is formatted visually,misusing some markup (such asTABLE with right alignment and no line wrapping)to ensure reasonable display on older user agents that do not handlebidirectionality. Such documents do not conform to the present specification.If necessary, they can be made to conform to the current specification (and atthe same time will be displayed correctly on older user agents) by addingBDOmarkup where necessary.Contrary to what is said in[RFC1555] and[RFC1556], ISO-8859-6 (Arabic) isnotvisual ordering.

8.2.5Characterreferences for directionality and joining control

Since ambiguities sometimes arise as to the directionality of certaincharacters (e.g., punctuation), the[UNICODE] specificationincludes characters to enable their proper resolution. Also, Unicode includessome characters to control joining behavior where this is necessary (e.g., somesituations with Arabic letters). HTML 4 includescharacter references for these characters.

The following DTD excerpt presents some of the directional entities:

   <!ENTITY zwnj CDATA "&#8204;"--=zero width non-joiner-->   <!ENTITY zwj  CDATA "&#8205;"--=zero width joiner-->   <!ENTITY lrm  CDATA "&#8206;"--=left-to-right mark-->   <!ENTITY rlm  CDATA "&#8207;"--=right-to-left mark-->

Thezwnj entity is used to block joining behavior in contextswhere joining will occur but shouldn't. Thezwj entity does theopposite; it forces joining when it wouldn't occur but should. For example, theArabic letter "HEH" is used to abbreviate "Hijri", the name of the Islamiccalendar system. Since the isolated form of "HEH" looks like the digit five asemployed in Arabic script (based on Indic digits), in order to preventconfusing "HEH" as a final digit five in a year, the initial form of "HEH" isused. However, there is no following context (i.e., a joining letter) to whichthe "HEH" can join. Thezwj character provides that context.

Similarly, in Persian texts, there are cases where a letter that normallywould join a subsequent letter in a cursive connection should not. Thecharacterzwnj is used to block joining in such cases.

The other characters,lrm andrlm, are used toforce directionality of directionally neutral characters. For example, if adouble quotation mark comes between an Arabic (right-to-left) and a Latin(left-to-right) letter, the direction of the quotation mark is not clear (is itquoting the Arabic text or the Latin text?). Thelrm andrlm characters have a directional property but no width and no word/linebreak property. Please consult[UNICODE] for moredetails.

Mirrored character glyphs. In general, thebidirectional algorithm does not mirror character glyphs but leaves themunaffected. An exception are characters such as parentheses (see[UNICODE], table 4-7). In cases where mirroring is desired, for example forEgyptian Hieroglyphs, Greek Bustrophedon, or special design effects, thisshould be controlled with styles.

8.2.6Theeffect of style sheets on bidirectionality

In general, using style sheets to change an element's visual rendering fromblock-level to inline or vice-versa is straightforward. However, because thebidirectional algorithm relies on theinline/block-level distinction, special care must be taken during thetransformation.

When an inline element that does not have adir attribute is transformed tothe style of a block-level element by a style sheet, it inherits thedirattribute from its closest parent block element to define the base direction ofthe block.

When a block element that does not have adir attribute is transformed tothe style of an inline element by a style sheet, the resulting presentationshould be equivalent, in terms of bidirectional formatting, to the formattingobtained by explicitly adding adir attribute (assigned the inherited value) tothe transformed element.


previous  next contents  elements  attributes  index

[8]ページ先頭

©2009-2025 Movatter.jp