| Unicode | |
|---|---|
| Number | 210 |
| Permalink | google.aip.dev/210 |
| State | Approved |
| Created | 2018-08-20 |
| Updated | 2018-08-20 |
AIP-210
Unicode
APIs should be consistent on how they explain, limit, and bill for stringvalues and their encodings. This ranges from little ambiguities (like fields"limited to 1024 characters") all the way to billing confusion (are names andvalues of properties in Datastore billed based on characters or bytes?).
In general, if we talk about limits measured in bytes, we are discriminatingagainst non-ASCII text since it takes up more space. On the other hand, if wetalk about "characters", we are ambiguous about whether those are Unicode "codepoints", "code units" for a particular encoding (e.g. UTF-8 or UTF-16),"graphemes", or "grapheme clusters".
Unicode primer
Character encoding tends to be an area we often gloss over, so a quick primer:
- Strings are just bytes that represent numbers according to some encoding format.
- When we talk aboutcharacters, we sometimes mean Unicodecode points, which are numbers in the Unicode spec (up to 21 bits).
- Other times we might meangraphemes orgrapheme clusters, which may have multiple numeric representations and may be represented by more than one code point. For example,
ámay be represented as a composition ofU+0061 + U+0301(thea+ the accent combining mark) or as a single code point,U+00E1. - Protocol buffers usesUTF-8 ("Unicode Transformation Format") which is a variable-length encoding scheme using up to 4code units (8-bit bytes) per code point.
Guidance
Character definition
TL;DR: In our APIs, "characters" means "Unicode code points".
In API documentation (e.g., API reference documents, blog posts, marketingdocumentation, billing explanations, etc), "character"must be defined as aUnicode code point.
Length units
TL;DR: Set size limits in "characters" (as defined above).
All string field length limits defined in API commentsmust be measured andenforced in characters as defined above. This means that there is an underlyingmaximum limit of (4 * characters) bytes, though this limit will only be hitwhen using exclusively characters that consist of 4 UTF-8 code units (32 bits).
If you use a database system (e.g. Spanner) which allows you to define a limitin characters, it is safe to assume that this byte-defined requirement ishandled by the underlying storage system.
Billing units
APIsmay use either code points or bytes (using the UTF-8 encoding) as theunit for billing or quota measurement (e.g., Cloud Translation chooses to usecharacters). If an API does not define this, the assumption is that the unit ofbilling is characters (e.g., $0.01per character, not $0.01per byte).
Unique identifiers
TL;DR: Unique identifiersshould limit to ASCII, generally onlyletters, numbers, hyphens, and underscores, andshould not start with anumber.
Strings used as unique identifiersshould limit inputs to ASCII characters,typically letters, numbers, hyphens, and underscores([a-zA-Z][a-zA-Z0-9_-]*). This ensures that there are never accidentalcollisions due to normalization. If an API decides to allow all valid Unicodecharacters in unique identifiers, the APImust reject any inputs that arenot in Normalization Form C. Generally, unique identifiersshould not startwith a number as that prefix is reserved for Google-generated identifiers andgives us an easy way to check whether we generated a unique numeric ID for orwhether the ID was chosen by a user.
Unique identifiersshould use a maximum length of 64 characters, thoughthis limit may be expanded as necessary. 64 characters should be sufficient formost purposes as even UUIDs only require 36 characters.
Note: SeeAIP-122 for recommendations about resource ID segments.
Normalization
TL;DR: Unicode valuesshould be stored inNormalization Form C.
Valuesshould always be normalized into Normalization Form C. Uniqueidentifiersmust always be stored in Normalization Form C (see the nextsection).
Imagine we're dealing with Spanish input "estaré" (the accented partwill be bolded throughout). This text has what we might visualize as 6"characters" (in this case, they are grapheme clusters). It has two possibleUnicode representations:
- Using 6 code points:
U+0065U+0073U+0074U+0061U+0072U+00E9 - Using 7 code points:
U+0065U+0073U+0074U+0061U+0072U+0065U+0301
Further, when encoding to UTF-8, these code points have two differentserialized representations:
- Using 7 code-units (7 bytes):
0x650x730x740x610x720xC30xA9 - Using 8 code-units (8 bytes):
0x650x730x740x610x720x650xCC0x81
To avoid this discrepancy in size (both code units and code points), useNormalization Form C which provides a canonical representation for strings.
Uniqueness
TL;DR: Unicode valuesmust be normalized toNormalization Form Cbefore checking uniqueness.
For the purposes of unique identification (e.g.,name,id, orparent),the valuemust be normalized intoNormalization Form C (which happensto be the most compact). Otherwise we may have what is essentially "the samestring" used to identify two entirely different resources.
In our example above, there are two ways of representing what is essentiallythe same text. This raises the question about whether the two representationsshould be treated as equivalent or not. In other words, if someone were to useboth of those byte sequences in a string field that acts as a uniqueidentifier, would it violate a uniqueness constraint?
The W3C recommends using Normalization Form C for all content moving across theinternet. It is the most compact normalized form on Unicode text, and avoidsmost interoperability problems. If we were to treat two Unicode byte sequencesas different when they have the same representation in NFC, we'd be required toreply to possible "Get" requests with content that isnot in normalizedform. Since that is definitely unacceptable, wemust treat the two asidentical by transforming any incoming string data into Normalized Form C orrejecting identifiers not in the normalized form.
There is some debate about whether we should view strings as sequences of codepoints represented as bytes (leading to uniqueness determined based on thebyte-representation of said string) or to interpret strings as a higher levelabstraction having many different possible byte-representations. The stancetaken here is that we already have a field type for handling that:bytes.Fields of typestring already express an opinion of the validity of an input(it must be valid UTF-8). As a result, treating two inputs that have identicalnormalized forms as different due to their underlying byte representation seemsto go against the original intent of thestring type. This distinctiontypically doesn't matter for strings that are opaque to our services (e.g.,description ordisplay_name), however when we rely on strings to uniquelyidentify resources, we are forced to take a stance.
Put differently, our goal is to allow someone with text in any encoding (ASCII,UTF-16, UTF-32, etc) to interact with our APIs without a lot of "gotchas".
References
- Unicode normalization forms
- Datastore pricing "name and value of each property" doesn't clarify this.
- Natural Language pricing uses charges based on UTF-8 code points rather than code units.
- Text matching and normalization
View on GitHub