Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Determine whether characters have the ID_Start or ID_Continue properties

License

Apache-2.0 and 2 other licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Unknown
LICENSE-UNICODE
NotificationsYou must be signed in to change notification settings

oxc-project/unicode-id-start

 
 

Important

This crate is a better optimized implementation of the olderunicode-id crate.This crate uses less static storage, and is able to classify both ASCII and non-ASCII codepoints with better performance, 2–10× faster thanunicode-id.

githubcrates.iodocs.rsbuild status

Implementation ofUnicode Standard Annex #31 for determining whichchar values are valid in programming language identifiers.

Changelog

1.3.0

  • Unicode 16.0.0

1.2.0

  • patch U+30FB KATAKANA MIDDLE DOT and U+FF65 HALFWIDTH KATAKANA MIDDLE DOT

Unicode 4.1 through Unicode 15 omitted these two characters from ID_Continue by accident.However, this accident was corrected in Unicode 15.1.Any JS VM that supports ES6+ but that uses a version of Unicode earlier than 15.1 will consider these to be a syntax error,so we deliberately omit these characters from the set of identifiers that are valid in both ES5 and ES6+.For more info see 2.2 inhttps://www.unicode.org/L2/L2023/23160-utc176-properties-recs.pdf

1.1.2

  • Unicode 15.1.0

1.1.1

  • Unicode 15.0.0

1.0.4

  • Unicode 14.0.0

Comparison of performance

The following table shows a comparison between five Unicode identifierimplementations.

Thestatic storage column shows the total size ofstatic tables that thecrate bakes into your binary, measured in 1000s of bytes.

The remaining columns show thecost per call to evaluate whether a singlechar has the ID_Start or ID_Continue Unicode property, comparing acrossdifferent ratios of ASCII to non-ASCII codepoints in the input data.

static storage0% nonascii1%10%100% nonascii
unicode-ident10.0 K0.96 ns0.95 ns1.09 ns1.55 ns
unicode-xid11.5 K1.88 ns2.14 ns3.48 ns15.63 ns
ucd-trie10.2 K1.29 ns1.28 ns1.36 ns2.15 ns
fst138 K55.1 ns54.9 ns53.2 ns28.5 ns
roaring66.1 K2.78 ns3.09 ns3.37 ns4.70 ns

Source code for the benchmark is provided in thebench directory of this repoand may be repeated by runningcargo criterion.


Comparison of data structures

unicode-id

They use a sorted array of character ranges, and do a binary search to look upwhether a given character lands inside one of those ranges.

staticID_Continue_table:[(char,char);763] =[('\u{30}','\u{39}'),// 0-9('\u{41}','\u{5a}'),// A-Z('\u{e0100}','\u{e01ef}'),];

The static storage used by this data structure scales with the number ofcontiguous ranges of identifier codepoints in Unicode. Every table entryconsumes 8 bytes, because it consists of a pair of 32-bitchar values.

In some ranges of the Unicode codepoint space, this is quite a sparserepresentation – there are some ranges where tens of thousands of adjacentcodepoints are all valid identifier characters. In other places, therepresentation is quite inefficient. A characater likeµ (U+00B5) which issurrounded by non-identifier codepoints consumes 64 bits in the table, while itwould be just 1 bit in a dense bitmap.

On a system with 64-byte cache lines, binary searching the table touches 7 cachelines on average. Each cache line fits only 8 table entries. Additionally, thebranching performed during the binary search is probably mostly unpredictable tothe branch predictor.

Overall, the crate ends up being about 10× slower on non-ASCII inputcompared to the fastest crate.

A potential improvement would be to pack the table entries more compactly.Rust'schar type is a 21-bit integer padded to 32 bits, which means everytable entry is holding 22 bits of wasted space, adding up to 3.9 K. They couldinstead fit every table entry into 6 bytes, leaving out some of the padding, fora 25% improvement in space used. With some cleverness it may be possible to fitin 5 bytes or even 4 bytes by storing a low char and an extent, instead of lowchar and high char. I don't expect that performance would improve much but thiscould be the most efficient for space across all the libraries, needing onlyabout 7 K to store.

ucd-trie

Their data structure is a compressed trie set specifically tailored for Unicodecodepoints. The design is credited to Raph Levien inrust-lang/rust#33098.

pubstructTrieSet{tree1_level1:&'static[u64;32],tree2_level1:&'static[u8;992],tree2_level2:&'static[u64],tree3_level1:&'static[u8;256],tree3_level2:&'static[u8],tree3_level3:&'static[u64],}

It represents codepoint sets using a trie to achieve prefix compression. Thefinal states of the trie are embedded in leaves or "chunks", where each chunk isa 64-bit integer. Each bit position of the integer corresponds to whether aparticular codepoint is in the set or not. These chunks are not just a compactrepresentation of the final states of the trie, but are also a form of suffixcompression. In particular, if multiple ranges of 64 contiguous codepoints havethe same Unicode properties, then they all map to the same chunk in the finallevel of the trie.

Being tailored for Unicode codepoints, this trie is partitioned into threedisjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints [0,0x800), the second [0x800, 0x10000) and the third [0x10000, 0x110000). Thesepartitions conveniently correspond to the space of 1 or 2 byte UTF-8 encodedcodepoints, 3 byte UTF-8 encoded codepoints and 4 byte UTF-8 encoded codepoints,respectively.

Lookups in this data structure are significantly more efficient than binarysearch. A lookup touches either 1, 2, or 3 cache lines based on which of thetrie partitions is being accessed.

One possible performance improvement would be for this crate to expose a way toquery based on a UTF-8 encoded string, returning the Unicode propertycorresponding to the first character in the string. Without such an API, thecaller is required to tokenize their UTF-8 encoded input data intochar, handthechar intoucd-trie, only forucd-trie to undo that work by convertingback into the variable-length representation for trie traversal.

fst

Uses afinite state transducer. This representation is built intoucd-generate but I am not aware of any advantage over theucd-trierepresentation. In particularucd-trie is optimized for storing Unicodeproperties whilefst is not.

As far as I can tell, the main thing that causesfst to have large size andslow lookups for this use case relative toucd-trie is that it does notspecialize for the fact that only 21 of the 32 bits in achar are meaningful.There are some dense arrays in the structure with large ranges that could neverpossibly be used.

roaring

This crate is a pure-Rust implementation ofRoaring Bitmap, a data structuredesigned for storing sets of 32-bit unsigned integers.

Roaring bitmaps are compressed bitmaps which tend to outperform conventionalcompressed bitmaps such as WAH, EWAH or Concise. In some instances, they can behundreds of times faster and they often offer significantly better compression.

In this use case the performance was reasonably competitive but stillsubstantially slower than the Unicode-optimized crates. Meanwhile thecompression was significantly worse, requiring 6× as much storage for thedata structure.

I also benchmarked thecroaring crate which is an FFI wrapper around the Creference implementation of Roaring Bitmap. This crate was consistently about15% slower than pure-Rustroaring, which could just be FFI overhead. I did notinvestigate further.

unicode-ident

This crate is most similar to theucd-trie library, in that it's based onbitmaps stored in the leafs of a trie representation, achieving both prefixcompression and suffix compression.

The key differences are:

  • Uses a single 2-level trie, rather than 3 disjoint partitions of differentdepth each.
  • Uses significantly larger chunks: 512 bits rather than 64 bits.
  • Compresses the ID_Start and ID_Continue properties togethersimultaneously, rather than duplicating identical trie leaf chunks across thetwo.

The following diagram show the ID_Start and ID_Continue Unicode booleanproperties in uncompressed form, in row-major order:

ID_StartID_Continue
ID_Start bitmapID_Continue bitmap

Uncompressed, these would take 140 K to store, which is beyond what would bereasonable. However, as you can see there is a large degree of similaritybetween the two bitmaps and across the rows, which lends well to compression.

This crate stores one 512-bit "row" of the above bitmaps in the leaf level of atrie, and a single additional level to index into the leafs. It turns out thereare 124 unique 512-bit chunks across the two bitmaps so 7 bits are sufficient toindex them.

The chunk size of 512 bits is selected as the size that minimizes the total sizeof the data structure. A smaller chunk, like 256 or 128 bits, would achievebetter deduplication but require a larger index. A larger chunk would increaseredundancy in the leaf bitmaps. 512 bit chunks are the optimum for total size ofthe index plus leaf bitmaps.

In fact since there are only 124 unique chunks, we can use an 8-bit index with aspare bit to index at the half-chunk level. This achieves an additional 8.5%compression by eliminating redundancies between the second half of any chunk andthe first half of any other chunk. Note that this is not the same as usingchunks which are half the size, because it does not necessitate raising the sizeof the trie's first level.

In contrast to binary search or theucd-trie crate, performing lookups in thisdata structure is straight-line code with no need for branching.

is_id_start:moveax,edishreax,9learcx,[rip+ unicode_ident::tables::TRIE_START]addrcx,raxxoreax,eaxcmpedi,201728cmovbrax,rcxtestrax,raxlearcx,[rip+ .L__unnamed_1]cmovnercx,raxmovzxeax, byte ptr[rcx]shlrax,5movecx,edishrecx,3andecx,63addrcx,raxlearax,[rip+ unicode_ident::tables::LEAF]moval, byte ptr[rax+rcx]and dil,7movecx,edishral,clandal,1ret

License

Use of the Unicode Character Database, as this crate does, is governed by theUNICODE LICENSE V3.

All intellectual property within this crate that isnot generated using theUnicode Character Database as input is licensed under either ofApache License, Version 2.0 orMIT license at your option.

Thegenerated files incorporate tabular data derived from the UnicodeCharacter Database, together with intellectual property from the original sourcecode content of the crate. One must comply with the terms of both the UnicodeLicense Agreement and either of the Apache license or MIT license when thosegenerated files are involved.

Unless you explicitly state otherwise, any contribution intentionally submittedfor inclusion in this crate by you, as defined in the Apache-2.0 license, shallbe licensed as just described, without any additional terms or conditions.

About

Determine whether characters have the ID_Start or ID_Continue properties

Resources

License

Apache-2.0 and 2 other licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Unknown
LICENSE-UNICODE

Stars

Watchers

Forks

Sponsor this project

 

Packages

No packages published

Languages

  • Rust100.0%

[8]ページ先頭

©2009-2025 Movatter.jp