oxc-project/unicode-id-startPublic

forked fromdtolnay/unicode-ident

NotificationsYou must be signed in to change notification settings
Fork2
Star14

Determine whether characters have the ID_Start or ID_Continue properties

License

Apache-2.0 and 2 other licenses found

Licenses found

14 stars 18 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.github		.github
benches		benches
diagram		diagram
generate		generate
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
LICENSE-UNICODE		LICENSE-UNICODE
MAINTENANCE.md		MAINTENANCE.md
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Repository files navigation

Important

This crate is a better optimized implementation of the olderunicode-id crate.This crate uses less static storage, and is able to classify both ASCII and non-ASCII codepoints with better performance, 2–10× faster thanunicode-id.

Unicode ID_start

Implementation ofUnicode Standard Annex #31 for determining whichchar values are valid in programming language identifiers.

Changelog

1.3.0

Unicode 16.0.0

1.2.0

patch・ U+30FB KATAKANA MIDDLE DOT and･ U+FF65 HALFWIDTH KATAKANA MIDDLE DOT

Unicode 4.1 through Unicode 15 omitted these two characters from ID_Continue by accident.However, this accident was corrected in Unicode 15.1.Any JS VM that supports ES6+ but that uses a version of Unicode earlier than 15.1 will consider these to be a syntax error,so we deliberately omit these characters from the set of identifiers that are valid in both ES5 and ES6+.For more info see 2.2 inhttps://www.unicode.org/L2/L2023/23160-utc176-properties-recs.pdf

1.1.2

Unicode 15.1.0

1.1.1

Unicode 15.0.0

1.0.4

Unicode 14.0.0

Comparison of performance

The following table shows a comparison between five Unicode identifierimplementations.

unicode-id-start is this crate, which is a fork ofunicode-ident;
unicode-xid is a widely used crate run by the "unicode-rs" org,unicode-id is a fork ofunicode-xid;
ucd-trie andfst are two data structures supported by theucd-generate tool;
roaring is a Rust implementation of Roaring bitmap.

Thestatic storage column shows the total size ofstatic tables that thecrate bakes into your binary, measured in 1000s of bytes.

The remaining columns show thecost per call to evaluate whether a singlechar has the ID_Start or ID_Continue Unicode property, comparing acrossdifferent ratios of ASCII to non-ASCII codepoints in the input data.

	static storage	0% nonascii	1%	10%	100% nonascii
`unicode-ident`	10.0 K	0.96 ns	0.95 ns	1.09 ns	1.55 ns
`unicode-xid`	11.5 K	1.88 ns	2.14 ns	3.48 ns	15.63 ns
`ucd-trie`	10.2 K	1.29 ns	1.28 ns	1.36 ns	2.15 ns
`fst`	138 K	55.1 ns	54.9 ns	53.2 ns	28.5 ns
`roaring`	66.1 K	2.78 ns	3.09 ns	3.37 ns	4.70 ns

Source code for the benchmark is provided in thebench directory of this repoand may be repeated by runningcargo criterion.

Comparison of data structures

unicode-id

They use a sorted array of character ranges, and do a binary search to look upwhether a given character lands inside one of those ranges.

staticID_Continue_table:[(char,char);763] =[('\u{30}','\u{39}'),// 0-9('\u{41}','\u{5a}'),// A-Z    …('\u{e0100}','\u{e01ef}'),];

The static storage used by this data structure scales with the number ofcontiguous ranges of identifier codepoints in Unicode. Every table entryconsumes 8 bytes, because it consists of a pair of 32-bitchar values.

In some ranges of the Unicode codepoint space, this is quite a sparserepresentation – there are some ranges where tens of thousands of adjacentcodepoints are all valid identifier characters. In other places, therepresentation is quite inefficient. A characater likeµ (U+00B5) which issurrounded by non-identifier codepoints consumes 64 bits in the table, while itwould be just 1 bit in a dense bitmap.

On a system with 64-byte cache lines, binary searching the table touches 7 cachelines on average. Each cache line fits only 8 table entries. Additionally, thebranching performed during the binary search is probably mostly unpredictable tothe branch predictor.

Overall, the crate ends up being about 10× slower on non-ASCII inputcompared to the fastest crate.

A potential improvement would be to pack the table entries more compactly.Rust'schar type is a 21-bit integer padded to 32 bits, which means everytable entry is holding 22 bits of wasted space, adding up to 3.9 K. They couldinstead fit every table entry into 6 bytes, leaving out some of the padding, fora 25% improvement in space used. With some cleverness it may be possible to fitin 5 bytes or even 4 bytes by storing a low char and an extent, instead of lowchar and high char. I don't expect that performance would improve much but thiscould be the most efficient for space across all the libraries, needing onlyabout 7 K to store.

ucd-trie

Their data structure is a compressed trie set specifically tailored for Unicodecodepoints. The design is credited to Raph Levien inrust-lang/rust#33098.

pubstructTrieSet{tree1_level1:&'static[u64;32],tree2_level1:&'static[u8;992],tree2_level2:&'static[u64],tree3_level1:&'static[u8;256],tree3_level2:&'static[u8],tree3_level3:&'static[u64],}

It represents codepoint sets using a trie to achieve prefix compression. Thefinal states of the trie are embedded in leaves or "chunks", where each chunk isa 64-bit integer. Each bit position of the integer corresponds to whether aparticular codepoint is in the set or not. These chunks are not just a compactrepresentation of the final states of the trie, but are also a form of suffixcompression. In particular, if multiple ranges of 64 contiguous codepoints havethe same Unicode properties, then they all map to the same chunk in the finallevel of the trie.

Being tailored for Unicode codepoints, this trie is partitioned into threedisjoint sets: tree1, tree2, tree3. The first set corresponds to codepoints [0,0x800), the second [0x800, 0x10000) and the third [0x10000, 0x110000). Thesepartitions conveniently correspond to the space of 1 or 2 byte UTF-8 encodedcodepoints, 3 byte UTF-8 encoded codepoints and 4 byte UTF-8 encoded codepoints,respectively.

Lookups in this data structure are significantly more efficient than binarysearch. A lookup touches either 1, 2, or 3 cache lines based on which of thetrie partitions is being accessed.

One possible performance improvement would be for this crate to expose a way toquery based on a UTF-8 encoded string, returning the Unicode propertycorresponding to the first character in the string. Without such an API, thecaller is required to tokenize their UTF-8 encoded input data intochar, handthechar intoucd-trie, only forucd-trie to undo that work by convertingback into the variable-length representation for trie traversal.

fst

Uses afinite state transducer. This representation is built intoucd-generate but I am not aware of any advantage over theucd-trierepresentation. In particularucd-trie is optimized for storing Unicodeproperties whilefst is not.

As far as I can tell, the main thing that causesfst to have large size andslow lookups for this use case relative toucd-trie is that it does notspecialize for the fact that only 21 of the 32 bits in achar are meaningful.There are some dense arrays in the structure with large ranges that could neverpossibly be used.

roaring

This crate is a pure-Rust implementation ofRoaring Bitmap, a data structuredesigned for storing sets of 32-bit unsigned integers.

Roaring bitmaps are compressed bitmaps which tend to outperform conventionalcompressed bitmaps such as WAH, EWAH or Concise. In some instances, they can behundreds of times faster and they often offer significantly better compression.

In this use case the performance was reasonably competitive but stillsubstantially slower than the Unicode-optimized crates. Meanwhile thecompression was significantly worse, requiring 6× as much storage for thedata structure.

I also benchmarked thecroaring crate which is an FFI wrapper around the Creference implementation of Roaring Bitmap. This crate was consistently about15% slower than pure-Rustroaring, which could just be FFI overhead. I did notinvestigate further.

unicode-ident

This crate is most similar to theucd-trie library, in that it's based onbitmaps stored in the leafs of a trie representation, achieving both prefixcompression and suffix compression.

The key differences are:

Uses a single 2-level trie, rather than 3 disjoint partitions of differentdepth each.
Uses significantly larger chunks: 512 bits rather than 64 bits.
Compresses the ID_Start and ID_Continue properties togethersimultaneously, rather than duplicating identical trie leaf chunks across thetwo.

The following diagram show the ID_Start and ID_Continue Unicode booleanproperties in uncompressed form, in row-major order:

ID_Start	ID_Continue

Uncompressed, these would take 140 K to store, which is beyond what would bereasonable. However, as you can see there is a large degree of similaritybetween the two bitmaps and across the rows, which lends well to compression.

This crate stores one 512-bit "row" of the above bitmaps in the leaf level of atrie, and a single additional level to index into the leafs. It turns out thereare 124 unique 512-bit chunks across the two bitmaps so 7 bits are sufficient toindex them.

The chunk size of 512 bits is selected as the size that minimizes the total sizeof the data structure. A smaller chunk, like 256 or 128 bits, would achievebetter deduplication but require a larger index. A larger chunk would increaseredundancy in the leaf bitmaps. 512 bit chunks are the optimum for total size ofthe index plus leaf bitmaps.

In fact since there are only 124 unique chunks, we can use an 8-bit index with aspare bit to index at the half-chunk level. This achieves an additional 8.5%compression by eliminating redundancies between the second half of any chunk andthe first half of any other chunk. Note that this is not the same as usingchunks which are half the size, because it does not necessitate raising the sizeof the trie's first level.

In contrast to binary search or theucd-trie crate, performing lookups in thisdata structure is straight-line code with no need for branching.

is_id_start:moveax,edishreax,9learcx,[rip+ unicode_ident::tables::TRIE_START]addrcx,raxxoreax,eaxcmpedi,201728cmovbrax,rcxtestrax,raxlearcx,[rip+ .L__unnamed_1]cmovnercx,raxmovzxeax, byte ptr[rcx]shlrax,5movecx,edishrecx,3andecx,63addrcx,raxlearax,[rip+ unicode_ident::tables::LEAF]moval, byte ptr[rax+rcx]and dil,7movecx,edishral,clandal,1ret

License

Use of the Unicode Character Database, as this crate does, is governed by theUNICODE LICENSE V3.

All intellectual property within this crate that isnot generated using theUnicode Character Database as input is licensed under either ofApache License, Version 2.0 orMIT license at your option.

Thegenerated files incorporate tabular data derived from the UnicodeCharacter Database, together with intellectual property from the original sourcecode content of the crate. One must comply with the terms of both the UnicodeLicense Agreement and either of the Apache license or MIT license when thosegenerated files are involved.

Unless you explicitly state otherwise, any contribution intentionally submittedfor inclusion in this crate by you, as defined in the Apache-2.0 license, shallbe licensed as just described, without any additional terms or conditions.

About

Determine whether characters have the ID_Start or ID_Continue properties

Resources

Readme

License

Apache-2.0 and 2 other licenses found

Releases

12tags

Sponsor this project

Learn more about GitHub Sponsors

Packages

No packages published

Languages

Rust100.0%

Movatterモバイル変換

Uh oh!

License

Licenses found

oxc-project/unicode-id-start

Folders and files

Latest commit

History

Repository files navigation

Unicode ID_start

Changelog

1.3.0

1.2.0

1.1.2

1.1.1

1.0.4

Comparison of performance

Comparison of data structures

unicode-id

ucd-trie

fst

roaring

unicode-ident

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages0

Uh oh!

Languages

Packages