- Notifications
You must be signed in to change notification settings - Fork0
A lightweight implementation of the Unicode Text Segmentation (UAX #29)
License
kytta/unicode-segmenter
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
A lightweight implementation of theUnicode Text Segmentation (UAX #29)
Spec compliant: Up-to-date Unicode data, verified by the official Unicode test suites and fuzzed with the native
Intl.Segmenter, and maintaining 100% test coverage.Excellent compatibility: It works well on older browsers, edge runtimes, React Native (Hermes) and QuickJS.
Zero-dependencies: It doesn't bloat
node_modulesor the network bandwidth. Like a small minimal snippet.Small bundle size: It effectively compresses the Unicode data and provides a bundler-friendly format.
Extremely efficient: It's carefully optimized for runtime performance, making it the fastest one in the ecosystem—outperforming even the built-in
Intl.Segmenter.TypeScript: It's fully type-checked, and provides type definitions and JSDoc.
ESM-first: It primarily supports ES modules, and still supports CommonJS.
Note
unicode-segmenter is nowe18e recommendation!
Unicode® 16.0.0
Unicode® Standard Annex #29 -Revision 45 (2024-08-28)
There are several entries for text segmentation.
unicode-segmenter/grapheme: Segments and countsextended grapheme clustersunicode-segmenter/intl-adapter:Intl.Segmenteradapterunicode-segmenter/intl-polyfill:Intl.Segmenterpolyfill
And extra utilities for combined use cases.
unicode-segmenter/emoji: Matches single codepoint emojisunicode-segmenter/general: Matches single codepoint alphanumericsunicode-segmenter/utils: Some utilities for handling codepoints
Utilities for text segmentation by extended grapheme cluster rules.
import{graphemeSegments}from'unicode-segmenter/grapheme';[...graphemeSegments('a̐éö̲\r\n')];// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }
import{splitGraphemes}from'unicode-segmenter/grapheme';[...splitGraphemes('#️⃣*️⃣0️⃣1️⃣2️⃣')];// 0: #️⃣// 1: *️⃣// 2: 0️⃣// 3: 1️⃣// 4: 2️⃣
import{countGraphemes}from'unicode-segmenter/grapheme';'👋 안녕!'.length;// => 6countGraphemes('👋 안녕!');// => 5'a̐éö̲'.length;// => 7countGraphemes('a̐éö̲');// => 3
Note
countGraphemes() is a small wrapper aroundgraphemeSegments().
If you need it more than once at a time, consider memoization or usegraphemeSegments() orsplitSegments() once instead.
graphemeSegments() exposes some knowledge identified in the middle of the process to support some useful cases.
For example, knowing theGrapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.
import{graphemeSegments,GraphemeCategory}from'unicode-segmenter/grapheme';function*matchEmoji(str){for(const{ segment, _catBegin}ofgraphemeSegments(input)){// `_catBegin` identified as Extended_Pictographic means the segment is emojiif(_catBegin===GraphemeCategory.Extended_Pictographic){yieldsegment;}}}[...matchEmoji('1🌷2🎁3💩4😜5👍')]// 0: 🌷// 1: 🎁// 2: 💩// 3: 😜// 4: 👍
Intl.Segmenter API adapter (onlygranularity: "grapheme" available yet)
import{Segmenter}from'unicode-segmenter/intl-adapter';// Same API with the `Intl.Segmenter`constsegmenter=newSegmenter();
Intl.Segmenter API polyfill (onlygranularity: "grapheme" available yet)
// Apply polyfill to the `globalThis.Intl` object.import'unicode-segmenter/intl-polyfill';constsegmenter=newIntl.Segmenter();
Utilities for matching emoji-like characters.
import{isEmojiPresentation,// match \p{Emoji_Presentation}isExtendedPictographic,// match \p{Extended_Pictographic}}from'unicode-segmenter/emoji';isEmojiPresentation('😍'.codePointAt(0));// => trueisEmojiPresentation('♡'.codePointAt(0));// => falseisExtendedPictographic('😍'.codePointAt(0));// => trueisExtendedPictographic('♡'.codePointAt(0));// => true
Utilities for matching alphanumeric characters.
import{isLetter,// match \p{L}isNumeric,// match \p{N}isAlphabetic,// match \p{Alphabetic}isAlphanumeric,// match [\p{N}\p{Alphabetic}]}from'unicode-segmenter/general';
You can access some internal utilities to deal with JavaScript strings.
import{isHighSurrogate,isLowSurrogate,surrogatePairToCodePoint,}from'unicode-segmenter/utils';constu32='😍';consthi=u32.charCodeAt(0);constlo=u32.charCodeAt(1);if(isHighSurrogate(hi)&&isLowSurrogate(lo)){constcodePoint=surrogatePairToCodePoint(hi,lo);// => equivalent to u32.codePointAt(0)}
import{isBMP}from'unicode-segmenter/utils';constchar='😍';// .length = 2constcp=char.codePointAt(0);char.length===isBMP(cp) ?1 :2;// => true
unicode-segmenter uses only fundamental features of ES2015, making it compatible with most browsers.
To ensure compatibility, the runtime should support:
If the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.
SinceHermes doesn't support theIntl.Segmenter API yet,unicode-segmenter is a good alternative.
unicode-segmenter is compiled into small & efficient Hermes bytecode than other JavaScript libraries. See thebenchmark for details.
unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.
- graphemer@1.4.0 (16.6M+ weekly downloads on NPM)
- grapheme-splitter@1.0.4 (5.7M+ weekly downloads on NPM)
- @formatjs/intl-segmenter@11.5.7 (5.4K+ weekly downloads on NPM)
- WebAssembly build ofunicode-segmentation@1.12.0 with minimum bindings
- Built-in
Intl.SegmenterAPI
| Name | Unicode® | ESM? | Size | Size (min) | Size (min+gzip) | Size (min+br) |
|---|---|---|---|---|---|---|
unicode-segmenter/grapheme | 16.0.0 | ✔️ | 15,929 | 12,110 | 5,050 | 3,738 |
graphemer | 15.0.0 | ✖️ ️ | 410,435 | 95,104 | 15,752 | 10,660 |
grapheme-splitter | 10.0.0 | ✖️ | 122,252 | 23,680 | 7,852 | 4,841 |
@formatjs/intl-segmenter* | 15.0.0 | ✖️ | 491,043 | 318,721 | 54,248 | 34,380 |
unicode-segmentation* | 16.0.0 | ✔️ | 56,529 | 52,443 | 24,110 | 17,343 |
Intl.Segmenter* | - | - | 0 | 0 | 0 | 0 |
@formatjs/intl-segmenterhandles grapheme, word, and sentence, but it's not tree-shakable.unicode-segmentationsize contains only minimum WASM binary and its bindings to execute benchmarking. It will increases to expose more features.Intl.Segmenter's Unicode data depends on the host, and may not be up-to-date.Intl.Segmentermay not be available insome old browsers, edge runtimes, or embedded environments.
| Name | Bytecode size | Bytecode size (gzip)* |
|---|---|---|
unicode-segmenter/grapheme | 22,019 | 11,513 |
graphemer | 133,974 | 31,715 |
grapheme-splitter | 63,855 | 19,133 |
- It would be compressed when included as an app asset.
Here is a brief explanation, and you can seearchived benchmark results.
Performance in Node.js:unicode-segmenter/grapheme is significantly faster than alternatives.
- 6~15x faster than other JavaScript libraries
- 1.5~3x faster than WASM binding of the Rust'sunicode-segmentation
- 1.5~3x faster than built-in
Intl.Segmenter
Performance in Bun:unicode-segmenter/grapheme has almost the same performance as the built-inIntl.Segmenter, with no performance degradation compared to other JavaScript libraries.
Performance in Browsers: The performance in browser environments varies greatly due to differences in browser engines and versions, which makes benchmarking less consistent. Despite these variations,unicode-segmenter/grapheme generally outperforms other JavaScript libraries in most environments.
Performance in React Native:unicode-segmenter/grapheme is significantly faster than alternatives when compiled to Hermes bytecode. It's 3~8x faster thangraphemer and 20~26x faster thangrapheme-splitter, with the performance gap increasing with input size.
Performance in QuickJS:unicode-segmenter/grapheme is the only usable library in terms of performance.
Instead of trusting these claims, you can tryyarn perf:grapheme directly in your environment or build your own benchmark.
The Rust Unicode team (@unicode-rs):
The initial implementation was ported manually fromunicode-segmentation library.Marijn Haverbeke (@marijnh):
Inspired a technique that can greatly compress Unicode data table fromhis library.
About
A lightweight implementation of the Unicode Text Segmentation (UAX #29)
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Languages
- JavaScript98.0%
- HTML2.0%