kytta/unicode-segmenterPublic

forked fromcometkim/unicode-segmenter

NotificationsYou must be signed in to change notification settings
Fork0
Star0

A lightweight implementation of the Unicode Text Segmentation (UAX #29)

License

MIT license

0 stars 2 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 210 Commits
.changeset		.changeset
.github/workflows		.github/workflows
.yarn/releases		.yarn/releases
benchmark		benchmark
licenses		licenses
scripts		scripts
src		src
test		test
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.yarnrc.yml		.yarnrc.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
package.json		package.json
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

Repository files navigation

unicode-segmenter

A lightweight implementation of theUnicode Text Segmentation (UAX #29)

Spec compliant: Up-to-date Unicode data, verified by the official Unicode test suites and fuzzed with the nativeIntl.Segmenter, and maintaining 100% test coverage.
Excellent compatibility: It works well on older browsers, edge runtimes, React Native (Hermes) and QuickJS.
Zero-dependencies: It doesn't bloatnode_modules or the network bandwidth. Like a small minimal snippet.
Small bundle size: It effectively compresses the Unicode data and provides a bundler-friendly format.
Extremely efficient: It's carefully optimized for runtime performance, making it the fastest one in the ecosystem—outperforming even the built-inIntl.Segmenter.
TypeScript: It's fully type-checked, and provides type definitions and JSDoc.
ESM-first: It primarily supports ES modules, and still supports CommonJS.

Note

unicode-segmenter is nowe18e recommendation!

Unicode® Version

Unicode® 16.0.0

Unicode® Standard Annex #29 -Revision 45 (2024-08-28)

APIs

There are several entries for text segmentation.

unicode-segmenter/grapheme: Segments and countsextended grapheme clusters
unicode-segmenter/intl-adapter:Intl.Segmenter adapter
unicode-segmenter/intl-polyfill:Intl.Segmenter polyfill

And extra utilities for combined use cases.

unicode-segmenter/emoji: Matches single codepoint emojis
unicode-segmenter/general: Matches single codepoint alphanumerics
unicode-segmenter/utils: Some utilities for handling codepoints

Export`unicode-segmenter/grapheme`

Utilities for text segmentation by extended grapheme cluster rules.

Example: Get grapheme segments

import{graphemeSegments}from'unicode-segmenter/grapheme';[...graphemeSegments('a̐éö̲\r\n')];// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }

Example: Split graphemes

import{splitGraphemes}from'unicode-segmenter/grapheme';[...splitGraphemes('#️⃣*️⃣0️⃣1️⃣2️⃣')];// 0: #️⃣// 1: *️⃣// 2: 0️⃣// 3: 1️⃣// 4: 2️⃣

Example: Count graphemes

import{countGraphemes}from'unicode-segmenter/grapheme';'👋 안녕!'.length;// => 6countGraphemes('👋 안녕!');// => 5'a̐éö̲'.length;// => 7countGraphemes('a̐éö̲');// => 3

Note

countGraphemes() is a small wrapper aroundgraphemeSegments().

If you need it more than once at a time, consider memoization or usegraphemeSegments() orsplitSegments() once instead.

Example: Build an advanced grapheme matcher

graphemeSegments() exposes some knowledge identified in the middle of the process to support some useful cases.

For example, knowing theGrapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.

import{graphemeSegments,GraphemeCategory}from'unicode-segmenter/grapheme';function*matchEmoji(str){for(const{ segment, _catBegin}ofgraphemeSegments(input)){// `_catBegin` identified as Extended_Pictographic means the segment is emojiif(_catBegin===GraphemeCategory.Extended_Pictographic){yieldsegment;}}}[...matchEmoji('1🌷2🎁3💩4😜5👍')]// 0: 🌷// 1: 🎁// 2: 💩// 3: 😜// 4: 👍

Export`unicode-segmenter/intl-adapter`

Intl.Segmenter API adapter (onlygranularity: "grapheme" available yet)

import{Segmenter}from'unicode-segmenter/intl-adapter';// Same API with the `Intl.Segmenter`constsegmenter=newSegmenter();

Export`unicode-segmenter/intl-polyfill`

Intl.Segmenter API polyfill (onlygranularity: "grapheme" available yet)

// Apply polyfill to the `globalThis.Intl` object.import'unicode-segmenter/intl-polyfill';constsegmenter=newIntl.Segmenter();

Export`unicode-segmenter/emoji`

Utilities for matching emoji-like characters.

Example: Use Unicode emoji property matches

import{isEmojiPresentation,// match \p{Emoji_Presentation}isExtendedPictographic,// match \p{Extended_Pictographic}}from'unicode-segmenter/emoji';isEmojiPresentation('😍'.codePointAt(0));// => trueisEmojiPresentation('♡'.codePointAt(0));// => falseisExtendedPictographic('😍'.codePointAt(0));// => trueisExtendedPictographic('♡'.codePointAt(0));// => true

Export`unicode-segmenter/general`

Utilities for matching alphanumeric characters.

Example: Use Unicode general property matchers

import{isLetter,// match \p{L}isNumeric,// match \p{N}isAlphabetic,// match \p{Alphabetic}isAlphanumeric,// match [\p{N}\p{Alphabetic}]}from'unicode-segmenter/general';

Export`unicode-segmenter/utils`

You can access some internal utilities to deal with JavaScript strings.

Example: Handle UTF-16 surrogate pairs

import{isHighSurrogate,isLowSurrogate,surrogatePairToCodePoint,}from'unicode-segmenter/utils';constu32='😍';consthi=u32.charCodeAt(0);constlo=u32.charCodeAt(1);if(isHighSurrogate(hi)&&isLowSurrogate(lo)){constcodePoint=surrogatePairToCodePoint(hi,lo);// => equivalent to u32.codePointAt(0)}

Example: Determine the length of a character

import{isBMP}from'unicode-segmenter/utils';constchar='😍';// .length = 2constcp=char.codePointAt(0);char.length===isBMP(cp) ?1 :2;// => true

Runtime Compatibility

unicode-segmenter uses only fundamental features of ES2015, making it compatible with most browsers.

To ensure compatibility, the runtime should support:

If the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.

React Native Support

SinceHermes doesn't support theIntl.Segmenter API yet,unicode-segmenter is a good alternative.

unicode-segmenter is compiled into small & efficient Hermes bytecode than other JavaScript libraries. See thebenchmark for details.

Comparison

unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.

`unicode-segmenter/grapheme` vs

graphemer@1.4.0 (16.6M+ weekly downloads on NPM)
grapheme-splitter@1.0.4 (5.7M+ weekly downloads on NPM)
@formatjs/intl-segmenter@11.5.7 (5.4K+ weekly downloads on NPM)
WebAssembly build ofunicode-segmentation@1.12.0 with minimum bindings
Built-inIntl.Segmenter API

JS Bundle Stats

Name	Unicode®	ESM?	Size	Size (min)	Size (min+gzip)	Size (min+br)
`unicode-segmenter/grapheme`	16.0.0	✔️	15,929	12,110	5,050	3,738
`graphemer`	15.0.0	✖️ ️	410,435	95,104	15,752	10,660
`grapheme-splitter`	10.0.0	✖️	122,252	23,680	7,852	4,841
`@formatjs/intl-segmenter`*	15.0.0	✖️	491,043	318,721	54,248	34,380
`unicode-segmentation`*	16.0.0	✔️	56,529	52,443	24,110	17,343
`Intl.Segmenter`*	-	-	0	0	0	0

@formatjs/intl-segmenter handles grapheme, word, and sentence, but it's not tree-shakable.
unicode-segmentation size contains only minimum WASM binary and its bindings to execute benchmarking. It will increases to expose more features.
Intl.Segmenter's Unicode data depends on the host, and may not be up-to-date.
Intl.Segmenter may not be available insome old browsers, edge runtimes, or embedded environments.

Hermes Bytecode Stats

Name	Bytecode size	Bytecode size (gzip)*
`unicode-segmenter/grapheme`	22,019	11,513
`graphemer`	133,974	31,715
`grapheme-splitter`	63,855	19,133

It would be compressed when included as an app asset.

Runtime Performance

Here is a brief explanation, and you can seearchived benchmark results.

Performance in Node.js:unicode-segmenter/grapheme is significantly faster than alternatives.

6~15x faster than other JavaScript libraries
1.5~3x faster than WASM binding of the Rust'sunicode-segmentation
1.5~3x faster than built-inIntl.Segmenter

Performance in Bun:unicode-segmenter/grapheme has almost the same performance as the built-inIntl.Segmenter, with no performance degradation compared to other JavaScript libraries.

Performance in Browsers: The performance in browser environments varies greatly due to differences in browser engines and versions, which makes benchmarking less consistent. Despite these variations,unicode-segmenter/grapheme generally outperforms other JavaScript libraries in most environments.

Performance in React Native:unicode-segmenter/grapheme is significantly faster than alternatives when compiled to Hermes bytecode. It's 3~8x faster thangraphemer and 20~26x faster thangrapheme-splitter, with the performance gap increasing with input size.

Performance in QuickJS:unicode-segmenter/grapheme is the only usable library in terms of performance.

Instead of trusting these claims, you can tryyarn perf:grapheme directly in your environment or build your own benchmark.

Acknowledgments

The Rust Unicode team (@unicode-rs):
The initial implementation was ported manually fromunicode-segmentation library.
Marijn Haverbeke (@marijnh):
Inspired a technique that can greatly compress Unicode data table fromhis library.

LICENSE

MIT

About

A lightweight implementation of the Unicode Text Segmentation (UAX #29)

Releases

No releases published

Packages

No packages published

Languages

JavaScript98.0%
HTML2.0%

Movatterモバイル変換

License

kytta/unicode-segmenter

Folders and files

Latest commit

History

Repository files navigation

unicode-segmenter

Unicode® Version

APIs

Exportunicode-segmenter/grapheme

Example: Get grapheme segments

Example: Split graphemes

Example: Count graphemes

Example: Build an advanced grapheme matcher

Exportunicode-segmenter/intl-adapter

Exportunicode-segmenter/intl-polyfill

Exportunicode-segmenter/emoji

Example: Use Unicode emoji property matches

Exportunicode-segmenter/general

Example: Use Unicode general property matchers

Exportunicode-segmenter/utils

Example: Handle UTF-16 surrogate pairs

Example: Determine the length of a character

Runtime Compatibility

React Native Support

Comparison

unicode-segmenter/grapheme vs

JS Bundle Stats

Hermes Bytecode Stats

Runtime Performance

Acknowledgments

LICENSE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Export`unicode-segmenter/grapheme`

Export`unicode-segmenter/intl-adapter`

Export`unicode-segmenter/intl-polyfill`

Export`unicode-segmenter/emoji`

Export`unicode-segmenter/general`

Export`unicode-segmenter/utils`

`unicode-segmenter/grapheme` vs

Packages