Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

A lightweight implementation of the Unicode Text Segmentation (UAX #29)

License

NotificationsYou must be signed in to change notification settings

cometkim/unicode-segmenter

Repository files navigation

NPM Package VersionNPM DownloadsIntegrationcodecovLICENSE - MIT

A lightweight implementation of theUnicode Text Segmentation (UAX #29)

  • Spec compliant: Up-to-date Unicode data, verified by the official Unicode test suites and fuzzed with the nativeIntl.Segmenter, and maintaining 100% test coverage.

  • Excellent compatibility: It works well on older browsers, edge runtimes, React Native (Hermes) and QuickJS.

  • Zero-dependencies: It doesn't bloatnode_modules or the network bandwidth. Like a small minimal snippet.

  • Small bundle size: It effectively compresses the Unicode data and provides a bundler-friendly format.

  • Extremely efficient: It's carefully optimized for runtime performance, making it the fastest one in the ecosystem—outperforming even the built-inIntl.Segmenter.

  • TypeScript: It's fully type-checked, and provides type definitions and JSDoc.

  • ESM-first: It primarily supports ES modules, and still supports CommonJS.

Note

unicode-segmenter is nowe18e recommendation!

Unicode® Version

Unicode® 16.0.0

Unicode® Standard Annex #29 -Revision 45 (2024-08-28)

APIs

Entries for Unicode text segmentation.

And matchers for extra use cases.

Exportunicode-segmenter/grapheme

Utilities for text segmentation by extended grapheme cluster rules.

Example: Get grapheme segments

import{graphemeSegments}from'unicode-segmenter/grapheme';[...graphemeSegments('a̐éö̲\r\n')];// 0: { segment: 'a̐', index: 0, input: 'a̐éö̲\r\n' }// 1: { segment: 'é', index: 2, input: 'a̐éö̲\r\n' }// 2: { segment: 'ö̲', index: 4, input: 'a̐éö̲\r\n' }// 3: { segment: '\r\n', index: 7, input: 'a̐éö̲\r\n' }

Example: Split graphemes

import{splitGraphemes}from'unicode-segmenter/grapheme';[...splitGraphemes('#️⃣*️⃣0️⃣1️⃣2️⃣')];// 0: #️⃣// 1: *️⃣// 2: 0️⃣// 3: 1️⃣// 4: 2️⃣

Example: Count graphemes

import{countGraphemes}from'unicode-segmenter/grapheme';'👋 안녕!'.length;// => 6countGraphemes('👋 안녕!');// => 5'a̐éö̲'.length;// => 7countGraphemes('a̐éö̲');// => 3

Note

countGraphemes() is a small wrapper aroundgraphemeSegments().

If you need it more than once at a time, consider memoization or usegraphemeSegments() orsplitSegments() once instead.

Example: Build an advanced grapheme matcher

graphemeSegments() exposes some knowledge identified in the middle of the process to support some useful cases.

For example, knowing theGrapheme_Cluster_Break category at the beginning and end of a segment can help approximately infer the applied boundary rule.

import{graphemeSegments,GraphemeCategory}from'unicode-segmenter/grapheme';function*matchEmoji(str){for(const{ segment, _catBegin}ofgraphemeSegments(input)){// `_catBegin` identified as Extended_Pictographic means the segment is emojiif(_catBegin===GraphemeCategory.Extended_Pictographic){yieldsegment;}}}[...matchEmoji('1🌷2🎁3💩4😜5👍')]// 0: 🌷// 1: 🎁// 2: 💩// 3: 😜// 4: 👍

Or build even more advanced one like an Unicode-awareTTY string width utility.

Exportunicode-segmenter/intl-adapter

Intl.Segmenter API adapter (onlygranularity: "grapheme" available yet)

import{Segmenter}from'unicode-segmenter/intl-adapter';// Same API with the `Intl.Segmenter`constsegmenter=newSegmenter();

Exportunicode-segmenter/intl-polyfill

Intl.Segmenter API polyfill (onlygranularity: "grapheme" available yet)

// Apply polyfill to the `globalThis.Intl` object.import'unicode-segmenter/intl-polyfill';constsegmenter=newIntl.Segmenter();

Exportunicode-segmenter/emoji

Utilities for matching emoji-like characters.

Example: Use Unicode emoji property matches

import{isEmojiPresentation,// match \p{Emoji_Presentation}isExtendedPictographic,// match \p{Extended_Pictographic}}from'unicode-segmenter/emoji';isEmojiPresentation('😍'.codePointAt(0));// => trueisEmojiPresentation('♡'.codePointAt(0));// => falseisExtendedPictographic('😍'.codePointAt(0));// => trueisExtendedPictographic('♡'.codePointAt(0));// => true

Exportunicode-segmenter/general

Utilities for matching alphanumeric characters.

Example: Use Unicode general property matchers

import{isLetter,// match \p{L}isNumeric,// match \p{N}isAlphabetic,// match \p{Alphabetic}isAlphanumeric,// match [\p{N}\p{Alphabetic}]}from'unicode-segmenter/general';

Runtime Compatibility

unicode-segmenter uses only fundamental features of ES2015, making it compatible with most browsers.

To ensure compatibility, the runtime should support:

If the runtime doesn't support these features, it can easily be fulfilled with tools like Babel.

React Native Support

SinceHermes doesn't support theIntl.Segmenter API yet,unicode-segmenter is a good alternative.

unicode-segmenter is compiled into small & efficient Hermes bytecode than other JavaScript libraries. See thebenchmark for details.

Comparison

unicode-segmenter aims to be lighter and faster than alternatives in the ecosystem while fully spec compliant. So the benchmark is tracking several libraries' performance, bundle size, and Unicode version compliance.

unicode-segmenter/grapheme vs

JS Bundle Stats

NameUnicode®ESM?SizeSize (min)Size (min+gzip)Size (min+br)Size (min+zstd)
unicode-segmenter/grapheme16.0.0✔️15,73012,1995,1133,7874,807
graphemer15.0.0✖️ ️410,43595,10415,75210,66015,911
grapheme-splitter10.0.0✖️122,25423,6827,8524,8026,753
@formatjs/intl-segmenter*15.0.0✖️603,301369,57672,22549,48367,964
unicode-segmentation*16.0.0✔️56,52952,43924,10817,34324,375
Intl.Segmenter*--00000
  • @formatjs/intl-segmenter handles grapheme, word, and sentence, but it's not tree-shakable.
  • unicode-segmentation size contains only minimum WASM binary and its bindings to execute benchmarking. It will increases to expose more features.
  • Intl.Segmenter's Unicode data depends on the host, and may not be up-to-date.
  • Intl.Segmenter may not be available insome old browsers, edge runtimes, or embedded environments.

Hermes Bytecode Stats

NameBytecode sizeBytecode size (gzip)*
unicode-segmenter/grapheme21,54211,392
graphemer134,08931,766
grapheme-splitter63,94619,162
  • It would be compressed when included as an app asset.

Runtime Performance

Here is a brief explanation, and you can seearchived benchmark results.

Performance in Node.js/Bun/Deno:unicode-segmenter/grapheme has best-in-class performance.

Performance in Browsers: The performance in browser environments varies greatly due to differences in browser engines, which makes benchmarking inconsistent, but:

  • Still significantly faster than other JavaScript libraries.
  • Generally outperforms the built-in in the most browser environments, except the Firefox.

Performance in React Native:unicode-segmenter/grapheme is still faster than alternatives when compiled to Hermes bytecode. It's 3~8x faster thangraphemer and 20~26x faster thangrapheme-splitter, with the performance gap increasing with input size.

Performance in QuickJS:unicode-segmenter/grapheme is the only usable library in terms of performance.

Instead of trusting these claims, you can tryyarn perf:grapheme directly in your environment or build your own benchmark.

Acknowledgments

LICENSE

MIT

About

A lightweight implementation of the Unicode Text Segmentation (UAX #29)

Topics

Resources

License

Stars

Watchers

Forks

Contributors3

  •  
  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp