shogo82148/uniseg-forkedPublic

forked fromrivo/uniseg

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Unicode Text Segmentation for Go (or: How to Count Characters in a String)

License

MIT license

0 stars 64 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github		.github
LICENSE.txt		LICENSE.txt
README.md		README.md
doc.go		doc.go
eastasianwidth.go		eastasianwidth.go
emojipresentation.go		emojipresentation.go
examples_test.go		examples_test.go
gen_breaktest.go		gen_breaktest.go
gen_properties.go		gen_properties.go
go.mod		go.mod
go.sum		go.sum
grapheme.go		grapheme.go
grapheme_test.go		grapheme_test.go
graphemebreak_test.go		graphemebreak_test.go
graphemeproperties.go		graphemeproperties.go
graphemerules.go		graphemerules.go
line.go		line.go
line_test.go		line_test.go
linebreak_test.go		linebreak_test.go
lineproperties.go		lineproperties.go
linerules.go		linerules.go
properties.go		properties.go
sentence.go		sentence.go
sentence_test.go		sentence_test.go
sentencebreak_test.go		sentencebreak_test.go
sentenceproperties.go		sentenceproperties.go
sentencerules.go		sentencerules.go
step.go		step.go
step_test.go		step_test.go
width.go		width.go
width_test.go		width_test.go
word.go		word.go
word_test.go		word_test.go
wordbreak_test.go		wordbreak_test.go
wordproperties.go		wordproperties.go
wordrules.go		wordrules.go

Repository files navigation

Unicode Text Segmentation for Go

This Go package implements Unicode Text Segmentation according toUnicode Standard Annex #29, Unicode Line Breaking according toUnicode Standard Annex #14 (Unicode version 14.0.0), and monospace font string width calculation similar towcwidth.

Background

Grapheme Clusters

In Go,strings are read-only slices of bytes. They can be turned into Unicode code points using thefor loop or by casting:[]rune(str). However, multiple code points may be combined into one user-perceived character or what the Unicode specification calls "grapheme cluster". Here are some examples:

String	Bytes (UTF-8)	Code points (runes)	Grapheme clusters
Käse	6 bytes:`4b 61 cc 88 73 65`	5 code points:`4b 61 308 73 65`	4 clusters:`[4b],[61 308],[73],[65]`
🏳️‍🌈	14 bytes:`f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88`	4 code points:`1f3f3 fe0f 200d 1f308`	1 cluster:`[1f3f3 fe0f 200d 1f308]`
🇩🇪	8 bytes:`f0 9f 87 a9 f0 9f 87 aa`	2 code points:`1f1e9 1f1ea`	1 cluster:`[1f1e9 1f1ea]`

This package provides tools to iterate over these grapheme clusters. This may be used to determine the number of user-perceived characters, to split strings in their intended places, or to extract individual characters which form a unit.

Word Boundaries

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. They are also used in database queries, to determine whether elements are within a certain number of words of one another. Searching may also use word boundaries in determining matching items. This package provides tools to determine word boundaries within strings.

Sentence Boundaries

Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides tools to determine sentence boundaries within strings.

Line Breaking

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides tools to determine where a string may or may not be broken and where it must be broken (for example after newline characters).

Monospace Width

Most terminals or text displays / text editors using a monospace font (for example source code editors) use a fixed width for each character. Some characters such as emojis or characters found in Asian and other languages may take up more than one character cell. This package provides tools to determine the number of cells a string will take up when displayed in a monospace font. Seehere for more information.

Installation

go get github.com/rivo/uniseg

Examples

Counting Characters in a String

n:=uniseg.GraphemeClusterCount("🇩🇪🏳️‍🌈")fmt.Println(n)// 2

Calculating the Monospace String Width

width:=uniseg.StringWidth("🇩🇪🏳️‍🌈!")fmt.Println(width)// 5

Using the`Graphemes` Class

This is the most convenient method of iterating over grapheme clusters:

gr:=uniseg.NewGraphemes("👍🏼!")forgr.Next() {fmt.Printf("%x ",gr.Runes())}// [1f44d 1f3fc] [21]

Using the`Step` or`StepString` Function

This is orders of magnitude faster than theGraphemes class, but it requires the handling of states and boundaries:

str:="🇩🇪🏳️‍🌈"state:=-1varcstringforlen(str)>0 {c,str,_,state=uniseg.StepString(str,state)fmt.Printf("%x ", []rune(c))}// [1f1e9 1f1ea] [1f3f3 fe0f 200d 1f308]

Advanced Examples

Breaking into grapheme clusters and evaluating line breaks:

str:="First line.\nSecond line."state:=-1var (cstringboundariesint)forlen(str)>0 {c,str,boundaries,state=uniseg.StepString(str,state)fmt.Print(c)ifboundaries&uniseg.MaskLine==uniseg.LineCanBreak {fmt.Print("|")}elseifboundaries&uniseg.MaskLine==uniseg.LineMustBreak {fmt.Print("‖")}}// First |line.// ‖Second |line.‖

If you're only interested in word segmentation, useFirstWord orFirstWordInString:

str:="Hello, world!"state:=-1varcstringforlen(str)>0 {c,str,state=uniseg.FirstWordInString(str,state)fmt.Printf("(%s)\n",c)}// (Hello)// (,)// ( )// (world)// (!)

Similarly, use

FirstGraphemeCluster orFirstGraphemeClusterInString for grapheme cluster determination only,
FirstSentence orFirstSentenceInString for sentence segmentation only, and
FirstLineSegment orFirstLineSegmentInString for line breaking / word wrapping (although usingStep orStepString is preferred as it will observe grapheme cluster boundaries).

Finally, if you need to reverse a string while preserving grapheme clusters, useReverseString:

fmt.Println(uniseg.ReverseString("🇩🇪🏳️‍🌈"))// 🏳️‍🌈🇩🇪

Documentation

Refer tohttps://pkg.go.dev/github.com/rivo/uniseg for the package's documentation.

Dependencies

This package does not depend on any packages outside the standard library.

Sponsor this Project

Become a Sponsor on GitHub to support this project!

Your Feedback

Add your issue here on GitHub, preferably before submitting any PR's. Feel free to get in touch if you have any questions.

About

Unicode Text Segmentation for Go (or: How to Count Characters in a String)

Releases

2tags

Packages

No packages published

Languages

Go100.0%

Movatterモバイル変換

License

shogo82148/uniseg-forked

Folders and files

Latest commit

History

Repository files navigation

Unicode Text Segmentation for Go

Background

Grapheme Clusters

Word Boundaries

Sentence Boundaries

Line Breaking

Monospace Width

Installation

Examples

Counting Characters in a String

Calculating the Monospace String Width

Using theGraphemes Class

Using theStep orStepString Function

Advanced Examples

Documentation

Dependencies

Sponsor this Project

Your Feedback

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Using the`Graphemes` Class

Using the`Step` or`StepString` Function

Packages