tc39/proposal-string-prototype-codepointsPublic

NotificationsYou must be signed in to change notification settings
Fork8
Star40

String.prototype.codePoints proposal for ECMAScript (stage 1)

tc39.github.io/proposal-string-prototype-codepoints/

License

MIT license

40 stars 8 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
spec.html		spec.html

Repository files navigation

String.prototype.codePoints

ECMAScript proposal for String.prototype.codePoints

Status

The proposal is in stage 1 ofthe TC39 process.

Motivation

Lexers for languages that involve code points above 0xFFFF (such as ECMAScript syntax itself), needto be able to tokenise a string into separate code points before handling them with own state machine.

Currently language APIs provide two ways to access entire code points:

codePointAt allows to retrieve a code point at a known position. The issue is that position is usually unknown in advance if you're just iterating over the string, and you need to manuallycalculate it on each iteration with a manualfor(;;) loop and a magically looking expression likepos += currentCodePoint <= 0xFFFF ? 1 : 2.
String.prototype[Symbol.iterator] which allows a hassle-free iteration over string codepoints,but yields their string values, which are inefficient to work with in performance-critical lexers, and still lack position information.

Proposed solution

We propose the addition of acodePoints() method functionally similar to the[@@iterator], but yielding positions and numerical values of code points instead of just string values, this way combining the benefits of both approaches presented above while avoiding the related pitfalls in consumer code.

Naming

The name and casing ofcodePoints was chosen to be consistent with existingcodePointAt API.

Illustrative examples

Test if something is an identifier

functionisIdent(input){letcodePoints=input.codePoints();letfirst=codePoints.next();if(first.done||!isIdentifierStart(first.value.codePoint)){returnfalse;}for(let{ codePoint}ofcodePoints){if(!isIdentifierContinue(codePoint)){returnfalse;}}returntrue;}

Full-blown tokeniser

functiontoDigit(cp){returncp-/* '0' */48;}// Generic helperclassLookaheadIterator{constructor(inner){this[Symbol.iterator]=this;this.inner=inner;this.next();}next(){letnext=this.lookahead;this.lookahead=this.inner.next();returnnext;}skipWhile(cond){while(!this.lookahead.done&&cond(this.lookahead.value.codePoint)){this.next();}// even when `done == true`, the returned `.value.position` is still valid// and represents position at the end of the stringreturnthis.lookahead.value.position;}}// Main tokeniserfunction*tokenise(input){letiter=newLookaheadIterator(input.codePoints());for(let{position:start, codePoint}ofiter){if(isIdentifierStart(codePoint)){yield{type:'Identifier',                start,end:iter.skipWhile(isIdentifierContinue)};}elseif(isDigit(codePoint)){yield{type:'Number',                start,end:iter.skipWhile(isDigit)};}else{thrownewSyntaxError(`Expected an identifier or digit at${start}`);}}}

FAQ

Why does iterator emit an object instead of an array like other key-value iterators?
[key, value] format is usually used for entries of collections which can be directly indexed bykey.
Unlike those collections, strings in ECMAScript are indexedas 16-bit units of UTF-16 text and not code points, so emitted objects won't have consequent indices but rather positions which might be 1 or 2 16-bit units away from each other.
To make the fact that they represent different measurement units and string representations explicit, we decided on{ position, codePoint } object format.
See#1 for more details.
What about iteration over different string representations - code units, grapheme clusters etc.?
These are not covered by this particular proposal, but should be easy to add as separate methods or APIs. In particular, language-specific representations are being worked on asIntl.Segmenter proposal.

Specification

You can view the rendered spechere.

Implementations

Polyfill

About

String.prototype.codePoints proposal for ECMAScript (stage 1)

tc39.github.io/proposal-string-prototype-codepoints/

Resources

Readme

License

MIT license

Code of conduct

Contributors5

Languages

HTML100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

String.prototype.codePoints

Status

Motivation

Proposed solution

Naming

Illustrative examples

Test if something is an identifier

Full-blown tokeniser

FAQ

Specification

Implementations

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors5

Uh oh!

Languages

Movatterモバイル変換

License

tc39/proposal-string-prototype-codepoints

Folders and files

Latest commit

History

Repository files navigation

String.prototype.codePoints

Status

Motivation

Proposed solution

Naming

Illustrative examples

Test if something is an identifier

Full-blown tokeniser

FAQ

Specification

Implementations

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors5

Uh oh!

Languages