NotificationsYou must be signed in to change notification settings
Fork526
Star9k

Commitad6dc17

authored

Fix some bugs in the diffWords regex (and errors & ambiguities in the comment above it) (#635)

1 parent3e1774a commitad6dc17Copy full SHA for ad6dc17

File tree

3 files changed

+103

-17

lines changed

release-notes.md
src/diff
- word.ts
test/diff
- word.js

3 files changed

+103

-17

lines changed

`‎release-notes.md‎`

Lines changed: 1 addition & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -3,6 +3,7 @@`
`3`	`3`	`##Future 8.0.3 release`
`4`	`4`
`5`	`5`	-[#631](https://github.com/kpdecker/jsdiff/pull/631) -fix support for using an`Intl.Segmenter` with`diffWords`. This has been almost completely broken since the feature was added in v6.0.0, since it would outright crash on any text that featured two consecutive newlines between a pair of words (a very common case).
	`6`	+-[#635](https://github.com/kpdecker/jsdiff/pull/635) -small tweaks to tokenization behaviour of`diffWords` when usedwithout an`Intl.Segmenter`. Specifically, the soft hyphen (U+00AD) is no longer considered to be a word break, and the multiplication and division signs (`×` and`÷`) are now treated as punctuation instead of as letters / word characters.
`6`	`7`
`7`	`8`	`##8.0.2`
`8`	`9`

`‎src/diff/word.ts‎`

Lines changed: 19 additions & 17 deletions

Original file line number	Diff line number	Diff line change
`@@ -4,23 +4,25 @@ import { longestCommonPrefix, longestCommonSuffix, replacePrefix, replaceSuffix,`
`4`	`4`
`5`	`5`	`// Based on https://en.wikipedia.org/wiki/Latin_script_in_Unicode`
`6`	`6`	`//`
`7`		`-// Ranges and exceptions:`
`8`		`-// Latin-1 Supplement, 0080–00FF`
`9`		`-// - U+00D7 × Multiplication sign`
`10`		`-// - U+00F7 ÷ Division sign`
`11`		`-// Latin Extended-A, 0100–017F`
`12`		`-// Latin Extended-B, 0180–024F`
`13`		`-// IPA Extensions, 0250–02AF`
`14`		`-// Spacing Modifier Letters, 02B0–02FF`
`15`		`-// - U+02C7 ˇ ˇ Caron`
`16`		`-// - U+02D8 ˘ ˘ Breve`
`17`		`-// - U+02D9 ˙ ˙ Dot Above`
`18`		`-// - U+02DA ˚ ˚ Ring Above`
`19`		`-// - U+02DB ˛ ˛ Ogonek`
`20`		`-// - U+02DC ˜ ˜ Small Tilde`
`21`		`-// - U+02DD ˝ ˝ Double Acute Accent`
`22`		`-// Latin Extended Additional, 1E00–1EFF`
`23`		`-constextendedWordChars='a-zA-Z0-9_\\u{C0}-\\u{FF}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';`
	`7`	`+// Chars/ranges counted as "word" characters by this regex are as follows:`
	`8`	`+//`
	`9`	`+// + U+00AD Soft hyphen`
	`10`	`+// + 00C0–00FF (letters with diacritics from the Latin-1 Supplement), except:`
	`11`	`+// - U+00D7 × Multiplication sign`
	`12`	`+// - U+00F7 ÷ Division sign`
	`13`	`+// + Latin Extended-A, 0100–017F`
	`14`	`+// + Latin Extended-B, 0180–024F`
	`15`	`+// + IPA Extensions, 0250–02AF`
	`16`	`+// + Spacing Modifier Letters, 02B0–02FF, except:`
	`17`	`+// - U+02C7 ˇ ˇ Caron`
	`18`	`+// - U+02D8 ˘ ˘ Breve`
	`19`	`+// - U+02D9 ˙ ˙ Dot Above`
	`20`	`+// - U+02DA ˚ ˚ Ring Above`
	`21`	`+// - U+02DB ˛ ˛ Ogonek`
	`22`	`+// - U+02DC ˜ ˜ Small Tilde`
	`23`	`+// - U+02DD ˝ ˝ Double Acute Accent`
	`24`	`+// + Latin Extended Additional, 1E00–1EFF`
	`25`	`+constextendedWordChars='a-zA-Z0-9_\\u{AD}\\u{C0}-\\u{D6}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';`
`24`	`26`
`25`	`27`	`// Each token is one of the following:`
`26`	`28`	`// - A punctuation mark plus the surrounding whitespace`

`‎test/diff/word.js‎`

Lines changed: 83 additions & 0 deletions

Original file line number	Diff line number	Diff line change
`@@ -59,6 +59,89 @@ describe('WordDiff', function() {`
`59`	`59`	`'.'`
`60`	`60`	`]);`
`61`	`61`	`});`
	`62`	`+`
	`63`	`+// Test for various behaviours discussed at`
	`64`	`+// https://github.com/kpdecker/jsdiff/issues/634#issuecomment-3381707327`
	`65`	`+// In particular we are testing that:`
	`66`	`+// 1. single code points representing accented characters (most of range`
	`67`	`+// U+00C0 thru U+00FF) are treated as word characters`
	`68`	`+// 2. soft hyphens are treated as part of the word they appear in`
	`69`	`+// 3. the multiplication and division signs are punctuation`
	`70`	`+// 4. currency signs are punctuation`
	`71`	`+// 5. section symbol is punctuation`
	`72`	`+// 6. reserved trademark symbol is punctuation`
	`73`	`+// 7. fractions are punctuation`
	`74`	`+// The behaviour being tested for in points 4 thru 7 above is of debatable`
	`75`	`+// correctness; it is not totally obvious whether we SHOULD treat those`
	`76`	`+// things as punctuation characters or as word characters. Nonetheless, we`
	`77`	`+// have this test to help document the current behaviour.`
	`78`	`+it('should handle the 0080-00FF range the way we expect',()=>{`
	`79`	`+expect(`
	`80`	`+wordDiff.tokenize(`
	`81`	`+'My daugh\u00adter, Am\u00E9lie, is 1½ years old and works for '+`
	`82`	`+'Google® for £6 per hour (equivalently £6÷60=£0.10 per minute, or '+`
	`83`	`+'£6×8=£48 per day), in violation of § 123 of the Child Labour Act.'`
	`84`	`+)`
	`85`	`+).to.deep.equal([`
	`86`	`+'My ',`
	`87`	`+' daugh\u00adter',`
	`88`	`+', ',`
	`89`	`+' Am\u00E9lie',`
	`90`	`+', ',`
	`91`	`+' is ',`
	`92`	`+' 1',`
	`93`	`+'½ ',`
	`94`	`+' years ',`
	`95`	`+' old ',`
	`96`	`+' and ',`
	`97`	`+' works ',`
	`98`	`+' for ',`
	`99`	`+' Google',`
	`100`	`+'® ',`
	`101`	`+' for ',`
	`102`	`+' £',`
	`103`	`+'6 ',`
	`104`	`+' per ',`
	`105`	`+' hour ',`
	`106`	`+' (',`
	`107`	`+'equivalently ',`
	`108`	`+' £',`
	`109`	`+'6',`
	`110`	`+'÷',`
	`111`	`+'60',`
	`112`	`+'=',`
	`113`	`+'£',`
	`114`	`+'0',`
	`115`	`+'.',`
	`116`	`+'10 ',`
	`117`	`+' per ',`
	`118`	`+' minute',`
	`119`	`+', ',`
	`120`	`+' or ',`
	`121`	`+' £',`
	`122`	`+'6',`
	`123`	`+'×',`
	`124`	`+'8',`
	`125`	`+'=',`
	`126`	`+'£',`
	`127`	`+'48 ',`
	`128`	`+' per ',`
	`129`	`+' day',`
	`130`	`+')',`
	`131`	`+', ',`
	`132`	`+' in ',`
	`133`	`+' violation ',`
	`134`	`+' of ',`
	`135`	`+' § ',`
	`136`	`+' 123 ',`
	`137`	`+' of ',`
	`138`	`+' the ',`
	`139`	`+' Child ',`
	`140`	`+' Labour ',`
	`141`	`+' Act',`
	`142`	`+'.'`
	`143`	`+]);`
	`144`	`+});`
`62`	`145`	`});`
`63`	`146`
`64`	`147`	`describe('#diffWords',function(){`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitad6dc17

File tree

3 files changed

3 files changed

`‎release-notes.md‎`

`‎src/diff/word.ts‎`

`‎test/diff/word.js‎`

0 commit comments