Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitad6dc17

Browse files
Fix some bugs in the diffWords regex (and errors & ambiguities in the comment above it) (#635)
1 parent3e1774a commitad6dc17

File tree

3 files changed

+103
-17
lines changed

3 files changed

+103
-17
lines changed

‎release-notes.md‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
##Future 8.0.3 release
44

55
-[#631](https://github.com/kpdecker/jsdiff/pull/631) -**fix support for using an`Intl.Segmenter` with`diffWords`**. This has been almost completely broken since the feature was added in v6.0.0, since it would outright crash on any text that featured two consecutive newlines between a pair of words (a very common case).
6+
-[#635](https://github.com/kpdecker/jsdiff/pull/635) -**small tweaks to tokenization behaviour of`diffWords`** when used*without* an`Intl.Segmenter`. Specifically, the soft hyphen (U+00AD) is no longer considered to be a word break, and the multiplication and division signs (`×` and`÷`) are now treated as punctuation instead of as letters / word characters.
67

78
##8.0.2
89

‎src/diff/word.ts‎

Lines changed: 19 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,25 @@ import { longestCommonPrefix, longestCommonSuffix, replacePrefix, replaceSuffix,
44

55
// Based on https://en.wikipedia.org/wiki/Latin_script_in_Unicode
66
//
7-
// Ranges and exceptions:
8-
// Latin-1 Supplement, 0080–00FF
9-
// - U+00D7 × Multiplication sign
10-
// - U+00F7 ÷ Division sign
11-
// Latin Extended-A, 0100–017F
12-
// Latin Extended-B, 0180–024F
13-
// IPA Extensions, 0250–02AF
14-
// Spacing Modifier Letters, 02B0–02FF
15-
// - U+02C7 ˇ ˇ Caron
16-
// - U+02D8 ˘ ˘ Breve
17-
// - U+02D9 ˙ ˙ Dot Above
18-
// - U+02DA ˚ ˚ Ring Above
19-
// - U+02DB ˛ ˛ Ogonek
20-
// - U+02DC ˜ ˜ Small Tilde
21-
// - U+02DD ˝ ˝ Double Acute Accent
22-
// Latin Extended Additional, 1E00–1EFF
23-
constextendedWordChars='a-zA-Z0-9_\\u{C0}-\\u{FF}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';
7+
// Chars/ranges counted as "word" characters by this regex are as follows:
8+
//
9+
// + U+00AD Soft hyphen
10+
// + 00C0–00FF (letters with diacritics from the Latin-1 Supplement), except:
11+
// - U+00D7 × Multiplication sign
12+
// - U+00F7 ÷ Division sign
13+
// + Latin Extended-A, 0100–017F
14+
// + Latin Extended-B, 0180–024F
15+
// + IPA Extensions, 0250–02AF
16+
// + Spacing Modifier Letters, 02B0–02FF, except:
17+
// - U+02C7 ˇ ˇ Caron
18+
// - U+02D8 ˘ ˘ Breve
19+
// - U+02D9 ˙ ˙ Dot Above
20+
// - U+02DA ˚ ˚ Ring Above
21+
// - U+02DB ˛ ˛ Ogonek
22+
// - U+02DC ˜ ˜ Small Tilde
23+
// - U+02DD ˝ ˝ Double Acute Accent
24+
// + Latin Extended Additional, 1E00–1EFF
25+
constextendedWordChars='a-zA-Z0-9_\\u{AD}\\u{C0}-\\u{D6}\\u{D8}-\\u{F6}\\u{F8}-\\u{2C6}\\u{2C8}-\\u{2D7}\\u{2DE}-\\u{2FF}\\u{1E00}-\\u{1EFF}';
2426

2527
// Each token is one of the following:
2628
// - A punctuation mark plus the surrounding whitespace

‎test/diff/word.js‎

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,89 @@ describe('WordDiff', function() {
5959
'.'
6060
]);
6161
});
62+
63+
// Test for various behaviours discussed at
64+
// https://github.com/kpdecker/jsdiff/issues/634#issuecomment-3381707327
65+
// In particular we are testing that:
66+
// 1. single code points representing accented characters (most of range
67+
// U+00C0 thru U+00FF) are treated as word characters
68+
// 2. soft hyphens are treated as part of the word they appear in
69+
// 3. the multiplication and division signs are punctuation
70+
// 4. currency signs are punctuation
71+
// 5. section symbol is punctuation
72+
// 6. reserved trademark symbol is punctuation
73+
// 7. fractions are punctuation
74+
// The behaviour being tested for in points 4 thru 7 above is of debatable
75+
// correctness; it is not totally obvious whether we SHOULD treat those
76+
// things as punctuation characters or as word characters. Nonetheless, we
77+
// have this test to help document the current behaviour.
78+
it('should handle the 0080-00FF range the way we expect',()=>{
79+
expect(
80+
wordDiff.tokenize(
81+
'My daugh\u00adter, Am\u00E9lie, is 1½ years old and works for '+
82+
'Google® for £6 per hour (equivalently £6÷60=£0.10 per minute, or '+
83+
'£6×8=£48 per day), in violation of § 123 of the Child Labour Act.'
84+
)
85+
).to.deep.equal([
86+
'My ',
87+
' daugh\u00adter',
88+
', ',
89+
' Am\u00E9lie',
90+
', ',
91+
' is ',
92+
' 1',
93+
'½ ',
94+
' years ',
95+
' old ',
96+
' and ',
97+
' works ',
98+
' for ',
99+
' Google',
100+
'® ',
101+
' for ',
102+
' £',
103+
'6 ',
104+
' per ',
105+
' hour ',
106+
' (',
107+
'equivalently ',
108+
' £',
109+
'6',
110+
'÷',
111+
'60',
112+
'=',
113+
'£',
114+
'0',
115+
'.',
116+
'10 ',
117+
' per ',
118+
' minute',
119+
', ',
120+
' or ',
121+
' £',
122+
'6',
123+
'×',
124+
'8',
125+
'=',
126+
'£',
127+
'48 ',
128+
' per ',
129+
' day',
130+
')',
131+
', ',
132+
' in ',
133+
' violation ',
134+
' of ',
135+
' § ',
136+
' 123 ',
137+
' of ',
138+
' the ',
139+
' Child ',
140+
' Labour ',
141+
' Act',
142+
'.'
143+
]);
144+
});
62145
});
63146

64147
describe('#diffWords',function(){

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp