Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32.2k
gh-109559: Updateunicodedata
for Unicode 15.1#109560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Changes fromall commits
21e297c
122a732
6d5238e
cd9cbf5
818a36c
24088ca
110c552
d8d9f98
27b1c13
af730eb
0db6920
44f6770
File filter
Filter by extension
Conversations
Uh oh!
There was an error while loading.Please reload this page.
Jump to
Uh oh!
There was an error while loading.Please reload this page.
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Update :mod:`unicodedata` database to Unicode 15.1.0. |
Large diffs are not rendered by default.
Uh oh!
There was an error while loading.Please reload this page.
Large diffs are not rendered by default.
Uh oh!
There was an error while loading.Please reload this page.
Large diffs are not rendered by default.
Uh oh!
There was an error while loading.Please reload this page.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -44,7 +44,7 @@ | ||
# * Doc/library/stdtypes.rst, and | ||
# * Doc/library/unicodedata.rst | ||
# * Doc/reference/lexical_analysis.rst (two occurrences) | ||
UNIDATA_VERSION = "15.1.0" | ||
UNICODE_DATA = "UnicodeData%s.txt" | ||
COMPOSITION_EXCLUSIONS = "CompositionExclusions%s.txt" | ||
EASTASIAN_WIDTH = "EastAsianWidth%s.txt" | ||
@@ -101,15 +101,16 @@ | ||
# these ranges need to match unicodedata.c:is_unified_ideograph | ||
cjk_ranges = [ | ||
('3400', '4DBF'), # CJK Ideograph Extension A CJK | ||
('4E00', '9FFF'), # CJK Ideograph | ||
('20000', '2A6DF'), # CJK Ideograph Extension B | ||
('2A700', '2B739'), # CJK Ideograph Extension C | ||
('2B740', '2B81D'), # CJK Ideograph Extension D | ||
('2B820', '2CEA1'), # CJK Ideograph Extension E | ||
('2CEB0', '2EBE0'), # CJK Ideograph Extension F | ||
('2EBF0', '2EE5D'), # CJK Ideograph Extension I | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. The range check that occurs later in this file implicitly assumes this list is in sorted order. It seems simpler to have an idiosyncratic order here than to try to introduce | ||
('30000', '3134A'), # CJK Ideograph Extension G | ||
('31350', '323AF'), # CJK Ideograph Extension H | ||
] | ||
@@ -1105,11 +1106,15 @@ def __init__(self, version, cjk_check=True): | ||
table[i].east_asian_width = widths[i] | ||
self.widths = widths | ||
for char, (propname, *propinfo) in UcdFile(DERIVED_CORE_PROPERTIES, version).expanded(): | ||
if propinfo: | ||
# this is not a binary property, ignore it | ||
continue | ||
Comment on lines +1109 to +1112 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. All the properties defined in As of Unicode 15.1, this file also includes definitions that use the With this change, the loop skips over any non-binary properties, since we have nothing to do with them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. It seems like it would be safer to explicitly ignore There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Is there a particular failure mode you have in mind? My rationale here was that the current internalized DB only cares about binary properties in this file, but in practice any of theproperty types enumerated by UAX#44 could appear in a future revision. I'm not strongly opposed to ignoring the specific property that breaks the tool against the current revision, but my rationale was that it seems safer to prevent this class of failure in the future if/when additional non-binary properties are added. | ||
if table[char]: | ||
# Some properties (e.g. Default_Ignorable_Code_Point) | ||
# apply to unassigned code points; ignore them | ||
table[char].binary_properties.add(propname) | ||
for char_range, value in UcdFile(LINE_BREAK, version): | ||
if value not in MANDATORY_LINE_BREAKS: | ||