Sep 20, 2023 · Sep 19, 2023 · Sep 19, 2023 · Sep 19, 2023 · Sep 19, 2023 · Sep 19, 2023
diff --git a/Doc/library/stdtypes.rst b/Doc/library/stdtypes.rst

   The casefolding algorithm is
   `described in section 3.13 'Default Case Folding' of the Unicode Standard
   <https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf>`__.
   <https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf>`__.

   .. versionadded:: 3.3

   property being one of "Lm", "Lt", "Lu", "Ll", or "Lo".  Note that this is different
   from the `Alphabetic property defined in the section 4.10 'Letters, Alphabetic, and
   Ideographic' of the Unicode Standard
   <https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf>`_.
   <https://www.unicode.org/versions/Unicode15.1.0/ch04.pdf>`_.


 .. method:: str.isascii()

   The lowercasing algorithm used is
   `described in section 3.13 'Default Case Folding' of the Unicode Standard
   <https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf>`__.
   <https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf>`__.


 .. method:: str.lstrip([chars])

   The uppercasing algorithm used is
   `described in section 3.13 'Default Case Folding' of the Unicode Standard
   <https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf>`__.
   <https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf>`__.


 .. method:: str.zfill(width)
diff --git a/Doc/library/unicodedata.rst b/Doc/library/unicodedata.rst

 This module provides access to the Unicode Character Database (UCD) which
 defines character properties for all Unicode characters. The data contained in
 this database is compiled from the `UCD version 15.0.0
 <https://www.unicode.org/Public/15.0.0/ucd>`_.
 this database is compiled from the `UCD version 15.1.0
 <https://www.unicode.org/Public/15.1.0/ucd>`_.

 The module uses the same names and symbols as defined by Unicode
 Standard Annex #44, `"Unicode Character Database"

 .. rubric:: Footnotes

 .. [#] https://www.unicode.org/Public/15.0.0/ucd/NameAliases.txt
 .. [#] https://www.unicode.org/Public/15.1.0/ucd/NameAliases.txt

 .. [#] https://www.unicode.org/Public/15.0.0/ucd/NamedSequences.txt
 .. [#] https://www.unicode.org/Public/15.1.0/ucd/NamedSequences.txt
diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst
 * *Nd* - decimal numbers
 * *Pc* - connector punctuations
 * *Other_ID_Start* - explicit list of characters in `PropList.txt
  <https://www.unicode.org/Public/15.0.0/ucd/PropList.txt>`_ to support backwards
  <https://www.unicode.org/Public/15.1.0/ucd/PropList.txt>`_ to support backwards
  compatibility
 * *Other_ID_Continue* - likewise

 All identifiers are converted into the normal form NFKC while parsing; comparison
 of identifiers is based on NFKC.

 A non-normative HTML file listing all valid identifier characters for Unicode
 15.0.0 can be found at
 https://www.unicode.org/Public/15.0.0/ucd/DerivedCoreProperties.txt
 15.1.0 can be found at
 https://www.unicode.org/Public/15.1.0/ucd/DerivedCoreProperties.txt


 .. _keywords:

 .. rubric:: Footnotes

 .. [#] https://www.unicode.org/Public/15.0.0/ucd/NameAliases.txt
 .. [#] https://www.unicode.org/Public/15.1.0/ucd/NameAliases.txt
diff --git a/Misc/NEWS.d/next/Library/2023-09-19-01-22-43.gh-issue-109559.ijaycU.rst b/Misc/NEWS.d/next/Library/2023-09-19-01-22-43.gh-issue-109559.ijaycU.rst
 Update :mod:`unicodedata` database to Unicode 15.1.0.
diff --git a/Modules/unicodedata.c b/Modules/unicodedata.c
        (0x2B740 <= code && code <= 0x2B81D) || /* CJK Ideograph Extension D */
        (0x2B820 <= code && code <= 0x2CEA1) || /* CJK Ideograph Extension E */
        (0x2CEB0 <= code && code <= 0x2EBE0) || /* CJK Ideograph Extension F */
        (0x2EBF0 <= code && code <= 0x2EE5D) || /* CJK Ideograph Extension I */
        (0x30000 <= code && code <= 0x3134A) || /* CJK Ideograph Extension G */
        (0x31350 <= code && code <= 0x323AF);   /* CJK Ideograph Extension H */
 }
diff --git a/Modules/unicodedata_db.h b/Modules/unicodedata_db.h
diff --git a/Modules/unicodename_db.h b/Modules/unicodename_db.h
diff --git a/Objects/unicodetype_db.h b/Objects/unicodetype_db.h
diff --git a/Tools/unicode/makeunicodedata.py b/Tools/unicode/makeunicodedata.py
 #   * Doc/library/stdtypes.rst, and
 #   * Doc/library/unicodedata.rst
 #   * Doc/reference/lexical_analysis.rst (two occurrences)
 UNIDATA_VERSION = "15.0.0"
 UNIDATA_VERSION = "15.1.0"
 UNICODE_DATA = "UnicodeData%s.txt"
 COMPOSITION_EXCLUSIONS = "CompositionExclusions%s.txt"
 EASTASIAN_WIDTH = "EastAsianWidth%s.txt"

 # these ranges need to match unicodedata.c:is_unified_ideograph
 cjk_ranges = [
    ('3400', '4DBF'),
    ('4E00', '9FFF'),
    ('20000', '2A6DF'),
    ('2A700', '2B739'),
    ('2B740', '2B81D'),
    ('2B820', '2CEA1'),
    ('2CEB0', '2EBE0'),
    ('30000', '3134A'),
    ('31350', '323AF'),
    ('3400', '4DBF'),    # CJK Ideograph Extension A CJK
    ('4E00', '9FFF'),    # CJK Ideograph
    ('20000', '2A6DF'),  # CJK Ideograph Extension B
    ('2A700', '2B739'),  # CJK Ideograph Extension C
    ('2B740', '2B81D'),  # CJK Ideograph Extension D
    ('2B820', '2CEA1'),  # CJK Ideograph Extension E
    ('2CEB0', '2EBE0'),  # CJK Ideograph Extension F
    ('2EBF0', '2EE5D'),  # CJK Ideograph Extension I
    ('30000', '3134A'),  # CJK Ideograph Extension G
    ('31350', '323AF'),  # CJK Ideograph Extension H
 ]


                table[i].east_asian_width = widths[i]
        self.widths = widths

        for char, (p,) in UcdFile(DERIVED_CORE_PROPERTIES, version).expanded():
        for char, (propname, *propinfo) in UcdFile(DERIVED_CORE_PROPERTIES, version).expanded():
            if propinfo:
                # this is not a binary property, ignore it
                continue

            if table[char]:
                # Some properties (e.g. Default_Ignorable_Code_Point)
                # apply to unassigned code points; ignore them
                table[char].binary_properties.add(p)
                table[char].binary_properties.add(propname)

        for char_range, value in UcdFile(LINE_BREAK, version):
            if value not in MANDATORY_LINE_BREAKS:
Original file line number	Diff line number	Diff line change
Expand Up		@@ -1641,7 +1641,7 @@ expression support in the :mod:`re` module).

		The casefolding algorithm is
		`described in section 3.13 'Default Case Folding' of the Unicode Standard
		<https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf>`__.
		<https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf>`__.

		.. versionadded:: 3.3

Expand DownExpand Up		@@ -1805,7 +1805,7 @@ expression support in the :mod:`re` module).
		property being one of "Lm", "Lt", "Lu", "Ll", or "Lo". Note that this is different
		from the `Alphabetic property defined in the section 4.10 'Letters, Alphabetic, and
		Ideographic' of the Unicode Standard
		<https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf>`_.
		<https://www.unicode.org/versions/Unicode15.1.0/ch04.pdf>`_.


		.. method:: str.isascii()
Expand DownExpand Up		@@ -1941,7 +1941,7 @@ expression support in the :mod:`re` module).

		The lowercasing algorithm used is
		`described in section 3.13 'Default Case Folding' of the Unicode Standard
		<https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf>`__.
		<https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf>`__.


		.. method:: str.lstrip([chars])
Expand DownExpand Up		@@ -2290,7 +2290,7 @@ expression support in the :mod:`re` module).

		The uppercasing algorithm used is
		`described in section 3.13 'Default Case Folding' of the Unicode Standard
		<https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf>`__.
		<https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf>`__.


		.. method:: str.zfill(width)
Expand Down
Original file line number	Diff line number	Diff line change
Expand Up		@@ -17,8 +17,8 @@

		This module provides access to the Unicode Character Database (UCD) which
		defines character properties for all Unicode characters. The data contained in
		this database is compiled from the `UCD version 15.0.0
		<https://www.unicode.org/Public/15.0.0/ucd>`_.
		this database is compiled from the `UCD version 15.1.0
		<https://www.unicode.org/Public/15.1.0/ucd>`_.

		The module uses the same names and symbols as defined by Unicode
		Standard Annex #44, `"Unicode Character Database"
Expand DownExpand Up		@@ -175,6 +175,6 @@ Examples:

		.. rubric:: Footnotes

		.. [#] https://www.unicode.org/Public/15.0.0/ucd/NameAliases.txt
		.. [#] https://www.unicode.org/Public/15.1.0/ucd/NameAliases.txt

		.. [#] https://www.unicode.org/Public/15.0.0/ucd/NamedSequences.txt
		.. [#] https://www.unicode.org/Public/15.1.0/ucd/NamedSequences.txt
Original file line number	Diff line number	Diff line change
Expand Up		@@ -315,16 +315,16 @@ The Unicode category codes mentioned above stand for:
		* Nd - decimal numbers
		* Pc - connector punctuations
		* Other_ID_Start - explicit list of characters in `PropList.txt
		<https://www.unicode.org/Public/15.0.0/ucd/PropList.txt>`_ to support backwards
		<https://www.unicode.org/Public/15.1.0/ucd/PropList.txt>`_ to support backwards
		compatibility
		* Other_ID_Continue - likewise

		All identifiers are converted into the normal form NFKC while parsing; comparison
		of identifiers is based on NFKC.

		A non-normative HTML file listing all valid identifier characters for Unicode
		15.0.0 can be found at
		https://www.unicode.org/Public/15.0.0/ucd/DerivedCoreProperties.txt
		15.1.0 can be found at
		https://www.unicode.org/Public/15.1.0/ucd/DerivedCoreProperties.txt


		.. _keywords:
Expand DownExpand Up		@@ -1045,4 +1045,4 @@ occurrence outside string literals and comments is an unconditional error:

		.. rubric:: Footnotes

		.. [#] https://www.unicode.org/Public/15.0.0/ucd/NameAliases.txt
		.. [#] https://www.unicode.org/Public/15.1.0/ucd/NameAliases.txt
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Update :mod:`unicodedata` database to Unicode 15.1.0.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -1035,6 +1035,7 @@ is_unified_ideograph(Py_UCS4 code)
		(0x2B740 <= code && code <= 0x2B81D) \|\| /* CJK Ideograph Extension D */
		(0x2B820 <= code && code <= 0x2CEA1) \|\| /* CJK Ideograph Extension E */
		(0x2CEB0 <= code && code <= 0x2EBE0) \|\| /* CJK Ideograph Extension F */
		(0x2EBF0 <= code && code <= 0x2EE5D) \|\| /* CJK Ideograph Extension I */
		(0x30000 <= code && code <= 0x3134A) \|\| /* CJK Ideograph Extension G */
		(0x31350 <= code && code <= 0x323AF); /* CJK Ideograph Extension H */
		}
Expand Down
Original file line number	Diff line number	Diff line change
Expand Up		@@ -44,7 +44,7 @@
		# * Doc/library/stdtypes.rst, and
		# * Doc/library/unicodedata.rst
		# * Doc/reference/lexical_analysis.rst (two occurrences)
		UNIDATA_VERSION = "15.0.0"
		UNIDATA_VERSION = "15.1.0"
		UNICODE_DATA = "UnicodeData%s.txt"
		COMPOSITION_EXCLUSIONS = "CompositionExclusions%s.txt"
		EASTASIAN_WIDTH = "EastAsianWidth%s.txt"
Expand DownExpand Up		@@ -101,15 +101,16 @@

		# these ranges need to match unicodedata.c:is_unified_ideograph
		cjk_ranges = [
		('3400', '4DBF'),
		('4E00', '9FFF'),
		('20000', '2A6DF'),
		('2A700', '2B739'),
		('2B740', '2B81D'),
		('2B820', '2CEA1'),
		('2CEB0', '2EBE0'),
		('30000', '3134A'),
		('31350', '323AF'),
		('3400', '4DBF'), # CJK Ideograph Extension A CJK
		('4E00', '9FFF'), # CJK Ideograph
		('20000', '2A6DF'), # CJK Ideograph Extension B
		('2A700', '2B739'), # CJK Ideograph Extension C
		('2B740', '2B81D'), # CJK Ideograph Extension D
		('2B820', '2CEA1'), # CJK Ideograph Extension E
		('2CEB0', '2EBE0'), # CJK Ideograph Extension F
		('2EBF0', '2EE5D'), # CJK Ideograph Extension I
Copy link ContributorAuthor SnoopJSep 19, 2023 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. The range check that occurs later in this file implicitly assumes this list is in sorted order. It seems simpler to have an idiosyncratic order here than to try to introduce`sorted()` or somesuch.
		('30000', '3134A'), # CJK Ideograph Extension G
		('31350', '323AF'), # CJK Ideograph Extension H
		]


Expand DownExpand Up		@@ -1105,11 +1106,15 @@ def __init__(self, version, cjk_check=True):
		table[i].east_asian_width = widths[i]
		self.widths = widths

		for char, (p,) in UcdFile(DERIVED_CORE_PROPERTIES, version).expanded():
		for char, (propname, *propinfo) in UcdFile(DERIVED_CORE_PROPERTIES, version).expanded():
		if propinfo:
		# this is not a binary property, ignore it
		continue
Comment on lines +1109 to +1112 Copy link ContributorAuthor SnoopJSep 19, 2023 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. All the properties defined in`DerivedCoreProperties.txt` happened to be binary until the latest update, so this tool was getting away with the assumption that they arealways binary. As of Unicode 15.1, this file also includes definitions that use the`Indict_Conjunct_Break` (`InCB`) property, which is an enumeration. With this change, the loop skips over any non-binary properties, since we have nothing to do with them. Copy link Contributor benjaminpSep 19, 2023 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. It seems like it would be safer to explicitly ignore`Indict_Conjunct_Break` rather than throw out everything with a second column. Copy link ContributorAuthor SnoopJSep 19, 2023 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. Is there a particular failure mode you have in mind? My rationale here was that the current internalized DB only cares about binary properties in this file, but in practice any of theproperty types enumerated by UAX#44 could appear in a future revision. I'm not strongly opposed to ignoring the specific property that breaks the tool against the current revision, but my rationale was that it seems safer to prevent this class of failure in the future if/when additional non-binary properties are added.

		if table[char]:
		# Some properties (e.g. Default_Ignorable_Code_Point)
		# apply to unassigned code points; ignore them
		table[char].binary_properties.add(p)
		table[char].binary_properties.add(propname)

		for char_range, value in UcdFile(LINE_BREAK, version):
		if value not in MANDATORY_LINE_BREAKS:
Expand Down