Oct 25, 2024 · Oct 25, 2024 · Nov 12, 2024 · Nov 12, 2024 · Nov 13, 2024 · Nov 14, 2024
diff --git a/doc/source/whatsnew/v2.3.0.rst b/doc/source/whatsnew/v2.3.0.rst

 I/O
 ^^^
 -
 - Bug in :func:`read_sql` causing an unintended exception when byte data was being converted to string when using the pyarrow dtype_backend (:issue:`59242`)
 -

 Period
diff --git a/pandas/core/internals/construction.py b/pandas/core/internals/construction.py
                    if dtype_backend != "numpy" and arr.dtype == np.dtype("O"):
                        new_dtype = StringDtype()
                        arr_cls = new_dtype.construct_array_type()
                        arr = arr_cls._from_sequence(arr, dtype=new_dtype)
                        try:
                            # Addressing (#59242)
                            # Byte data that could not be decoded into
                            # a string would throw a UnicodeDecodeError exception

                            # Try and greedily convert to string
                            # Will fail if the object is bytes
                            arr = arr_cls._from_sequence(arr, dtype=new_dtype)
                        except UnicodeDecodeError:
                            pass

                elif dtype_backend != "numpy" and isinstance(arr, np.ndarray):
                    if arr.dtype.kind in "iufb":
                        arr = pd_array(arr, copy=False)
diff --git a/pandas/tests/io/test_sql.py b/pandas/tests/io/test_sql.py
        (5, "E"),
    ]
    drop_table(table_name, sqlite_buildin)


 def test_bytes_column(sqlite_buildin):
    """
    Regression test for (#59242)
    Bytes being returned in a column that could not be converted
    to a string would raise a UnicodeDecodeError
    when using dtype_backend='pyarrow'
    """
    query = """
    select cast(x'0123456789abcdef0123456789abcdef' as blob) a
    """
    df = pd.read_sql(query, sqlite_buildin, dtype_backend="pyarrow")
    assert df.a.values[0] == b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"
Original file line number	Diff line number	Diff line change
Expand Up		@@ -130,7 +130,7 @@ MultiIndex

		I/O
		^^^
		-
		- Bug in :func:`read_sql` causing an unintended exception when byte data was being converted to string when using the pyarrow dtype_backend (:issue:`59242`)
		-

		Period
Expand Down
Original file line number	Diff line number	Diff line change
Expand Up		@@ -970,7 +970,17 @@ def convert(arr):
		if dtype_backend != "numpy" and arr.dtype == np.dtype("O"):
Copy link Member WillAydNov 14, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. I think this is the wrong place to be doing this; in the sql.py module can't we read in the type of the database and only try to convert BINARY types to Arrow binary types? Copy link Author kastkeepitjumpinlikekangaroosNov 15, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. Based on some local testing using the ADBC driver I can confirm that it yields a`pandas.core.arrays.arrow.array.ArrowExtensionArray` with `._dtype` of`pandas.ArrowDtype`. When the query returns a bytes type column we get a`.type` of`bytes`, and likewise a`.type` of`string` is returned for a string type column. Seems like we don't need to do any conversions when using the ADBC driver as you've stated if I'm understanding correctly here! Wondering if it makes sense to remove the code here trying to convert based on a`dtype_backend != 'numpy'` since this will fix the cause of the exception in the issue? and maybe raise an exception when trying to use a`pyarrowdtype_backend` with the`SQLiteDatabase` connection type here:https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L695 ? Copy link Member WillAydNov 15, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. I think the general problem is that pandas does not have a first class "binary" data type, so I'm not sure how to solve this for anything but the pyarrow backend. With the pyarrow backend, I think you can still move this logic to`sql.py` and check the type of the column coming back from the database. If it is a binary type in the database, using the PyArrow binary type with that backend makes sense. Not sure if@mroeschke has other thoughts to the general issue. This is likely another good use case to track in PDEP-13#58455 Copy link Member mroeschkeNov 15, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. I agree that this is the incorrect place to handle this conversion logic and this should only be a valid conversion for the pyarrow backend (`ArrowExtensionArray._from_sequence` should be able to return a binary type with binary data.) Copy link Author kastkeepitjumpinlikekangaroosNov 17, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. in`sql.py` it looks like the result of this conversion is being overwritten when the`dtype_backend` is`pyarrow`:https://github.com/pandas-dev/pandas/blob/main/pandas/io/sql.py#L181 and the dtype returned by the current logic is`ArrowDtype(pa.binary())` for the example in the issue, so maybe just removing the conversion logic is all that's needed to resolve this issue? I've removed the block doing the conversion and added a test case showing that the resulting df has a dtype of`ArrowDtype(pa.binary())` when the`dtype_backend='pyarrow'`
		new_dtype = StringDtype()
		arr_cls = new_dtype.construct_array_type()
		arr = arr_cls._from_sequence(arr, dtype=new_dtype)
		try:
		# Addressing (#59242)
		# Byte data that could not be decoded into
		# a string would throw a UnicodeDecodeError exception

		# Try and greedily convert to string
		# Will fail if the object is bytes
		arr = arr_cls._from_sequence(arr, dtype=new_dtype)
Copy link Member mroeschkeOct 30, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. Shouldn't this ideally return`pandas.ArrowDtype(pyarrow.binary())` type? kastkeepitjumpinlikekangaroos reacted with eyes emoji Copy link Author kastkeepitjumpinlikekangaroosNov 12, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. that makes sense thanks! so looks like the previous logic was not taking into account pyarrow types when doing this conversation so I've added logic similar to my initial change where we try to convert to a pyarrow string but then fall back to binary if we run into an invalid error (i.e. we tried to parse but it failed due to an encoding error). Please let me know what you think! Was also considering trying to type check the contents of`arr` to see if it has string or bytes data but seems like greedily trying to convert ends up being better performance in most cases (since we might have to search the whole arr to see if one of the elements is a bytes sequence that can't be converted to a string)
		except UnicodeDecodeError:
		pass

		elif dtype_backend != "numpy" and isinstance(arr, np.ndarray):
		if arr.dtype.kind in "iufb":
		arr = pd_array(arr, copy=False)
Expand Down
Original file line number	Diff line number	Diff line change
Expand Up		@@ -4352,3 +4352,17 @@ def test_xsqlite_if_exists(sqlite_buildin):
		(5, "E"),
		]
		drop_table(table_name, sqlite_buildin)


		def test_bytes_column(sqlite_buildin):
		"""
Copy link Member WillAydNov 21, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. This is well intentioned but can you remove the docstring? We don't use them in tests. Instead, you can just add a comment pointing to the github issue number in the function body kastkeepitjumpinlikekangaroos reacted with thumbs up emoji
		Regression test for (#59242)
		Bytes being returned in a column that could not be converted
		to a string would raise a UnicodeDecodeError
		when using dtype_backend='pyarrow'
		"""
		query = """
		select cast(x'0123456789abcdef0123456789abcdef' as blob) a
		"""
		df = pd.read_sql(query, sqlite_buildin, dtype_backend="pyarrow")
		assert df.a.values[0] == b"\x01#Eg\x89\xab\xcd\xef\x01#Eg\x89\xab\xcd\xef"
Copy link Member WillAydNov 20, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. Can you use our built-in test helpers instead? I think you can just do: result=pd.read_sql(...)expected=pd.DataFrame({"a": ...},dtype=pd.ArrowDtype(pa.binary()))tm.assert_frame_equal(result,expected) What data type does this produce currently with the`numpy_nullable` backend - object? Copy link Author kastkeepitjumpinlikekangaroosNov 20, 2024 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. for sure, changed the testing logic over to using this! for`numpy_nullable` and`lib.no_default` the dtype returned is an object