python strings and binary data

Question 1

My question is about python 3.0 strings.

My understanding is that for the linestr = "a", the charcter 'a' is encoded (using utf-8 - for example) and stored in the str object. If UTF-8 representation of 'a' is 1 byte the string is 1 byte long. Am I right?
if the above is true what happens when we read a binary file using read(). Suppose I have a two byte file with two bytes of binary data and I read it in a string using read command like
```
open(fileName, mode='rb')     str= file.read()
```
nowstr will be two bytes long and each byte will be what was stored in the fileName. Am I right?
If I am right in the above point then the str object is not in any particual encoding format (like UTF, etc.), So what does it mean that python strings are always unicode? Also what will happen if I call str.encode(). It will make no sense?
As thestr object read from file is actually a array of bytes. Is there any way to convert it to bytearray type?

Question 2

You really need to read thePython Unicode HOWTO, andthis article for good measure.

Question 3

You really don't want to name your stringsstr. Especially since you're asking aboutstr andbytes objects, making it even more confusing than just shadowing the built-in.

Question 4

You are confused. "Encodings" pertain tobyte strings, not tounicode strings. Meaningful statements: "This byte string is utf-8 encoded.", "This byte string is 2 bytes long." Meaningless statements: "This unicode string is utf-8 encoded", "This unicode string is 2 bytes long"

str = "a" means "create a unicode string 'a' and a reference to it namedstr". Unicode strings are of coursestored in some encoding because it needs to exist as bytes in memory, but that is not relevant. All your code treats it as if it has no encoding at all--it has been abstracted away from bytes. A unicode string isa sequence of unicode code points (i.e. of integers that represent characters).
Yes and no.str here (the return value ofread()) is abyte string, not aunicode string."a" != b"a".
Your byte-stringstr possesses anunknown encoding and must bedecoded to produce a unicode string. Byte strings don't have anencode() method because it is meaningless--they are either already an encoding of a unicode string, or they are not representing a unicode string at all (e.g. an image).
It's not an array of bytes, it's a byte-string. Abytearray is amutable list of bytes. You can produce a bytearray withbytearray(byte_string), but bytearrays are intended for fairly specialized uses (e.g., to avoid copying for send-recv buffers), not casual use. Normally you just want a byte string.

Question 5

When you read a file in binary mode, the value returned from theread() method is abytes object, not astr object. The documentation covers this in depth.

>>> with open('foo', mode='rb') as f: s = f.read()... >>> sb'abc\n'>>> len(s)4>>> type(s)<class 'bytes'>

Question 6

Python strings storeUnicode codepoints.

Codepoints are not the same thing as bytes. Bytes are a computer representation of numbers (most commonly between 0 and 255), and those numbers can be translated to codepoints through the process of decoding, and in the other direction with encoding. Python 3 strings contain codepoints, one for each character in the text.

Python source code can define string literals using a series of bytes, that the interpreter decodes to unicode using the UTF-8 codec by default, but you can set other codecs at the top of the file. On disk, the lettera in UTF-8 encoding is indeed just one byte, that is the nature of the UTF-8 standard.

If you read a file in text mode, Python applies the decoding process for you automatically, but when you open it in binary mode, no decoding is done and you get abytes object instead. The contents of that object should reflect the contents of the file exactly. Note that it isnot of typestr, it is not unicode, it is not even a Python string. To turn bytes into a string you'd need to explicitly decode with the.decode() method.

Abytearray is trivially created from abytes value, just callbytesarray() on it.

Francis Avila 31.8k7 gold badges63 silver badges99 bronze badges · Accepted Answer · 2013-04-03 15:27:25Z

You are confused. "Encodings" pertain tobyte strings, not tounicode strings. Meaningful statements: "This byte string is utf-8 encoded.", "This byte string is 2 bytes long." Meaningless statements: "This unicode string is utf-8 encoded", "This unicode string is 2 bytes long"

str = "a" means "create a unicode string 'a' and a reference to it namedstr". Unicode strings are of coursestored in some encoding because it needs to exist as bytes in memory, but that is not relevant. All your code treats it as if it has no encoding at all--it has been abstracted away from bytes. A unicode string isa sequence of unicode code points (i.e. of integers that represent characters).
Yes and no.str here (the return value ofread()) is abyte string, not aunicode string."a" != b"a".
Your byte-stringstr possesses anunknown encoding and must bedecoded to produce a unicode string. Byte strings don't have anencode() method because it is meaningless--they are either already an encoding of a unicode string, or they are not representing a unicode string at all (e.g. an image).
It's not an array of bytes, it's a byte-string. Abytearray is amutable list of bytes. You can produce a bytearray withbytearray(byte_string), but bytearrays are intended for fairly specialized uses (e.g., to avoid copying for send-recv buffers), not casual use. Normally you just want a byte string.

Movatterモバイル変換

Collectives™ on Stack Overflow

python strings and binary data

3 Answers3

Comments

Comments

Comments

Your Answer

Sign up orlog in

Post as a guest

Related

Hot Network Questions

Subscribe to RSS