My question is about python 3.0 strings.
My understanding is that for the line
str = "a", the charcter 'a' is encoded (using utf-8 - for example) and stored in the str object. If UTF-8 representation of 'a' is 1 byte the string is 1 byte long. Am I right?if the above is true what happens when we read a binary file using read(). Suppose I have a two byte file with two bytes of binary data and I read it in a string using read command like
open(fileName, mode='rb') str= file.read()now
strwill be two bytes long and each byte will be what was stored in the fileName. Am I right?If I am right in the above point then the str object is not in any particual encoding format (like UTF, etc.), So what does it mean that python strings are always unicode? Also what will happen if I call str.encode(). It will make no sense?
As the
strobject read from file is actually a array of bytes. Is there any way to convert it to bytearray type?
- 1You really need to read thePython Unicode HOWTO, andthis article for good measure.Martijn Pieters– Martijn Pieters2013-04-03 15:12:31 +00:00CommentedApr 3, 2013 at 15:12
- 2You really don't want to name your strings
str. Especially since you're asking aboutstrandbytesobjects, making it even more confusing than just shadowing the built-in.Wooble– Wooble2013-04-03 15:20:12 +00:00CommentedApr 3, 2013 at 15:20
3 Answers3
You are confused. "Encodings" pertain tobyte strings, not tounicode strings. Meaningful statements: "This byte string is utf-8 encoded.", "This byte string is 2 bytes long." Meaningless statements: "This unicode string is utf-8 encoded", "This unicode string is 2 bytes long"
str = "a"means "create a unicode string 'a' and a reference to it namedstr". Unicode strings are of coursestored in some encoding because it needs to exist as bytes in memory, but that is not relevant. All your code treats it as if it has no encoding at all--it has been abstracted away from bytes. A unicode string isa sequence of unicode code points (i.e. of integers that represent characters).- Yes and no.
strhere (the return value ofread()) is abyte string, not aunicode string."a" != b"a". - Your byte-string
strpossesses anunknown encoding and must bedecoded to produce a unicode string. Byte strings don't have anencode()method because it is meaningless--they are either already an encoding of a unicode string, or they are not representing a unicode string at all (e.g. an image). - It's not an array of bytes, it's a byte-string. A
bytearrayis amutable list of bytes. You can produce a bytearray withbytearray(byte_string), but bytearrays are intended for fairly specialized uses (e.g., to avoid copying for send-recv buffers), not casual use. Normally you just want a byte string.
Comments
Python strings storeUnicode codepoints.
Codepoints are not the same thing as bytes. Bytes are a computer representation of numbers (most commonly between 0 and 255), and those numbers can be translated to codepoints through the process of decoding, and in the other direction with encoding. Python 3 strings contain codepoints, one for each character in the text.
Python source code can define string literals using a series of bytes, that the interpreter decodes to unicode using the UTF-8 codec by default, but you can set other codecs at the top of the file. On disk, the lettera in UTF-8 encoding is indeed just one byte, that is the nature of the UTF-8 standard.
If you read a file in text mode, Python applies the decoding process for you automatically, but when you open it in binary mode, no decoding is done and you get abytes object instead. The contents of that object should reflect the contents of the file exactly. Note that it isnot of typestr, it is not unicode, it is not even a Python string. To turn bytes into a string you'd need to explicitly decode with the.decode() method.
Abytearray is trivially created from abytes value, just callbytesarray() on it.
Comments
Explore related questions
See similar questions with these tags.