0

My question is about python 3.0 strings.

  1. My understanding is that for the linestr = "a", the charcter 'a' is encoded (using utf-8 - for example) and stored in the str object. If UTF-8 representation of 'a' is 1 byte the string is 1 byte long. Am I right?

  2. if the above is true what happens when we read a binary file using read(). Suppose I have a two byte file with two bytes of binary data and I read it in a string using read command like

    open(fileName, mode='rb')     str= file.read()

    nowstr will be two bytes long and each byte will be what was stored in the fileName. Am I right?

  3. If I am right in the above point then the str object is not in any particual encoding format (like UTF, etc.), So what does it mean that python strings are always unicode? Also what will happen if I call str.encode(). It will make no sense?

  4. As thestr object read from file is actually a array of bytes. Is there any way to convert it to bytearray type?

Martijn Pieters's user avatar
Martijn Pieters
1.1m326 gold badges4.2k silver badges3.4k bronze badges
askedApr 3, 2013 at 15:09
Rohit's user avatar
2
  • 1
    You really need to read thePython Unicode HOWTO, andthis article for good measure.CommentedApr 3, 2013 at 15:12
  • 2
    You really don't want to name your stringsstr. Especially since you're asking aboutstr andbytes objects, making it even more confusing than just shadowing the built-in.CommentedApr 3, 2013 at 15:20

3 Answers3

2

You are confused. "Encodings" pertain tobyte strings, not tounicode strings. Meaningful statements: "This byte string is utf-8 encoded.", "This byte string is 2 bytes long." Meaningless statements: "This unicode string is utf-8 encoded", "This unicode string is 2 bytes long"

  1. str = "a" means "create a unicode string 'a' and a reference to it namedstr". Unicode strings are of coursestored in some encoding because it needs to exist as bytes in memory, but that is not relevant. All your code treats it as if it has no encoding at all--it has been abstracted away from bytes. A unicode string isa sequence of unicode code points (i.e. of integers that represent characters).
  2. Yes and no.str here (the return value ofread()) is abyte string, not aunicode string."a" != b"a".
  3. Your byte-stringstr possesses anunknown encoding and must bedecoded to produce a unicode string. Byte strings don't have anencode() method because it is meaningless--they are either already an encoding of a unicode string, or they are not representing a unicode string at all (e.g. an image).
  4. It's not an array of bytes, it's a byte-string. Abytearray is amutable list of bytes. You can produce a bytearray withbytearray(byte_string), but bytearrays are intended for fairly specialized uses (e.g., to avoid copying for send-recv buffers), not casual use. Normally you just want a byte string.
answeredApr 3, 2013 at 15:27
Francis Avila's user avatar
Sign up to request clarification or add additional context in comments.

Comments

0

When you read a file in binary mode, the value returned from theread() method is abytes object, not astr object. The documentation covers this in depth.

>>> with open('foo', mode='rb') as f: s = f.read()... >>> sb'abc\n'>>> len(s)4>>> type(s)<class 'bytes'>
answeredApr 3, 2013 at 15:21
Josh Lee's user avatar

Comments

0

Python strings storeUnicode codepoints.

Codepoints are not the same thing as bytes. Bytes are a computer representation of numbers (most commonly between 0 and 255), and those numbers can be translated to codepoints through the process of decoding, and in the other direction with encoding. Python 3 strings contain codepoints, one for each character in the text.

Python source code can define string literals using a series of bytes, that the interpreter decodes to unicode using the UTF-8 codec by default, but you can set other codecs at the top of the file. On disk, the lettera in UTF-8 encoding is indeed just one byte, that is the nature of the UTF-8 standard.

If you read a file in text mode, Python applies the decoding process for you automatically, but when you open it in binary mode, no decoding is done and you get abytes object instead. The contents of that object should reflect the contents of the file exactly. Note that it isnot of typestr, it is not unicode, it is not even a Python string. To turn bytes into a string you'd need to explicitly decode with the.decode() method.

Abytearray is trivially created from abytes value, just callbytesarray() on it.

answeredApr 3, 2013 at 15:22
Martijn Pieters's user avatar

Comments

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.