Join us and get access to thousands of tutorials and a community of expert Pythonistas.
This lesson is for members only.Join us and get access to thousands of tutorials and a community of expert Pythonistas.
Unicode in Python: Working With Character Encodings (Summary)
Congratulations on learning more aboutcharacter encodings! In this lesson, you’ll cover a few caveats to remember when you’re working with encodings and see some resources you can check out to keep learning.
In this course, you learned about:
- Fundamental concepts ofcharacter encodings and numbering systems
- Integer, binary, octal, hex, str, andbytes literals in Python
- Differences between Unicodecode points and UTF-8 encoding
- Python’sbuilt-in functions related to character encoding and numbering systems
- Otherencoding formats included in Python’s Standard Library
It’s very important to know the encoding of any data you read. Using the wrong encoding may result in an exception, or worse it will read successfully but have the wrong content.
Wikipedia has some useful pages:
00:00Well, you’ve made it through eight lessons on Unicode.You’ll recall that I started off with the basics of encoding,talked about the Pythonstring module and the constants that are available tomanipulate ASCII,took a detour down Computer Science Lane and talked about bits and bytes and howthey can be represented in oct and hex.
00:19And no Unicode course would be complete without a section on Unicode.Lesson 5 talked about how UTF-8 actually is represented in binary.Lesson 6 looked at digraphs and ligatures and other kinds of combinedcharacters.
00:33Lesson 7 gave a tour of built-in Python functions that are helpful whendealing with Unicode or byte conversion.And the last lesson was on encodings besides UTF-8.
00:45In this lesson,I’m going to talk about a couple of remaining corner cases and point you at somereferences and possible future reading material.It’s important to remember that all input is bytes until it’s decoded.
00:57If you assume a data’s encoding, you may run into trouble.Let’s say you were accessing a recipe site API,and you got the following chunk of data.
01:09If you make an assumption about the decoding…
01:15you could be in trouble. Hexbc is not valid UTF-8.
01:23Change the encoding to Latin-1,and all of a sudden the data makes an awful lot more sense.
01:31The symbol for one quarter in UTF-8 isn’tbc,butc2 bc. There are worse cases than getting an exception.At least when you get an exception, you know something went wrong.
01:43Consider the following piece of Norse. Encoding it…
01:51and then decoding it in UTF-16 by accident, results in a different character.No error, no exception.Your data is now dirty and wherever you put it, it’ll be wrong.
02:05A Python-specific problem is theopen() command.open() specifies encoding, but it defaults,and the default is platform-specific. If you’re opening a text file,i.e. not specifying a binary mode and you don’t explicitly name the encoding,you will get the operating system’s encoding.
02:26On a Mac, that’s UTF-8. On older versions of Windows,it wascp1252. On more recent ones, it might be UTF-16.You can see what the default encoding is by looking at theget_preferred_encoding()method of thelocale module.
02:42Python ships with a module that represents the Unicode database.It’s calledunicodedata.You can use this to do lookups on your characters or on your code points.
02:53Let’s look at it in action.
02:58Thename() method takes a str of a single character and returns the Unicode name forthat character.
03:10Thelookup() method does the opposite. Given the name'EURO SIGN',it returns the corresponding character.
03:20By usingname() andlookup() together you can go back and forth.Wikipedia has a ton of content on Unicode. There’s the Unicode article itself,and then there are breakdowns on Unicode character lists, the different sectionsof Unicode and how they’re blocked together, how to do the combinations,and then, of course,specifics to the encodings like UTF-8. In addition to Wikipedia,unicode.org itself has a rich amount of material and examples that you can pullfrom. If you’re looking for other encodings—back to Wikipedia.
03:53There’s plenty there on ASCII, extended ASCII,Latin-1, and Windows-1252.If my babbling about digraph and ligatures was interesting to you,Wikipedia has got even more information there as well.
04:07Joel on Software is a great source for programmers and his blog entry on the minimumyou need to know for Unicode is quite in-depth.Additionally, David Zentgraf’s article and the Mozilla article on detectingencodings also cover lots of useful information. Specific to Python,you can look at the What’s New in Python 3.0article that talks about how texts and bytes has changed,and the default Unicode mechanisms in Python 3.
04:33Understanding Unicode is so necessary that Python has a full how-to on it, anddeep within the documentation,you can find a full listing of the supported encodings. Given the topic,it seems only appropriate to saymerci,grazie,gracias.
04:49Thanks for your attention. I hope it’s been informative.
Alain Rouleau onJuly 2, 2020
Very interesting, thanks!
Pradeep Kumar onJuly 6, 2020
Awesome Course!!!
Ranjit Shrivastva onAug. 21, 2020
Interesting topic…Thanks for unicode detail.
sacsachin onOct. 10, 2020
Great tutorial.
DoubleA onJan. 24, 2021
Thank you for sharing your deep knowledge of the topic. For me as a beginner it’s hard to grasp 100% of the stuff just now, but the big picture has now become so much clearer!

Christopher TrudeauRP Team onJan. 24, 2021
Glad you enjoyed it @DoubleA. Feel free to post questions if you need clarity on something.
Become a Member to join the conversation.
Course Contents

