Unicode in Python: Working With Character Encodings (Overview)
Python’sUnicode support is strong and robust, but it takes some time to master. There are many ways of encoding text into binary data, and in this course you’ll learn a bit of the history of encodings. You’ll also spend time learning the intricacies of Unicode,UTF-8, and how to use them when programming Python. You’ll practice with multiple examples and see how smooth working with text and binary data in Python can be!
By the end of this course, you’ll know:
- What anencoding is
- WhatASCII is
- How binary displays asoctal andhex values
- How UTF-8 encodes acode point
- How to combine code points into a singleglyph
- Whichbuilt-in functions can help you
00:00Welcome to Unicode and Character Encodings in Python.My name is Chris and I will be your guide.This course talks about what an encoding is and how it works,where ASCII came from and how it evolved,how binary bits can be described in oct and hex and how to use those to mapto code points, the Unicode standard and the UTF-8 encoding thereof,how UTF-8 uses the underlying bits to encode a code point,how multiple code points can result in a single character or glyph, functionsbuilt into Python that can help you when you’re messing around with charactersin Unicode, and other encodings. First off, strings and character encoding isone of the big changes between Python 2 in Python 3. In fact,it’s one of the better reasons to move from Python 2 to Python 3.
00:46All the examples in this course will be Python 3 based. If you’re using aPython 2 interpreter, you’re not going to be able to follow along.It’s really easy to forget when you’re programming in a nice high-level languagelike Python that computers really only understand numbers.
01:00When you’re dealing with text,you’re actually dealing with a mapping between a number and a character that isbeing displayed.The fundamental item that is being stored in memory is still a number. ASCII wasone of the preeminent standards for this kind of mapping.
01:16It specified that certain numbers represented certain letters,and so when the computer used those numbers in the context of a stringit would produce the right letters.
01:26The problem with ASCII was it really only encoded the Latin alphabet.It didn’t even include accented characters.It was invented by and for English speakers;it wasn’t until later that accents for other Western languages were added. Bycontrast,Unicode is an international standard and has enough space to encode all writtenlanguages. In fact, it has space to encode other things as well,like emojis. At one point in time, there was even a move to add Klingon to it,but it was turned down. But there’s still space left overif the standard body changes its mind. First off, a little history.
02:00I think I mentioned that computers only understand numbers? Well,computers only understand numbers. In fact, it’s even worse than that—they really only understand binary. Everything is a 1 or a 0.
02:10This goes down to how transistors work—they’re eitheron oroff.So, inside of the computer, everything is represented as eitherTrue orFalse,on oroff, or 1 or 0 to represent that. Everything on top of that is anabstraction.
02:25Abyte is a grouping of bits. In the early history of computers,the size of a byte was different from different machines.By the time PCs came around,there were 8 bits to a byte, and that’s pretty common now.
02:37Now, most processors deal with more than one byte at a time,but instead of redefining how big a byte is,they have other terms likeword for groupings of bytes.
02:46An 8-bit byte can hold2^8 combinations—that’s 256. The counting starts at 0,so the number range, instead of being from 1 to 256, is from 0 to 255. Back in
03:00the olden times—and I’m talking about timeso old that even an old man like me thinks they’re the past—IBM introduced BCD,orBinary Coded Decimal. This was an early encoding. It was very,very simple and very small. It used 6 bits to represent a character.
03:15This wasn’t enough to even fully cover the English language,so IBM extended BCD with EBCDIC—Extended BinaryCoded Decimal Interchange Code.
03:25This used a full 8 bits to describe a character and was so advancedit actually included lowercase letters. Around the same time as EBCDIC beingstandardized, ASCII was introduced.
03:36ASCII was put together by a standards body rather than by a single company andbecame more popular across different platforms.ASCII only required 7 bits,but at the time most computers were using an 8-bit byte,so the lead bit was just left as 0. Sometimes, using some transmission protocolslike over modems or terminals, that 8th bit would be used as a parity bit tomake sure that the byte had been transmitted correctly.
04:01ASCII was adopted as an international standard in 1967, and quicklythere were several iterations and extensions made on top of it.The extended ASCII format moved to a full 8-bits of description and addedaccent characters,allowing Western languages that were not English to be described.
04:19PCs used ASCII, so when they became the defacto standard,ASCII became the way of communicating between computers. For clarity’s sake,let’s establish some common terminology. First off, what’s a character?
04:31This probably feels clear to you—it’s that one little single unit of text—but this term can actually get a little confusing depending on who you’retalking to. So for the purposes of this course,the wordcharacter is going to mean a minimal unit of text that has a semanticvalue. So, that includes things like emojis, or symbols in Han Chinese,as well as obvious stuff like the letter A.
04:52Acharacter set is just a collection of these characters,and these sets can be used across multiple languages.Think about the Latin character set that most European languages can use,the Greek character set that pretty much only the Greek language can use,and the Russian character set, which is used across certain Slavic languages.
05:10Acode point is a number that represents a single character in one of these setsof encoded characters. For example, in the ASCII standard,the capital letter'A' is the decimal number 65.
05:25Acode unit, by contrast to a code point, is a sequence of bits that representthat code point. In ASCII,the code point 65 means'A', and it’s stored in the computer using that number.
05:37In other encoding standards that mapping may not apply.As I mentioned before, in the original ASCII standard,a code unit was 7 bits long,so that covered from the numbers 0 to 127. Unicode supports different kinds ofencodings, and some of those even have varying length code units.
05:55UTF-8, one of those encodings, is an 8-bit encoding,but its code point can map to 1, 2, 3, or 4 code units,so multiple bytes may be describing a single code point.
06:08That’s enough background.Let’s look at some code. In order to inspect some strings,I’ve written a quick little method inside of a file calledshow.py. The corepart of this method is line 5,which uses the built-inord() function, returning the code point of the characterthat is passed in.
06:29I’m going to import that function into the REPL and start with a simple stringin English saying'Hello there'.Callingcode_points() on that prints out the code point for each one ofthe values in thestr (string).
06:42If you look at this, capital'H' is72 in ASCII,so it maps down below. Six characters in, you’ll see32—that’s a space (" ") in ASCII. Notice that every one of these numbers is below128—that means they’re in the range of the original ASCII 7-bit standard.
07:02Let’s look at something a little more challenging.
07:05Here’s some Russian that says “da svidaniya”, or at least, that’s what the web page Icopied it from said it did—I hope it says that. Runningcode_points() on it,
07:16you get a significantly larger set of numbers.Now, the third character in is32—a space—just like in'Hello there'.And if you look near here at the end, there’s a character that’s225,which is below 256 in the extended ASCII range.
07:34That is the accented'á'. Everything else here is from the Cyrillic alphabet,which has much higher code point numbers above the ASCII range. All of these,as you’ll notice, are sort of around a thousand.
07:48That’s it for the introduction. Next up,I’ll dive deeper into Python strings and their relationship to ASCII.
Course Contents

