Movatterモバイル変換

Loading video player…

Unicode in Python: Working With Character Encodings (Overview)

Christopher Trudeau

Unicode in Python: Working With Character EncodingsChristopher Trudeau 07:56

Recommended Tutorial Course Slides (.pdf)Sample Code (.zip)Ask a Question

Python’sUnicode support is strong and robust, but it takes some time to master. There are many ways of encoding text into binary data, and in this course you’ll learn a bit of the history of encodings. You’ll also spend time learning the intricacies of Unicode,UTF-8, and how to use them when programming Python. You’ll practice with multiple examples and see how smooth working with text and binary data in Python can be!

By the end of this course, you’ll know:

What anencoding is
WhatASCII is
How binary displays asoctal andhex values
How UTF-8 encodes acode point
How to combine code points into a singleglyph
Whichbuilt-in functions can help you

Download

Sample Code (.zip)

6.4 KB

Download

Course Slides (.pdf)

1.6 MB

00:00Welcome to Unicode and Character Encodings in Python.My name is Chris and I will be your guide.This course talks about what an encoding is and how it works,where ASCII came from and how it evolved,how binary bits can be described in oct and hex and how to use those to map to code points, the Unicode standard and the UTF-8 encoding thereof,how UTF-8 uses the underlying bits to encode a code point,how multiple code points can result in a single character or glyph, functions built into Python that can help you when you’re messing around with characters in Unicode, and other encodings. First off, strings and character encoding is one of the big changes between Python 2 in Python 3. In fact,it’s one of the better reasons to move from Python 2 to Python 3.

00:46All the examples in this course will be Python 3 based. If you’re using a Python 2 interpreter, you’re not going to be able to follow along.It’s really easy to forget when you’re programming in a nice high-level language like Python that computers really only understand numbers.

01:00When you’re dealing with text,you’re actually dealing with a mapping between a number and a character that is being displayed.The fundamental item that is being stored in memory is still a number. ASCII was one of the preeminent standards for this kind of mapping.

01:16It specified that certain numbers represented certain letters,and so when the computer used those numbers in the context of a string it would produce the right letters.

01:26The problem with ASCII was it really only encoded the Latin alphabet.It didn’t even include accented characters.It was invented by and for English speakers;it wasn’t until later that accents for other Western languages were added. By contrast,Unicode is an international standard and has enough space to encode all written languages. In fact, it has space to encode other things as well,like emojis. At one point in time, there was even a move to add Klingon to it,but it was turned down. But there’s still space left over if the standard body changes its mind. First off, a little history.

02:00I think I mentioned that computers only understand numbers? Well,computers only understand numbers. In fact, it’s even worse than that—they really only understand binary. Everything is a 1 or a 0.

02:10This goes down to how transistors work—they’re eitheron oroff.So, inside of the computer, everything is represented as eitherTrue or False,on oroff, or 1 or 0 to represent that. Everything on top of that is an abstraction.

02:25Abyte is a grouping of bits. In the early history of computers,the size of a byte was different from different machines.By the time PCs came around,there were 8 bits to a byte, and that’s pretty common now.

02:37Now, most processors deal with more than one byte at a time,but instead of redefining how big a byte is,they have other terms likeword for groupings of bytes.

02:46An 8-bit byte can hold2^8 combinations—that’s 256. The counting starts at 0,so the number range, instead of being from 1 to 256, is from 0 to 255. Back in

03:00the olden times—and I’m talking about time so old that even an old man like me thinks they’re the past—IBM introduced BCD,orBinary Coded Decimal. This was an early encoding. It was very,very simple and very small. It used 6 bits to represent a character.

03:15This wasn’t enough to even fully cover the English language,so IBM extended BCD with EBCDIC—Extended BinaryCoded Decimal Interchange Code.

03:25This used a full 8 bits to describe a character and was so advanced it actually included lowercase letters. Around the same time as EBCDIC being standardized, ASCII was introduced.

03:36ASCII was put together by a standards body rather than by a single company and became more popular across different platforms.ASCII only required 7 bits,but at the time most computers were using an 8-bit byte,so the lead bit was just left as 0. Sometimes, using some transmission protocols like over modems or terminals, that 8th bit would be used as a parity bit to make sure that the byte had been transmitted correctly.

04:01ASCII was adopted as an international standard in 1967, and quickly there were several iterations and extensions made on top of it.The extended ASCII format moved to a full 8-bits of description and added accent characters,allowing Western languages that were not English to be described.

04:19PCs used ASCII, so when they became the defacto standard,ASCII became the way of communicating between computers. For clarity’s sake,let’s establish some common terminology. First off, what’s a character?

04:31This probably feels clear to you—it’s that one little single unit of text—but this term can actually get a little confusing depending on who you’re talking to. So for the purposes of this course,the wordcharacter is going to mean a minimal unit of text that has a semantic value. So, that includes things like emojis, or symbols in Han Chinese,as well as obvious stuff like the letter A.

04:52Acharacter set is just a collection of these characters,and these sets can be used across multiple languages.Think about the Latin character set that most European languages can use,the Greek character set that pretty much only the Greek language can use,and the Russian character set, which is used across certain Slavic languages.

05:10Acode point is a number that represents a single character in one of these sets of encoded characters. For example, in the ASCII standard,the capital letter'A' is the decimal number 65.

05:25Acode unit, by contrast to a code point, is a sequence of bits that represent that code point. In ASCII,the code point 65 means'A', and it’s stored in the computer using that number.

05:37In other encoding standards that mapping may not apply.As I mentioned before, in the original ASCII standard,a code unit was 7 bits long,so that covered from the numbers 0 to 127. Unicode supports different kinds of encodings, and some of those even have varying length code units.

05:55UTF-8, one of those encodings, is an 8-bit encoding,but its code point can map to 1, 2, 3, or 4 code units,so multiple bytes may be describing a single code point.

06:08That’s enough background.Let’s look at some code. In order to inspect some strings,I’ve written a quick little method inside of a file calledshow.py. The core part of this method is line 5,which uses the built-inord() function, returning the code point of the character that is passed in.

06:29I’m going to import that function into the REPL and start with a simple string in English saying'Hello there'.Callingcode_points() on that prints out the code point for each one of the values in thestr (string).

06:42If you look at this, capital'H' is72 in ASCII,so it maps down below. Six characters in, you’ll see32—that’s a space (" ") in ASCII. Notice that every one of these numbers is below128—that means they’re in the range of the original ASCII 7-bit standard.

07:02Let’s look at something a little more challenging.

07:05Here’s some Russian that says “da svidaniya”, or at least, that’s what the web page I copied it from said it did—I hope it says that. Runningcode_points() on it,

07:16you get a significantly larger set of numbers.Now, the third character in is32—a space—just like in'Hello there'.And if you look near here at the end, there’s a character that’s225,which is below 256 in the extended ASCII range.

07:34That is the accented 'á'. Everything else here is from the Cyrillic alphabet,which has much higher code point numbers above the ASCII range. All of these,as you’ll notice, are sort of around a thousand.

07:48That’s it for the introduction. Next up,I’ll dive deeper into Python strings and their relationship to ASCII.

Become a Member to join the conversation.

Course Contents

11%

Unicode in Python: Working With Character Encodings (Overview)07:56

Working With ASCII and the Python String Module05:49

Working in Binary: Bits, Bytes, Oct, and Hex06:26

Using Unicode04:15

Encoding UTF-806:19

Combining Characters05:40

Using Built-In Functions05:38

Using Other Encodings04:45

Unicode in Python: Working With Character Encodings (Summary)04:53

[8]ページ先頭

©2009-2026 Movatter.jp