![]() | This article has multiple issues. Please helpimprove it or discuss these issues on thetalk page.(Learn how and when to remove these messages) (Learn how and when to remove this message)
|
Abinary-to-text encoding isencoding ofdata inplain text. More precisely, it is an encoding of binary data in a sequence ofprintable characters. These encodings are necessary for transmission of data when thecommunication channel does not allow binary data (such asemail orNNTP) or is not8-bit clean.PGP documentation (RFC 9580) uses the term "ASCII armor" for binary-to-text encoding when referring toBase64.
The basic need for a binary-to-text encoding comes from a need to communicate arbitrarybinary data over preexistingcommunications protocols that were designed to carry only English languagehuman-readable text. Those communication protocols may only be 7-bit safe (and within that avoid certain ASCII control codes), and may requireline breaks at certain maximum intervals, and may not maintainwhitespace. Thus, only the 94printable ASCII characters are "safe" to use to convey data.
TheASCII text-encoding standard uses 7 bits to encode characters. With this it is possible to encode 128 (i.e. 27) unique values (0–127) to represent the alphabetic, numeric, and punctuation characters commonly used inEnglish, plus a selection ofControl characters which do not represent printable characters. For example, the capital letterA is represented in 7 bits as 100 00012, 0x41 (1018) , the numeral2 is 011 00102 0x32 (628), the character} is 111 11012 0x7D (1758), and theControl characterRETURN is 000 11012 0x0D (158).
In contrast, most computers store data in memory organized in eight-bitbytes. Files that contain machine-executable code and non-textual data typically contain all 256 possible eight-bit byte values. Many computer programs came to rely on this distinction between seven-bittext and eight-bitbinary data, and would not function properly if non-ASCII characters appeared in data that was expected to include only ASCII text. For example, if the value of the eighth bit is not preserved, the program might interpret a byte value above 127 as a flag telling it to perform some function.
It is often desirable, however, to be able to send non-textual data through text-based systems, such as when one might attach an image file to an e-mail message. To accomplish this, the data is encoded in some way, such that eight-bit data is encoded into seven-bit ASCII characters (generally using only alphanumeric and punctuation characters—the ASCII printable characters). Upon safe arrival at its destination, it is then decoded back to its eight-bit form. This process is referred to as binary to text encoding. Many programs perform this conversion to allow for data-transport, such asPGP andGNU Privacy Guard.
Binary-to-text encoding methods are also used as a mechanism for encodingplain text. For example:
By using a binary-to-text encoding on messages that are already plain text, then decoding on the other end, one can make such systems appear to be completelytransparent. This is sometimes referred to as 'ASCII armoring'. For example, the ViewState component ofASP.NET usesbase64 encoding to safely transmit text via HTTP POST, in order to avoiddelimiter collision.
The table below compares the most used forms of binary-to-text encodings. The efficiency listed is the ratio between the number of bits in the input and the number of bits in the encoded output.
Encoding | Data type | Efficiency | Programming language implementations | Comments |
---|---|---|---|---|
Ascii85 | Arbitrary | 80% | awkArchived 2014-12-29 at theWayback Machine,C,C (2),C#,F#,Go,JavaPerl,Python,Python (2) | There exist several variants of this encoding,Base85,btoa, etc. |
Base32 | Arbitrary | 62.5% | ANSI C,Delphi,Go,Java,C# F#,Python | |
Base36 | Integer | ~64% | bash,C,C++,C#,Java,Perl,PHP,Python, Visual Basic,Swift, many others | Uses theArabic numerals 0–9 and theLatin letters A–Z (theISO basic Latin alphabet). Commonly used byURL redirection systems likeTinyURL or SnipURL/Snipr as compact alphanumeric identifiers. |
Base45 | Arbitrary | ~67% (97%[a]) | Go,Python | Defined in IETF Specification RFC 9285 for including binary data compactly in aQR code.[1] |
Base56 | Integer | — | PHP,Python,Go | A variant of Base58 encoding which further sheds the '1' and the lowercase 'o' characters in order to minimise the risk of fraud and human-error.[2] |
Base58 | Integer | ~73% | C,C++,Python,C#,Java | Similar to Base64, but modified to avoid both non-alphanumeric characters (+ and /) and letters that might look ambiguous when printed (0 – zero, I – capital i, O – capital o and l – lower-case L). Base58 is used to representbitcoin addresses.[citation needed] Some messaging and social media systemsbreak lines on non-alphanumeric strings. This is avoided by not usingURI reserved characters such as +. ForSegWit, it was replaced by Bech32, see below.![]() |
Base62 | Arbitrary | ~74% | Rust,Python | Similar to Base64, but contains only alphanumeric characters. |
Base64 | Arbitrary | 75% | awkArchived 2014-12-29 at theWayback Machine,C,C (2),Delphi,Go,Python, many others | An early and still-popular encoding, first specified as part ofRFC 989 in 1987 |
Base85 | Arbitrary | 80% | C,Python,Python (2) | Revised version ofAscii85. |
Base91[3] | Arbitrary | 81% | C# F# | Constant width variant |
basE91[4] | Arbitrary | 81% | C, Java, PHP, 8086 Assembly, AWKC#, F#,Rust | Variable width variant |
Base94[5] | Arbitrary | 82% | Python,C,Rust | |
Base122[6] | Arbitrary | 87.5% | JavaScript,Python,Java,Base125 Python and Javascript,Go,C | |
BaseXML[7] | Arbitrary | 83.5% | C Python JavaScript | |
Bech32 | Arbitrary | 62.5% + at least 8 chars (label, separator, 6-charECC) | C, C++,JavaScript,Go, Python,Haskell,Ruby,Rust | Specification.[8] Used in Bitcoin and theLightning Network.[9] The data portion is encoded like Base32 with the possibility to check and correct up to 6 mistyped characters using the 6-characterBCH code at the end, which also checks/corrects the Human Readable Part. The Bech32m variant has a subtle change that makes it more resilient to changes in length.[10] |
BinHex | Arbitrary | 75% | Perl,C,C (2) | MacOS Classic |
Decimal | Integer | ~42% | Most languages | Usually the default representation for input/output from/to humans. |
Hexadecimal (Base16) | Arbitrary | 50% | Most languages | Exists inuppercase andlowercase variants |
Intel HEX | Arbitrary | ≲50% | C library,C++ | Typically used to programEPROM,NOR flash memory chips |
MIME | Arbitrary | SeeQuoted-printable andBase64 | SeeQuoted-printable andBase64 | Encoding container for e-mail-like formatting |
Percent-encoding | Text (URIs), Arbitrary (RFC1738) | ~40%[b] (33–70%[c]) | C,Python, probably many others | |
Quoted-printable | Text | ~33–100%[d] | Probably many | Preserves line breaks; cuts lines at 76 characters |
S-record (Motorola hex) | Arbitrary | 49.6% | C library,C++ | Typically used to programEPROM,NOR flash memory chips. 49.6% assumes 255 binary bytes per record. |
Tektronix hex | Arbitrary | Typically used to programEPROM,NOR flash memory chips. | ||
TxMS | Arbitrary | TypeScript, CLI,Dart | TxMS compresses binary data into a readable text format using Binary-to-Text encoding and allows reversible conversion back to hexadecimal. | |
Uuencoding | Arbitrary | ~60% (up to 70%) | Perl,C,Delphi,Java,Python, probably many others | An early encoding developed in 1980 forUnix-to-Unix Copy. Largely replaced by MIME andyEnc |
Xxencoding | Arbitrary | ~75% (similar to Uuencoding) | C,Delphi | Proposed (and occasionally used) as replacement for Uuencoding to avoid character set translation problems between ASCII and the EBCDIC systems that could corrupt Uuencoded data |
z85 (ZeroMQ spec:32/Z85) | Binary & ASCII | 80% (similar to Ascii85/Base85) | C (original),C#,Dart,Erlang,Go,Lua,Ruby,Rust and others | Specifies a subset of ASCII similar toAscii85, omitting a few characters that may cause program bugs (` \ " ' _ , ; ). The format conforms toZeroMQ spec:32/Z85. |
RFC 1751 (S/KEY) | Arbitrary | 33% | C,[11]Python | "A Convention forHuman-readable 128-bit Keys". A series of small English words is easier for humans to read, remember, and type in than decimal or other binary-to-text encoding systems.[12] Each 64-bit number is mapped to six short words, of one to four characters each, from a public 2048-word dictionary.[11] |
The 95isprint codes 32 to 126 are known as theASCII printable characters.
Some older and today uncommon formats include BOO,BTOA, and USR encoding.
Most of these encodings generate text containing only a subset of allASCII printable characters: for example, thebase64 encoding generates text that only contains upper case and lower case letters, (A–Z, a–z), numerals (0–9), and the "+", "/", and "=" symbols.
Some of these encoding (quoted-printable and percent encoding) are based on a set of allowed characters and a singleescape character. The allowed characters are left unchanged, while all other characters are converted into a string starting with the escape character. This kind of conversion allows the resulting text to be almost readable, in that letters and digits are part of the allowed characters, and are therefore left as they are in the encoded text. These encodings produce the shortest plain ASCII output for input that is mostly printable ASCII.
Some other encodings (base64,uuencoding) are based on mapping all possible sequences of sixbits into different printable characters. Since there are more than 26 = 64 printable characters, this is possible. A given sequence of bytes is translated by viewing it as a stream of bits, breaking this stream in chunks of six bits and generating the sequence of corresponding characters. The different encodings differ in the mapping between sequences of bits and characters and in how the resulting text is formatted.
Some encodings (the original version of BinHex and the recommended encoding forCipherSaber) use four bits instead of six, mapping all possible sequences of 4 bits onto the 16 standardhexadecimal digits. Using 4 bits per encoded character leads to a 50% longer output than base64, but simplifies encoding and decoding—expanding each byte in the source independently to two encoded bytes is simpler than base64's expanding 3 source bytes to 4 encoded bytes.
Out ofPETSCII's first 192 codes, 164 have visible representations when quoted: 5 (white), 17–20 and 28–31 (colors and cursor controls), 32–90 (ascii equivalent), 91–127 (graphics), 129 (orange), 133–140 (function keys), 144–159 (colors and cursor controls), and 160–192 (graphics).[13] This theoretically permits encodings, such as base128, between PETSCII-speaking machines.
Even in Byte mode, a typical QR code reader tries to interpret a byte sequence as text encoded in UTF-8 or ISO/IEC 8859-1. ... Such data has to be converted into an appropriate text before that text could be encoded as a QR code. ... Base45 ... offers a more compact QR code encoding.