Movatterモバイル変換

[0]ホーム

Jump to content

Null-terminated string

Edit links

From Wikipedia, the free encyclopedia

Data structure

"CString" redirects here. For other uses, seeC string (disambiguation).

Incomputer programming, anull-terminated string is acharacter string stored as anarray containing the characters and terminated with anull character (a character with an internal value of zero, called "NUL" in this article, not same as theglyph zero). Alternative names areC string, which refers to theC programming language andASCIIZ^[1] (although C can use encodings other thanASCII).

The length of a string is found by searching for the (first) NUL. This can be slow as it takes O(n) (linear time) with respect to the string length. It also means that a string cannot contain a NUL (there is a NUL in memory, but it is after the last character, notin the string).

History

[edit]

Null-terminated strings were produced by the.ASCIZ directive of thePDP-11 assembly languages and theASCIZ directive of theMACRO-10 macro assembly language for thePDP-10. These predate the development of the C programming language, but other forms of strings were often used.

At the time C (and the languages that it was derived from) was developed, memory was extremely limited, so using only one byte of overhead to store the length of a string was attractive. The only popular alternative at that time, usually called a "Pascal string" (a more modern term is "length-prefixed"), used a leadingbyte to store the length of the string. This allows the string to contain NUL and made finding the length need only one memory access (O(1)(constant) time), but limited string length to 255 characters. C designerDennis Ritchie chose to follow the convention of null-termination to avoid the limitation on the length of a string and because maintaining the count seemed, in his experience, less convenient than using a terminator.^[2]

This had some influence on CPUinstruction set design. Some CPUs in the 1970s and 1980s, such as theZilog Z80 and theDEC VAX, had dedicated instructions for handling length-prefixed strings. However, as the null-terminated string gained traction, CPU designers began to take it into account, as seen for example in IBM's decision to add the "Logical String Assist" instructions to theES/9000 520 in 1992 and the vector string instructions to theIBM z13 in 2015.^[3]

FreeBSD developerPoul-Henning Kamp, writing inACM Queue, referred to the victory of null-terminated strings over a 2-byte (not one-byte) length as "the most expensive one-byte mistake" ever.^[4]

Limitations

[edit]

While simple to implement, this representation has been prone to errors and performance problems.

Null-termination has historically createdsecurity problems.^[5] A NUL inserted into the middle of a string will truncate it unexpectedly.^[6] A common bug was to not allocate the additional space for the NUL, so it was written over adjacent memory. Another was to not write the NUL at all, which was often not detected during testing because the block of memory already contained zeros. Due to the expense of finding the length, many programs did not bother before copying a string to a fixed-sizebuffer, causing abuffer overflow if it was too long.

The inability to store a zero requires that text and binary data be kept distinct and handled by different functions (with the latter requiring the length of the data to also be supplied). This can lead to code redundancy and errors when the wrong function is used.

The speed problems with finding the length can usually be mitigated by combining it with another operation that is O(n) anyway, such as instrlcpy. However, this does not always result in an intuitiveAPI.

Character encodings

[edit]

Null-terminated strings require that the encoding does not use a zero byte (0x00) anywhere; therefore it is not possible to store every possibleASCII orUTF-8 string.^[7]^[8]^[9] However, it is common to store the subset of ASCII or UTF-8 – every character except NUL – in null-terminated strings. Some systems use "modified UTF-8" which encodes NUL as two non-zero bytes (0xC0, 0x80) and thus allow all possible strings to be stored. This is not allowed by the UTF-8 standard, because it is anoverlong encoding, and it is seen as a security risk. Some other byte may be used as end of string instead, like 0xFE or 0xFF, which are not used in UTF-8.

UTF-16 uses 2-byte integers and as either byte may be zero (and in factevery other byte is, when representing ASCII text), cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bitUTF-16 characters, terminated by a 16-bit NUL (0x0000).

Improvements

[edit]

Many attempts to make C string handling less error prone have been made. One strategy is to add safer functions such asstrdup andstrlcpy, whilstdeprecating the use of unsafe functions such as gets. Another is to add an object-oriented wrapper around C strings so that only safe calls can be done. However, it is possible to call the unsafe functions anyway.

Most modern libraries replace C strings with a structure containing a 32-bit or larger length value (far more than were ever considered for length-prefixed strings), and often add another pointer, a reference count, and even a NUL to speed up conversion back to a C string. Memory is far larger now, such that if the addition of 3 (or 16, or more) bytes to each string is a real problem the software will have to be dealing with so many small strings that some other storage method will save even more memory (for instance there may be so many duplicates that ahash table will use less memory). Examples include theC++Standard Template Librarystd::string, theQtQString, theMFCCString, and the C-based implementationCFString fromCore Foundation as well as itsObjective-C siblingNSString fromFoundation, both by Apple. More complex structures may also be used to store strings such as therope.

References

[edit]

^"Chapter 15 - MIPS Assembly Language"(PDF).Carleton University. Retrieved9 October 2023.
^Ritchie, Dennis M. (1996). "The development of the C language". In Bergin, Jr., Thomas J.; Gibson, Jr., Richard G. (eds.).History of Programming Languages (2 ed.). New York: ACM Press.ISBN 0-201-89502-1 – via Addison-Wesley (Reading, Mass).
^IBM z/Architecture Principles of Operation
^Kamp, Poul-Henning (25 July 2011), "The Most Expensive One-byte Mistake",ACM Queue,9 (7):40–43,doi:10.1145/2001562.2010365,ISSN 1542-7730,S2CID 30282393
^Rain Forest Puppy (9 September 1999)."Perl CGI problems".Phrack Magazine.9 (55). artofhacking.com: 7. Retrieved3 January 2016.
^"Null byte injection on PHP?".
^Yergeau, François (November 2003)."UTF-8, a transformation format of ISO 10646". Retrieved19 September 2013.
^"Unicode/UTF-8-character table". Retrieved13 September 2013.
^Kuhn, Markus."UTF-8 and Unicode FAQ". Retrieved13 September 2013.

C programming language

Features

Standard library

Char File I/O Math Dynamic memory String Time Variadic POSIX
Implementations	Bionic libhybris dietlibc glibc EGLIBC klibc Windows CRT musl Newlib uClibc

Compilers

IDEs

Comparison with
other languages

Descendant
languages

Designer

Dennis Ritchie

Category

v t e Data types
Uninterpreted	Bit Byte Trit Tryte Word Bit array
Numeric	Arbitrary-precision or bignum Complex Decimal Fixed point Block floating point Floating point Reduced precision Minifloat Half precision bfloat16 Single precision Double precision Quadruple precision Octuple precision Extended precision Long double Integer signedness Interval Rational
Pointer	Address physical virtual Reference
Text	Character String null-terminated
Composite	Algebraic data type generalized Array Associative array Class Dependent Equality Inductive Intersection List Object metaobject Option type Product Record or Struct Refinement Set Union tagged
Other	Any type Boolean Bottom type Collection Enumerated type Exception Function type Opaque data type Recursive data type Semaphore Stream Strongly typed identifier Type class Empty type Unit type Void
Related topics	Value Abstract data type Boxing Data structure Generic Kind metaclass Parametric polymorphism Primitive data type Interface Subtyping Type constructor Type conversion Type system Type theory Variable