Movatterモバイル変換

[0]ホーム

Jump to content

Bit array

Edit links

From Wikipedia, the free encyclopedia

Array data structure that compactly stores bits

This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Bit array" – news ·newspapers ·books ·scholar ·JSTOR(December 2010) (Learn how and when to remove this message)

Abit array (also known asbit map,bit set,bit string, orbit vector) is anarray data structure that compactly storesbits. It can be used to implement a simpleset data structure. A bit array is effective at exploitingbit-level parallelism in hardware to perform operations quickly. A typical bit array storeskw bits, wherew is the number of bits in the unit of storage, such as abyte orword, andk is some nonnegative integer. Ifw does not divide the number of bits to be stored, some space is wasted due tointernal fragmentation.

Definition

[edit]

A bit array is a mapping from some domain (almost always a range of integers) to values in the set $\{{\texttt {0}},{\texttt {1}}\}$ . The values can be interpreted as dark/light, absent/present, locked/unlocked, valid/invalid, et cetera. The point is that there are only two possible values, so they can be stored in one bit. As with other arrays, the access to a single bit can be managed by applying an index to the array. Assuming its size (or length) to ben bits, the array can be used to specify a subset of the domain (e.g. $\{0,1,2,...,n-1\}$ ), where a1-bit indicates the presence and a0-bit the absence of a number in the set. This set data structure uses aboutn/w words of space, wherew is the number of bits in eachmachine word. Whether the least significant bit (of the word) or the most significant bit indicates the smallest-index number is largely irrelevant, but the former tends to be preferred (onlittle-endian machines).

A finitebinary relation may be represented by a bit array called alogical matrix. In thecalculus of relations, these arrays are composed withmatrix multiplication where the arithmetic is Boolean, and such a composition representscomposition of relations.^[1]

Basic operations

[edit]

Although most machines are not able to address individual bits in memory, nor have instructions to manipulate single bits, each bit in a word can be singled out and manipulated usingbitwise operations. In particular:

UseOR to set a bit to one:

      11101010 OR 00000100    = 11101110

AND to set a bit to zero:

         11101010AND 11111101      = 11101000

AND to determine if a bit is set, by zero-testing:

            11101010                     11101010   AND 00000001            AND 00000010         = 00000000                  = 00000010(=0 ∴ bit isn't set)     (≠0 ∴ bit is set)

XOR to invert or toggle a bit:

        11101010             11101110XOR 00000100    XOR 00000100     = 11101110         =  11101010

NOT to invert all bits:

NOT 10110010       = 01001101

To obtain thebit mask needed for these operations, we can use abit shift operator to shift the number 1 to the left by the appropriate number of places, as well asbitwise negation if necessary.

Given two bit arrays of the same size representing sets, we can compute theirunion,intersection, andset-theoretic difference usingn/w simple bit operations each (2n/w for difference), as well as thecomplement of either:

for ifrom 0to n/w-1    complement_a[i] :=not a[i]    union[i]        := a[i]or b[i]    intersection[i] := a[i]and b[i]    difference[i]   := a[i]and (not b[i])

If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly nested loop that loops through each word, one at a time. Onlyn/w memory accesses are required:

for ifrom 0 to n/w-1    index := 0    // if needed    word := a[i]for bfrom 0 to w-1        value := wordand 1 ≠ 0        word := word shift right 1        // do something with value        index := index + 1    // if needed

Both of these code samples exhibit ideallocality of reference, which will subsequently receive large performance boost from a data cache. If a cache line isk words, only aboutn/wk cache misses will occur.

More complex operations

[edit]

As withcharacter strings it is straightforward to definelength,substring,lexicographicalcompare,concatenation,reverse operations. The implementation of some of these operations is sensitive toendianness.

Population / Hamming weight

[edit]

If we wish to find the number of 1 bits in a bit array, sometimes called the population count or Hamming weight, there are efficient branch-free algorithms that can compute the number of bits in a word using a series of simple bit operations. We simply run such an algorithm on each word and keep a running total. Counting zeros is similar. See theHamming weight article for examples of an efficient implementation.

Inversion

[edit]

Vertical flipping of a one-bit-per-pixel image, or some FFT algorithms, requires flipping the bits of individual words (sob31 b30 ... b0 becomesb0 ... b30 b31).When this operation is not available on the processor, it's still possible to proceed by successive passes, in this example on 32 bits:

exchange two 16-bit halfwordsexchange bytes by pairs (0xddccbbaa -> 0xccddaabb)...swap bits by pairsswap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1)The last operation can be written ((x&0x55555555) << 1) | (x&0xaaaaaaaa) >> 1)).

Find first one

[edit]

Thefind first set orfind first one operation identifies the index or position of the 1-bit with the smallest index in an array, and has widespread hardware support (for arrays not larger than a word) and efficient algorithms for its computation. When apriority queue is stored in a bit array, find first one can be used to identify the highest priority element in the queue. To expand a word-sizefind first one to longer arrays, one can find the first nonzero word and then runfind first one on that word. The related operationsfind first zero,count leading zeros,count leading ones,count trailing zeros,count trailing ones, andlog base 2 (seefind first set) can also be extended to a bit array in a straightforward manner.

Compression

[edit]

A bit array is the most dense storage for "random" bits, that is, where each bit is equally likely to be 0 or 1, and each one is independent. But most data are not random, so it may be possible to store it more compactly. For example, the data of a typical fax image is not random and can be compressed.Run-length encoding is commonly used to compress these long streams. However, most compressed data formats are not so easy to access randomly; also by compressing bit arrays too aggressively we run the risk of losing the benefits due to bit-level parallelism (vectorization). Thus, instead of compressing bit arrays as streams of bits, we might compress them as streams of bytes or words (seeBitmap index (compression)).

Advantages and disadvantages

[edit]

Bit arrays, despite their simplicity, have a number of marked advantages over other data structures for the same problems:

They are extremely compact; no other data structures can storen independent pieces of data inn/w words.
They allow small arrays of bits to be stored and manipulated in the register set for long periods of time with no memory accesses.
Because of their ability to exploit bit-level parallelism, limit memory access, and maximally use thedata cache, they often outperform many other data structures on practical data sets, even those that are more asymptotically efficient.

However, bit arrays are not the solution to everything. In particular:

Without compression, they are wasteful set data structures for sparse sets (those with few elements compared to their range) in both time and space. For such applications, compressed bit arrays,Judy arrays,tries, or evenBloom filters should be considered instead.
Accessing individual elements can be expensive and difficult to express in some languages. If random access is more common than sequential and the array is relatively small, a byte array may be preferable on a machine with byte addressing. A word array, however, is probably not justified due to the huge space overhead and additional cache misses it causes, unless the machine only has word addressing.

Applications

[edit]

Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a premium. Most commonly, they are used to represent a simple group of Boolean flags or an ordered sequence of Boolean values.

Bit arrays are used forpriority queues, where the bit at indexk is set if and only ifk is in the queue; this data structure is used, for example, by theLinux kernel, and benefits strongly from a find-first-zero operation in hardware.

Bit arrays can be used for the allocation ofmemory pages,inodes, disk sectors, etc. In such cases, the termbitmap may be used. However, this term is frequently used to refer toraster images, which may use multiplebits per pixel.

Another application of bit arrays is theBloom filter, a probabilisticset data structure that can store large sets in a small space in exchange for a small probability of error. It is also possible to build probabilistichash tables based on bit arrays that accept either false positives or false negatives.

Bit arrays and the operations on them are also important for constructingsuccinct data structures, which use close to the minimum possible space. In this context, operations like finding thenth 1 bit or counting the number of 1 bits up to a certain position become important.

Bit arrays are also a useful abstraction for examining streams ofcompressed data, which often contain elements that occupy portions of bytes or are not byte-aligned. For example, the compressedHuffman coding representation of a single 8-bit character can be anywhere from 1 to 255 bits long.

Ininformation retrieval, bit arrays are a good representation for theposting lists of very frequent terms. If we compute the gaps between adjacent values in a list of strictly increasing integers and encode them usingunary coding, the result is a bit array with a 1 bit in thenth position if and only ifn is in the list. The implied probability of a gap ofn is 1/2ⁿ. This is also the special case ofGolomb coding where the parameter M is 1; this parameter is only normally selected when−log(2 −p) / log(1 −p) ≤ 1, or roughly the term occurs in at least 38% of documents.

Examples

[edit]

Given a big file ofIPv4 addresses (more than 100 GB) — we need to count unique addresses. If we use genericmap[string]bool — we will need more than 64 GB ofRAM, so we use thebit map, inGo:

packagemainimport("bufio""fmt""math/bits""os")// bitsetSize is the number of bytes needed for 2^32 bits (512 MiB)constbitsetSize=1<<29funcmain(){file,err:=os.Open("ip_addresses")iferr!=nil{fmt.Println("Error opening file:",err)return}deferfile.Close()bitset:=[bitsetSize]byte{}// Use a buffered scanner with a larger bufferscanner:=bufio.NewScanner(file)constmaxBuffer=64*1024// 64 KB bufferbuf:=make([]byte,0,maxBuffer)scanner.Buffer(buf,maxBuffer)// Process each lineforscanner.Scan(){line:=scanner.Bytes()// Parse the IP address manually from bytesip:=parseIPv4(line)// Set the bitbyteIndex:=ip>>3// Divide by 8bitIndex:=ip&7// Bit position 0-7bitset[byteIndex]|=1<<bitIndex}// Check for scanning errorsiferr:=scanner.Err();err!=nil{fmt.Println("Error reading file:",err)return}varcountuint64fori:=0;i<bitsetSize;i++{count+=uint64(bits.OnesCount8(bitset[i]))}fmt.Println("Number of unique IPv4 addresses:",count)}funcparseIPv4(line[]byte)(ipuint32){i:=0// Octet 1n:=uint32(line[i]-'0')fori=1;line[i]!='.';i++{n=n*10+uint32(line[i]-'0')}ip|=n<<24i++// Skip the dot// Octet 2n=uint32(line[i]-'0')i++for;line[i]!='.';i++{n=n*10+uint32(line[i]-'0')}ip|=n<<16i++// Skip the dot// Octet 3n=uint32(line[i]-'0')i++for;line[i]!='.';i++{n=n*10+uint32(line[i]-'0')}ip|=n<<8i++// Skip the dot// Octet 4n=uint32(line[i]-'0')i++for;i<len(line);i++{n=n*10+uint32(line[i]-'0')}ip|=nreturnip}

Language support

[edit]

TheAPL programming language fully supports bit arrays of arbitrary shape and size as a Boolean datatype distinct from integers. All major implementations (Dyalog APL, APL2, APL Next, NARS2000, Gnu APL, etc.) pack the bits densely into whatever size the machine word is. Bits may be accessed individually via the usual indexing notation (for example,A[3]) as well as through all of the usual primitive functions and operators where they are often operated on using a special case algorithm such as summing the bits via a table lookup of bytes.

TheC programming language'sbit fields, pseudo-objects found in structs with size equal to some number of bits, are in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the bits are still accessed using bytewise operators on most machines, and they can only be defined statically (like C's static arrays, their sizes are fixed at compile-time). It is also a common idiom for C programmers to use words as small bit arrays and access bits of them using bit operators. A widely available header file included in theX11 system,<xtrapbits.h>, is “a portable way for systems to define bit field manipulation of arrays of bits.” A more explanatory description of aforementioned approach can be found in thecomp.lang.c faq.

InC++, although individualbools typically occupy the same space as a byte or an integer, theSTL typestd::vector<bool> is apartial template specialization in which bits are packed as a space efficiency optimization. Since bytes (and not bits) are the smallest addressable unit in C++, the[] operator doesnot return a reference to an element, but instead returns aproxy reference. This might seem a minor point, but it means thatvector<bool> isnot a standard STL container, which is why the use ofvector<bool> is generally discouraged. Another unique STL class,std::bitset,^[2] creates a vector of bits fixed at a particular size at compile-time, and in its interface and syntax more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as the ability to efficiently count the number of bits that are set. Likearray, the size ofbitset must be specified at compile time, and cannot be inferred by the compiler. TheBoost C++ Libraries provides aboost::dynamic_bitset class^[3] whose size is specified at run-time.

TheD programming language provides bit arrays in its standard library, Phobos, instd.bitmanip. As in C++, the [] operator does not return a reference, since individual bits are not directly addressable on most hardware, but instead returns abool.

InJava, the classBitSet (java.util.BitSet) creates a bit array that is then manipulated with functions named after bitwise operators familiar to C programmers. Unlike thebitset in C++, the JavaBitSet does not have a "size" state (it has an effectively infinite size, initialized with 0 bits); a bit can be set or tested at any index. In addition, there is a classEnumSet (java.util.EnumSet), which represents a Set of values of anenumerated type internally as a bit vector, as a safer alternative to bit fields.

The.NET Framework supplies aSystems.Collections.BitArray collection class. It stores bits using an array of typeint (each element in the array usually represents 32 bits).^[4] The class supports random access and bitwise operators, can be iterated over, and itsLength property can be changed to grow or truncate it.

AlthoughStandard ML has no support for bit arrays, Standard ML of New Jersey has an extension, theBitArray structure, in its SML/NJ Library. It is not fixed in size and supports set operations and bit operations, including, unusually, shift operations.

Haskell likewise currently lacks standard support for bitwise operations, but bothGHC and Hugs provide aData.Bits module with assorted bitwise functions and operators, including shift and rotate operations and an "unboxed" array over Boolean values may be used to model a Bit array, although this lacks support from the former module.

InPerl, strings can be used as expandable bit arrays. They can be manipulated using the usual bitwise operators (~ | & ^),^[5] and individual bits can be tested and set using thevec function.^[6]

InRuby, one can access (but not set) a bit of an integer (Fixnum orBignum) using the bracket operator ([]), as if it were an array of bits.

InRust 1.3.0, there existed astd::collections::BitSet object, however it was deprecated and removed for some reason. Instead, one must use thebit_set orbitvec libraries.^[7]^[8]

Apple'sCore Foundation library containsCFBitVector andCFMutableBitVector structures.

PL/I supports arrays ofbit strings of arbitrary length, which may be either fixed-length or varying. The array elements may bealigned— each element begins on a byte or word boundary— orunaligned— elements immediately follow each other with no padding.

PL/pgSQL and PostgreSQL's SQL supportbit strings as native type. There are two SQL bit types:bit(n) andbit varying(n), wheren is a positive integer.^[9]

Hardware description languages such asVHDL,Verilog, andSystemVerilog natively support bit vectors as these are used to model storage elements likeflip-flops, hardware busses and hardware signals in general. In hardware verification languages such as OpenVera,e andSystemVerilog, bit vectors are used to sample values from the hardware models, and to represent data that is transferred to hardware during simulations.

Common Lisp provides multi-dimensional bit arrays. A one-dimensionalbit-vector implementation is provided as a special case of the built-inarray, acting in a dual capacity as a class and a type specifier.^[10] Bit arrays (and thus bit vectors) relies on the generalmake-array function to be configured with an element type ofbit, which optionally permits a bit vector to be designated as dynamically resizable. Thebit-vector, however, is not infinite in extent. A more restrictedsimple-bit-vector type exists, which explicitly excludes the dynamic characteristics.^[11] Bit vectors are represented as, and can be constructed in a more concise fashion by, thereader macro#*bits.^[12] In addition to the general functions applicable to all arrays, dedicated operations exist for bit arrays. Single bits may be accessed and modified using thebit andsbit functions^[13] and an extensive number of logical operations is supported.^[14]

References

[edit]

^Irving Copilowish (December 1948) "Matrix development of the calculus of relations",Journal of Symbolic Logic 13(4): 193–203Jstor link
^"SGI.com Tech Archive Resources now retired".SGI. 2 January 2018.
^"dynamic_bitset<Block, Allocator> - 1.66.0".www.boost.org.
^".NET mscorlib source code".github.com/microsoft. 15 October 2021.
^"perlop - perldoc.perl.org".perldoc.perl.org.
^"vec - perldoc.perl.org".perldoc.perl.org.
^"bit_set - Rust".docs.rs. 16 July 2024.
^"bitvec - Rust".docs.rs. 10 July 2022.
^"8.10. Bit String Types". 30 September 2021.
^"CLHS: System Class BIT-VECTOR".www.lispworks.com.
^"CLHS: Type SIMPLE-BIT-VECTOR".www.lispworks.com.
^"CLHS: Section 2.4.8.4".www.lispworks.com.
^"CLHS: Accessor BIT, SBIT".www.lispworks.com.
^"CLHS: Function BIT-AND, BIT-ANDC1, BIT-ANDC2..."www.lispworks.com.

External links

[edit]

mathematical bases Archived 2019-10-16 at theWayback Machine by Pr. D.E.Knuth
vector<bool> Is Nonconforming, and Forces Optimization Choice
vector<bool>: More Problems, Better Solutions

v t e Data structures
Types	Collection Container
Abstract	Associative array Multimap Retrieval Data Structure List Stack Queue Double-ended queue Priority queue Double-ended priority queue Set Multiset Disjoint-set
Arrays	Bit array Circular buffer Dynamic array Hash table Hashed array tree Sparse matrix
Linked	Association list Linked list Skip list Unrolled linked list XOR linked list
Trees	B-tree Binary search tree AA tree AVL tree Red–black tree Self-balancing tree Splay tree Heap Binary heap Binomial heap Fibonacci heap R-tree R* tree R+ tree Hilbert R-tree Rope Trie Hash tree
Graphs	Binary decision diagram Directed acyclic graph Directed acyclic word graph
List of data structures