Incomputer science, ahash table is adata structure that implements anassociative array, also called adictionary or simplymap; an associative array is anabstract data type that mapskeys tovalues.[3] A hash table uses ahash function to compute anindex, also called ahash code, into an array ofbuckets orslots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored. A map implemented by a hash table is called ahash map.
Most hash table designs employ animperfect hash function.Hash collisions, where the hash function generates the same index for more than one key, therefore typically must be accommodated in some way. Common strategies to handle hash collisions include chaining, which stores multiple elements in the same slot using linked lists, and open addressing, which searches for the next available slot according to a probing sequence.[4]
In a well-dimensioned hash table, the average time complexity for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions ofkey–value pairs, atamortized constant average cost per operation.[5][4]: 513–558 [6]
Hashing is an example of aspace-time tradeoff. Ifmemory is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and abinary search orlinear search can be used to retrieve the element.[7]: 458
In many situations, hash tables turn out to be on average more efficient thansearch trees or any othertable lookup structure. Hash tables are widely used in modern software systems for tasks such as database indexing, caching, and implementing associative arrays, due to their fast average-case performance.[8] For this reason, they are widely used in many kinds of computersoftware, particularly forassociative arrays,database indexing,caches, andsets. Many programming languages provide built-in hash table structures, such as Python’s dictionaries, Java’s HashMap, and C++’s unordered_map, which abstract the complexity of hashing from the programmer.[9]
The idea of hashing arose independently in different places. In January 1953,Hans Peter Luhn wrote an internalIBM memorandum that used hashing with chaining. The first example ofopen addressing was proposed by A. D. Linh, building on Luhn's memorandum.[4]: 547 Around the same time,Gene Amdahl,Elaine M. McGraw,Nathaniel Rochester, andArthur Samuel ofIBM Research implemented hashing for theIBM 701assembler.[10]: 124 Open addressing with linear probing is credited to Amdahl, althoughAndrey Ershov independently had the same idea.[10]: 124–125 The term "open addressing" was coined byW. Wesley Peterson in his article which discusses the problem of search in large files.[11]: 15
The first published work on hashing with chaining is credited toArnold Dumey, who discussed the idea of using remainder modulo a prime as a hash function.[11]: 15 The word "hashing" was first published in an article by Robert Morris.[10]: 126 Atheoretical analysis of linear probing was submitted originally by Konheim and Weiss.[11]: 15
Anassociative array stores aset of (key, value) pairs and allows insertion, deletion, and lookup (search), with the constraint ofunique keys. In the hash table implementation of associative arrays, an array of length is partially filled with elements, where. A key is hashed using a hash function to compute an index location in the hash table, where. The efficiency of a hash table depends on the load factor, defined as the ratio of the number of stored elements to the number of available slots, with lower load factors generally yielding faster operations.[12] At this index, both the key and its associated value are stored. Storing the key alongside the value ensures that lookups can verify the key at the index to retrieve the correct value, even in the presence of collisions. Under reasonable assumptions, hash tables have bettertime complexity bounds on search, delete, and insert operations in comparison toself-balancing binary search trees.[11]: 1
Hash tables are also commonly used to implement sets, by omitting the stored value for each key and merely tracking whether the key is present.[11]: 1
Aload factor is a critical statistic of a hash table, and is defined as follows:[2]where
is the number of entries occupied in the hash table.
is the number of buckets.
The performance of the hash table deteriorates in relation to the load factor.[11]: 2 In the limit of large and, each bucket statistically has aPoisson distribution with expectation for an ideally randomhash function.
The software typically ensures that the load factor remains below a certain constant,. This helps maintain good performance. Therefore, a common approach is to resize or "rehash" the hash table whenever the load factor reaches. Similarly the table may also be resized if the load factor drops below.[13]
With separate chaining hash tables, each slot of the bucket array stores a pointer to a list or array of data.[14]
Separate chaining hash tables suffer gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed.[13]
With separate chaining, the value of that gives best performance is typically between 1 and 3.[13]
With open addressing, each slot of the bucket array holds exactly one item. Therefore an open-addressed hash table cannot have a load factor greater than 1.[14]
The performance of open addressing becomes very bad when the load factor approaches 1.[13]
Therefore a hash table that uses open addressingmust be resized orrehashed if the load factor approaches 1.[13]
With open addressing, acceptable figures of max load factor should range around 0.6 to 0.75.[15][16]: 110
Ahash function maps the universe of keys to indices or slots within the table, that is, for. The conventional implementations of hash functions are based on theinteger universe assumption that all elements of the table stem from the universe, where thebit length of is confined within theword size of acomputer architecture.[11]: 2
A hash function is said to beperfect for a given set if it isinjective on, that is, if each element maps to a different value in.[17][18] A perfect hash function can be created if all the keys are known ahead of time.[17]
The scheme in hashing by multiplication is as follows:[11]: 2–3 Where is a non-integerreal-valued constant and is the size of the table. An advantage of the hashing by multiplication is that the is not critical.[11]: 2–3 Although any value produces a hash function,Donald Knuth suggests using thegolden ratio.[11]: 3
Commonly a string is used as a key to the hash function. Stroustrup[20] describes a simple hash function in which an unsigned integer that is initially zero is repeatedly left shifted one bit and then xor'ed with the integer value of the next character. This hash value is then taken modulo the table size. If the left shift is not circular, then the string length should be at least eight bits less than the size of the unsigned integer in bits. Another common way to hash a string to an integer is with a polynomial rolling hash function.
Uniform distribution of the hash values is a fundamental requirement of a hash function. A non-uniform distribution increases the number of collisions and the cost of resolving them. Uniformity is sometimes difficult to ensure by design, but may be evaluated empirically using statistical tests, e.g., aPearson's chi-squared test for discrete uniform distributions.[21][22]
The distribution needs to be uniform only for table sizes that occur in the application. In particular, if one uses dynamic resizing with exact doubling and halving of the table size, then the hash function needs to be uniform only when the size is apower of two. Here the index can be computed as some range of bits of the hash function. On the other hand, some hashing algorithms prefer to have the size be aprime number.[23]
Foropen addressing schemes, the hash function should also avoidclustering, the mapping of two or more keys to consecutive slots. Such clustering may cause the lookup cost to skyrocket, even if the load factor is low and collisions are infrequent. The popular multiplicative hash is claimed to have particularly poor clustering behavior.[23][4]
K-independent hashing offers a way to prove a certain hash function does not have bad keysets for a given type of hashtable. A number of K-independence results are known for collision resolution schemes such as linear probing and cuckoo hashing. Since K-independence can prove a hash function works, one can then focus on finding the fastest possible such hash function.[24]
A search algorithm that uses hashing consists of two parts. The first part is computing ahash function which transforms the search key into anarray index. The ideal case is such that no two search keys hash to the same array index. However, this is not always the case and impossible to guarantee for unseen given data.[4]: 515 Hence the second part of the algorithm is collision resolution. The two common methods for collision resolution are separate chaining and open addressing.[7]: 458
Hash collision resolved by separate chainingHash collision by separate chaining with head records in the bucket array.
In separate chaining, the process involves building alinked list withkey–value pair for each search array index. The collided items are chained together through a single linked list, which can be traversed to access the item with a unique search key.[7]: 464 Collision resolution through chaining with linked list is a common method of implementation of hash tables. Let and be the hash table and the node respectively, the operation involves as follows:[19]: 258
Chained-Hash-Insert(T,k)insertxat the head of linked listT[h(k)]Chained-Hash-Search(T,k)search for an element with keykin linked listT[h(k)]Chained-Hash-Delete(T,k)deletexfrom the linked listT[h(k)]
If the element is comparable eithernumerically orlexically, and inserted into the list by maintaining thetotal order, it results in faster termination of the unsuccessful searches.[4]: 520–521
Indynamic perfect hashing, two-level hash tables are used to reduce the look-up complexity to be a guaranteed in the worst case. In this technique, the buckets of entries are organized asperfect hash tables with slots providing constant worst-case lookup time, and low amortized time for insertion.[25] A study shows array-based separate chaining to be 97% more performant when compared to the standard linked list method under heavy load.[26]: 99
Techniques such as usingfusion tree for each buckets also result in constant time for all operations with high probability.[27]
The linked list of separate chaining implementation may not becache-conscious due tospatial locality—locality of reference—when the nodes of the linked list are scattered across memory, thus the list traversal during insert and search may entailCPU cache inefficiencies.[26]: 91
Hash collision resolved by open addressing with linear probing (interval=1). Note that "Ted Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had previously collided with "John Smith".This graph compares the average number of CPU cache misses required to look up elements in large hash tables (far exceeding size of the cache) with chaining and linear probing. Linear probing performs better due to betterlocality of reference, though as the table gets full, its performance degrades drastically.
Open addressing is another collision resolution technique in which every entry record is stored in the bucket array itself, and the hash resolution is performed throughprobing. When a new entry has to be inserted, the buckets are examined, starting with the hashed-to slot and proceeding in someprobe sequence, until an unoccupied slot is found. When searching for an entry, the buckets are scanned in the same sequence, until either the target record is found, or an unused array slot is found, which indicates an unsuccessful search.[31]
Well-known probe sequences include:
Linear probing, in which the interval between probes is fixed (usually 1).[32]
Quadratic probing, in which the interval between probes is increased by adding the successive outputs of a quadratic polynomial to the value given by the original hash computation.[33]: 272
Double hashing, in which the interval between probes is computed by a secondary hash function.[33]: 272–273
The performance of open addressing may be slower compared to separate chaining since the probe sequence increases when the load factor approaches 1.[13][26]: 93 The probing results in aninfinite loop if the load factor reaches 1, in the case of a completely filled table.[7]: 471 Theaverage cost of linear probing depends on the hash function's ability todistribute the elementsuniformly throughout the table to avoidclustering, since formation of clusters would result in increased search time.[7]: 472
Coalesced hashing is a hybrid of both separate chaining and open addressing in which the buckets or nodes link within the table.[34]: 6–8 The algorithm is ideally suited forfixed memory allocation.[34]: 4 The collision in coalesced hashing is resolved by identifying the largest-indexed empty slot on the hash table, then the colliding value is inserted into that slot. The bucket is also linked to the inserted node's slot which contains its colliding hash address.[34]: 8
Cuckoo hashing is a form of open addressing collision resolution technique which guarantees worst-case lookup complexity and constant amortized time for insertions. The collision is resolved through maintaining two hash tables, each having its own hashing function, and collided slot gets replaced with the given item, and the preoccupied element of the slot gets displaced into the other hash table. The process continues until every key has its own spot in the empty buckets of the tables; if the procedure enters intoinfinite loop—which is identified through maintaining a threshold loop counter—both hash tables get rehashed with newer hash functions and the procedure continues.[35]: 124–125
Hopscotch hashing is an open addressing based algorithm which combines the elements ofcuckoo hashing,linear probing and chaining through the notion of aneighbourhood of buckets—the subsequent buckets around any given occupied bucket, also called a "virtual" bucket.[36]: 351–352 The algorithm is designed to deliver better performance when the load factor of the hash table grows beyond 90%; it also provides high throughput inconcurrent settings, thus well suited for implementing resizableconcurrent hash table.[36]: 350 The neighbourhood characteristic of hopscotch hashing guarantees a property that, the cost of finding the desired item from any given buckets within the neighbourhood is very close to the cost of finding it in the bucket itself; the algorithm attempts to be an item into its neighbourhood—with a possible cost involved in displacing other items.[36]: 352
Each bucket within the hash table includes an additional "hop-information"—anH-bitbit array for indicating therelative distance of the item which was originally hashed into the current virtual bucket withinH − 1 entries.[36]: 352 Let and be the key to be inserted and bucket to which the key is hashed into respectively; several cases are involved in the insertion procedure such that the neighbourhood property of the algorithm is vowed:[36]: 352–353 if is empty, the element is inserted, and the leftmost bit of bitmap isset to 1; if not empty, linear probing is used for finding an empty slot in the table, the bitmap of the bucket gets updated followed by the insertion; if the empty slot is not within the range of theneighbourhood, i.e.H − 1, subsequent swap and hop-info bit array manipulation of each bucket is performed in accordance with its neighbourhoodinvariant properties.[36]: 353
Robin Hood hashing is an open addressing based collision resolution algorithm; the collisions are resolved through favouring the displacement of the element that is farthest—or longestprobe sequence length (PSL)—from its "home location" i.e. the bucket to which the item was hashed into.[37]: 12 Although Robin Hood hashing does not change thetheoretical search cost, it significantly affects thevariance of thedistribution of the items on the buckets,[38]: 2 i.e. dealing withcluster formation in the hash table.[39] Each node within the hash table that uses Robin Hood hashing should be augmented to store an extra PSL value.[40] Let be the key to be inserted, be the (incremental) PSL length of, be the hash table and be the index, the insertion procedure is as follows:[37]: 12–13 [41]: 5
If: the iteration goes into the next bucket without attempting an external probe.
If: insert the item into the bucket; swap with—let it be; continue the probe from theth bucket to insert; repeat the procedure until every element is inserted.
Repeated insertions cause the number of entries in a hash table to grow, which consequently increases the load factor; to maintain the amortized performance of the lookup and insertion operations, a hash table is dynamically resized and the items of the tables arerehashed into the buckets of the new hash table,[13] since the items cannot be copied over as varying table sizes results in different hash value due tomodulo operation.[42] If a hash table becomes "too empty" after deleting some elements, resizing may be performed to avoid excessivememory usage.[43]
Generally, a new hash table with a size double that of the original hash table getsallocated privately and every item in the original hash table gets moved to the newly allocated one by computing the hash values of the items followed by the insertion operation. Rehashing is simple, but computationally expensive.[44]: 478–479
Some hash table implementations, notably inreal-time systems, cannot pay the price of enlarging the hash table all at once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform the resizing gradually to avoid storage blip—typically at 50% of new table's size—during rehashing and to avoidmemory fragmentation that triggersheap compaction due to deallocation of largememory blocks caused by the old hash table.[45]: 2–3 In such case, the rehashing operation is done incrementally through extending prior memory block allocated for the old hash table such that the buckets of the hash table remain unaltered. A common approach for amortized rehashing involves maintaining two hash functions and. The process of rehashing a bucket's items in accordance with the new hash function is termed ascleaning, which is implemented throughcommand pattern by encapsulating the operations such as, and through awrapper such that each element in the bucket gets rehashed and its procedure involve as follows:[45]: 3
The performance of a hash table is dependent on the hash function's ability in generatingquasi-random numbers () for entries in the hash table where, and denotes the key, number of buckets and the hash function such that. If the hash function generates the same for distinct keys (), this results incollision, which is dealt with in a variety of ways. The constant time complexity () of the operation in a hash table is presupposed on the condition that the hash function doesn't generate colliding indices; thus, the performance of the hash table isdirectly proportional to the chosen hash function's ability todisperse the indices.[47]: 1 However, construction of such a hash function ispractically infeasible, that being so, implementations depend oncase-specificcollision resolution techniques in achieving higher performance.[47]: 2
The best performance is obtained in the case that the hash function distributes the elements of the universe uniformaly, and the elements stored at the table are drawn at random from the universe. In this case, in hashing with chaining, the expected time for a successful search is, and the expected time for an unsuccessful search is.[48]
Hash tables may also be used asdisk-based data structures anddatabase indices (such as indbm) althoughB-trees are more popular in these applications.[49]
Hash tables can be used to implementcaches, auxiliary data tables that are used to speed up the access to data that is primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two colliding entries—usually erasing the old item that is currently stored in the table and overwriting it with the new item, so every item in the table has a unique hash value.[50][51]
Hash tables can be used in the implementation ofset data structure, which can store unique values without any particular order; set is typically used in testing the membership of a value in the collection, rather than element retrieval.[52]
Many programming languages provide hash table functionality, either as built-in associative arrays or asstandard library modules.
InJavaScript, an "object" is a mutable collection of key-value pairs (called "properties"), where each key is either a string or a guaranteed-unique "symbol"; any other value, when used as a key, is firstcoerced to a string. Aside from the seven "primitive" data types, every value in JavaScript is an object.[54] ECMAScript 2015 also added theMap data structure, which accepts arbitrary values as keys.[55]
^Martin Farach-Colton; Andrew Krapivin; William Kuszmaul.Optimal Bounds for Open Addressing Without Reordering. 2024 IEEE 65th Annual Symposium on Foundations of Computer Science (FOCS).arXiv:2501.02305.doi:10.1109/FOCS61266.2024.00045.
^abOwolabi, Olumide (February 2003). "Empirical studies of some hashing functions".Information and Software Technology.45 (2):109–112.doi:10.1016/S0950-5849(02)00174-X.
^Demaine, Erik; Lind, Jeff (Spring 2003)."Lecture 2"(PDF).6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory.Archived(PDF) from the original on June 15, 2010. RetrievedJune 30, 2008.
^abcCulpepper, J. Shane; Moffat, Alistair (2005). "Enhanced Byte Codes with Restricted Prefix Properties".String Processing and Information Retrieval. Lecture Notes in Computer Science. Vol. 3772. pp. 1–12.doi:10.1007/11575832_1.ISBN978-3-540-29740-6.
^Askitis, Nikolas; Sinha, Ranjan (October 2010). "Engineering scalable, cache and space efficient tries for strings".The VLDB Journal.19 (5):633–660.doi:10.1007/s00778-010-0183-9.
^Askitis, Nikolas; Zobel, Justin (October 2005). "Cache-conscious Collision Resolution in String Hash Tables".Proceedings of the 12th International Conference, String Processing and Information Retrieval (SPIRE 2005). Vol. 3772/2005. pp. 91–102.doi:10.1007/11575832_11.ISBN978-3-540-29740-6.
^Tenenbaum, Aaron M.; Langsam, Yedidyah; Augenstein, Moshe J. (1990).Data Structures Using C. Prentice Hall. pp. 456–461, p. 472.ISBN978-0-13-199746-2.
^Celis, Pedro (March 28, 1988).External Robin Hood Hashing(PDF) (Technical report). Bloomington, Indiana:Indiana University, Department of Computer Science. 246.Archived(PDF) from the original on November 3, 2021. RetrievedNovember 2, 2021.
^Baeza-Yates, Ricardo; Poblete, Patricio V. (1999). "Chapter 2: Searching". In Atallah (ed.).Algorithms and Theory of Computation Handbook. CRC Press. pp. 2–6.ISBN0849326494.