Table of Contents
Invented over half a century ago, thehash table is a classicdata structure that has been fundamental to programming. To this day, it helps solve many real-life problems, such as indexing database tables, caching computed values, or implementing sets. It often comes up injob interviews, and Python uses hash tables all over the place to make name lookups almost instantaneous.
Even though Python comes with its own hash table calleddict
, it can be helpful to understand how hash tables work behind the curtain. A coding assessment may even task you with building one. This tutorial will walk you through the steps of implementing a hash table from scratch as if there were none in Python. Along the way, you’ll face a few challenges that’ll introduce important concepts and give you an idea of why hash tables are so fast.
In addition to this, you’ll get a hands-on crash course intest-driven development (TDD) and will actively practice it while building your hash table in a step-by-step fashion. You’re not required to have any prior experience with TDD, but at the same time, you won’t get bored even if you do!
In this tutorial, you’ll learn:
hash()
works behind the scenesIt’ll help if you’re already familiar withPython dictionaries and have basic knowledge ofobject-oriented programming principles. To get the complete source code and the intermediate steps of the hash table implemented in this tutorial, follow the link below:
Source Code:Click here to download the source code that you’ll use to build a hash table in Python.
Before diving deeper, you should familiarize yourself with the terminology because it can get slightly confusing. Colloquially, the termhash table orhash map is often used interchangeably with the worddictionary. However, there’s a subtle difference between the two concepts as the former is more specific than the latter.
In computer science, adictionary is anabstract data type made up ofkeys andvalues arranged in pairs. Moreover, it defines the following operations for those elements:
In a sense, this abstract data type resembles abilingual dictionary, where the keys are foreign words, and the values are their definitions or translations to other languages. But there doesn’t always have to be a sense of equivalence between keys and values. Aphone book is another example of a dictionary, which combines names with the corresponding phone numbers.
Note: Anytime youmap one thing to another orassociate a value with a key, you’re essentially using a kind of a dictionary. That’s why dictionaries are also known asmaps orassociative arrays.
Dictionaries have a few interesting properties. One of them is that you can think of a dictionary as amathematical function that projects one or more arguments to exactly one value. The direct consequences of that fact are the following:
There are related concepts that extend the idea of a dictionary. For example, amultimap lets you have more than one value per key, while abidirectional map not only maps keys to values but also provides mapping in the opposite direction. However, in this tutorial, you’re only going to consider the regular dictionary, which maps exactly one value to each key.
Here’s a graphical depiction of a hypothetical dictionary, which maps some abstract concepts to their corresponding English words:
It’s a one-way map of keys to values, which are two completely different sets of elements. Right away, you can see fewer values than keys because the wordbow happens to be ahomonym with multiple meanings. Conceptually, this dictionary still contains four pairs, though. Depending on how you decided to implement it, you could reuse repeated values to conserve memory or duplicate them for simplicity.
Now, how do you code such a dictionary in a programming language? The right answer is youdon’t, because most modern languages come with dictionaries as eitherprimitive data types or classes in theirstandard libraries. Python ships with a built-indict
type, which already wraps a highly optimized data structure written in C so that you don’t have to write a dictionary yourself.
Python’sdict
lets you perform all the dictionary operations listed at the beginning of this section:
>>>glossary={"BDFL":"Benevolent Dictator For Life"}>>>glossary["GIL"]="Global Interpreter Lock"# Add>>>glossary["BDFL"]="Guido van Rossum"# Update>>>delglossary["GIL"]# Delete>>>glossary["BDFL"]# Search'Guido van Rossum'>>>glossary{'BDFL': 'Guido van Rossum'}
With thesquare bracket syntax ([ ]
), you can add a new key-value pair to a dictionary. You can also update the value of or delete an existing pair identified by a key. Finally, you can look up the value associated with the given key.
That said, you may ask a different question. How does the built-in dictionary actually work? How does it map keys of arbitrary data types, and how does it do it so quickly?
Finding an efficient implementation of this abstract data type is known as thedictionary problem. One of the most well-known solutions takes advantage of thehash table data structure that you’re about to explore. However, note that it isn’t the only way to implement a dictionary in general. Another popular implementation builds on top of ared-black tree.
Have you ever wondered why accessingsequence elements in Python works so quickly, regardless of which index you request? Say you were working with a very long string of characters, like this one:
>>>importstring>>>text=string.ascii_uppercase*100_000_000>>>text[:50]# Show the first 50 characters'ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWX'>>>len(text)2600000000
There are 2.6 billion characters from repeatingASCII letters in thetext
variable above, which you can count withPython’slen()
function. Yet, getting the first, middle, last, or any other character from this string is equally quick:
>>>text[0]# The first element'A'>>>text[len(text)//2]# The middle element'A'>>>text[-1]# The last element, same as text[len(text) - 1]'Z'
The same is true for all sequence types in Python, such aslists and tuples. How come? The secret to such a blazing speed is that sequences in Python are backed by anarray, which is arandom-access data structure. It follows two principles:
When you know the memory address of an array, which is called theoffset, then you can get to the desired element in the array instantly by calculating a fairly straightforward formula:
You start at the array’s offset, which is also the address of the array’s first element, with the index zero. Next, you move forward by adding the required number of bytes, which you get by multiplying the element size by the target element’s index. It always takes the same amount of time to multiply and add a few numbers together.
Note: Unlike arrays, Python’s lists can contain heterogeneous elements of varying sizes, which would break the above formula. To mitigate this, Python adds another level of indirection by introducing an array ofpointers to memory locations rather than storing values directly in the array:
Pointers are merely integer numbers, which always take up the same amount of space. It’s customary to denote memory addresses using thehexadecimal notation. Python and some other languages prefix such numbers with0x
.
Okay, so you know that finding an element in an array is quick, no matter where that element is physically located. Can you take the same idea and reuse it in a dictionary? Yes!
Hash tables get their name from a trick calledhashing, which lets them translate an arbitrary key into an integer number that can work as an index in a regular array. So, instead of searching a value by a numeric index, you’ll look it up by an arbitrary key without a noticeable performance loss. That’s neat!
In practice, hashing won’t work with every key, but most built-in types in Python can be hashed. If you follow a few rules, then you’ll be able to create your ownhashable types too. You’ll learn more about hashing in the next section.
Ahash function performs hashing by turning any data into a fixed-size sequence of bytes called thehash value or thehash code. It’s a number that can act as a digital fingerprint or adigest, usually much smaller than the original data, which lets you verify its integrity. If you’ve ever fetched a large file from the Internet, such as a disk image of a Linux distribution, then you may have noticed anMD5 orSHA-2checksum on the download page.
Aside from verifying data integrity and solving the dictionary problem, hash functions help in other fields, includingsecurity andcryptography. For example, you typically store hashed passwords in databases to mitigate the risk of data leaks.Digital signatures involve hashing to create a message digest before encryption.Blockchain transactions are another prime example of using a hash function for cryptographic purposes.
Note: Acryptographic hash function is a special type of hash function that must meet a few additional requirements. In this tutorial, you’ll only encounter the most basic form of the hash function used in the hash table data structure, though.
While there are manyhashing algorithms, they all share a few common properties that you’re about to discover in this section. Implementing a good hash function correctly is a difficult task that may require the understanding of advanced math involvingprime numbers. Fortunately, you don’t usually need to implement such an algorithm by hand.
Python comes with a built-inhashlib module, which provides a variety of well-known cryptographic hash functions, as well as less secure checksum algorithms. The language also has a globalhash()
function, used primarily for quick element lookup indictionaries andsets. You can study how it works first to learn about the most important properties of hash functions.
hash()
Before taking a stab at implementing a hash function from scratch, hold on for a moment and analyze Python’shash()
to distill its properties. This will help you understand what problems are involved when designing a hash function of your own.
Note: The choice of a hash function can dramatically impact your hash table’s performance. Therefore, you’ll rely on the built-inhash()
function when building a custom hash table later on in this tutorial. Implementing a hash function in this section only serves as an exercise.
For starters, try your hand at callinghash()
on a few data type literals built into Python, such as numbers and strings, to see how the function behaves:
>>>hash(3.14)322818021289917443>>>hash(3.14159265358979323846264338327950288419716939937510)326490430436040707>>>hash("Lorem")7677195529669851635>>>hash("Lorem ipsum dolor sit amet, consectetur adipisicing elit,"..."sed do eiusmod tempor incididunt ut labore et dolore magna"..."aliqua. Ut enim ad minim veniam, quis nostrud exercitation"..."ullamco laboris nisi ut aliquip ex ea commodo consequat."..."Duis aute irure dolor in reprehenderit in voluptate velit"..."esse cillum dolore eu fugiat nulla pariatur. Excepteur sint"..."occaecat cupidatat non proident, sunt in culpa qui officia"..."deserunt mollit anim id est laborum.")1107552240612593693
There are already several observations that you can make by looking at the result. First, the built-in hash function may return different values on your end for some of the inputs shown above. While the numeric input always seems to produce an identical hash value, the string most likely doesn’t. Why is that? It may seem likehash()
is anon-deterministic function, but that couldn’t be further from the truth!
When you callhash()
with the same argument within your existing interpreter session, then you’ll keep getting the same result:
>>>hash("Lorem")7677195529669851635>>>hash("Lorem")7677195529669851635>>>hash("Lorem")7677195529669851635
That’s because hash values areimmutable and don’t change throughout an object’s lifetime. However, as soon as you exit Python and start it again, then you’ll almost certainly see different hash values across Python invocations. You can test this by trying the-c
option to run aone-liner script in your terminal:
C:\> python -c print(hash('Lorem'))6182913096689556094C:\> python -c print(hash('Lorem'))1756821463709528809C:\> python -c print(hash('Lorem'))8971349716938911741
$python-c'print(hash("Lorem"))'6182913096689556094$python-c'print(hash("Lorem"))'1756821463709528809$python-c'print(hash("Lorem"))'8971349716938911741
That’s expected behavior, which was implemented in Python as a countermeasure against aDenial-of-Service (DoS) attack that exploited aknown vulnerability of hash functions in web servers. Attackers could abuse a weak hash algorithm to deliberately create so-called hash collisions, overloading the server and making it inaccessible. Ransom was a typical motive for the attack as most victims made money through an uninterrupted online presence.
Today, Python enableshash randomization by default for some inputs, such as strings, to make the hash values less predictable. This makeshash()
a bit moresecure and the attack more difficult. You can disable randomization, though, by setting a fixed seed value through thePYTHONHASHSEED
environment variable, for example:
C:\>setPYTHONHASHSEED=1C:\> python -c print(hash('Lorem'))440669153173126140C:\> python -c print(hash('Lorem'))440669153173126140C:\> python -c print(hash('Lorem'))440669153173126140
$PYTHONHASHSEED=1python-c'print(hash("Lorem"))'440669153173126140$PYTHONHASHSEED=1python-c'print(hash("Lorem"))'440669153173126140$PYTHONHASHSEED=1python-c'print(hash("Lorem"))'440669153173126140
Now, each Python invocation yields the same hash value for a known input. This can help in partitioning or sharing data across a cluster of distributed Python interpreters. Just be careful and understand the risks involved in disabling hash randomization. All in all, Python’shash()
is indeed adeterministic function, which is one of the most fundamental features of the hash function.
Additionally,hash()
seems fairlyuniversal as it takes arbitrary inputs. In other words, it takes values of various types and sizes. The function accepts strings and floating-point numbers regardless of their length or magnitude without complaining. In fact, you can calculate a hash value of more exotic types too:
>>>hash(None)5904497366826>>>hash(hash)2938107101725>>>classPerson:...pass...>>>hash(Person)5904499092884>>>hash(Person())8738841746871>>>hash(Person())8738841586112
Here, you called the hash function onPython’sNone
object, thehash()
function itself, and even a customPerson
class with a few of its instances. That said, not all objects have a corresponding hash value. Thehash()
function will raise an exception if you try calling it against one of those few objects:
>>>hash([1,2,3])Traceback (most recent call last): File"<stdin>", line1, in<module>TypeError:unhashable type: 'list'
The input’s underlying type will determine whether or not you can calculate a hash value. In Python, instances of built-in mutable types—like lists, sets, and dicts—aren’t hashable. You’ve gotten a hint of why that is, but you’ll learn more in alater section. For now, you can assume that most data types should work with a hash function in general.
hash()
Another interesting characteristic ofhash()
is that it always produces afixed-size output no matter how big the input was. In Python, the hash value is an integer with a moderate magnitude. Occasionally, it may come out as a negative number, so take that into account if you plan to rely on hash values in one way or another:
>>>hash("Lorem")-972535290375435184
The natural consequence of a fixed-size output is that most of the original information gets irreversibly lost during the process. That’s fine since you want the resulting hash value to act as a unified digest of arbitrarily large data, after all. However, because the hash function projects a potentially infinite set of values onto a finite space, this can lead to ahash collision when two different inputs produce the same hash value.
Note: If you’re mathematically inclined, then you could use thepigeonhole principle to describe hash collisions more formally:
Givenm items andn containers, ifm >n, then there’s at least one container with more than one item.
In this context, items are a potentially infinite number of values that you feed into the hash function, while containers are their hash values assigned from a finite pool.
Hash collisions are an essential concept in hash tables, which you’llrevisit later in more depth when implementing your custom hash table. For now, you can think of them as highly undesirable. You should avoid hash collisions as much as possible because they can lead to very inefficient lookups and could be exploited by hackers. Therefore, a good hash function must minimize the likelihood of a hash collision for both security and efficiency.
In practice, this often means that the hash function must assignuniformly distributed values over the available space. You can visualize the distribution of hash values produced by Python’shash()
by plotting a textual histogram in your terminal. Copy the following block of code and save it in a file namedhash_distribution.py
:
# hash_distribution.pyfromcollectionsimportCounterdefdistribute(items,num_containers,hash_function=hash):returnCounter([hash_function(item)%num_containersforiteminitems])defplot(histogram):forkeyinsorted(histogram):count=histogram[key]padding=(max(histogram.values())-count)*" "print(f"{key:3}{'■'*count}{padding} ({count})")
It uses aCounter
instance to conveniently represent the histogram of hash values of the provided items. The hash values are spread over the specified number of containers by wrapping them with themodulo operator. Now, you can take one hundred printable ASCII characters, for example, then calculate their hash values and show their distribution:
>>>fromhash_distributionimportplot,distribute>>>fromstringimportprintable>>>plot(distribute(printable,num_containers=2))0 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ (51)1 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ (49)>>>plot(distribute(printable,num_containers=5))0 ■■■■■■■■■■■■■■■ (15)1 ■■■■■■■■■■■■■■■■■■■■■■■■■■ (26)2 ■■■■■■■■■■■■■■■■■■■■■■ (22)3 ■■■■■■■■■■■■■■■■■■ (18)4 ■■■■■■■■■■■■■■■■■■■ (19)
When there are only two containers, you should expect roughly a fifty-fifty distribution. Adding more containers should result in filling them more or less evenly. As you can see, the built-inhash()
function is pretty good but not perfect at distributing the hash values uniformly.
Related to that, the uniform distribution of hash values is typicallypseudo-random, which is especially important for cryptographic hash functions. This prevents potential attackers from using statistical analysis to try and predict the correlation between input and output of the hash function. Consider altering a single letter in a string, and check how that affects the resulting hash value in Python:
>>>hash("Lorem")1090207136701886571>>>hash("Loren")4415277245823523757
It’s a completely different hash value now, despite only one letter being different. Hash values are often subject to anavalanche effect, as even the smallest input change gets amplified. Nevertheless, this feature of the hash function isn’t essential for the sake of implementing a hash table data structure.
In most cases, Python’shash()
exhibits yet another nonessential feature of cryptographic hash functions, which stems from the pigeonhole principle mentioned earlier. It behaves like aone-way function because finding its inverse is next to impossible in the majority of cases. However, there are notable exceptions:
>>>hash(42)42
Hash values of small integers are equal to themselves, which is an implementation detail thatCPython uses for simplicity and efficiency. Bear in mind that the actual hash values don’t matter as long as you can calculate them in a deterministic way.
Last but not least, calculating a hash value in Python isfast, even for very big inputs. On a modern computer, callinghash()
with a string of 100 million characters as the argument returns instantaneously. If it weren’t so fast, then the additional overhead of the hash value computation would offset the benefits of hashing in the first place.
Based on what you’ve learned so far about Python’shash()
, you can now draw conclusions about the desired properties of a hash function in general. Here’s a summary of those features, comparing the regular hash function with its cryptographic flavor:
Feature | Hash Function | Cryptographic Hash Function |
---|---|---|
Deterministic | ✔️ | ✔️ |
Universal Input | ✔️ | ✔️ |
Fixed-Sized Output | ✔️ | ✔️ |
Fast to Compute | ✔️ | ✔️ |
Uniformly Distributed | ✔️ | ✔️ |
Randomly Distributed | ✔️ | |
Randomized Seed | ✔️ | |
One-Way Function | ✔️ | |
Avalanche Effect | ✔️ |
The goals of both hash function types overlap, so they share a few common features. On the other hand, a cryptographic hash function provides additional guarantees around security.
Before building your own hash function, you’ll take a look at another function built into Python that’s seemingly its most straightforward substitute.
Probably one of the most straightforward hash function implementations imaginable in Python is the built-inid()
, which tells you an object’s identity. In the standard Python interpreter, identity is the same as the object’s memory address expressed as an integer:
>>>id("Lorem")139836146678832
Theid()
function has most of the desired hash function properties. After all, it’s super fast and works with any input. It returns a fixed-size integer in a deterministic way. At the same time, you can’t easily retrieve the original object based on its memory address. The memory addresses themselves are immutable during an object’s lifetime and somewhat randomized between interpreter runs.
So, why does Python insist on using a different function for hashing then?
First of all, the intent ofid()
is different fromhash()
, so other Python distributions may implement identity in alternative ways. Second, memory addresses are predictable without having a uniform distribution, which is both insecure and severely inefficient for hashing. Finally, equal objects should generally produce the same hash code even if they have distinct identities.
Note: Later, you’ll learn more about thecontract between the equality of values and the corresponding hash codes.
With that out of the way, you can finally think of making your own hash function.
Designing a hash function that meets all requirements from scratch is hard. As mentioned before, you’ll be using the built-inhash()
function to create your hash table prototype in the next section. However, trying to build a hash function from the ground up is a great way of learning how it works. By the end of this section, you’ll only have a rudimentary hash function that’s far from perfect, but you’ll have gained valuable insights.
In this exercise, you can limit yourself to only one data type at first and implement a crude hash function around it. For example, you could consider strings and sum up theordinal values of the individual characters in them:
>>>defhash_function(text):...returnsum(ord(character)forcharacterintext)
You iterate over the text using agenerator expression, then turn each individual character into the correspondingUnicode code point with the built-inord()
function, and finallysum the ordinal values together. This will throw out a single number for any given text provided as an argument:
>>>hash_function("Lorem")511>>>hash_function("Loren")512>>>hash_function("Loner")512
Right away, you’ll notice a few problems with this function. Not only is it string-specific, but it also suffers from poor distribution of hash codes, which tend to form clusters at similar input values. A slight change in the input has little effect on the observed output. Even worse, the function remains insensitive to character order in the text, which meansanagrams of the same word, such asLoren andLoner, lead to a hash code collision.
To fix the first problem, try converting the input to a string with a call tostr()
. Now, your function will be able to deal with any type of argument:
>>>defhash_function(key):...returnsum(ord(character)forcharacterinstr(key))>>>hash_function("Lorem")511>>>hash_function(3.14)198>>>hash_function(True)416
You can callhash_function()
with an argument of any data type, including a string, a floating-point number, or a Boolean value.
Note that this implementation will only be as good as the corresponding string representation. Some objects may not have a textual representation suitable for the code above. In particular, custom class instances without the special methods.__str__()
and.__repr__()
properly implemented are a good example. Plus, you won’t be able to distinguish between different data types anymore:
>>>hash_function("3.14")198>>>hash_function(3.14)198
In reality, you’d want to treat the string"3.14"
and the floating-point number3.14
as distinct objects with different hash codes. One way to mitigate this would be to tradestr()
forrepr()
, which encloses the representation of strings with additional apostrophes ('
):
>>>repr("3.14")"'3.14'">>>repr(3.14)'3.14'
That’ll improve your hash function to some extent:
>>>defhash_function(key):...returnsum(ord(character)forcharacterinrepr(key))>>>hash_function("3.14")276>>>hash_function(3.14)198
Strings are now distinguishable from numbers. To tackle the issue with anagrams, likeLoren andLoner, you might modify your hash function by taking into consideration the character’s value as well as its position within the text:
>>>defhash_function(key):...returnsum(...index*ord(character)...forindex,characterinenumerate(repr(key),start=1)...)
Here, you take the sum of products derived from multiplying the ordinal values of characters and their corresponding indices. Notice youenumerate the indices from one rather than zero. Otherwise, the first character would always be discarded as its value would be multiplied by zero.
Now your hash function is fairly universal and doesn’t cause nearly as many collisions as before, but its output can grow arbitrarily large because the longer the string, the bigger the hash code. Also, it’s quite slow for larger inputs:
>>>hash_function("Tiny")1801>>>hash_function("This has a somewhat medium length.")60919>>>hash_function("This is very long and slow!"*1_000_000)33304504435500117
You can always address unbounded growth by taking the modulo (%
) of your hash code against a known maximum size, such as one hundred:
>>>hash_function("Tiny")%1001>>>hash_function("This has a somewhat medium length.")%10019>>>hash_function("This is very long and slow!"*1_000_000)%10017
Remember that choosing a smaller pool of hash codes increases the likelihood of hash code collisions. If you don’t know the number of your input values up front, then it’s best to leave that decision until later if you can. You may also impose a limit on your hash codes by assuming a reasonable maximum value, such assys.maxsize
, which represents the highest value of integers supported natively in Python.
Ignoring the function’s slow speed for a moment, you’ll notice another peculiar issue with your hash function. It results in suboptimal distribution of hash codes through clustering and by not taking advantage of all the available slots:
>>>fromhash_distributionimportplot,distribute>>>fromstringimportprintable>>>plot(distribute(printable,6,hash_function)) 0 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ (31) 1 ■■■■ (4) 2 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ (31) 4 ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ (33) 5 ■ (1)
The distribution is uneven. Moreover, there are six containers available, but one is missing from the histogram. This problem stems from the fact that the two apostrophes added byrepr()
cause virtually all keys in this example to result in an even hash number. You can avoid this by removing the left apostrophe if it exists:
>>>hash_function("a"),hash_function("b"),hash_function("c")(350, 352, 354)>>>defhash_function(key):...returnsum(...index*ord(character)...forindex,characterinenumerate(repr(key).lstrip("'"),1)...)>>>hash_function("a"),hash_function("b"),hash_function("c")(175, 176, 177)>>>plot(distribute(printable,6,hash_function)) 0 ■■■■■■■■■■■■■■■■ (16) 1 ■■■■■■■■■■■■■■■■ (16) 2 ■■■■■■■■■■■■■■■ (15) 3 ■■■■■■■■■■■■■■■■■■ (18) 4 ■■■■■■■■■■■■■■■■■ (17) 5 ■■■■■■■■■■■■■■■■■■ (18)
The call tostr.lstrip()
will only affect a string if it starts with the specified prefix to strip.
Naturally, you can continue improving your hash function further. If you’re curious about the implementation ofhash()
for strings and byte sequences in Python, then it currently uses theSipHash algorithm, which might fall back to a modified version ofFNV if the former is unavailable. To find out which hash algorithm your Python interpreter is using, reach for thesys
module:
>>>importsys>>>sys.hash_info.algorithm'siphash24'
At this point, you have a pretty good grasp of the hash function, how it’s supposed to work, and what challenges you might face in implementing it. In the upcoming sections, you’ll use a hash function to build a hash table. The choice of a particular hash algorithm will influence the hash table’s performance. With that knowledge as your foundation, you can feel free to stick with the built-inhash()
from now on.
In this section, you’re going to create a custom class representing the hash table data structure. It won’t be backed by a Python dictionary, so you can build a hash table from scratch and practice what you’ve learned so far. At the same time, you’ll model your implementation after the built-in dictionary by mimicking its most essential features.
Note: This is just a quick reminder that implementing a hash table is only an exercise and an educational tool to teach you about the problems that this data structure solves. Just like coding a custom hash function before, a pure-Python hash table implementation has no practical use in real-life applications.
Below is a list of the high-level requirements for your hash table, which you’ll be implementing now. By the end of this section, your hash table will exhibit the followingcore features. It’ll let you:
In addition to these, you’ll implement a fewnonessential but still useful features. Specifically, you should be able to:
While implementing these features, you’ll actively exercisetest-driven development by gradually adding more features to your hash table. Note that your prototype will only cover the basics. You’ll learn how to cope with some more advanced corner cases a bit later in this tutorial. In particular, this section won’t cover how to:
Feel free to use the supplementary materials ascontrol checkpoints if you get stuck or if you’d like to skip some of the intermediate refactoring steps. Each subsection ends with a complete implementation stage and the corresponding tests that you can start from. Grab the following link and download the supporting materials with the complete source code and the intermediate steps used in this tutorial:
Source Code:Click here to download the source code that you’ll use to build a hash table in Python.
Now that you know the high-level properties of a hash function and its purpose, you can exercise atest-driven development approach to build a hash table. If you’ve never tried this programming technique before, then it boils down to three steps that you tend to repeat in a cycle:
Python comes with theunittest package out of the box, but the third-partypytest library has an arguably shallower learning curve, so you’ll use that in this tutorial instead. Go ahead and installpytest
in yourvirtual environment now:
(venv)C:\> python -m pip install pytest
(venv)$python-mpipinstallpytest
Remember that you can verify each implementation stage against several control checkpoints. Next, create a file namedtest_hashtable.py
and define a dummy test function in it to check if pytest will pick it up:
# test_hashtable.pydeftest_should_always_pass():assert2+2==22,"This is just a dummy test"
The framework leverages the built-inassert
statement to run your tests and report their results. That means you can just use regular Python syntax, without importing any specific API until absolutely necessary. It also detects test files and test functions as long as their names start with thetest
prefix.
Theassert
statement takes a Boolean expression as an argument, followed by an optional error message. When the condition evaluates toTrue
, then nothing happens, as if there were no assertion at all. Otherwise, Python will raise anAssertionError
and display the message on thestandard error stream (stderr). Meanwhile, pytest intercepts assertion errors and builds a report around them.
Now, open the terminal, change your working directory to wherever you saved that test file, and run thepytest
command without any arguments. Its output should look similar to this:
(venv)C:\> python -m pytest=========================== test session starts ===========================platform win32 -- Python 3.10.2, pytest-6.2.5, pluggy-1.0.0rootdir: C:\Users\realpython\hashtablecollected 1 itemtest_hashtable.py F [100%]================================ FAILURES =================================_________________________ test_should_always_pass _________________________ def test_should_always_pass():> assert 2 + 2 == 22,"This is just a dummy test"E AssertionError: This is just a dummy testE assert (2 + 2) == 22test_hashtable.py:4: AssertionError========================= short test summary info =========================FAILED test_hashtable.py::test_should_always_pass - AssertionError: This...============================ 1 failed in 0.03s ============================
(venv)$pytest=========================== test session starts ===========================platform linux -- Python 3.10.0, pytest-6.2.5, py-1.11.0, pluggy-1.0.0rootdir: /home/realpython/hashtablecollected 1 itemtest_hashtable.py F [100%]================================ FAILURES =================================_________________________ test_should_always_pass _________________________ def test_should_always_pass():> assert 2 + 2 == 22, "This is just a dummy test"E AssertionError: This is just a dummy testE assert (2 + 2) == 22test_hashtable.py:4: AssertionError========================= short test summary info =========================FAILED test_hashtable.py::test_should_always_pass - AssertionError: This...============================ 1 failed in 0.03s ============================
Uh-oh. There’s a failure in your test. To find the root cause, increase the verbosity of pytest’s output by appending the-v
flag to the command. Now you can pinpoint where the problem lies:
def test_should_always_pass():> assert 2 + 2 == 22, "This is just a dummy test"E AssertionError: This is just a dummy testE assert 4 == 22E +4E -22
The output shows what theactual andexpected values were for the failed assertion. In this case, adding two plus two results in four rather than twenty-two. You can fix the code by correcting the expected value:
# test_hashtable.pydeftest_should_always_pass():assert2+2==4,"This is just a dummy test"
When you rerun pytest, there should be no test failures anymore:
(venv)C:\> python -m pytest=========================== test session starts ===========================platform win32 -- Python 3.10.2, pytest-6.2.5, pluggy-1.0.0rootdir: C:\Users\realpython\hashtablecollected 1 itemtest_hashtable.py . [100%]============================ 1 passed in 0.00s ============================
(venv)$pytest=========================== test session starts ===========================platform linux -- Python 3.10.0, pytest-6.2.5, py-1.11.0, pluggy-1.0.0rootdir: /home/realpython/hashtablecollected 1 itemtest_hashtable.py . [100%]============================ 1 passed in 0.00s ============================
That’s it! You’ve just learned the essential steps in automating test cases for your hash table implementation. Naturally, you can use an IDE such asPyCharm or an editor likeVS Code that integrates with testing frameworks if that’s more convenient for you. Next up, you’re going to put this new knowledge into practice.
HashTable
ClassRemember to follow thered-green-refactor cycle described earlier. Therefore, you must start by identifying your first test case. For example, you should be able to instantiate a new hash table by calling the hypotheticalHashTable
class imported from thehashtable
module. This call should return a non-empty object:
# test_hashtable.pyfromhashtableimportHashTabledeftest_should_create_hashtable():assertHashTable()isnotNone
At this point, your test will refuse to run because of an unsatisfied import statement at the top of the file. You’re in thered phase, after all. The red phase is the only time when you’re allowed to add new features, so go on and create another module namedhashtable.py
and put theHashTable
class definition in it:
# hashtable.pyclassHashTable:pass
It’s a bare-bones class placeholder, but it should be just enough to make your test pass. By the way, if you’re using a code editor, then you can convenientlysplit the view into columns, displaying the code under test and the corresponding tests side by side:
If you’re curious about the color scheme depicted in the screenshot above, then it’s theDracula Theme. It’s available for many code editors and not just PyCharm.
Once running pytest succeeds, then you can start thinking of another test case since there’s barely anything to refactor yet. For example, a basic hash table should contain a sequence of values. At this early stage, you can assume the sequence to have afixed size established at the hash table’s creation time. Modify your existing test case accordingly:
# test_hashtable.pyfromhashtableimportHashTabledeftest_should_create_hashtable():assertHashTable(size=100)isnotNone
You specify the size using akeyword argument. However, before adding new code to theHashTable
class, rerun your tests to confirm that you’ve entered the red phase again. Witnessing a failing test can be invaluable. When you implement a piece of code later, you’ll know whether it satisfies a particular group of tests or if they remain unaffected. Otherwise, your tests could lie to you by verifying something different than you thought!
After confirming that you’re in the red phase, declare the.__init__()
method in theHashTable
class with the expected signature, but don’t implement its body:
# hashtable.pyclassHashTable:def__init__(self,size):pass
Boom! You’re back in the green phase once more, so how about a bit of refactoring this time? For instance, you could rename thesize
argument tocapacity
if that’s more descriptive to you. Don’t forget to change the test case first, then run pytest, and always update the code under test as the final step:
# hashtable.pyclassHashTable:def__init__(self,capacity):pass
As you can tell, the red-green-refactor cycle consists of brief stages, often lasting no more than a few seconds each. So, why don’t you continue by adding more test cases? It would be nice if your data structure could report back the hash table’s capacity using Python’s built-inlen()
function. Add another test case and observe how it fails miserably:
# test_hashtable.pyfromhashtableimportHashTabledeftest_should_create_hashtable():assertHashTable(capacity=100)isnotNonedeftest_should_report_capacity():assertlen(HashTable(capacity=100))==100
To handlelen()
correctly, you must implement thespecial method.__len__()
in your class and remember the capacity supplied through the class initializer:
# hashtable.pyclassHashTable:def__init__(self,capacity):self.capacity=capacitydef__len__(self):returnself.capacity
You may think that TDD isn’t moving you in the right direction. That’s not how you might have envisioned the hash table implementation. Where’s the sequence of values that you started with in the beginning? Unfortunately, takingbaby steps and making many course corrections along the way is something that test-driven development gets a lot of criticism for. Therefore, it may not be appropriate for projects involving lots of experimentation.
On the other hand, implementing a well-known data structure such as a hash table is a perfect application of this software development methodology. You have clear expectations that are straightforward to codify as test cases. Soon, you’ll see for yourself that taking the next step will lead to a slight change in the implementation. Don’t worry about it, though, because perfecting the code itself is less important than making your test cases pass.
As you keep adding more constraints through the test cases, you frequently have to rethink your implementation. For example, a new hash table should probably start with empty slots for the stored values:
# test_hashtable.py# ...deftest_should_create_empty_value_slots():assertHashTable(capacity=3).values==[None,None,None]
In other words, a new hash table should expose a.values
attribute having the requested length and being filled with theNone
elements. By the way, it’s common to use such verbose function names because they’ll appear in your test report. The more readable and descriptive your tests, the more quickly you’ll figure out what part needs fixing.
Note: As a rule of thumb, your test cases should be as independent and atomic as possible, which usually means using only one assertion per function. Nevertheless, your test scenarios may sometimes need a bit of setup or teardown. They may also involve a few steps. In such cases, it’s customary to structure your function around the so-calledgiven-when-then convention:
deftest_should_create_empty_value_slots():# Givenexpected_values=[None,None,None]hash_table=HashTable(capacity=3)# Whenactual_values=hash_table.values# Thenassertactual_values==expected_values
Thegiven part describes the initial state and preconditions to your test case, whilewhen represents the action of your code under test, andthen is responsible for asserting the resulting outcome.
This is one of many possible ways to satisfy your existing tests:
# hashtable.pyclassHashTable:def__init__(self,capacity):self.values=capacity*[None]def__len__(self):returnlen(self.values)
You replace the.capacity
attribute with a list of the requested length containing onlyNone
elements. Multiplying a number by a list or the other way around is a quick way to populate that list with the given value or values. Other than that, you update the special method.__len__()
so that all tests will pass.
Note: A Python dictionary has a length corresponding to the actual number of key-value pairs stored instead of its internal capacity. You’ll achieve a similar effect soon.
Now that you’re acquainted with TDD, it’s time to dive deeper and add the remaining features to your hash table.
Now that you can create a hash table, it’s time to give it some storage capabilities. The traditional hash table is backed by an array capable of storing only one data type. Because of this, hash table implementations in many languages, such as Java, require you to declare the type for their keys and values up front:
Map<String,Integer>phonesByNames=newHashMap<>();
This particular hash table maps strings to integers, for example. However, because arrays aren’t native to Python, you’ll keep using a list instead. As a side effect, your hash table will be able to accept arbitrary data types for both the keys and values, just like Python’sdict
.
Note: Python has an efficientarray
collection, but it’s intended for numeric values only. You may sometimes find it convenient for processing raw binary data.
Now add another test case for inserting key-value pairs into your hash table using the familiar square bracket syntax:
# test_hashtable.py# ...deftest_should_insert_key_value_pairs():hash_table=HashTable(capacity=100)hash_table["hola"]="hello"hash_table[98.6]=37hash_table[False]=Trueassert"hello"inhash_table.valuesassert37inhash_table.valuesassertTrueinhash_table.values
First, you create a hash table with one hundred empty slots and then populate it with three pairs of keys and values of various types, including strings, floating-point numbers, and Booleans. Finally, you assert that the hash table contains the expected values in whatever order. Note that your hash table only remembers thevalues but not the associated keys at the moment!
The most straightforward and perhaps slightly naive implementation that would satisfy this test is as follows:
# hashtable.pyclassHashTable:def__init__(self,capacity):self.values=capacity*[None]def__len__(self):returnlen(self.values)def__setitem__(self,key,value):self.values.append(value)
It completely ignores the key and just appends the value to the right end of the list, increasing its length. You may very well immediately think of another test case. Inserting elements into the hash table shouldn’t grow its size. Similarly, removing an element shouldn’t shrink the hash table, but you only take a mental note of that, because there’s no ability to remove key-value pairs yet.
Note: You could also write a placeholder test and tell pytest to skip it unconditionally until later:
importpytest@pytest.mark.skipdeftest_should_not_shrink_when_removing_elements():pass
It leverages one of thedecorators provided by pytest.
In the real world, you’d want to create separate test cases with descriptive names dedicated to testing these behaviors. However, because this is only a tutorial, you’re going to add a new assertion to the existing function for brevity:
# test_hashtable.py# ...deftest_should_insert_key_value_pairs():hash_table=HashTable(capacity=100)hash_table["hola"]="hello"hash_table[98.6]=37hash_table[False]=Trueassert"hello"inhash_table.valuesassert37inhash_table.valuesassertTrueinhash_table.valuesassertlen(hash_table)==100
You’re in the red phase now, so rewrite your special method to keep the capacity fixed at all times:
# hashtable.pyclassHashTable:def__init__(self,capacity):self.values=capacity*[None]def__len__(self):returnlen(self.values)def__setitem__(self,key,value):index=hash(key)%len(self)self.values[index]=value
You turn an arbitrary key into a numeric hash value and use the modulo operator to constrain the resulting index within the available address space. Great! Your test report lights up green again.
Note: The code above relies on Python’s built-inhash()
function, which has an element of randomization, as you already learned. Therefore, your test might fail in rare cases when two keys happen to produce an identical hash code by coincidence. Because you’ll deal with hash code collisions later, you can disable hash randomization or use a predictable seed when running pytest in the meantime:
(venv)C:\>setPYTHONHASHSEED=128(venv)C:\> python -m pytest
(venv)$PYTHONHASHSEED=128pytest
Make sure to pick a hash seed that won’t cause any collisions in your sample data. Finding one can involve a bit of a trial and error. In my case, value128
seemed to work fine.
But can you think of some edge cases, maybe? What about insertingNone
into your hash table? It would create a conflict between a legitimate value and the designatedsentinel value that you chose to indicate blanks in your hash table. You’ll want to avoid that.
As always, start by writing a test case to express the desired behavior:
# test_hashtable.py# ...deftest_should_not_contain_none_value_when_created():assertNonenotinHashTable(capacity=100).values
One way to work around this would be to replaceNone
in your.__init__()
method with another unique value that users are unlikely to insert. For example, you could define a special constant by creating a brand-newobject to represent blank spaces in your hash table:
# hashtable.pyBLANK=object()classHashTable:def__init__(self,capacity):self.values=capacity*[BLANK]# ...
You only need a single blank instance to mark the slots as empty. Naturally, you’ll need to update an older test case to get back to the green phase:
# test_hashtable.pyfromhashtableimportHashTable,BLANK# ...deftest_should_create_empty_value_slots():assertHashTable(capacity=3).values==[BLANK,BLANK,BLANK]# ...
Then, write a positive test case exercising thehappy path for handling the insertion of aNone
value:
deftest_should_insert_none_value():hash_table=HashTable(capacity=100)hash_table["key"]=NoneassertNoneinhash_table.values
You create an empty hash table with one hundred slots and insertNone
associated with some arbitrary key. It should work like a charm if you’ve been closely following the steps so far. If not, then look at the error messages because they often contain clues as to what went wrong. Alternatively, check the sample code available for download at this link:
Source Code:Click here to download the source code that you’ll use to build a hash table in Python.
In the next subsection, you’ll add the ability to retrieve values by their associated keys.
To retrieve a value from the hash table, you’ll want to leverage the same square brackets syntax as before, only without using the assignment statement. You’ll also need a sample hash table. To avoid duplicating the same setup code across the individual test cases in yourtest suite, you can wrap it in atest fixture exposed by pytest:
# test_hashtable.pyimportpytest# ...@pytest.fixturedefhash_table():sample_data=HashTable(capacity=100)sample_data["hola"]="hello"sample_data[98.6]=37sample_data[False]=Truereturnsample_datadeftest_should_find_value_by_key(hash_table):asserthash_table["hola"]=="hello"asserthash_table[98.6]==37asserthash_table[False]isTrue
Atest fixture is a function annotated with the@pytest.fixture
decorator. It returns sample data for your test cases, such as a hash table populated with known keys and values. Your pytest will automatically call that function for you and inject its result into any test function that declares an argument with the same name as your fixture. In this case, the test function expects ahash_table
argument, which corresponds to your fixture name.
To be able to find values by key, you can implement the element lookup through another special method called.__getitem__()
in yourHashTable
class:
# hashtable.pyBLANK=object()classHashTable:def__init__(self,capacity):self.values=capacity*[BLANK]def__len__(self):returnlen(self.values)def__setitem__(self,key,value):index=hash(key)%len(self)self.values[index]=valuedef__getitem__(self,key):index=hash(key)%len(self)returnself.values[index]
You calculate the index of an element based on the hash code of the provided key and return whatever sits under that index. But what about missing keys? As of now, you might return a blank instance when a given key hasn’t been used before, an outcome which isn’t all that helpful. To replicate how a Pythondict
would work in such a case, you should raise aKeyError
exception instead:
# test_hashtable.py# ...deftest_should_raise_error_on_missing_key():hash_table=HashTable(capacity=100)withpytest.raises(KeyError)asexception_info:hash_table["missing_key"]assertexception_info.value.args[0]=="missing_key"
You create an empty hash table and try accessing one of its values through a missing key. The pytest framework includes a special construct for testing exceptions. Up above, you use thepytest.raises
context manager to expect a specific kind of exception within the following code block. Handling this case is a matter of adding aconditional statement to your accessor method:
# hashtable.py# ...classHashTable:# ...def__getitem__(self,key):index=hash(key)%len(self)value=self.values[index]ifvalueisBLANK:raiseKeyError(key)returnvalue
If the value at the given index is blank, then you raise the exception. By the way, you use theis
operator instead of the equality test operator (==
) to ensure that you’re comparing theidentities rather than values. Although the default implementation of the equality test in custom classes falls back to comparing the identities of their instances, most built-in data types distinguish between the two operators and implement them differently.
Because you can now determine whether a given key has an associated value in your hash table, you might as well implement thein
operator to mimic a Python dictionary. Remember to write and cover these test cases individually to respect test-driven development principles:
# test_hashtable.py# ...deftest_should_find_key(hash_table):assert"hola"inhash_tabledeftest_should_not_find_key(hash_table):assert"missing_key"notinhash_table
Both test cases take advantage of the test fixture that you prepared earlier and verify the.__contains__()
special method, which you can implement in the following way:
# hashtable.py# ...classHashTable:# ...def__contains__(self,key):try:self[key]exceptKeyError:returnFalseelse:returnTrue
When accessing the given key raises aKeyError
, you intercept that exception and returnFalse
to indicate a missing key. Otherwise, you combine thetry
…except
block with anelse
clause and returnTrue
.Exception handling is great but can sometimes be inconvenient, which is whydict.get()
lets you specify an optionaldefault value. You can replicate the same behavior in your custom hash table:
# test_hashtable.py# ...deftest_should_get_value(hash_table):asserthash_table.get("hola")=="hello"deftest_should_get_none_when_missing_key(hash_table):asserthash_table.get("missing_key")isNonedeftest_should_get_default_value_when_missing_key(hash_table):asserthash_table.get("missing_key","default")=="default"deftest_should_get_value_with_default(hash_table):asserthash_table.get("hola","default")=="hello"
The code of.get()
looks similar to the special method you’ve just implemented:
# hashtable.py# ...classHashTable:# ...defget(self,key,default=None):try:returnself[key]exceptKeyError:returndefault
You attempt to return the value corresponding to the provided key. In case of an exception, you return the default value, which is anoptional argument. When the user leaves it unspecified, then it equalsNone
.
For completeness, you’ll add the capability to delete a key-value pair from your hash table in the upcoming subsection.
Python dictionaries let you delete previously inserted key-value pairs using the built-indel
keyword, which removes information about both the key and the value. Here’s how this could work with your hash table:
# test_hashtable.py# ...deftest_should_delete_key_value_pair(hash_table):assert"hola"inhash_tableassert"hello"inhash_table.valuesdelhash_table["hola"]assert"hola"notinhash_tableassert"hello"notinhash_table.values
First, you verify if the sample hash table has the desired key and value. Then, you delete both by indicating only the key and repeat the assertions but with inverted expectations. You can make this test pass as follows:
# hashtable.py# ...classHashTable:# ...def__delitem__(self,key):index=hash(key)%len(self)delself.values[index]
You calculate the index associated with a key and remove the corresponding value from the list unconditionally. However, you immediately remember your mental note from earlier about asserting that your hash table should not shrink when you delete elements from it. So, you add two more assertions:
# test_hashtable.py# ...deftest_should_delete_key_value_pair(hash_table):assert"hola"inhash_tableassert"hello"inhash_table.valuesassertlen(hash_table)==100delhash_table["hola"]assert"hola"notinhash_tableassert"hello"notinhash_table.valuesassertlen(hash_table)==100
These will ensure that the size of your hash table’s underlying list remains unaffected. Now, you need to update your code so that it marks a slot as blank instead of throwing it away completely:
# hashtable.py# ...classHashTable:# ...def__delitem__(self,key):index=hash(key)%len(self)self.values[index]=BLANK
Considering that you’re in the green phase again, you can take this opportunity to spend some time refactoring. The same index formula appears three times in different places. You can extract it and simplify the code:
# hashtable.py# ...classHashTable:# ...def__delitem__(self,key):self.values[self._index(key)]=BLANKdef__setitem__(self,key,value):self.values[self._index(key)]=valuedef__getitem__(self,key):value=self.values[self._index(key)]ifvalueisBLANK:raiseKeyError(key)returnvalue# ...def_index(self,key):returnhash(key)%len(self)
Suddenly, after applying only this slight modification, a pattern emerges. Deleting an item is equivalent to inserting a blank object. So, you can rewrite your deletion routine to take advantage of the mutator method:
# hashtable.py# ...classHashTable:# ...def__delitem__(self,key):self[key]=BLANK# ...
Assigning a value through the square brackets syntax delegates to the.__setitem__()
method. All right, that’s enough refactoring for now. It’s much more important to think of other test cases at this point. For example, what happens when you request to delete a missing key? Python’sdict
raises aKeyError
exception in such a case, so you can do the same:
# hashtable.py# ...deftest_should_raise_key_error_when_deleting(hash_table):withpytest.raises(KeyError)asexception_info:delhash_table["missing_key"]assertexception_info.value.args[0]=="missing_key"
Covering this test case is relatively straightforward as you can rely on the code that you wrote when implementing thein
operator:
# hashtable.py# ...classHashTable:# ...def__delitem__(self,key):ifkeyinself:self[key]=BLANKelse:raiseKeyError(key)# ...
If you find the key in your hash table, then you overwrite the associated value with the sentinel value to remove that pair. Otherwise, you raise an exception. All right, there’s one more basic hash table operation to cover, which you’ll do next.
The insertion method should already take care of updating a key-value pair, so you’re only going to add a relevant test case and check if it works as expected:
# test_hashtable.py# ...deftest_should_update_value(hash_table):asserthash_table["hola"]=="hello"hash_table["hola"]="hallo"asserthash_table["hola"]=="hallo"asserthash_table[98.6]==37asserthash_table[False]isTrueassertlen(hash_table)==100
After modifying the value,hello, of an existing key and changing it tohallo, you also check if other key-value pairs, as well as the hash table’s length, remain untouched. That’s it. You already have a basic hash table implementation, but a few extra features that are relatively cheap to implement are still missing.
It’s time to address the elephant in the room. Python dictionaries let you iterate over theirkeys,values, or key-value pairs calleditems. However, your hash table is really only a list of values with fancy indexing on top of it. If you ever wanted to retrieve the original keys put into your hash table, then you’d be out of luck because the current hash table implementation won’t ever remember them.
In this subsection, you’ll refactor your hash table heavily to add the ability to retain keys and values. Bear in mind that there will be several steps involved, and many tests will start failing as a result of that. If you’d like to skip those intermediate steps and see the effect, then jump ahead todefensive copying.
Wait a minute. You keep reading aboutkey-value pairs in this tutorial, so why not replace values with tuples? After all, tuples are straightforward in Python. Even better, you could usenamed tuples to take advantage of their named element lookup. But first, you need to think of a test.
Note: Remember to focus on the high-level user-facing functionality when figuring out a test case. Don’t go about testing a piece of code that you may anticipate based on your programmer’s experience or gut feeling. It’s the tests that should ultimately drive your implementation in TDD, not the other way around.
First of all, you’re going to need another attribute in yourHashTable
class to hold the key-value pairs:
# test_hashtable.py# ...deftest_should_return_pairs(hash_table):assert("hola","hello")inhash_table.pairsassert(98.6,37)inhash_table.pairsassert(False,True)inhash_table.pairs
The order of key-value pairs is unimportant at this point, so you can assume that they might come out in any order each time you request them. However, instead of introducing another field to the class, you could reuse.values
by renaming it to.pairs
and making other necessary adjustments. There are a few. Just be aware that this will temporarily make some tests fail until you fix the implementation.
Note: If you’re using a code editor, then you can conveniently rename a variable or a class member with a single click of a button by leveraging the refactoring capabilities. In PyCharm, for example, you can right-click on a variable, chooseRefactor from the context menu, and thenRename…. Or you can use the corresponding keyboard shortcut:
That’s the most straightforward and reliable way of changing the name of a variable in your project. The code editor will update all variable references across your project files.
When you’ve renamed.values
to.pairs
inhashtable.py
andtest_hashtable.py
, then you’ll also need to update the.__setitem__()
special method. In particular, it should store tuples of the key and the associated value now:
# hashtable.py# ...classHashTable:# ...def__setitem__(self,key,value):self.pairs[self._index(key)]=(key,value)
Inserting an element in your hash table wraps the key and the value in a tuple and then puts that tuple at the desired index in your list of pairs. Note that your list will initially contain only blank objects instead of tuples, so you’ll be using two different data types in your list of pairs. One is a tuple, while the other one could be anything but a tuple to denote a blank slot.
Because of that, you don’t need any special sentinel constant anymore to mark a slot as empty. You can safely remove yourBLANK
constant and replace it with the plainNone
again where necessary, so go ahead and do that now.
Note: Removing code may be hard to accept at first, but less is better! As you can see, test-driven development can sometimes make you run in circles.
You can take another step back to regain control over deleting an item:
# hashtable.py# ...classHashTable:# ...def__delitem__(self,key):ifkeyinself:self.pairs[self._index(key)]=Noneelse:raiseKeyError(key)
Unfortunately, your.__delitem__()
method can no longer take advantage of the square brackets syntax because this would result in wrapping whatever sentinel value you chose in an unnecessary tuple. You must use an explicit assignment statement here to avoid needlessly complicated logic down the road.
The last important piece of the puzzle is updating the.__getitem__()
method:
# hashtable.py# ...classHashTable:# ...def__getitem__(self,key):pair=self.pairs[self._index(key)]ifpairisNone:raiseKeyError(key)returnpair[1]
You peek at an index, expecting to find a key-value pair. If you get nothing at all, then you raise an exception. On the other hand, if you see something interesting, then you grab the tuple’s second element at index one, which corresponds to the mapped value. However, you can write this more elegantly using a named tuple, as suggested earlier:
# hashtable.pyfromtypingimportNamedTuple,AnyclassPair(NamedTuple):key:Anyvalue:AnyclassHashTable:# ...def__setitem__(self,key,value):self.pairs[self._index(key)]=Pair(key,value)def__getitem__(self,key):pair=self.pairs[self._index(key)]ifpairisNone:raiseKeyError(key)returnpair.value# ...
ThePair
class consists of the.key
and.value
attributes, which are free to take values ofany data type. Simultaneously, your class inherits all the regular tuple’s behaviors because it extends theNamedTuple
parent. Note that you have to explicitly callPair()
on your key and value in the.__setitem__()
method to take advantage of the named attribute access in.__getitem__()
.
Note: Despite using a custom data type to represent key-value pairs, you can write your tests to expect eitherPair
instances or regular tuples, thanks to the compatibility of both types.
Naturally, you have a few test cases to update at this point before the report can turn green again. Take your time and carefully review your test suite. Alternatively, look at the code from the supporting materials if you feel stuck, or take a peek here:
# test_hashtable.py# ...deftest_should_insert_key_value_pairs():hash_table=HashTable(capacity=100)hash_table["hola"]="hello"hash_table[98.6]=37hash_table[False]=Trueassert("hola","hello")inhash_table.pairsassert(98.6,37)inhash_table.pairsassert(False,True)inhash_table.pairsassertlen(hash_table)==100# ...deftest_should_delete_key_value_pair(hash_table):assert"hola"inhash_tableassert("hola","hello")inhash_table.pairsassertlen(hash_table)==100delhash_table["hola"]assert"hola"notinhash_tableassert("hola","hello")notinhash_table.pairsassertlen(hash_table)==100
There will be another test case that needs special care. Specifically, it’s about verifying that an empty hash table has noNone
values when created. You’ve just replaced a list of values with a list of pairs. To fish the values out again, you can use alist comprehension such as this:
# test_hashtable.py# ...deftest_should_not_contain_none_value_when_created():hash_table=HashTable(capacity=100)values=[pair.valueforpairinhash_table.pairsifpair]assertNonenotinvalues
If you’re concerned about stuffing too much logic into the test case, then you’d be absolutely right. After all, you want to test the hash table’s behavior. But don’t worry about that just yet. You’ll revisit this test case shortly.
Once you’re back in the green phase, try to figure out possible corner cases. For example,.pairs
are exposed as a public attribute that anyone could intentionally or unintentionally tamper with. In practice, accessor methods should never leak your internal implementation but should makedefensive copies to protect mutable attributes from external modifications:
# test_hashtable.py# ...deftest_should_return_copy_of_pairs(hash_table):asserthash_table.pairsisnothash_table.pairs
Whenever you request the key-value pairs from a hash table, you expect to get a brand-new object with a unique identity. You can hide a private field behind aPython property, so create one and replace every reference to.pairs
with._pairs
in yourHashTable
class only. The leadingunderscore is a standard naming convention in Python that indicates internal implementation:
# hashtable.py# ...classHashTable:def__init__(self,capacity):self._pairs=capacity*[None]def__len__(self):returnlen(self._pairs)def__delitem__(self,key):ifkeyinself:self._pairs[self._index(key)]=Noneelse:raiseKeyError(key)def__setitem__(self,key,value):self._pairs[self._index(key)]=Pair(key,value)def__getitem__(self,key):pair=self._pairs[self._index(key)]ifpairisNone:raiseKeyError(key)returnpair.valuedef__contains__(self,key):try:self[key]exceptKeyError:returnFalseelse:returnTruedefget(self,key,default=None):try:returnself[key]exceptKeyError:returndefault@propertydefpairs(self):returnself._pairs.copy()def_index(self,key):returnhash(key)%len(self)
When you request a list of key-value pairs stored in your hash table, you’ll get their shallow copy each time. Because you won’t have a reference to your hash table’s internal state, it’ll remain unaffected by potential changes to its copy.
Note: The order of class methods that you arrived at might slightly differ from that in the code block presented above. That’s okay because method ordering doesn’t matter from Python’s standpoint. However, it’s customary to start withstatic or class methods, followed by your class’s public interface, which you’re most likely to look at. The internal implementation should typically appear at the very end.
To avoid having to jump around your code, it’s a good idea to organize methods in a way that resembles a story. Specifically, a higher-level function should be listed before the low-level functions that are called.
To further mimicdict.items()
in your property, the resulting list of pairs shouldn’t include blank slots. In other words, there shouldn’t be anyNone
values in that list:
# test_hashtable.py# ...deftest_should_not_include_blank_pairs(hash_table):assertNonenotinhash_table.pairs
To satisfy this test, you can filter empty values out by adding a condition to the list comprehension in your property:
# hashtable.py# ...classHashTable:# ...@propertydefpairs(self):return[pairforpairinself._pairsifpair]
You don’t need an explicit call to.copy()
because the list comprehension creates a new list. For every pair in the original list of the key-value pairs, you check if that particular pair is truthy and keep it in the resulting list. However, this will break two other tests that you need to update now:
# test_hashtable.py# ...deftest_should_create_empty_value_slots():assertHashTable(capacity=3)._pairs==[None,None,None]# ...deftest_should_insert_none_value():hash_table=HashTable(100)hash_table["key"]=Noneassert("key",None)inhash_table.pairs
That’s not ideal because one of your tests reaches out for the internal implementation instead of focusing on public interfaces. Nevertheless, such testing is known aswhite-box testing, which has its place.
Do you remember the test case that you modified by adding a list comprehension to retrieve the values from your key-value pairs? Well, here it is again if you need to refresh your memory:
# test_hashtable.py# ...deftest_should_not_contain_none_value_when_created():hash_table=HashTable(capacity=100)values=[pair.valueforpairinhash_table.pairsifpair]assertNonenotinvalues
The highlighted line looks just like what you’d need to implement the.values
property, which you replaced with.pairs
earlier. You may update the test function to take advantage of.values
again:
# test_hashtable.py# ...deftest_should_not_contain_none_value_when_created():assertNonenotinHashTable(capacity=100).values
It might feel as though it was a wasted effort. But, these values are now evaluated dynamically through a getter property, whereas they were stored in a fixed-size list before. To satisfy this test, you can reuse part of its old implementation, which employs a list comprehension:
# hashtable.py# ...classHashTable:# ...@propertydefvalues(self):return[pair.valueforpairinself.pairs]
Notice that you no longer have to specify the optional filtering condition here, because there’s already one lurking behind the.pairs
property.
Purists might think of using aset comprehension instead of a list comprehension to communicate the lack of guarantees for the order of values. However, that would result in losing information about duplicate values in the hash table. Protect yourself from such a possibility up front by writing another test case:
# test_hashtable.py# ...deftest_should_return_duplicate_values():hash_table=HashTable(capacity=100)hash_table["Alice"]=24hash_table["Bob"]=42hash_table["Joe"]=42assert[24,42,42]==sorted(hash_table.values)
If you have a hash table with names and ages, for example, and more than one person has the same age, then.values
should keep all repeated age values. You can sort the ages to ensure repeatable test runs. While this test case doesn’t require writing new code, it’ll guard againstregressions.
It’s worthwhile to check the expected values, their types, and their number. However, you can’t compare two lists directly because the actual values in a hash table might appear in an unpredictable order. To disregard the order in your test, you could convert both lists to sets or sort them like before. Unfortunately, sets remove potential duplicates, while sorting isn’t possible when lists contain incompatible types.
To reliably check if two lists have exactly the same elements of arbitrary types with potential duplicates while ignoring their order, you might use the following Python idiom:
defhave_same_elements(list1,list2):returnall(elementinlist1forelementinlist2)
It leverages the built-inall()
function, but it’s quite verbose. You’re probably better off using thepytest-unordered
plugin. Don’t forget to install it into your virtual environment first:
(venv)C:\> python -m pip install pytest-unordered
(venv)$python-mpipinstallpytest-unordered
Next, import theunordered()
function into your test suite and use it to wrap the hash table’s values:
# test_hashtable.pyimportpytestfrompytest_unorderedimportunorderedfromhashtableimportHashTable# ...deftest_should_get_values(hash_table):assertunordered(hash_table.values)==["hello",37,True]
Doing so converts the values to an unordered list, which redefines the equality test operator so that it doesn’t take order into account when comparing list elements. Additionally, values of an empty hash table should be an empty list, while the.values
property should always return a new list copy:
# test_hashtable.py# ...deftest_should_get_values_of_empty_hash_table():assertHashTable(capacity=100).values==[]deftest_should_return_copy_of_values(hash_table):asserthash_table.valuesisnothash_table.values
On the other hand, the hash table’s keys must be unique, so it makes sense to emphasize that by returning a set rather than a list of keys. After all, sets are by definition unordered collections of items with no duplicates:
# test_hashtable.py# ...deftest_should_get_keys(hash_table):asserthash_table.keys=={"hola",98.6,False}deftest_should_get_keys_of_empty_hash_table():assertHashTable(capacity=100).keys==set()deftest_should_return_copy_of_keys(hash_table):asserthash_table.keysisnothash_table.keys
There’s no empty set literal in Python, so you have to call the built-inset()
function directly in this case. The implementation of the corresponding getter function will look familiar:
# hashtable.py# ...classHashTable:# ...@propertydefkeys(self):return{pair.keyforpairinself.pairs}
It resembles the.values
property. The difference is that you use curly brackets instead of square brackets and refer to the.key
attribute instead of.value
in your named tuple. Alternatively, you could usepair[0]
if you wanted to, but it would seem less readable.
This also reminds you about the need for a similar test case that you missed when covering the.pairs
attribute. It makes sense to return a set of pairs for the sake of consistency:
# test_hashtable.py# ...deftest_should_return_pairs(hash_table):asserthash_table.pairs=={("hola","hello"),(98.6,37),(False,True)}deftest_should_get_pairs_of_empty_hash_table():assertHashTable(capacity=100).pairs==set()
So, the.pairs
property will also use a set comprehension from now on:
# hashtable.py# ...classHashTable:# ...@propertydefpairs(self):return{pairforpairinself._pairsifpair}
You don’t need to worry about losing any information, because each key-value pair is unique. At this point, you should be in the green phase again.
Note that you can take advantage of the.pairs
property to convert your hash table to a plain old dictionary and use.keys
and.values
to test that:
deftest_should_convert_to_dict(hash_table):dictionary=dict(hash_table.pairs)assertset(dictionary.keys())==hash_table.keysassertset(dictionary.items())==hash_table.pairsassertlist(dictionary.values())==unordered(hash_table.values)
To disregard the order of elements, remember to wrap the dictionary keys and key-value pairs with sets before making the comparison. In contrast, your hash table’s values come out as a list, so be sure to use theunordered()
function to compare lists while ignoring the element order.
Okay, your hash table is really beginning to take shape now!
There’s one small detail that you intentionally left broken for simplicity until now. It’s the length of your hash table, which currently reports its maximum capacity even when there are only empty slots. Fortunately, it doesn’t take much effort to fix this. Find your function namedtest_should_report_capacity()
, rename as shown below, and check if an empty hash table’s length is equal to zero rather than one hundred:
# test_hashtable.py# ...deftest_should_report_length_of_empty_hash_table():assertlen(HashTable(capacity=100))==0
To make the capacity independent from the length, modify your special method.__len__()
so that it refers to the public property with filtered pairs instead of the private list of all slots:
# hashtable.py# ...classHashTable:# ...def__len__(self):returnlen(self.pairs)# ...
You just removed the leading underscore from the variable name. But that small change is now causing a whole bunch of tests to end abruptly with an error and a few to fail.
Note: Failed tests are less severe because their assertions evaluate toFalse
, while errors indicate utterly unexpected behavior in your code.
It seems that most test cases suffer from the same unhandled exception due to division by zero when mapping the key to an index. That makes sense because._index()
uses the hash table’s length to find the remainder from dividing the hashed key by the number of slots available. However, the hash table’s length has a different meaning now. You need to take the length of the internal list instead:
# hashtable.pyclassHashTable:# ...def_index(self,key):returnhash(key)%len(self._pairs)
That’s much better now. The three test cases that still fail use wrong assumptions about the hash table’s length. Change those assumptions to make the tests pass:
# test_hashtable.py# ...deftest_should_insert_key_value_pairs():hash_table=HashTable(capacity=100)hash_table["hola"]="hello"hash_table[98.6]=37hash_table[False]=Trueassert("hola","hello")inhash_table.pairsassert(98.6,37)inhash_table.pairsassert(False,True)inhash_table.pairsassertlen(hash_table)==3# ...deftest_should_delete_key_value_pair(hash_table):assert"hola"inhash_tableassert("hola","hello")inhash_table.pairsassertlen(hash_table)==3delhash_table["hola"]assert"hola"notinhash_tableassert("hola","hello")notinhash_table.pairsassertlen(hash_table)==2# ...deftest_should_update_value(hash_table):asserthash_table["hola"]=="hello"hash_table["hola"]="hallo"asserthash_table["hola"]=="hallo"asserthash_table[98.6]==37asserthash_table[False]isTrueassertlen(hash_table)==3
You’re back in the game, but theZeroDivisionError
was a red flag that should immediately make you want to add additional test cases:
# test_hashtable.py# ...deftest_should_not_create_hashtable_with_zero_capacity():withpytest.raises(ValueError):HashTable(capacity=0)deftest_should_not_create_hashtable_with_negative_capacity():withpytest.raises(ValueError):HashTable(capacity=-100)
Creating aHashTable
with a non-positive capacity doesn’t make much sense. How could you have a container with a negative length? So you should raise an exception if someone were to attempt that, either on purpose or by accident. The standard way to indicate such incorrect input arguments in Python is by raising aValueError
exception:
# hashtable.py# ...classHashTable:def__init__(self,capacity):ifcapacity<1:raiseValueError("Capacity must be a positive number")self._pairs=capacity*[None]# ...
If you’re diligent, then you should probably also check for invalid argument types, but that’s beyond the scope of this tutorial. You can treat it as a voluntary exercise.
Next, add another scenario for testing the length of a non-empty hash table provided as a fixture by pytest:
# test_hashtable.py# ...deftest_should_report_length(hash_table):assertlen(hash_table)==3
There are three key-value pairs, so the length of the hash table should also be three. This test shouldn’t require any additional code. Finally, because you’re in the refactoring phase now, you can add a bit ofsyntactic sugar by introducing the.capacity
property and using it where possible:
# test_hashtable.py# ...deftest_should_report_capacity_of_empty_hash_table():assertHashTable(capacity=100).capacity==100deftest_should_report_capacity(hash_table):asserthash_table.capacity==100
The capacity is constant and determined at the hash table’s creation time. You can derive it from the length of the underlying list of pairs:
# hashtable.py# ...classHashTable:# ...@propertydefcapacity(self):returnlen(self._pairs)def_index(self,key):returnhash(key)%self.capacity
As you introduce new vocabulary to your problem domain, it helps you discover new opportunities for more accurate, unambiguous naming. For example, you’ve seen the wordpairs used interchangeably to refer to both the actual key-value pairs stored in your hash table and an internal list ofslots for pairs. It might be a good opportunity to refactor your code by changing._pairs
to._slots
everywhere.
Then, one of your earlier test cases will communicate its intent more clearly:
# test_hashtable.py# ...deftest_should_create_empty_value_slots():assertHashTable(capacity=3)._slots==[None,None,None]
Perhaps it would make even more sense to rename this test by replacing the wordvalue withpair in it:
# test_hashtable.py# ...deftest_should_create_empty_pair_slots():assertHashTable(capacity=3)._slots==[None,None,None]
You might think that such philosophical musings aren’t necessary. However, the more descriptive your names, the more readable your code will be—if not for others, then certainly for you in the future. There are evenbooks andjokes about it. Your tests are a form of documentation, so it pays off to maintain the same level of attention to detail for them as for your code under test.
Python lets youiterate over a dictionary through the keys, the values, or the key-value pairs known as items. You want the same behavior in your custom hash table, so you start by sketching out a few test cases:
# test_hashtable.py# ...deftest_should_iterate_over_keys(hash_table):forkeyinhash_table.keys:assertkeyin("hola",98.6,False)deftest_should_iterate_over_values(hash_table):forvalueinhash_table.values:assertvaluein("hello",37,True)deftest_should_iterate_over_pairs(hash_table):forkey,valueinhash_table.pairs:assertkeyinhash_table.keysassertvalueinhash_table.values
Iteration by keys, values, or key-value pairs works out of the box with the current implementation because the underlying sets and lists can already handle that. On the other hand, to make instances of yourHashTable
classiterable, you must define another special method, which will let you cooperate directly withfor
loops:
# test_hashtable.py# ...deftest_should_iterate_over_instance(hash_table):forkeyinhash_table:assertkeyin("hola",98.6,False)
Unlike before, you hand over a reference to theHashTable
instance here, but the behavior is the same as if you were iterating over the.keys
property. This behavior is compatible with the built-indict
in Python.
The special method you need,.__iter__()
, must return aniterator object, which the loop uses internally:
# hashtable.py# ...classHashTable:# ...def__iter__(self):yield fromself.keys
This is an example of agenerator iterator, which takes advantage of theyield keyword in Python.
Note: Theyield from
expression delegates the iteration to a sub-generator, which can be another iterable object, such as a list or a set.
Theyield
keyword lets you define an in-place iterator using afunctional style without creating another class. The special method named.__iter__()
gets called by thefor
loop when it starts.
Okay, only a few nonessential features are missing from your hash table at this point.
The next bit that you’re going to implement in this section will make your hash table look pleasant whenprinted onto the standard output. The textual representation of your hash table will look similar to a Pythondict
literal:
# test_hashtable.py# ...deftest_should_use_dict_literal_for_str(hash_table):assertstr(hash_table)in{"{'hola': 'hello', 98.6: 37, False: True}","{'hola': 'hello', False: True, 98.6: 37}","{98.6: 37, 'hola': 'hello', False: True}","{98.6: 37, False: True, 'hola': 'hello'}","{False: True, 'hola': 'hello', 98.6: 37}","{False: True, 98.6: 37, 'hola': 'hello'}",}
The literal uses curly braces, commas, and colons, while keys and values have their respective representations. For example, strings are enclosed in single apostrophes. As you don’t know the exact order of the key-value pairs, you check if the string representation of your hash table conforms to one of the possible pair permutations.
To make your class work with the built-instr()
function, you must implement the corresponding special method inHashTable
:
# hashtable.py# ...classHashTable:# ...def__str__(self):pairs=[]forkey,valueinself.pairs:pairs.append(f"{key!r}:{value!r}")return"{"+", ".join(pairs)+"}"
You iterate over keys and values through the.pairs
property and usef-strings to format the individual pairs. Notice the!r
conversion flag in the template string, which enforces callingrepr()
instead of the defaultstr()
on keys and values. This ensures more explicit representation, which varies between data types. For example, it wraps strings in single apostrophes.
The difference betweenstr()
andrepr()
is subtler. In general, both are meant for converting objects to strings. However, while you can expectstr()
to return human-friendly text,repr()
often returns a valid piece of Python code that you can evaluate to recreate the original object:
>>>fromfractionsimportFraction>>>quarter=Fraction("1/4")>>>str(quarter)'1/4'>>>repr(quarter)'Fraction(1, 4)'>>>eval(repr(quarter))Fraction(1, 4)
In this example, the string representation of aPython fraction is'1/4'
, but the canonical string representation of the same object represents a call to theFraction
class.
You can achieve a similar effect in yourHashTable
class. Unfortunately, your class initializer doesn’t allow for creating new instances out of values at the moment. You’ll address this by introducing aclass method that’ll let you createHashTable
instances from Python dictionaries:
# test_hashtable.py# ...deftest_should_create_hashtable_from_dict():dictionary={"hola":"hello",98.6:37,False:True}hash_table=HashTable.from_dict(dictionary)asserthash_table.capacity==len(dictionary)*10asserthash_table.keys==set(dictionary.keys())asserthash_table.pairs==set(dictionary.items())assertunordered(hash_table.values)==list(dictionary.values())
This plays nicely with your earlier conversion, which was in the opposite direction. However, you’ll need to assume a sufficiently large capacity for the hash table to hold all the key-value pairs from the original dictionary. A reasonable estimate seems to be ten times the number of pairs. You can hard-code it for now:
# hashtable.py# ...classHashTable:@classmethoddeffrom_dict(cls,dictionary):hash_table=cls(len(dictionary)*10)forkey,valueindictionary.items():hash_table[key]=valuereturnhash_table
You create a new hash table and set its capacity using an arbitrary factor. Then, you insert key-value pairs by copying them from the dictionary passed as an argument to the method. You can allow for overriding the default capacity if you want to, so add a similar test case:
# test_hashtable.py# ...deftest_should_create_hashtable_from_dict_with_custom_capacity():dictionary={"hola":"hello",98.6:37,False:True}hash_table=HashTable.from_dict(dictionary,capacity=100)asserthash_table.capacity==100asserthash_table.keys==set(dictionary.keys())asserthash_table.pairs==set(dictionary.items())assertunordered(hash_table.values)==list(dictionary.values())
To make the capacity optional, you may take advantage of theshort-circuit evaluation of Boolean expressions:
# hashtable.py# ...classHashTable:@classmethoddeffrom_dict(cls,dictionary,capacity=None):hash_table=cls(capacityorlen(dictionary)*10)forkey,valueindictionary.items():hash_table[key]=valuereturnhash_table
If thecapacity
isn’t specified, then you fall back to the default behavior, which multiplies the dictionary’s length by ten. With this, you’re finally able to provide a canonical string representation for yourHashTable
instances:
# test_hashtable.py# ...deftest_should_have_canonical_string_representation(hash_table):assertrepr(hash_table)in{"HashTable.from_dict({'hola': 'hello', 98.6: 37, False: True})","HashTable.from_dict({'hola': 'hello', False: True, 98.6: 37})","HashTable.from_dict({98.6: 37, 'hola': 'hello', False: True})","HashTable.from_dict({98.6: 37, False: True, 'hola': 'hello'})","HashTable.from_dict({False: True, 'hola': 'hello', 98.6: 37})","HashTable.from_dict({False: True, 98.6: 37, 'hola': 'hello'})",}
As before, you check for all the possible attribute order permutations in the resulting representation. The canonical string representation of a hash table looks mostly the same as withstr()
, but it also wraps the dict literal in a call to your new.from_dict()
class method. The corresponding implementation delegates to the special method.__str__()
by calling the built-instr()
function:
# hashtable.py# ...classHashTable:# ...def__repr__(self):cls=self.__class__.__name__returnf"{cls}.from_dict({str(self)})"
Note that you don’t hard-code the class name, to avoid issues if you ever choose to rename your class at a later time.
Your hash table prototype is almost ready, as you’ve implemented nearly all core and nonessential features from the list mentioned in the introduction to this section. You’ve just added the ability to create a hash table from a Pythondict
and provided string representations for its instances. The last and final bit is ensuring that hash table instances can be compared by value.
Hash tables are like sets in the sense that they don’t impose any specific order for their contents. In fact, hash tables and sets have more in common than you might think, as they’re both backed by a hash function. It’s the hash function that makes the key-value pairs in a hash table unordered. However, remember that starting from Python 3.6,dict
does indeed retain insertion order as an implementation detail.
Two hash tables should compare equal if and only if they have the same set of key-value pairs. However, Python compares objectidentities by default because it doesn’t know how to interpretvalues of custom data types. So, two instances of your hash table would always compare unequal even if they shared the same set of key-value pairs.
To fix this, you can implement the equality test operator (==
) by providing the special.__eq__()
method in yourHashTable
class. Additionally, Python will call this method for you to evaluate the not equal operator (!=
) unless you also implement.__ne__()
explicitly.
You want the hash table to be equal to itself, its copy, or another instance with the same key-value pairs regardless of their order. Conversely, a hash table shouldnot be equal to an instance with a different set of key-value pairs or a completely different data type:
# test_hashtable.py# ...deftest_should_compare_equal_to_itself(hash_table):asserthash_table==hash_tabledeftest_should_compare_equal_to_copy(hash_table):asserthash_tableisnothash_table.copy()asserthash_table==hash_table.copy()deftest_should_compare_equal_different_key_value_order(hash_table):h1=HashTable.from_dict({"a":1,"b":2,"c":3})h2=HashTable.from_dict({"b":2,"a":1,"c":3})asserth1==h2deftest_should_compare_unequal(hash_table):other=HashTable.from_dict({"different":"value"})asserthash_table!=otherdeftest_should_compare_unequal_another_data_type(hash_table):asserthash_table!=42
You use.from_dict()
, introduced in the previous subsection, to quickly populate new hash tables with values. You can take advantage of the same class method to make a new copy of a hash table instance. Here’s the code that satisfies these test cases:
# hashtable.py# ...classHashTable:# ...def__eq__(self,other):ifselfisother:returnTrueiftype(self)isnottype(other):returnFalsereturnset(self.pairs)==set(other.pairs)# ...defcopy(self):returnHashTable.from_dict(dict(self.pairs),self.capacity)
The special method.__eq__()
takes some object to compare as an argument. If that object happens to be the current instance ofHashTable
, then you returnTrue
because the same identity implies value equality. Otherwise, you proceed by comparing the types and the sets of key-value pairs. The conversion to sets helps make the ordering irrelevant even if you decide to turn.pairs
into another ordered list in the future.
By the way, the resulting copy should have not only the same key-value pairs but also the same capacity as the source hash table. At the same time, two hash tables with different capacities should still compare equal:
# test_hashtable.py# ...deftest_should_copy_keys_values_pairs_capacity(hash_table):copy=hash_table.copy()assertcopyisnothash_tableassertset(hash_table.keys)==set(copy.keys)assertunordered(hash_table.values)==copy.valuesassertset(hash_table.pairs)==set(copy.pairs)asserthash_table.capacity==copy.capacitydeftest_should_compare_equal_different_capacity():data={"a":1,"b":2,"c":3}h1=HashTable.from_dict(data,capacity=50)h2=HashTable.from_dict(data,capacity=100)asserth1==h2
These two tests will work with your currentHashTable
implementation, so you don’t need to code anything extra.
Your custom hash table prototype is still missing a couple of nonessential features that the built-in dictionary provides. You may try adding them yourself as an exercise. For example, you could replicate other methods from Python dictionaries, such asdict.clear()
ordict.update()
. Other than that, you might implement one of thebitwise operators supported bydict
since Python 3.9, which allow for aunion operation.
Well done! This is the finished test suite for the tutorial, and your hash table has passed all unit tests. Give yourself a well-deserved pat on the back because that was a heck of a great job. Or was it?
Suppose you reduce the capacity of your hash table to only account for the inserted pairs. The following hash table should accommodate all three key-value pairs stored in the source dictionary:
>>>fromhashtableimportHashTable>>>source={"hola":"hello",98.6:37,False:True}>>>hash_table=HashTable.from_dict(source,capacity=len(source))>>>str(hash_table)'{False: True}'
However, when you reveal the keys and values of the resulting hash table, you’ll sometimes find that there are fewer items. To make this code snippet repeatable, run it with hash randomization disabled by setting thePYTHONHASHSEED
environment variable to0
.
More often than not, you’ll end up losing information even though there’s enough space available in the hash table. That’s because most hash functions aren’t perfect, causing hash collisions, which you’ll learn how to resolve next.
In this section, you’ll explore the two most common strategies for dealing with hash code collisions, and you’ll implement them in yourHashTable
class. If you’d like to follow the examples below, then don’t forget to set thePYTHONHASHSEED
variable to0
to disable Python’s hash randomization.
By now, you should already have a good idea of what a hash collision is. It’s when two keys produce identical hash code, leading distinct values to collide with each other at the same index in the hash table. For example, in a hash table with one hundred slots, the keys"easy"
and"difficult"
happen to share the same index when hash randomization is disabled:
>>>hash("easy")%10043>>>hash("difficult")%10043
As of now, your hash table can’t detect attempts at storing different values in the same slot:
>>>fromhashtableimportHashTable>>>hash_table=HashTable(capacity=100)>>>hash_table["easy"]="Requires little effort">>>hash_table["difficult"]="Needs much skill">>>print(hash_table){'difficult': 'Needs much skill'}>>>hash_table["easy"]'Needs much skill'
Not only do you end up overwriting the pair identified by the"easy"
key, but even worse, retrieving the value of this now-nonexistent key gives you an incorrect answer!
There are three strategies available to you for addressing hash collisions:
Strategy | Description |
---|---|
Perfect Hashing | Choose aperfect hash function to avoid hash collisions in the first place. |
Open Addressing | Spread the collided values in a predictable way that lets you retrieve them later. |
Closed Addressing | Keep the collided values in a separate data structure to search through. |
While perfect hashing is only possible when you know all the values up front, the other twohash collision resolution methods are more practical, so you’ll take a closer look at them in this tutorial. Note thatopen addressing can be represented by several specific algorithms, including:
In contrast,closed addressing is best known forseparate chaining. Additionally, there’s alsocoalesced hashing, which combines the ideas behind both open and closed addressing into one algorithm.
To follow test-driven development, you’re going to need to design a test case first. But how do you test a hash collision? Python’s built-in function uses hash randomization for some of its data types by default, which makes predicting its behavior extremely difficult. Choosing a hash seed manually with thePYTHONHASHSEED
environment variable would be impractical and make your test cases fragile.
The best way to work around this is using amocking library, such as Python’sunittest.mock
:
fromunittest.mockimportpatch@patch("builtins.hash",return_value=42)deftest_should_detect_hash_collision(hash_mock):asserthash("foobar")==42
Patching temporarily replaces one object with another. For example, you can substitute the built-inhash()
function with a fake one that always returns the same expected value, making your tests repeatable. This substitution only has an effect during the function call, after which the originalhash()
is brought back again.
You can apply the@patch
decorator against your whole test function or limit the mock object’s scope with a context manager:
fromunittest.mockimportpatchdeftest_should_detect_hash_collision():asserthash("foobar")notin[1,2,3]withpatch("builtins.hash",side_effect=[1,2,3]):asserthash("foobar")==1asserthash("foobar")==2asserthash("foobar")==3
With a context manager, you can access the built-inhash()
function as well as its mocked version within the same test case. You could even have multiple flavors of the mocked function if you wanted to. Theside_effect
parameter lets you specify an exception to raise or a sequence of values that your mock object will return on consecutive invocations.
In the remaining part of this tutorial, you’ll continue adding more features to yourHashTable
class without strictly following test-driven development. New tests will be omitted for brevity, while modifying the class will cause some of the existing test to fail. However, you’ll find a working test suite in the accompanying materials available for download.
Stop for a moment to understand the theory behind hash code collisions. When it comes to handling them,linear probing is one of the oldest, most straightforward, and most effective techniques, all at the same time. It requires a few extra steps for theinsert,lookup,delete, andupdate operations, which you’re about to learn.
Consider a sample hash table representing thePython glossary with common acronyms. It has a capacity of ten slots in total, but four of them are already taken by the following key-value pairs:
Index | Key | Value |
---|---|---|
0 | ||
1 | BDFL | Benevolent Dictator For Life |
2 | ||
3 | REPL | Read–Evaluate–Print Loop |
4 | ||
5 | ||
6 | ||
7 | ||
8 | PEP | Python Enhancement Proposal |
9 | WSGI | Web Server Gateway Interface |
Now you want to put another term into your glossary to define theMRO acronym, which stands formethod resolution order. You calculate the hash code of the key and truncate it with the modulo operator to get an index between zero and nine:
>>>hash("MRO")8199632715222240545>>>hash("MRO")%105
Instead of using the modulo operator, you could alternatively truncate the hash code with a properbitmask, which is how Python’sdict
works internally.
Note: To get consistent hash codes, set thePYTHONHASHSEED
environment variable to0
to disable hash randomization.
Great! There’s a free slot in your hash table at index five, where you can insert a new key-value pair:
Index | Key | Value |
---|---|---|
0 | ||
1 | BDFL | Benevolent Dictator For Life |
2 | ||
3 | REPL | Read–Evaluate–Print Loop |
4 | ||
5 | MRO | Method Resolution Order |
6 | ||
7 | ||
8 | PEP | Python Enhancement Proposal |
9 | WSGI | Web Server Gateway Interface |
So far, so good. Your hash table is still 50 percent free space, so you keep adding more terms until you try inserting theEAFP acronym. It turns out the hash code of EAFP truncates to one, which is the index of a slot occupied by theBDFL term:
>>>hash("EAFP")-159847004290706279>>>hash("BDFL")-6396413444439073719>>>hash("EAFP")%101>>>hash("BDFL")%101
The likelihood of hashing two different keys into colliding hash codes is relatively small. However, projecting those hash codes onto a small range of array indices is a different story. With linear probing, you can detect and mitigate such collisions by storing the collided key-value pairs next to each other:
Index | Key | Value |
---|---|---|
0 | ||
1 | BDFL | Benevolent Dictator For Life |
2 | EAFP | Easier To Ask For Forgiveness Than Permission |
3 | REPL | Read–Evaluate–Print Loop |
4 | ||
5 | MRO | Method Resolution Order |
6 | ||
7 | ||
8 | PEP | Python Enhancement Proposal |
9 | WSGI | Web Server Gateway Interface |
Although BDFL and EAFP keys give the same index equal to one, only the first inserted key-value pair winds up taking it. Whichever pair comes second will be placed next to the occupied index. Therefore, linear probing makes the hash table sensitive to the insertion order.
Note: When you use linear probing or other hash collision resolution methods, then you can’t rely merely on the hash code to find the corresponding slot. You also need to compare the keys.
Consider adding another acronym,ABC, forabstract base classes, whose hash code truncates to index eight. You can’t insert it in the following position this time because it’s already taken by WSGI. Under normal circumstances, you’d continue looking for a free slot further down, but because you reached the last index, you must wrap around and insert the new acronym at index zero:
Index | Key | Value |
---|---|---|
0 | ABC | Abstract Base Class |
1 | BDFL | Benevolent Dictator For Life |
2 | EAFP | Easier To Ask For Forgiveness Than Permission |
3 | REPL | Read–Evaluate–Print Loop |
4 | ||
5 | MRO | Method Resolution Order |
6 | ||
7 | ||
8 | PEP | Python Enhancement Proposal |
9 | WSGI | Web Server Gateway Interface |
Tosearch for key-value pairs stuffed into the hash table like this, follow a similar algorithm. Start by looking at the expected index first. For example, to find the value associated with the ABC key, calculate its hash code and map it to an index:
>>>hash("ABC")-4164790459813301872>>>hash("ABC")%108
There’s a key-value pair stored at index eight, but it has a different key equal to PEP, so you skip it by increasing the index. Again, that slot is occupied by an unrelated term, WSGI, so you bounce off and wrap around to finally find your pair with a matching key at index zero. That’s your answer.
In general, there are three possible stopping conditions for the search operation:
The last point makesdeleting an existing key-value pair more tricky. If you just removed an entry from the hash table, then you’d introduce a blank slot, which would stop the lookup there regardless of any previous collisions. To make the collided key-value pairs reachable again, you’d have to either rehash them or use alazy deletion strategy.
The latter is less difficult to implement but has the additional cost of increasing the number of necessary lookup steps. Essentially, rather than deleting a key-value pair, you replace it with a sentinel value, depicted by a red cross (❌) below, which makes finding entries that had previously collided possible. Say you wanted to delete the BDFL and PEP terms:
Index | Key | Value |
---|---|---|
0 | ABC | Abstract Base Class |
1 | ❌ | |
2 | EAFP | Easier To Ask For Forgiveness Than Permission |
3 | REPL | Read–Evaluate–Print Loop |
4 | ||
5 | MRO | Method Resolution Order |
6 | ||
7 | ||
8 | ❌ | |
9 | WSGI | Web Server Gateway Interface |
You’ve replaced the corresponding key-value pairs with two instances of the sentinel value. Later, when you look for the ABC key, for example, you bounce off the sentinel at index eight, then continue to WSGI, and finally arrive at index zero with the matching key. Without one of the sentinels, you’d stop the search much earlier, falsely concluding there’s no such key.
Note: Your hash table’s capacity remains unaffected because you’re free to overwrite sentinels when inserting new key-value pairs. On the other hand, if you were to fill up the hash table and delete most of its elements, then you’d practically end up with thelinear search algorithm.
So far, you’ve learned about insertion, deletion, and lookup. However, there’s one catch aboutupdating the value of an existing entry in a hash table with linear probing. When searching for a pair to update, you should only skip the slot if it’s occupied by another pair with a different key or if it contains a sentinel value. On the other hand, if the slot is empty or has a matching key, then you should set the new value.
In the next subsection, you’re going to modify yourHashTable
class to use linear probing for hash collision resolution.
HashTable
ClassAfter a brief detour into linear probing theory, you’re back to coding now. Because linear probing will be used in all four basicCRUD operations in the hash table, it helps to write a helper method in your class to encapsulate the logic of visiting the hash table’s slots:
# hashtable.py# ...classHashTable:# ...def_probe(self,key):index=self._index(key)for_inrange(self.capacity):yieldindex,self._slots[index]index=(index+1)%self.capacity
Given a key, you start by using the corresponding hash code to find its expected index in the hash table. Then, you loop through all the available slots in the hash table, starting from the calculated index. At each step, you return the current index and the associated pair, which might be empty or marked as deleted. Afterward, you increase the index, wrapping around the origin if necessary.
Next, you can rewrite your.__setitem__()
method. Don’t forget about the need for a new sentinel value. This value will let you differentiate between slots that have never been occupied and those that had collided before but are now deleted:
# hashtable.pyDELETED=object()# ...classHashTable:# ...def__setitem__(self,key,value):forindex,pairinself._probe(key):ifpairisDELETED:continueifpairisNoneorpair.key==key:self._slots[index]=Pair(key,value)breakelse:raiseMemoryError("Not enough capacity")
If the slot is empty or contains a pair with the matching key, then you reassign a new key-value pair at the current index and stop the linear probing. Otherwise, if another pair occupies that slot and has a different key, or the slot is marked as deleted, then you move forward until you find a free slot or exhaust all the available slots. If you run out of available slots, you raise aMemoryError
exception to indicate the hash table’s insufficient capacity.
Note: Coincidentally, the.__setitem__()
method also covers updating the value of an existing pair. Because pairs are represented by immutable tuples, you replace the entire pair with the matching key and not just its value component.
Getting and deleting key-value pairs from a hash table using linear probing works almost the same way:
# hashtable.pyDELETED=object()# ...classHashTable:# ...def__getitem__(self,key):for_,pairinself._probe(key):ifpairisNone:raiseKeyError(key)ifpairisDELETED:continueifpair.key==key:returnpair.valueraiseKeyError(key)def__delitem__(self,key):forindex,pairinself._probe(key):ifpairisNone:raiseKeyError(key)ifpairisDELETED:continueifpair.key==key:self._slots[index]=DELETEDbreakelse:raiseKeyError(key)
The only difference is in the highlighted lines. To delete a pair, you must know where it’s located in the hash table to replace it with the sentinel value. On the other hand, you’re only interested in the corresponding value when searching by key. If this code duplication bothers you, then you can try refactoring it as an exercise. However, having it written explicitly helps make the point.
Note: This is a textbook implementation of the hash table, which probes elements by comparing keys using the equality test operator (==
). However, it’s a potentially costly operation, which real-life implementations avoid by storing the hash code, along with keys and values, in triplets rather than pairs. Hash codes are cheap to compare, on the other hand.
You can leverage thehash-equal contract to speed things up. If two hash codes are different, then they’re guaranteed to stem from different keys, so there’s no need to perform a costly equality test in the first place. This trick dramatically reduces the number of key comparisons.
There’s yet another important detail to note. The hash table’s slots can no longer be in one of only two states—that is, empty or occupied. Inserting the sentinel value into the hash table to mark a slot as deleted messes up the hash table’s.pairs
,.keys
, and.values
properties and reports the length incorrectly. To fix this, you have to filter out bothNone
andDELETED
values when returning key-value pairs:
# hashtable.py# ...classHashTable:# ...@propertydefpairs(self):return{pairforpairinself._slotsifpairnotin(None,DELETED)}
With that little update, your hash table should now be able to deal with hash collisions by spreading the collided pairs and looking for them in a linear fashion:
>>>fromunittest.mockimportpatch>>>fromhashtableimportHashTable>>>withpatch("builtins.hash",return_value=24):...hash_table=HashTable(capacity=100)...hash_table["easy"]="Requires little effort"...hash_table["difficult"]="Needs much skill">>>hash_table._slots[24]Pair(key='easy', value='Requires little effort')>>>hash_table._slots[25]Pair(key='difficult', value='Needs much skill')
Despite having the same hash code equal to twenty-four, both collided keys,"easy"
and"difficult"
, appear next to each other on the._slots
list. Notice they’re listed in the same order in which you added them to the hash table. Try swapping the insertion order or changing the hash table’s capacity and observe how that affects the slots.
As of now, the hash table’s capacity remains fixed. Before implementing linear probing, you remained oblivious and kept overwriting collided values. Now, you can detect when there’s no more space in your hash table and raise a corresponding exception. However, wouldn’t it be better to let the hash table dynamically scale its capacity as necessary?
There are two different strategies when it comes to resizing a hash table. You can wait until the very last moment and only resize the hash table when it becomes full, or you can do so eagerly after reaching a certain threshold. Both ways have their pros and cons. Thelazy strategy is arguably more straightforward to implement, so you’ll take a closer look at it first. However, it can lead to more collisions and worse performance.
The only time you absolutely have to increase the number of slots in your hash table is when the insertion of a new pair fails, raising theMemoryError
exception. Go ahead and replace theraise
statement with a call to another helper method that you’ll create, followed by a recursive call to.__setitem__()
through the square brackets syntax:
# hashtable.py# ...classHashTable:# ...def__setitem__(self,key,value):forindex,pairinself._probe(key):ifpairisDELETED:continueifpairisNoneorpair.key==key:self._slots[index]=Pair(key,value)breakelse:self._resize_and_rehash()self[key]=value
When you determine that all slots are occupied, by either a legitimate pair or the sentinel value, you must allocate more memory, copy the existing pairs, and try inserting that new key-value pair again.
Note: Putting your old key-value pairs into a bigger hash table will make them hash to entirely different slots. Rehashing takes advantage of the extra slots that you just created, reducing the number of collisions in the new hash table. Moreover, it saves space by reclaiming slots that used to be marked as deleted with the sentinel value. You don’t need to worry about past collisions, because the key-value pairs will find new slots anyway.
Now, implement resizing and rehashing in the following way:
# hashtable.py# ...classHashTable:# ...def_resize_and_rehash(self):copy=HashTable(capacity=self.capacity*2)forkey,valueinself.pairs:copy[key]=valueself._slots=copy._slots
Create a local copy of the hash table. Because it’s difficult to predict how many more slots you may need, take a wild guess and double the capacity size. Then, iterate over your existing set of key-value pairs and insert them into the copy. Finally, reassign the._slots
attribute in your instance so that it points to the enlarged list of slots.
Your hash table can now dynamically increase its size when needed, so give it a spin. Create an empty hash table with a capacity of one and try inserting some key-value pairs into it:
>>>fromhashtableimportHashTable>>>hash_table=HashTable(capacity=1)>>>foriinrange(20):...num_pairs=len(hash_table)...num_empty=hash_table.capacity-num_pairs...print(...f"{num_pairs:>2}/{hash_table.capacity:>2}",...("X"*num_pairs)+("."*num_empty)...)...hash_table[i]=i... 0/ 1 . 1/ 1 X 2/ 2 XX 3/ 4 XXX. 4/ 4 XXXX 5/ 8 XXXXX... 6/ 8 XXXXXX.. 7/ 8 XXXXXXX. 8/ 8 XXXXXXXX 9/16 XXXXXXXXX.......10/16 XXXXXXXXXX......11/16 XXXXXXXXXXX.....12/16 XXXXXXXXXXXX....13/16 XXXXXXXXXXXXX...14/16 XXXXXXXXXXXXXX..15/16 XXXXXXXXXXXXXXX.16/16 XXXXXXXXXXXXXXXX17/32 XXXXXXXXXXXXXXXXX...............18/32 XXXXXXXXXXXXXXXXXX..............19/32 XXXXXXXXXXXXXXXXXXX.............
>>>fromhashtableimportHashTable>>>hash_table=HashTable(capacity=1)>>>foriinrange(20):...num_pairs=len(hash_table)...num_empty=hash_table.capacity-num_pairs...print(...f"{num_pairs:>2}/{hash_table.capacity:>2}",...("▣"*num_pairs)+("□"*num_empty)...)...hash_table[i]=i... 0/ 1 □ 1/ 1 ▣ 2/ 2 ▣▣ 3/ 4 ▣▣▣□ 4/ 4 ▣▣▣▣ 5/ 8 ▣▣▣▣▣□□□ 6/ 8 ▣▣▣▣▣▣□□ 7/ 8 ▣▣▣▣▣▣▣□ 8/ 8 ▣▣▣▣▣▣▣▣ 9/16 ▣▣▣▣▣▣▣▣▣□□□□□□□10/16 ▣▣▣▣▣▣▣▣▣▣□□□□□□11/16 ▣▣▣▣▣▣▣▣▣▣▣□□□□□12/16 ▣▣▣▣▣▣▣▣▣▣▣▣□□□□13/16 ▣▣▣▣▣▣▣▣▣▣▣▣▣□□□14/16 ▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□15/16 ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□16/16 ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣17/32 ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□□□□□□□□□□□□□□18/32 ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□□□□□□□□□□□□□19/32 ▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣▣□□□□□□□□□□□□□
In successfully inserting twenty key-value pairs, you never got any errors. With this rough depiction, you can clearly see the doubling of slots, which takes place when the hash table becomes full and needs more slots.
Thanks to the automatic resizing that you’ve just implemented, you can assume adefault capacity for new hash tables. This way, creating an instance of theHashTable
class will no longer require specifying the initial capacity, although doing so could improve performance. A common choice for the initial capacity is a small power of two, such as eight:
# hashtable.py# ...classHashTable:@classmethoddeffrom_dict(cls,dictionary,capacity=None):hash_table=cls(capacityorlen(dictionary))forkey,valueindictionary.items():hash_table[key]=valuereturnhash_tabledef__init__(self,capacity=8):ifcapacity<1:raiseValueError("Capacity must be a positive number")self._slots=capacity*[None]# ...
That makes it possible to create hash tables with a call to the parameterless initializerHashTable()
. Note that you can also update your class methodHashTable.from_dict()
to use the dictionary’s length as the initial capacity. Previously, you multiplied the dictionary’s length by an arbitrary factor, which was necessary to make your tests pass, due to unhandled hash collisions.
As stated before, there’s one problem with the lazy resizing strategy, and that’s the increased likelihood of collisions. You’re going to address that next.
Waiting until your hash table becomes saturated isn’t optimal. You can try aneager strategy to resize the hash table before reaching its total capacity, keeping the collisions from happening in the first place. How do you decide on the best moment to resize and rehash? Use the load factor!
Theload factor is the ratio of the number of currently occupied slots, including the deleted ones, to all slots in the hash table. The higher the load factor, the bigger the chance of a hash collision, which results in worse lookup performance. Therefore, you want the load factor to remain relatively small at all times. Bumping up the hash table’s size is due whenever the load factor reaches a certain threshold.
The choice of a particular threshold is a classic example of thespace–time trade-off in computer science. More frequent hash table resizing is cheaper and leads to better performance at the cost of more memory consumption. Conversely, waiting longer can save you some memory, but the key lookups will be slower. The chart below depicts the relationship between the amount of allocated memory and the average number of collisions:
The data behind this chart measures the average number of collisions caused by inserting one hundred elements into an empty hash table with an initial capacity of one. The measurement was repeated many times for various load factor thresholds, at which the hash table resized itself in discrete jumps by doubling its capacity.
The intersection of both plots, which appears at around 0.75, indicates the threshold’ssweet spot, with the lowest amount of memory and number of collisions. Using a higher load factor threshold doesn’t provide significant memory savings, but it increases the number of collisions exponentially. A smaller threshold improves the performance but for a high price of mostly wasted memory. Remember that all you really need is one hundred slots!
You can experiment with different load factor thresholds, but resizing the hash table when 60 percent of its slots are taken might be a good starting point. Here’s how you can implement the load factor calculation in yourHashTable
class:
# hashtable.py# ...classHashTable:# ...@propertydefload_factor(self):occupied_or_deleted=[slotforslotinself._slotsifslot]returnlen(occupied_or_deleted)/self.capacity
You start by filtering slots that are truthy, which would be anything butNone
, and then take the ratio according to the load factor’s definition. Note that if you decide to use a comprehension expression, then itmust be a list comprehension to count all sentinel value occurrences. In this case, using a set comprehension would filter out the repeated markers of deleted pairs, leaving only one instance and resulting in a wrong load factor.
Next, modify your class to accept an optionalload factor threshold and use it to eagerly resize and rehash the slots:
# hashtable.py# ...classHashTable:# ...def__init__(self,capacity=8,load_factor_threshold=0.6):ifcapacity<1:raiseValueError("Capacity must be a positive number")ifnot(0<load_factor_threshold<=1):raiseValueError("Load factor must be a number between (0, 1]")self._slots=capacity*[None]self._load_factor_threshold=load_factor_threshold# ...def__setitem__(self,key,value):ifself.load_factor>=self._load_factor_threshold:self._resize_and_rehash()forindex,pairinself._probe(key):ifpairisDELETED:continueifpairisNoneorpair.key==key:self._slots[index]=Pair(key,value)break
The load factor threshold defaults to 0.6, which means 60 percent of all slots are occupied. You use a weak inequality (>=
) instead of a strict one (>
) to account for the load factor threshold at its maximum value, which can never be greater than one. If the load factor equals one, then you must also resize the hash table before inserting another key-value pair.
Brilliant! Your hash table has just become a bit faster. That concludes the open addressing example in this tutorial. Next up, you’re going to resolve hash collisions using one of the most popular closed addressing techniques.
Separate chaining is another extremely popular hash collision resolution method, perhaps even more widespread than linear probing. The idea is to group similar items by a common feature into so-calledbuckets to narrow down the search space. For example, you could imagine harvesting fruits and collecting them into color-coded baskets:
Each basket contains fruits of roughly the same color. So, when you’re craving an apple, for example, you only need to search through the basket labeled with red. In an ideal world, each basket should contain no more than one element, making the search instantaneous. You can think of the labels as hash codes and the fruits with the same color as the collided key-value pairs.
A hash table based on separate chaining is a list of references to buckets, typically implemented aslinked lists that formchains of elements:
Each linked list contains key-value pairs whose keys share the same hash code due to a collision. When looking for a value by key, you need to locate the right bucket first, then traverse it usinglinear search to find the matching key, and finally return the corresponding value. Linear search just means going through each item in the bucket, one by one, until you find the right key.
Note: Linked lists’ elements have a small memory overhead because everynode contains a reference to the next element. At the same time, such a memory layout makes appending and removing elements very quick compared to a regular array.
To adapt yourHashTable
class to use separate chaining, start by removing the._probe()
method and replacing slots with buckets. Now, instead of having aNone
value or a pair at each index, you’ll make each index hold a bucket that might be empty or not. Each bucket will be a linked list:
# hashtable.pyfromcollectionsimportdeque# ...classHashTable:# ...def__init__(self,capacity=8,load_factor_threshold=0.6):ifcapacity<1:raiseValueError("Capacity must be a positive number")ifnot(0<load_factor_threshold<=1):raiseValueError("Load factor must be a number between (0, 1]")self._buckets=[deque()for_inrange(capacity)]self._load_factor_threshold=load_factor_threshold# ...def_resize_and_rehash(self):copy=HashTable(capacity=self.capacity*2)forkey,valueinself.pairs:copy[key]=valueself._buckets=copy._buckets
Instead of implementing a linked list from scratch, you may take advantage of Python’s double-ended queue, ordeque, available in thecollections
module, which uses adoubly-linked list under the hood. It lets you append and remove elements more efficiently than a plain list.
Don’t forget to update the.pairs
,.capacity
, and.load_factor
properties so that they refer to buckets instead of slots:
# hashtable.py# ...classHashTable:# ...@propertydefpairs(self):return{pairforbucketinself._bucketsforpairinbucket}# ...@propertydefcapacity(self):returnlen(self._buckets)# ...@propertydefload_factor(self):returnlen(self)/self.capacity
You no longer need the sentinel value to mark elements as deleted, so go ahead and remove theDELETED
constant. This makes the key-value pairs and the load factor’s definitions more straightforward. The capacity is synonymous with the number of buckets because you want to keep at most one key-value pair in each bucket, minimizing the number of hash code collisions.
Note: The load factor defined like this can become greater than one when the number of key-value pairs stored in the hash table exceeds the number of buckets.
By the way, allowing too many collisions will effectively degenerate your hash table into a flat list with lineartime complexity, significantly degrading its performance. Attackers might take advantage of this fact by artificially creating as many collisions as possible.
With separate chaining, all basic hash table operations boil down to finding the right bucket and searching through it, which makes the corresponding methods look similar:
# hashtable.py# ...classHashTable:# ...def__delitem__(self,key):bucket=self._buckets[self._index(key)]forindex,pairinenumerate(bucket):ifpair.key==key:delbucket[index]breakelse:raiseKeyError(key)def__setitem__(self,key,value):ifself.load_factor>=self._load_factor_threshold:self._resize_and_rehash()bucket=self._buckets[self._index(key)]forindex,pairinenumerate(bucket):ifpair.key==key:bucket[index]=Pair(key,value)breakelse:bucket.append(Pair(key,value))def__getitem__(self,key):bucket=self._buckets[self._index(key)]forpairinbucket:ifpair.key==key:returnpair.valueraiseKeyError(key)
Deque instances take care of updating their internal references when you delete an item by index. If you had used a custom linked list, you’d have to rewire the collided key-value pairs manually after each modification. As before, updating an existing key-value pair requires replacing the old one with a brand-new one because key-value pairs are immutable.
If you’d like to avoid repeating yourself, then try refactoring the three methods above usingstructural pattern matching, introduced in Python 3.10. You’ll find one possible solution in the accompanying materials.
Okay, you know how to cope with hash code collisions, and you’re now ready to move on. Next up, you’ll make your hash table return keys and values in their insertion order.
Because the classic hash table data structure useshashing to spread the keys uniformly and sometimes pseudo-randomly, it can’t guarantee their order. As a rule of thumb, you shouldnever assume that hash table elements will come in a consistent order when you request them. But it may sometimes be useful or even necessary to impose a specific order on your elements.
Up until Python 3.6, the only way to enforce order on dictionary elements was to use theOrderedDict
wrapper from the standard library. Later, the built-indict
data type started preserving theinsertion order of the key-value pairs. Regardless, it may still be wise to assume a lack of element order to make your code compatible with older or alternative Python versions.
How can you replicate a similar insertion order preservation in your customHashTable
class? One way could be to remember the sequence of keys as you insert them and iterate over that sequence to return keys, values, and pairs. Start by declaring another internal field in your class:
# hashtable.py# ...classHashTable:# ...def__init__(self,capacity=8,load_factor_threshold=0.6):ifcapacity<1:raiseValueError("Capacity must be a positive number")ifnot(0<load_factor_threshold<=1):raiseValueError("Load factor must be a number between (0, 1]")self._keys=[]self._buckets=[deque()for_inrange(capacity)]self._load_factor_threshold=load_factor_threshold
It’s an empty list of keys that’ll grow and shrink as you modify the hash table’s contents:
# hashtable.py# ...classHashTable:# ...def__delitem__(self,key):bucket=self._buckets[self._index(key)]forindex,pairinenumerate(bucket):ifpair.key==key:delbucket[index]self._keys.remove(key)breakelse:raiseKeyError(key)def__setitem__(self,key,value):ifself.load_factor>=self._load_factor_threshold:self._resize_and_rehash()bucket=self._buckets[self._index(key)]forindex,pairinenumerate(bucket):ifpair.key==key:bucket[index]=Pair(key,value)breakelse:bucket.append(Pair(key,value))self._keys.append(key)
You remove the key when there’s no longer an associated key-value pair in the hash table. On the other hand, you only add a key when you insert a brand-new key-value pair into the hash table for the first time. You don’t want to insert a key when you do an update, because that would result in multiple copies of the same key.
Next, you can change the three properties.keys
,.values
, and.pairs
so that they follow the same order of the inserted keys:
# hashtable.py# ...classHashTable:# ...@propertydefkeys(self):returnself._keys.copy()@propertydefvalues(self):return[self[key]forkeyinself.keys]@propertydefpairs(self):return[(key,self[key])forkeyinself.keys]
Make sure to return a copy of your keys to avoid leaking the internal implementation. Also, note that all of these three properties now return lists instead of sets because you want them to retain the right order. In turn, this lets youzip keys and values to make pairs:
>>>fromhashtableimportHashTable>>>hash_table=HashTable.from_dict({..."hola":"hello",...98.6:37,...False:True...})>>>hash_table.keys['hola', 98.6, False]>>>hash_table.values['hello', 37, True]>>>hash_table.pairs[('hola', 'hello'), (98.6, 37), (False, True)]>>>hash_table.pairs==list(zip(hash_table.keys,hash_table.values))True
Keys and values are always returned in the same order so that, for example, the first key and the first value map to each other. This means that you can zip them like in the example above.
Now you know how to keep the insertion order of the key-value pairs in your hash table. On the other hand, if you want to sort them by more advanced criteria, then you can follow the same techniques as withsorting a built-in dictionary. There’s no additional code to write at this point.
Your hash table is complete and fully functional now. It can map arbitrary keys to values using the built-inhash()
function. It can detect and resolve hash code collisions and even retain the insertion order of the key-value pairs. You could theoretically use it over Pythondict
if you wanted to, without noticing much difference apart from the poor performance and occasionally more verbose syntax.
Note: As mentioned before, you should rarely need to implement a data structure such as a hash table yourself. Python comes with many usefulcollections that have unparalleled performance and are tested in the field by countless developers. For specialized data structures, you should checkPyPI for third-party libraries before attempting to make one of your own. You’ll save yourself a lot of time and significantly reduce the risk of bugs.
Until now, you’ve taken it for granted that most built-in types in Python can work as hash table keys. However, to use any hash table implementation in practice, you’ll have to restrict keys to only hashable types and understand the implications of doing so. That’ll be especially helpful when you decide to bring custom data types into the equation.
You learned earlier that some data types, including most primitive data types in Python, arehashable, while others aren’t. The primary trait of hashability is the ability to calculate the hash code of a given object:
>>>hash(frozenset(range(10)))3895031357313128696>>>hash(set(range(10)))Traceback (most recent call last): File"<input>", line1, in<module>hash(set(range(10)))TypeError:unhashable type: 'set'
For example, instances of thefrozenset
data type in Python are hashable, while ordinary sets don’t implement hashing at all. Hashability directly impacts whether objects of certain types can becomedictionary keys orset members, as both of these data structures use thehash()
function internally:
>>>hashable=frozenset(range(10))>>>{hashable:"set of numbers"}{frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9}): 'set of numbers'}>>>{hashable}{frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9})}>>>unhashable=set(range(10))>>>{unhashable:"set of numbers"}Traceback (most recent call last): File"<input>", line1, in<module>{unhashable:"set of numbers"}TypeError:unhashable type: 'set'>>>{unhashable}Traceback (most recent call last): File"<input>", line1, in<module>{unhashable}TypeError:unhashable type: 'set'
While an instance offrozenset
is hashable, the corresponding set with precisely the same values is not. Be aware that you can still use an unhashable data type for the dictionary values. It’s the dictionarykeys that must be able to calculate their corresponding hash codes.
Note: When you insert an unhashable object into a hashable container, such as afrozenset
, then that container also becomes unhashable.
Hashability is closely related tomutability, or the ability to change the internal state of an object during its lifetime. The relationship between the two is a bit like changing an address. When you move to another place, you’re still the same person, but your old friends might have a hard time trying to find you.
Unhashable types in Python, such as lists, sets, or dictionaries, happen to bemutable containers, as you can modify their values by adding or removing elements. On the other hand, most built-in hashable types in Python areimmutable. Does this mean mutable types can’t be hashable?
The correct answer is theycan be both mutable and hashable, but they rarely should be! Mutating a key would result in changing its memory address within a hash table. Consider this custom class as an example:
>>>classPerson:...def__init__(self,name):...self.name=name
This class represents a person with a name. Python provides a default implementation for the special method.__hash__()
in your classes, which merely uses the object’s identity to derive its hash code:
>>>hash(Person("Joe"))8766622471691>>>hash(Person("Joe"))8766623682281
Each individualPerson
instance has a unique hash code even when it’s logically equal to other instances. To make the object’s value determine its hash code, you can override the default implementation of.__hash__()
like so:
>>>classPerson:...def__init__(self,name):...self.name=name......def__hash__(self):...returnhash(self.name)>>>hash(Person("Joe"))==hash(Person("Joe"))True
You callhash()
on the.name
attribute so that instances of thePerson
class with equal names always have the same hash code. This is convenient for looking them up in a dictionary, for example.
Note: You can explicitly mark your class asunhashable by setting its.__hash__
attribute equal toNone
:
>>>classPerson:...__hash__=None......def__init__(self,name):...self.name=name>>>hash(Person("Alice"))Traceback (most recent call last): File"<input>", line1, in<module>hash(Person("Alice"))TypeError:unhashable type: 'Person'
This will preventhash()
from working on instances of your class.
Another trait of hashable types is the ability to compare their instances by value. Recall that a hash table compares keys with the equality test operator (==
), so you must implement another special method,.__eq__()
, in your class to allow for that:
>>>classPerson:...def__init__(self,name):...self.name=name......def__eq__(self,other):...ifselfisother:...returnTrue...iftype(self)isnottype(other):...returnFalse...returnself.name==other.name......def__hash__(self):...returnhash(self.name)
This code fragment should look familiar if you went through theequality test of hash tables before. In a nutshell, you check if the other object is the exact same instance, an instance of another type, or another instance of the same type and equal value to the.name
attribute.
Note: Coding special methods such as.__eq__()
and.__hash__()
can be repetitive, tedious, and error-prone. If you’re on Python 3.7 or above, then you can achieve the same effect more compactly by usingdata classes:
@dataclass(unsafe_hash=True)classPerson:name:str
While a data class generates.__eq__()
based on your class attributes, you must set theunsafe_hash
option to enable the correct.__hash__()
method generation.
Having implemented.__eq__()
and.__hash__()
, you can use thePerson
class instances as dictionary keys:
>>>alice=Person("Alice")>>>bob=Person("Bob")>>>employees={alice:"project manager",bob:"engineer"}>>>employees[bob]'engineer'>>>employees[Person("Bob")]'engineer'
Perfect! It doesn’t matter if you find an employee by thebob
reference that you created earlier or a brand-newPerson("Bob")
instance. Unfortunately, things get complicated when Bob suddenly decides to change his name and go by Bobby instead:
>>>bob.name="Bobby">>>employees[bob]Traceback (most recent call last): File"<input>", line1, in<module>employees[bob]KeyError:<__main__.Person object at 0x7f607e325e40>>>>employees[Person("Bobby")]Traceback (most recent call last): File"<input>", line1, in<module>employees[Person("Bobby")]KeyError:<__main__.Person object at 0x7f607e16ed10>>>>employees[Person("Bob")]Traceback (most recent call last): File"<input>", line1, in<module>employees[Person("Bob")]KeyError:<__main__.Person object at 0x7f607e1769e0>
You no longer have a way of retrieving the corresponding value even though you still use the original key object that you inserted before. What’s more surprising, though, is that you can’t access the value through a new key object with the updated person’s name or with the old one. Can you tell why?
The hash code of the original key determined which bucket the associated value got stored in. Mutating the state of your key made its hash code indicate a completely different bucket or slot, which doesn’t contain the expected value. But using a key with the old name doesn’t help either. While it points to the right bucket, the stored key has mutated, making equality comparison between"Bob"
and"Bobby"
evaluate toFalse
rather than matching.
Therefore, hash codes must beimmutable for the hashing to work as expected. Since hash codes are typically derived from an object’s attributes, their state should be fixed and never change over time. In practice, this also means that objects intended as hash table keys should be immutable themselves.
To sum up, a hashable data type has the following traits:
.__hash__()
method to calculate the instance’shash code.__eq__()
method tocompare instances byvalueThe fourth and final trait of hashable types is that they must comply with thehash-equal contract, which you’ll learn more about in the following subsection. In short, objects with equal values must have identical hash codes.
To avoid problems when using custom classes as hash table keys, they should comply with the hash-equal contract. If there’s one thing to remember about that contract, it’s that when you implement.__eq__()
, you shouldalways implement a corresponding.__hash__()
. The only time you don’t have to implement both methods is when you use a wrapper such as adata class or an immutablenamed tuple that already does this for you.
Also, not implementing both methods can be okay as long as you’re absolutely sure that you won’t ever use objects of your data type as dictionary keys or set members. But can you be so sure?
Note: If you can’t use a data class or a named tuple, and you’d like to manually compare and hash more than one field in a class, then wrap them in a tuple:
classPerson:def__init__(self,name,date_of_birth,married):self.name=nameself.date_of_birth=date_of_birthself.married=marrieddef__hash__(self):returnhash(self._fields)def__eq__(self,other):ifselfisother:returnTrueiftype(self)isnottype(other):returnFalsereturnself._fields==other._fields@propertydef_fields(self):returnself.name,self.date_of_birth,self.married
It makes sense to define a private property to return that tuple if there are relatively many fields in your class.
While you can implement both methods however you like, they must satisfy the hash-equal contract, which states that two equal objects must hash to equal hash codes. This makes it possible to find the rightbucket based on a provided key. However, the reverse isn’t true, becausecollisions may occasionally cause the same hash code to be shared by unequal values. You can express this more formally by using these two implications:
a = b
⇒hash(a) = hash(b)
hash(a) = hash(b)
⇏a = b
The hash-equal contract is aone-way contract. If two keys are logically equal, then their hash codes must also be equal. On the other hand, if two keys share the same hash code, it’s likely they’re the same key, but it’s also possible that they’re different keys. You need to compare them to be sure if they really match. By the law ofcontraposition, you can derive another implication from the first one:
hash(a) ≠ hash(b)
⇒a ≠ b
If you know that two keys have different hash codes, then there’s no point in comparing them. That can help improve performance, as the equality test tends to be costly.
Some IDEs offer to automatically generate the.__hash__()
and.__eq__()
methods for you, but you need to remember to regenerate them every time you modify your class attributes. Therefore, stick to Python’s data classes or named tuples whenever you can to guarantee the proper implementation of hashable types.
At this point, you canimplement a hash table from scratch in Python, using different strategies to resolvehash collisions. You know how to make your hash tableresize and rehash dynamically to accommodate more data and how to preserve theinsertion order of key-value pairs. Along the way, you practicedtest-driven development (TDD) by adding features to your hash table step by step.
Other than that, you have a deep understanding of Python’s built-inhash()
function and can create one of your own. You understand thehash-equal contract, the difference betweenhashable andunhashable data types, and their relationship withimmutable types. You know how to create custom hashable classes using various tools in Python.
In this tutorial, you learned:
hash()
works behind the scenesWith this knowledge, you’re prepared to answer most questions that you might get on ajob interview related to thehash table data structure.
You can download the complete source code and the intermediate steps used throughout this tutorial by clicking the link below:
Source Code:Click here to download the source code that you’ll use to build a hash table in Python.
🐍 Python Tricks 💌
Get a short & sweetPython Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.
AboutBartosz Zaczyński
Bartosz is an experienced software engineer and Python educator with an M.Sc. in Applied Computer Science.
» More about BartoszMasterReal-World Python Skills With Unlimited Access to Real Python
Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:
MasterReal-World Python Skills
With Unlimited Access to Real Python
Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:
What Do You Think?
What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.
Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students.Get tips for asking good questions andget answers to common questions in our support portal.
Keep Learning
Already have an account?Sign-In
Almost there! Complete this form and click the button below to gain instant access:
Build a Hash Table in Python With TDD (Source Code)