A Bug I Squashed: py-hopscotch-dict Part 2

Mon 03 October 2022

An Unwelcome Visitor Arrives

Like every ridiculously subtle bug this one seemed to appear out of nowhere. I always confirm my code passes tests locally before pushing changes, but I still found myself staring down a red CI build marker and a strange error:

        def __delitem__(self, key: Hashable) -> None:

                # Key not in dict
                if data_idx is None:
>                       raise KeyError(key)
E                       KeyError: None

During handling of the above exception, another exception occurred:

        def _set_lookup_index_info([...]) -> None:
>               if lookup_idx < 0:
E               TypeError: '<' not supported between instances of 'NoneType' and 'int'

I hadn't changed the currently-exploding functions in months so I began debugging using the standard process [1], with satisfactory results. It made sense to chalk the failure up to a transient Hypothesis error - it only happened on one of the six environments I test against, and has only happened once in the hundreds of times I'd tested locally and in CI - but I still couldn't shake the feeling this failure was too weird to be a fluke. At least this bug had the decency to present itself without eating all my memory and crashing my computer. Additionally, Hypothesis creates a decorator when it runs in CI to reproduce failures locally so I should be able to root-cause this bug in record time!

E   hypothesis.errors.DidNotReproduce: The shape of the test data has changed in some way from where this blob was defined. Are you sure you're running the same test?

No matter what I did the decorator failed to reproduce the failure on my machine [2], so I continued debugging using the next step of the standard process [3]. After about an hour of rerunning my tests I finally managed to replicate the failure on my machine, allowing me to add some PDB breakpoints and get to work.

The Hunt Is On

I spent the next few days running down rabbit holes and into dead ends: 19 out of every 20 test runs finished without issue and failures alternated between seeing the same key/value pair twice and being unable to find certain key/value pairs I knew were in the dict; I added additional checks in my tests to detect when these situations occurred, but never got a hit. Even after debugging sessions running as late as 4 AM I still couldn't reproduce the bug's side effects reliably, much less isolate the actual bug and fix it.

I shifted my attention to _free_up(), a different function that spontaneously began exploding, when I caught a lucky break - I noticed the function only exploded when it tried to work on the last bucket in _indices, one of the two main housekeeping data structures in py-hopscotch-dict.

At this point I need to explain the two main concepts of hopscotch hashing: firstly each bucket a key can hash to has what's called a neighborhood, or a semi-fixed number of adjacent buckets (which are called neighbors) where that key can be stored in the event of a hash collision; secondly it is possible to empty a target bucket by pushing key data deeper into its neighborhood - if you've ever pushed things aside to get to something deep in a cabinet you understand the basic concept here.

Since _free_up() didn't do anything if the bucket to empty had an empty neighbor, the function could only fail in this situation if every bucket in the last bucket's neighborhood was occupied - the function didn't try to wrap around and empty buckets at the start of _indices for fear I may write an infinite loop. I hoped to trigger the bug manually by filling the last bucket's neighborhood, but this plan failed.

Taking a look up the call stack showed _free_up() was being called by _resize(), which made even less sense: that call sequence could only happen if a target bucket's neighborhood was overfull, but since _resize() doesn't add data to the dict if everything fit before resizing then everything should fit after. I racked my brain trying to figure out how I could overfill a neighborhood without causing a resize, again to no success.

Then, in a moment reminiscent of the previous bug, I had an errant thought race through my head as I was trying to get to sleep: "What if you delete something before resizing?" Unlike before I was too anxious about forgetting this idea to sleep, and after five minutes of messing around I knew I'd struck gold:

>>> d.clear()
>>> for i in range(1, 7): d[i] = i
>>> d[127] = 127
>>> d[7] = 7
>>> d[255] = 255
>>> d[0] = 0
>>> del d[7]
>>> for i in range(8, 25): d[i] = i
Traceback (most recent call last):
        raise RuntimeError("Could not open index while maintaining invariant")

The Bug

At this point I need to explain two unfortunately relevant details in my implementation: the first is keys are stored in their own list and deleting a key from the dict is performed by moving the last key in the list into the index of the key being deleted; the second is resizing works by reinserting data into the resized dict according to the order of the key list. Thusly before the deletion in the above snippet, the key list looks like this:

[1, 2, 3, 4, 5, 6, 127, 7, 255, 0]

Whereas after the deletion the key list looks like this:

[1, 2, 3, 4, 5, 6, 127, 0, 255]

This bug is so insidous because at this point the dict is still functional - until it resizes. After increasing _indices to have 128 buckets, the mapping from _keys to _indices is in the process of being recreated when it reaches key 255.

I use the built-in hash() to hash my keys, which has the useful property where hash(x) == x if x is an integer; to turn hash(x) into a valid bucket in the dict I wrap it modulo the size of _indices. Putting these two pieces of information together, it is easy to spot the first hash collision as 127 % 128 == 127 == 255 % 128 so the dict will have to put the key 255 in an open neighbor of the bucket at index 127.

But what is that neighbor? Neighborhoods for buckets at the end of _indices wrap around to the beginning, but at this size the neighbors of the bucket at index 127 are already occupied [4]. This shouldn't be a problem though - _free_up() can shuffle things around to open up a bucket to store the key 255, right? Yes - so long as there's an open bucket before the end of _indices to to push data into, which there isn't because we are at the end of _indices. Thus, the end result is a RuntimeError out of nowhere despite the fact everything worked before resizing.

The Fix

Once I reached this point the solution was obvious: this bug only effects buckets with neighborhoods that wrap around, which all include index 0 in their neighborhoods, so I can call _free_up() on index 0 instead to give the function the best chance to empty a bucket.

After multiple 16+ hour days of debugging, all I had to do was change some code from this:

nearest_neighbor = self._get_open_neighbor(expected_lookup_idx)
if nearest_neighbor is None:

to this:

nearest_neighbor = self._get_open_neighbor(expected_lookup_idx)
if nearest_neighbor is None:
        if new_size - expected_lookup_idx < self._nbhd_size:
                free_up_idx = 0
                free_up_idx = expected_lookup_idx



And thus, all was right with the world again - at least, until Hypothesis drops some other ridiculously subtle bug on me I have to spend the better part of a week to reproduce. I'm pretty proud of finding this bug - I think this was the hardest bug I've ever tracked down, including all of my professional work and personal projects.

[1]Rerunning the tests to see if the errors go away
[2]I know this was my fault for believing in computers in the first place
[3]Rerunning the tests until things failed locally
[4]When _indices is 128 buckets large, each bucket only has eight neighbors, so the neighborhood of the bucket at index 127 ranges from indices 127 to 6