A Bug I Squashed: py-hopscotch-dict Part 1

Tue 27 September 2022

This bug is a couple years old now, but I'm still proud of myself for finding it. Anyway, onto the story:

I was this close to the first public release of py-hopscotch-dict: I'd organized the repo just-so, built my first CI/CD pipeline - I'd even signed up for PyPI so it could be found via pip. I ran my tests locally one last time to get that sweet dopamine hit of 100% code coverage before version 1.0.0 went gold, but strangely my computer locked up running the last test. I hadn't made any changes to this test - a Hypothesis state machine meant to be my integration test - and it had been rock-solid over the hundreds of runs the few days prior. I immediately began a rigorous info collection process [1] to prepare for a debugging session, but the issue never resurfaced. I chalked it up to computers being terrible sometimes and created a release, but a few months later I saw the issue reappear in CI testing. Unfortunately this meant the problem was in the code instead of my computer, so I had a bug to squash.

Thankfully the failed CI tests had provided a breadcrumb in that the test run failed due to being OOM killed, but the fact that after hundreds of test runs the crash only appeared twice meant the odds of both reproducing the bug locally and getting a debugger on it when it was happening were slim. Days went by without progress, not helped by the fact that trying to find the bug without triggering its effect meant I was effectively guessing where it might be. After four or five days of 12+ hour bugfinding sessions, I was ready to throw in the towel - I even rewrote internal data structures in case the more obvious choices were causing some sort of bloat, but I was still as blind as when I began.

Dejected, I started thinking a full rewrite would get rid of the bug and searched for inspiration from other hopscotch hash implementations - there were only a couple on GitHub, none in Python, but the C implementation I looked at used a neat trick to solve a flaw I noticed in my implementation. Frustrated from chasing ghosts for the best part of a week trying to find a bug desperate to remain hidden, no small number of thoughts buzzed in my head as I tried to sleep that night. But right before I finally managed to fall asleep, one stood out: "what if two keys tried to occupy the last bucket?" I decided to risk forgetting my thought overnight by going to sleep instead of writing it down somewhere, but thankfully it was the first thought in my head the next morning and immediately led me to the bug.

At this point a quick overview of hopscotch hashing is necessary - in essence, dict keys map to buckets in an internal array, but in the event of a hash collision can instead occupy other adjacent buckets in what's called the original bucket's neighborhood. As my code was written, a bucket's neighborhood only extended as far as the last bucket in the array - so buckets less than one neighborhood away from the end of the array had fewer buckets in their neighborhoods, and the last bucket had a neighborhood of only itself. Additionally, in the event of a hash collision the code used a combination of the least significant bits of each hash and resizing the underlying array - and in a worst-case situation could require a number of resizes that makes the dict use so much memory the process gets OOM killed.

Once I found the bug I added support for wraparound neighborhoods and cut a new release. So what did I learn? Mostly that I would prefer to fix 10 obvious bugs over one very subtle one. Also that Hypothesis can trigger a bug for you, but that doesn't mean it'll be any help finding or fixing it.

[1]Rerunning the tests a few times to see if they keep failing