My Little Profiler Part 2: Profile Harder

A year or so ago I optimized a library I wrote - I made it faster but couldn't make a dent in memory-efficiency. A week or so back I got some free time on my hands and decided to try squeezing some more blood out of that stone while also working on another library I wrote.

Prologue

I didn't give much thought to the computer I used for benchmarking the first time around - it was my daily-driver laptop with an i7-7660 and 32 gigs of DDR4-3200 - but lately I've gotten a bug up my butt about how modern software engineers don't care about their software's performance because they write/test their code on cutting-edge machines that drown their incompetency under an ocean of transistors. [1]

Therefore, this time around I decided to run all my tests on the oldest hardware I had laying around: my trusty Lenovo X201 with a Core i5-560M [2] and 8 gigs of DDR3-1600 RAM - I figured if the code runs fast on this machine it should run fast anywhere. To keep myself honest I decided to benchmark on Python 3.10 instead of 3.11 which has significant performance improvements.

My benchmarking methodology hadn't changed: I reboot before collecting data to mitigate performance impacts from other applications, empty any caches and reset any CPU-internal statistics that may impact timing; I run the benchmark five times, dropping the best/worst runs to mitigate the worst of random variance; I use Scalene while benchmarking to find potential areas for improvement:

for i in {1..5}; do scalene --outfile "run_${i}".json --json --cli --use-virtual-time --profile-all --profile-exclude threading bench.py; done

The benchmarking script is the same as last time:

from pathlib import Path
from time import perf_counter
from zipfile import ZipFile

from py_fast_trie import XFastTrie

def main():
    t = XFastTrie()
    start = perf_counter()

    for i in range(1, 2):
        name = f"primes{i}"
        p = Path(f"~/bench/{name}.zip").expanduser()
        ctr = 0

        with ZipFile(p) as z:
            with z.open(f"{name}.txt") as f:
                print(f"opened {name}.txt")
                for l in f.readlines()[2:]:
                    for n in l.split():
                        t += int(n)
                        ctr += 1

                        if ctr % 100000 == 0:
                            print(f"{ctr}: {perf_counter() - start}")

if __name__ == "__main__": main()

Baseline

Before I got started, I wanted some new baseline numbers to measure my progress:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	51.398	49.750	51.120	49.866	48.949
200,000	103.161	100.806	103.633	101.060	99.302
300,000	155.839	152.914	157.379	153.429	150.755
400,000	206.385	202.690	208.936	203.560	199.885
500,000	258.164	253.589	261.741	255.263	250.223
600,000	310.564	305.287	315.163	307.331	301.320
700,000	360.837	355.014	366.638	357.365	350.446
800,000	409.455	403.542	416.476	405.792	398.305
900,000	468.603	462.091	476.522	464.563	456.567
1,000,000	518.218	510.928	526.935	513.607	505.245

I knew running the benchmark on a CPU almost 7 years older than before with an older memory architecture at half the speed would impact the numbers, but after so much work to break the 300 second barrier last time seeing an average runtime of 514.251 seconds with no runs faster than 500 seconds did hurt a bit. I turned that pain into fuel for improving my code and dove in.

`setitem()` Helper Function Inlining

I'd planned to focus solely on py-fast-trie this time around when I had a trademark Intrusive Thought - my X-fast trie implementation queries for values in dicts much more than it inserts them but __setitem()__ is still in my benchmark's critical path, shouldn't I take a look at it since I didn't touch it last time? Sure enough, it wasn't hard to find some dead weight:

def __setitem__(self, key: Hashable, value: Any) -> None:
    [...]

    # The index of the key in _keys and its related value in _values
    _, data_idx = self._lookup(key)

    [...]

    nearest_nbr = self._get_open_neighbor(expected_lookup_idx)

Two function calls in the critical path that could easily be inlined - and as I've said before, calling Python functions is costly. Inlining those functions netted a slight performance increase:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	50.082	49.233	48.365	50.747	49.613
200,000	101.585	99.765	98.068	102.945	102.066
300,000	154.028	151.189	148.803	155.969	153.971
400,000	204.098	200.310	197.377	206.738	203.776
500,000	255.665	250.640	247.193	258.789	254.766
600,000	307.780	301.796	297.742	311.408	306.524
700,000	357.761	351.148	346.327	362.116	356.393
800,000	406.153	398.758	393.499	411.255	404.527
900,000	465.131	456.253	450.472	470.748	462.860
1,000,000	514.331	504.381	498.860	520.585	511.680

As the meme goes, it ain't much - the new average runtime of 510.131 seconds was only a 0.8% decrease - but it's honest work. I wasn't expecting much since dict insertion is such a small part of the critical path, hopefully the other changes I had in mind were more impactful.

Removing TrieNode Boilerplate

At this point I actually turned my attention to py-fast-trie and realized trying to be as Pythonic as possible was hurting more than helping - I made the variables in TrieNode "private" with an underscore, but they still needed to be manipulated from outside so every interaction with data in a TrieNode was mediated by some Java-esque property function:

def _get_left(self) -> TrieNode | None:
    return self._left

[...]

def _set_left(self, new_left: TrieNode | None) -> None:
    self._left = new_left

[...]

left = property(_get_left, _set_left)

I looked over my shoulder to make sure Guido couldn't watch me sin before I made my variables "public" and deleted the property functions; I also deleted some vanity properties that were differently-named wrappers around existing data. Let nobody tell you crime doesn't pay, my Python crimes netted me a tidy speedup:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	45.102	44.205	60.581	43.652	43.334
200,000	91.503	92.636	106.579	88.792	88.165
300,000	137.844	139.566	173.454	134.950	133.790
400,000	181.679	201.431	218.022	178.840	177.351
500,000	227.057	249.523	277.933	223.998	222.159
600,000	273.066	307.030	330.565	269.907	268.277
700,000	317.079	359.461	383.385	313.717	311.832
800,000	359.415	406.239	437.969	356.072	353.734
900,000	412.286	475.429	491.165	408.735	405.686
1,000,000	455.490	520.598	534.803	451.822	448.320

My benchmark now ran in an average runtime of 475.970 seconds, a 6.7% decrease in runtime from inlining __setitem__() helper functions alone and a 7.4% decrease in runtime from baseline.

`insert()` Helper Function Inlining

Since my benchmark is focused on insertion, I figured I'd get the biggest return on investment by looking at insert() for the X-fast trie - and I found some more low-hanging fruit at the start of the function. As each leaf should have a reference to its predecessor and successor leaves, insert(), reasonably, calls predecessor() and successor() to find them:

leaf_pred = self.predecessor(value) if self._count > 0 else None
leaf_succ = self.successor(value) if self._count > 0 else None

However, both those functions just call _get_closest_leaf() and go one step in opposite directions - and insert() needs to go both directions anyway. Instead of making two function calls that both end up calling the same function anyway, I make the function call directly and get everything I need at once:

if self._count > 0:
    neighbor = self._get_closest_leaf(value)

    # The closest leaf is the predecessor of the value to be inserted
    if neighbor.value < value:
        leaf_pred = neighbor
        leaf_succ = neighbor.right

    # The closest leaf is the successor of the value to be inserted
    elif neighbor.value > value:
        leaf_succ = neighbor
        leaf_pred = neighbor.left

This resulted in the first big impact of the session:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	41.438	40.603	42.979	40.999	41.095
200,000	83.384	82.959	87.351	83.508	83.609
300,000	126.383	126.254	132.544	127.026	128.558
400,000	166.991	167.410	175.553	168.095	169.933
500,000	208.945	209.747	219.917	210.548	212.704
600,000	251.425	252.864	264.987	253.651	256.026
700,000	292.232	293.916	307.957	294.753	297.441
800,000	331.534	333.513	349.676	334.203	337.186
900,000	380.940	383.076	401.422	384.024	387.338
1,000,000	420.943	423.314	443.358	424.239	427.765

With an average runtime of 425.106 seconds this last change resulted in a 10.7% decrease in runtime from removing TrieNode boilerplate and a 17.3% decrease in runtime from baseline.

Removing Redundant Input Sanitization

To make sure the library never attempted to operate on invalid input, all the functions where a user would theoretically interact with the library called a sanitization function on input:

def __iadd__(self, value: int | bytes) -> XFastTrie:
    value = self._to_int(value, self._maxlen)
    self.insert(value)
    return self

However, because users could theoretically call insert() directly it also called the sanitization function:

def insert(self, value: int | bytes) -> None:
    value = self._to_int(value, self._maxlen)
    [...]

It didn't make sense to call that function twice so I removed the redundant calls to _to_int() for all the helper functions for a slight speedup:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	40.213	41.987	40.749	40.266	40.611
200,000	81.800	85.372	83.088	81.974	84.468
300,000	124.271	129.844	126.365	124.635	127.621
400,000	164.377	171.911	167.344	165.100	168.547
500,000	205.895	215.486	209.621	206.893	210.853
600,000	248.038	259.710	252.534	249.308	253.889
700,000	288.276	302.006	293.618	289.834	294.889
800,000	326.877	342.684	333.332	328.799	334.416
900,000	375.741	393.699	383.036	377.808	383.985
1,000,000	415.021	435.202	423.252	417.456	424.107

This change dropped the average runtime to 421.605 seconds, a 0.8% runtime decrease from inlining predecessor/successor leaf discovery and an 18.0% decrease in runtime from baseline - once again yielding a small win for little work.

`get_closest_leaf()` Value Conversion

I began this optimization session with two primary strategies I gained from the last go-around: inline as many function calls as is reasonable; and eliminate as much bitbanging as possible. Thankfully there wasn't much bitbanging in py-fast-trie to begin with, but there was some in _get_closest_leaf():

direction = value >> (self._maxlen - level - 2) & 1
descendant = ancestor.left if direction == 0 else ancestor.right

Since _get_closest_leaf() was in the critical path for insertion I converted the integer manipulation into string manipulation:

leg = bin(value)[2:].zfill(self._maxlen)[level + 1]
descendant = ancestor.left if leg == "0" else ancestor.right

Given what this strategy did for _get_displaced_neighbors(), I expected a tidy performance boost:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	42.789	40.969	43.094	40.826	42.622
200,000	86.834	83.419	87.190	83.228	87.070
300,000	131.933	126.828	132.348	126.575	132.157
400,000	174.629	168.065	174.953	167.681	175.010
500,000	218.733	210.559	218.979	210.098	219.153
600,000	263.500	253.780	263.586	253.346	263.841
700,000	306.331	295.005	306.311	294.614	306.430
800,000	347.549	334.739	347.465	334.353	347.597
900,000	399.028	384.396	398.805	384.203	398.941
1,000,000	440.930	424.744	440.521	424.613	440.773

I was rewarded with another quick win of an average runtime...increase...to 435.346 seconds, a 3.3% increase over removing redundant input sanitization but still a 15.3% reduction in runtime from baseline. To be honest these results threw me for a loop and I didn't notice how well this "optimization" performed until I was analyzing the data for this post - the final runtimes were "close enough" to the one before it that I assumed it was another small win; had I noticed at the time I would have undone it, but since the change is in all my subsequent runs I kept it in.

Rewriting `insert()`

At this point I felt good about decreasing average runtimes by 80 seconds, but I was still a ways from my goal of a sub-300 second average. To make matters worse, since py-fast-trie had less code than py-hopscotch-dict the low-hanging fruit was pretty much gone by now - if I wanted any more gains I was going to have to start working for them. Thankfully I knew there was still one big bundle of suck waiting for me: the trie insertion function. [3]

I'd skimmed the function when I removed predecessor() and successor() calls earlier, but as I read it more thoroughly I realized there was no way to escape a complete rewrite. I didn't want to do it since I knew finding all the little bugs a rewrite would introduce would take longer than rewriting it to begin with; but then I remembered initially writing the function, and I remembered making it add intermediate nodes up from the leaf instead of down from the closest existing ancestor, and I knew what I had to do.

A day or so later Hypothesis said my new version of insert() was good enough, so I declared another small victory against computers. And what did I get for my efforts - was the agita over remembering how to handle descendant pointers worth it?

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	13.566	13.627	13.262	13.471	13.357
200,000	28.504	28.604	27.886	28.350	28.122
300,000	44.400	44.555	43.463	44.245	43.834
400,000	57.867	58.030	56.645	57.592	57.146
500,000	72.680	72.886	71.150	72.337	71.875
600,000	88.150	88.418	86.297	87.750	87.187
700,000	101.703	101.931	99.536	101.066	100.587
800,000	113.377	113.543	110.883	112.524	112.083
900,000	135.700	136.013	132.814	135.194	134.157
1,000,000	148.050	148.341	144.802	147.401	146.304

Yeah, I'd say so. Every run was almost twice as fast at every point as when I stopped trying to optimize last time - on slower hardware! Runtimes decreased to an average of 147.251 seconds, a 66.2% decrease over my bitbanging replacement and a 71.4% decrease from baseline. I said this the last time I was optimizing and I'll say it again here: it felt great to make numbers like this, but it meant I wrote some real dogshit code in the past - at least I'm atoning for my computer crimes now.

Adding `slots`

After the last set of results I felt like I could declare overall victory, but despite all this work to make everything run faster I was still using too damn much memory:

A snippet of a Scalene profile result; the XFastTrie object is using 1.083 GiB of space

To be honest, I had no idea where it was going - 1 million 64-bit numbers should only take up 8MB of space so even with 20x overhead the trie should use far less than 1GB, plus prior attempts to use Scalene's memory leak detection feature resulted in alerts that put the false in false positive.

A snippet of a Scalene profile result suggesting the builtin enumerate() function has a memory leak

I don't remember why the idea came back to me, but I had a thought: "__slots__ is supposed to help with memory and you used it in py-hopscotch-dict without issue, why not try it in py-fast-trie too?"

class TrieNode:
    __slots__ = (
        "leaf",
        "left",
        "parent",
        "right",
        "value",
    )
[...]
class XFastTrie:
    __slots__ = (
        "_count",
        "_level_tables",
        "_max",
        "_maxlen",
        "_min",
        "_root",
    )

I didn't have any better ideas and it was quick to implement, so I added __slots__ on all the classes in py-fast-trie to surprising results:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	12.137	12.176	12.192	12.170	12.178
200,000	25.670	25.753	25.853	25.713	25.756
300,000	40.092	40.137	40.377	40.040	40.192
400,000	52.349	52.410	52.644	52.208	52.396
500,000	65.816	65.825	66.144	65.578	65.859
600,000	79.758	79.773	80.195	79.467	79.848
700,000	90.353	90.392	90.724	90.068	90.423
800,000	102.484	102.514	102.775	102.183	102.479
900,000	122.755	122.711	123.494	122.396	122.773
1,000,000	134.319	134.221	135.025	133.940	134.299

My new average runtime was 134.280 seconds, an 8.8% decrease from rewriting insert() and a 73.9% decrease from baseline; but more importantly I managed to reduce memory usage by over 300MB from baseline!

A snippet of a Scalene profile result; the XFastTrie object is using 761 MiB of space

`get_closest_leaf()` Value Deconversion

Everybody makes mistakes, and I definitely made one when I thought getting rid of bitbanging would speed up _get_closest_leaf() - but to be fair that has been the truth basically every other time I've tried it. Anyway, reverting the change did what I expected:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	12.153	12.111	12.317	12.109	12.073
200,000	25.714	25.662	26.068	25.574	25.584
300,000	40.102	40.022	40.645	39.871	39.918
400,000	52.322	52.177	52.993	52.034	52.098
500,000	65.732	65.545	66.562	65.379	65.468
600,000	79.652	79.449	80.672	79.203	79.334
700,000	90.270	89.917	91.403	89.734	89.858
800,000	102.362	101.868	103.609	101.786	101.852
900,000	122.628	122.275	124.173	121.907	121.997
1,000,000	134.152	133.727	135.824	133.348	133.410

Average runtimes decreased slightly to 133.763 seconds, 0.4% from adding __slots__ and 74.0% overall. I thought the speedup would be greater since moving away from bitbanging led to a 3.3% increase, but I assume rewriting insert() may have decreased the benefit of _get_closest_leaf() being faster.

`contains()` Helper Function Inlining

As I was rewriting insert(), it finally clicked the first line calls __contains__() in py-hopscotch-dict but that function is still essentially boilerplate:

def __contains__(self, key: Hashable) -> bool:
    _, idx = self._lookup(key)
    return idx is not None

I decided to return to py-hopscotch-dict and inline the _lookup() call into __contains__() like I'd done for most of the other dict access functions:

def __contains__(self, key: Hashable) -> bool:
    result = False
    home_idx = abs(hash(key)) % self._size

    nbhd = self._nbhds[home_idx]
    for nbhd_idx, nbr in enumerate(bin(nbhd)[:1:-1]):
        if nbr == "1":
            neighbor = (home_idx + nbhd_idx) % self._size
            data_idx = self._indices[neighbor]

            if data_idx < 0:
                raise RuntimeError(
                    (
                        "Index {} has supposed displaced neighbor that points to "
                        "free index"
                    ).format(home_idx)
                )

            if self._keys[data_idx] == key:
                result = True
                break

    return result

The benefit was less than expected, but still welcome:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	11.852	11.913	11.697	11.960	11.787
200,000	25.203	25.313	24.785	25.326	25.017
300,000	39.435	39.555	38.695	39.544	39.077
400,000	51.381	51.550	50.450	51.553	50.905
500,000	64.576	64.808	63.352	64.776	63.963
600,000	78.305	78.566	76.742	78.481	77.540
700,000	88.572	88.879	86.838	88.837	87.703
800,000	100.363	100.694	98.437	100.669	99.373
900,000	120.675	121.089	118.113	120.821	119.443
1,000,000	131.940	132.335	129.160	132.158	130.580

Runtimes now averaged 131.559 seconds, a 1.6% decrease over returning to bitbanging in _get_closest_leaf() and a 74.4% decrease in runtime overall.

`resize()` Refactoring

The Scalene reports suggested I was still spending a surprising amount of time in __resize__(), so I decided to see what performance improvements were possible there.

for data_idx, key in enumerate(self._keys):
    expected_lookup_idx = abs(hash(key)) % new_size

    open_nbr_idx = self._get_open_neighbor(expected_lookup_idx)
    if open_nbr_idx is None:
        if new_size - expected_lookup_idx < self._nbhd_size:
            free_up_idx = 0
        else:
            free_up_idx = expected_lookup_idx

        self._free_up(free_up_idx)
        open_nbr_idx = cast(int, self._get_open_neighbor(expected_lookup_idx))
    nbhd_idx = (open_nbr_idx - expected_lookup_idx) % new_size
    self._set_neighbor(expected_lookup_idx, nbhd_idx)
    self._set_lookup_index_info(open_nbr_idx, data=data_idx)

I couldn't help thinking eating two calls to _get_open_neighbor() every time a neighborhood was too full was a bit ridiculous, so I inlined the functionality as best I could:

for data_idx, key in enumerate(self._keys):
    found_nbr = False
    home_idx = abs(hash(key)) % new_size

    for offset in range(self._nbhd_size):
        nbr = (home_idx + offset) % self._size
        if self._indices[nbr] == self.FREE_ENTRY:
            self._set_neighbor(home_idx, offset)
            self._set_lookup_index_info(nbr, data=data_idx)
            found_nbr = True
            break

    if not found_nbr:
        free_up_idx = 0 if new_size - home_idx < self._nbhd_size else home_idx
        self._free_up(free_up_idx)

        for offset in range(self._nbhd_size):
            nbr = (home_idx + offset) % self._size
            if self._indices[nbr] == self.FREE_ENTRY:
                self._set_neighbor(home_idx, offset)
                self._set_lookup_index_info(nbr, data=data_idx)
                break

I still don't like writing essentially the same loop twice, but I can't argue with the results:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	11.669	11.642	11.647	11.684	11.590
200,000	24.610	24.656	24.608	24.687	24.453
300,000	38.380	38.435	38.350	38.436	38.045
400,000	50.085	50.158	50.023	50.206	49.678
500,000	62.948	62.992	62.840	63.056	62.399
600,000	76.343	76.312	76.166	76.403	75.635
700,000	86.521	86.421	86.273	86.624	85.765
800,000	98.242	98.058	97.906	98.357	97.328
900,000	117.489	117.430	117.345	117.531	116.343
1,000,000	128.543	128.437	128.356	128.608	127.282

I now had an average runtime of 128.445 seconds, a 2.4% decrease from inlining helper functions in __contains__() and a 75.0% decrease from baseline.

Neighborhood Management Inlining

While I was refactoring _resize() I took a look at some of the other helper functions to remind myself how much "help" they actually provide:

def _set_neighbor(self, lookup_idx: int, nbhd_idx: int) -> None:
    if lookup_idx < 0 or nbhd_idx < 0:
        raise ValueError("Indexes cannot be negative")
    elif lookup_idx >= self._size:
        raise ValueError(f"Index {lookup_idx} outside array")
    elif nbhd_idx >= self._nbhd_size:
        raise ValueError("Trying to set neighbor outside neighborhood")

    self._nbhds[lookup_idx] |= 1 << nbhd_idx

def _clear_neighbor(self, lookup_idx: int, nbhd_idx: int) -> None:
    if lookup_idx < 0 or nbhd_idx < 0:
        raise ValueError("Indexes cannot be negative")
    elif lookup_idx >= self._size:
        raise ValueError(f"Index {lookup_idx} outside array")
    elif nbhd_idx >= self._nbhd_size:
        raise ValueError("Trying to clear neighbor outside neighborhood")

    self._nbhds[lookup_idx] &= ~(1 << nbhd_idx)

I know I originally wrote my code like this to "be more professional" and "avoid DRY", but eating a function call in a hot loop for one line of bitbanging is ridiculous and I would stand for it no longer. Even though I'm still not a fan of bitbanging I couldn't think of a better way that was still idempotent so I inlined the code as it was:

for data_idx, key in enumerate(self._keys):
    found_nbr = False
    home_idx = abs(hash(key)) % new_size

    for offset in range(self._nbhd_size):
        nbr = (home_idx + offset) % self._size
        if self._indices[nbr] == self.FREE_ENTRY:
            # Mark the neighbor storing the entry index as occupied
            self._nbhds[home_idx] |= 1 << offset
            self._set_lookup_index_info(nbr, data=data_idx)
            found_nbr = True
            break

    if not found_nbr:
        free_up_idx = 0 if new_size - home_idx < self._nbhd_size else home_idx
        self._free_up(free_up_idx)

        for offset in range(self._nbhd_size):
            nbr = (home_idx + offset) % self._size
            if self._indices[nbr] == self.FREE_ENTRY:
                # Mark the neighbor storing the entry index as occupied
                self._nbhds[home_idx] |= 1 << offset
                self._set_lookup_index_info(nbr, data=data_idx)
                break

At least one of my heuristics didn't let me down:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	11.453	11.186	11.307	11.374	11.319
200,000	24.095	23.575	23.766	23.958	23.845
300,000	37.361	36.680	36.867	37.208	37.058
400,000	48.807	47.977	48.172	48.634	48.374
500,000	61.321	60.249	60.461	61.084	60.719
600,000	74.274	72.991	73.233	73.960	73.580
700,000	84.264	82.817	83.156	83.960	83.455
800,000	95.777	94.081	94.592	95.492	94.824
900,000	114.183	112.017	112.552	113.673	113.149
1,000,000	124.927	122.507	123.203	124.409	123.798

The average runtime decreased to 123.803 seconds, a 3.6% runtime decrease from refactoring _resize() and a 75.9% decrease from baseline.

Index Information Inlining

The keen-eyed reader may have noticed a second helper function in the snippet of _resize():

def _set_lookup_index_info(
    self, lookup_idx: int, data: Optional[int] = None, nbhd: Optional[int] = None
) -> None:
    if lookup_idx < 0:
        raise ValueError("Indexes cannot be negative")
    elif lookup_idx >= self._size:
        raise ValueError(f"Index {lookup_idx} outside array")

    if data is not None:
        self._indices[lookup_idx] = data

    if nbhd is not None:
        self._nbhds[lookup_idx] = nbhd

As it turned out I built a "two-line" helper function, but I only used one of the legs so it ended up being a one-line helper function as well. Obviously I got rid of it:

for data_idx, key in enumerate(self._keys):
    found_nbr = False
    home_idx = abs(hash(key)) % new_size

    for offset in range(self._nbhd_size):
        nbr = (home_idx + offset) % self._size
        if self._indices[nbr] == self.FREE_ENTRY:
            # Mark the neighbor storing the entry index as occupied
            self._nbhds[home_idx] |= 1 << offset
            self._indices[nbr] = data_idx
            found_nbr = True
            break

    if not found_nbr:
        free_up_idx = 0 if new_size - home_idx < self._nbhd_size else home_idx
        self._free_up(free_up_idx)

        for offset in range(self._nbhd_size):
            nbr = (home_idx + offset) % self._size
            if self._indices[nbr] == self.FREE_ENTRY:
                # Mark the neighbor storing the entry index as occupied
                self._nbhds[home_idx] |= 1 << offset
                self._indices[nbr] = data_idx
                break

The benefit here was also obvious:

Primes Inserted	Run 1	Run 2	Run 3	Run 4	Run 5
100,000	10.723	11.064	10.861	10.850	11.022
200,000	22.405	23.174	22.694	22.719	23.078
300,000	34.683	35.831	35.096	35.128	35.725
400,000	45.415	46.918	45.917	45.973	46.790
500,000	56.952	58.872	57.622	57.696	58.728
600,000	68.932	71.273	69.730	69.862	71.103
700,000	78.363	81.036	79.275	79.456	80.858
800,000	89.294	92.324	90.341	90.546	92.087
900,000	105.790	109.434	106.981	107.250	109.142
1,000,000	115.745	119.785	117.074	117.404	119.458

At 117.979 seconds, the average benchmark runtime decreased 4.7% from inlining neighborhood management and 77.1% from baseline.

Wrapping Up

A line graph plotting average benchmark runtime after various optimization strategies; made using linegraphmaker.co

To be honest, I thought about stopping after I put bitbanging back in _get_closest_leaf() - I wanted to see a sub-120 second run, but had no idea how to shave off an additional 15 seconds - but I'm glad I stuck with it. Now my last goal is to insert the first 100k primes in under 10 seconds on this hardware - but I've put too much time into this already. I'm never happy realizing I write such unperformant garbage code but I am satisfied with the results I get when I sit down and try to improve it. I decided to play a data scientist and put all my data into a spreadsheet so I could do some math on it:

A line graph plotting benchmark runtime standard deviation after various optimization strategies; mad using rapidtables.com

I'm glad I managed to ultimately reduce the standard deviation of my benchmark runs by roughly a factor of three, but I don't what I changed from when I cut it by a factor of 70.

Memory usage declined from 1109 MiB to 761 MiB, a 31% decrease pretty much entirely from using __slots__ on TrieNode. I didn't realize how many TrieNode objects I make in py-fast-trie; since there's so many of them a small savings goes a long way, but since there's basically nothing in them I don't have many options to make them smaller.

Just for funsies, I ran my benchmark script without Scalene on my homelab server with a Ryzen 5 5650 and DDR4-3200 memory:

Primes Inserted	Runtime
100,000	2.576
200,000	5.332
300,000	8.235
400,000	10.803
500,000	13.563
600,000	16.421
700,000	19.040
800,000	21.339
900,000	25.175
1,000,000	27.512

Feels good man

[1]	Perhaps I've read too many SwiftOnSecurity tweets and it's starting to poison my brain, yeesh

[2]	The processor was built before they even put the generation in the model number!

[3]	"Thankfully"; the trie deletion function also sucked and I rewrote it at the same time, but it doesn't get called as part of my benchmark and was less sucky

Prologue

Baseline

__setitem__() Helper Function Inlining

Removing TrieNode Boilerplate

insert() Helper Function Inlining

Removing Redundant Input Sanitization

get_closest_leaf() Value Conversion

Rewriting insert()

Adding __slots__

get_closest_leaf() Value Deconversion

__contains__() Helper Function Inlining

__resize__() Refactoring