Until recently, profilers were a tool I'd heard people gushing over but never gotten around to trying myself. I put ox_profile on my list of neat tools to try out when I heard of it at PyCon in 2019 and replaced it with Scalene when I heard of that at PyCon in 2021, but I still didn't have any plans to see what a profiler could actually do for me.
A few months back, I got a bug up my butt about making a library I wrote run faster and use less memory; I have definitely made the library faster and hopefully with a couple more days of work I can get it to use less memory.
Prologue
Since I was trying to profile a library instead of a standard application, I needed something to invoke the library and give me data I could use; I came up with this:
from pathlib import Path
from time import perf_counter
from zipfile import ZipFile
from py_fast_trie import XFastTrie
def main():
t = XFastTrie()
start = perf_counter()
for i in range(1, 2):
name = f"primes{i}"
p = Path(f"~/{name}.zip").expanduser()
ctr = 0
with ZipFile(p) as z:
with z.open(f"{name}.txt") as f:
print(f"opened {name}.txt")
for l in f.readlines()[2:]:
for n in l.split():
t += int(n)
ctr += 1
if ctr % 100000 == 0:
print(f"{ctr}: {perf_counter() - start}")
if __name__ == "__main__": main()
In short, this script reads one million prime numbers from a zipfile into an X fast trie. Additionally, the script prints out its elapsed runtime every 100,000 primes inserted to spotlight any behavior happening during insertion that may not affect the overall runtime.
My testing methodology was to reboot my computer before each set of runs to avoid influence from other processes and have a completely cold cache, then let Scalene profile the script five times to generate data points, then drop the best and worst runs before averaging the runtime of a given build; I could also use the Scalene profile data to find hotspots for optimization.
Baseline
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 115.665 | 112.900 | 111.302 | 115.191 | 114.632 |
200,000 | 241.377 | 239.297 | 235.710 | 239.913 | 242.562 |
300,000 | 368.667 | 366.665 | 362.053 | 369.749 | 375.604 |
400,000 | 492.063 | 491.821 | 485.105 | 492.931 | 501.103 |
500,000 | 618.964 | 618.813 | 612.279 | 618.787 | 631.176 |
600,000 | 748.827 | 747.478 | 742.050 | 747.763 | 764.497 |
700,000 | 875.075 | 872.175 | 867.659 | 872.513 | 893.623 |
800,000 | 1004.142 | 1000.391 | 996.166 | 1000.819 | 1025.974 |
900,000 | 1153.979 | 1148.325 | 1145.739 | 1149.454 | 1178.279 |
1,000,000 | 1281.768 | 1274.578 | 1273.444 | 1275.811 | 1306.384 |
The baseline numbers weren't great at an average of 1277.386 seconds, but considering I had nothing to compare them to I didn't know if I was expecting more than I should. Before I began profiling, I had a hunch most of my time and memory would be spent in a dependency I wrote - I tried something fancy with how I stored some underlying data which I expected to come back to bite me - and I was right, but not where I thought:
I was spending 62.5% of my time in _get_displaced_neighbors(), which didn't have much code in it:
def _get_displaced_neighbors(lookup_idx, nbhd, nbhd_size, max_size):
if lookup_idx < 0:
raise ValueError("Indexes cannot be negative")
elif lookup_idx >= max_size:
raise ValueError(f"Index {lookup_idx} outside array")
result = []
for i in range(nbhd_size):
if nbhd & (1 << i) > 0:
result.append((lookup_idx + i) % max_size)
return result
The per-line data wasn't much nicer:
I swear the point of a profiler is to make you feel like an absolute idiot, so when you stop sucking you feel like a genius. Nevertheless, spending half my time bit-banging is about the lowest-hanging fruit I could possibly get, so I started there.
String Manipulation
I started looking around for faster ways to pull displaced neighbors out of a bit field when I found a StackOverflow post suggesting to turn the field into an actual string and manipulate it that way, so the function would now look like this:
def _get_displaced_neighbors(lookup_idx, nbhd, max_size):
if lookup_idx < 0:
raise ValueError("Indexes cannot be negative")
elif lookup_idx >= max_size:
raise ValueError(f"Index {lookup_idx} outside array")
result = []
for nbhd_idx, neighbor in enumerate(bin(nbhd)[:1:-1]):
if neighbor == "1":
result.append((lookup_idx + nbhd_idx) % max_size)
return result
The results were very promising:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 65.006 | 61.806 | 62.424 | 64.088 | 62.385 |
200,000 | 130.040 | 127.776 | 126.299 | 130.893 | 127.129 |
300,000 | 195.861 | 193.759 | 192.098 | 196.506 | 192.607 |
400,000 | 260.300 | 257.007 | 256.125 | 259.970 | 256.474 |
500,000 | 325.722 | 321.910 | 321.469 | 324.689 | 321.622 |
600,000 | 391.856 | 387.077 | 387.087 | 389.936 | 386.920 |
700,000 | 456.190 | 452.615 | 450.729 | 453.175 | 450.606 |
800,000 | 519.202 | 517.478 | 515.014 | 516.446 | 514.334 |
900,000 | 591.787 | 592.855 | 588.375 | 589.005 | 587.216 |
1,000,000 | 654.960 | 660.119 | 652.161 | 651.986 | 650.770 |
The new average of 653.036 seconds was faster by 624.35 seconds, a 48.9% reduction in runtime. After this change the library was only spending 31.5% of its time in _get_displaced_neighbors().
Buoyed by the results of my first attempt at optimization, I pressed forward.
List Comprehensions
I was still concerned about what Scalene said was happening to memory inside _get_displaced_neighbors() - I was fairly certain those lines were marked as potential issues because that's where the script spends most of its time, but in case all those enums were doing more than I thought, I decided to develop a new plan of attack.
I decided to try a flyer next go-round - comprehensions in Python are supposed to be faster than the corresponding imperative code, right? And hopefully they'll save a bit of memory too. In the end the function now looked like this:
def _get_displaced_neighbors(lookup_idx, nbhd, max_size):
if lookup_idx < 0:
raise ValueError("Indexes cannot be negative")
elif lookup_idx >= max_size:
raise ValueError(f"Index {lookup_idx} outside array")
result = [(lookup_idx + i) % max_size for (i, c) in enumerate(bin(nbhd)[:1:-1]) if c == "1"]
return result
After rebooting and retesting, I got these results:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 73.705 | 72.608 | 75.885 | 73.100 | 73.176 |
200,000 | 148.738 | 147.913 | 153.808 | 149.307 | 149.385 |
300,000 | 224.068 | 223.060 | 231.448 | 225.558 | 226.757 |
400,000 | 296.912 | 296.410 | 306.960 | 300.250 | 302.188 |
500,000 | 371.301 | 371.032 | 384.256 | 376.168 | 379.173 |
600,000 | 446.013 | 446.353 | 461.598 | 452.560 | 455.946 |
700,000 | 518.918 | 520.006 | 537.578 | 527.141 | 531.602 |
800,000 | 593.010 | 593.645 | 613.945 | 601.947 | 607.021 |
900,000 | 676.714 | 677.744 | 700.153 | 686.662 | 692.907 |
1,000,000 | 749.768 | 750.165 | 776.237 | 760.944 | 768.211 |
Not all experiments pan out - this one certainly didn't, at 759.773 seconds the average runtime is 16.3% longer than using string manipulation with no change to memory usage - but I didn't let this slow me down.
Inlining
While looking for the next place to make gains, I realized something: _get_displaced_neighbors() was made its own function for separation of duties, it's only ever called from _get_lookup_index_info() and there's only three lines in it - plus function calls aren't cheap, so why not inline it? It would make sense:
def _get_lookup_index_info(self, lookup_idx):
if lookup_idx < 0:
raise ValueError("Indexes cannot be negative")
elif lookup_idx >= self._size:
raise ValueError(f"Index {lookup_idx} outside array")
neighbors = []
lookup_offset = calcsize(self._pack_fmt) * lookup_idx
data_idx, nbhd = unpack_from(self._pack_fmt, self._lookup_table, lookup_offset)
for nbhd_idx, neighbor in enumerate(bin(nbhd)[:1:-1]):
if neighbor == "1":
neighbors.append((lookup_idx + nbhd_idx) % max_size)
return data_idx, neighbors
In addition to saving a function call _get_lookup_index_info() has the same argument checks as _get_displaced_neighbors() so they wouldn't need to be done twice. Everything appeared to come up Milhouse, so all that was left was to start testing to confirm:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 61.012 | 59.153 | 57.793 | 57.753 | 58.787 |
200,000 | 122.970 | 120.708 | 116.476 | 120.183 | 120.297 |
300,000 | 184.217 | 184.045 | 176.358 | 180.544 | 181.876 |
400,000 | 244.400 | 243.895 | 234.348 | 238.553 | 240.313 |
500,000 | 305.208 | 305.069 | 293.614 | 297.827 | 300.458 |
600,000 | 367.373 | 366.513 | 353.343 | 357.399 | 360.990 |
700,000 | 426.911 | 426.363 | 411.766 | 415.699 | 419.735 |
800,000 | 486.667 | 484.390 | 470.073 | 474.072 | 478.545 |
900,000 | 555.432 | 551.842 | 538.063 | 542.677 | 546.490 |
1,000,000 | 615.373 | 610.409 | 597.570 | 602.015 | 605.454 |
Inlining neighbor detection resulted in an average of 605.959 seconds, over 52% faster than the baseline and a little over 7% faster than string manipulation alone. More importantly, it gave me the first run under 10 minutes - when I went from 21 to 11 minutes with just one change I knew knocking off another minute was doable. Time spent in the function was still decreasing when comparing the combined 53.3% of _get_displaced_neighbors() and _get_lookup_index_info() from the fastest string manipulation to the 49.5% of the inlined _get_lookup_index_info().
Neighbor Toggling
At this point I was running kinda dry - there wasn't much to the inlined _get_lookup_index_info() but Scalene still said almost half my time was spent there. There wasn't much else to remove other than some argument checks, meanwhile a quarter of my time was ostensibly spent searching for neighbors:
The worst part was most calls to _get_lookup_index_info() didn't even use neighbors, they were slowing the library down for essentially no benefit. I tried everything I could think of to compute them lazily, from generators to coroutines, and nothing worked - until I had a brain blast: why can't I just put them behind a flag? _get_lookup_index_info() is only called in like six places, adding an argument for whether or not to run neighbor detection wouldn't be that hard to do, so I did it:
def _get_lookup_index_info(self, lookup_idx, get_neighbors=False):
if lookup_idx < 0:
raise ValueError("Indexes cannot be negative")
elif lookup_idx >= self._size:
raise ValueError(f"Index {lookup_idx} outside array")
neighbors = []
lookup_offset = calcsize(self._pack_fmt) * lookup_idx
data_idx, nbhd = unpack_from(self._pack_fmt, self._lookup_table, lookup_offset)
if get_neighbors:
for nbhd_idx, neighbor in enumerate(bin(nbhd)[:1:-1]):
if neighbor == "1":
neighbors.append((lookup_idx + nbhd_idx) % self._size)
return data_idx, neighbors
The results speak for themselves:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 50.270 | 52.114 | 51.133 | 51.797 | 49.237 |
200,000 | 102.333 | 106.391 | 103.681 | 106.344 | 101.553 |
300,000 | 155.588 | 159.445 | 157.069 | 159.732 | 152.880 |
400,000 | 207.121 | 211.264 | 209.120 | 212.322 | 203.195 |
500,000 | 258.304 | 264.065 | 260.925 | 265.330 | 254.578 |
600,000 | 309.204 | 317.347 | 313.710 | 319.027 | 306.253 |
700,000 | 358.971 | 369.373 | 365.078 | 371.287 | 357.013 |
800,000 | 408.148 | 421.080 | 416.441 | 422.898 | 406.914 |
900,000 | 465.355 | 480.437 | 475.625 | 482.984 | 464.682 |
1,000,000 | 515.270 | 531.929 | 527.056 | 534.965 | 514.848 |
With an average runtime of 524.751 seconds, the updated library runs almost 59% faster than baseline and over 13% faster than inlining alone.
Lookup Table Entry Size Precomputation
At this point Scalene said my library still spends over 40% of its time in _get_lookup_index_info() and there weren't that many other stones for me to turn over. Given the benefit from inlining _get_displaced_neighbors() I'm somewhat surprised I didn't think doing the same for calcsize() would yield much of an improvement, but I knew you have to shed as much dead weight as possible in your hot loops so I stored this value in an instance variable and referenced it in the function:
def _get_lookup_index_info(self, lookup_idx, get_neighbors=False):
if lookup_idx < 0:
raise ValueError("Indexes cannot be negative")
elif lookup_idx >= self._size:
raise ValueError(f"Index {lookup_idx} outside array")
neighbors = []
lookup_offset = self._lookup_size * lookup_idx
data_idx, nbhd = unpack_from(self._pack_fmt, self._lookup_table, lookup_offset)
if get_neighbors:
for nbhd_idx, neighbor in enumerate(bin(nbhd)[:1:-1]):
if neighbor == "1":
neighbors.append((lookup_idx + nbhd_idx) % self._size)
return data_idx, neighbors
Let me never doubt how much a "small" function call costs in a hot loop again:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 48.860 | 49.114 | 48.528 | 48.684 | 50.097 |
200,000 | 99.665 | 99.705 | 99.455 | 98.915 | 100.216 |
300,000 | 148.832 | 150.283 | 150.723 | 150.419 | 150.644 |
400,000 | 196.822 | 199.569 | 200.046 | 199.414 | 199.440 |
500,000 | 246.204 | 249.753 | 251.155 | 249.415 | 249.322 |
600,000 | 296.087 | 300.836 | 303.075 | 299.705 | 299.609 |
700,000 | 344.516 | 350.515 | 353.604 | 349.321 | 348.919 |
800,000 | 392.337 | 398.952 | 403.745 | 398.809 | 398.923 |
900,000 | 447.934 | 455.667 | 461.380 | 456.021 | 456.071 |
1,000,000 | 496.665 | 504.922 | 512.022 | 504.274 | 506.819 |
With a final average of 505.338 seconds, the test script now runs almost 61.5% faster than baseline, 3.7% faster than toggling neighbor generation alone.
Splitting Index/Neighbor Data
At this point I shifted my focus to accessing housekeeping data in self._lookup_table more quickly, and I couldn't stop thinking whatever space savings I got from combining lookup index and neighborhood data into a single data structure was more than lost in time spent calling unpack_from() every time I needed the data - so I split each out into its own array. When I began, the function creating the housekeeping data structure looked like this:
def _make_lookup_table(table_size: int) -> tuple[bytearray, str, int]:
if table_size < 0:
raise ValueError("Lookup table cannot have negative length")
table_log_size = table_size.bit_length()
if table_log_size < 8:
struct_fmt = "b B"
elif table_log_size < 16:
struct_fmt = ">h H"
elif table_log_size < 32:
struct_fmt = ">i I"
else:
struct_fmt = ">l L"
return (
bytearray(pack(struct_fmt, HopscotchDict.FREE_ENTRY, 0) * table_size),
struct_fmt,
calcsize(struct_fmt),
)
After splitting lookup index data and neighborhood data apart, the function looked like this:
def _make_lookup_tables(table_size: int) -> tuple[array[int], array[int]]:
if table_size < 0:
raise ValueError("Lookup table cannot have negative length")
table_log_size = table_size.bit_length()
if table_log_size < 8:
index_fmt = "b"
nbhd_fmt = "B"
elif table_log_size < 16:
index_fmt = "h"
nbhd_fmt = "H"
elif table_log_size < 32:
index_fmt = "i"
nbhd_fmt = "I"
else:
index_fmt = "l"
nbhd_fmt = "L"
lookup_table = array(index_fmt, (HopscotchDict.FREE_ENTRY,) * table_size)
nbhd_table = array(nbhd_fmt, (0,) * table_size)
return (lookup_table, nbhd_table)
After updating all the downstream functions, I received a nice reduction in runtime for my efforts:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 42.586 | 43.356 | 42.258 | 42.285 | 43.635 |
200,000 | 87.960 | 89.509 | 86.394 | 87.398 | 90.212 |
300,000 | 133.104 | 136.025 | 132.977 | 132.543 | 137.193 |
400,000 | 175.373 | 179.548 | 175.902 | 175.404 | 181.664 |
500,000 | 219.386 | 224.529 | 219.753 | 219.754 | 227.008 |
600,000 | 264.242 | 270.176 | 264.836 | 264.754 | 273.625 |
700,000 | 306.358 | 313.558 | 308.133 | 307.340 | 318.601 |
800,000 | 349.192 | 356.049 | 352.180 | 348.939 | 361.678 |
900,000 | 401.421 | 410.147 | 404.848 | 401.295 | 413.659 |
1,000,000 | 444.160 | 455.015 | 447.733 | 443.461 | 457.737 |
My test script now finishes in an average of 448.969 seconds, nearly 64.9% faster than the baseline and 11.1% faster than using the combined data structure.
More Inlining
Once I broke up the housekeeping data structure, I realized there wasn't much use for _get_lookup_index_info() anymore - most things really wanted either the lookup index or the neighbors for a bucket and now they could get each easily - so I got rid of it, inlining its code as needed. Even though I managed to bring the average runtime of the test script from over 21 minutes to under 8 minutes already, this change still astounded me:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 33.634 | 32.534 | 33.731 | 33.640 | 33.592 |
200,000 | 69.395 | 66.432 | 69.372 | 67.865 | 67.757 |
300,000 | 104.827 | 99.754 | 104.390 | 101.887 | 102.279 |
400,000 | 138.866 | 132.095 | 137.511 | 134.830 | 135.320 |
500,000 | 173.498 | 165.230 | 171.808 | 168.828 | 169.377 |
600,000 | 208.386 | 198.927 | 206.518 | 202.880 | 203.921 |
700,000 | 242.714 | 231.813 | 240.597 | 235.885 | 238.040 |
800,000 | 275.938 | 263.833 | 273.817 | 267.645 | 271.588 |
900,000 | 314.713 | 300.933 | 312.255 | 304.742 | 310.409 |
1,000,000 | 348.945 | 333.755 | 345.821 | 337.084 | 344.395 |
Inlining _get_lookup_index_info() brought my average down to 342.433 seconds, 73.1% faster than the baseline and a whopping 23.7% reduction from accessing lookup indices and neighbor data separately. I couldn't believe there were still optimizations to be found that could knock over 100 seconds off my average, I knew there were still somehow more to find.
Even More Inlining
At this point I had two goals: to see a run insert the first 100,000 primes in under 30 seconds and to see a run finish in under five minutes. I knew with a current average around 33 seconds the first goal was doable, but unless I found another big piece of low-hanging fruit I might not get the second. Checking the Scalene profiles from my last set of runs showed the test script was spending over 15% of its time in get() and __getitem()__ even though there isn't much to them:
def __getitem__(self, key: Hashable) -> Any:
_, idx = self._lookup(key)
if idx is not None:
return self._values[idx]
else:
raise KeyError(key)
def get(self, key: Hashable, default: Any = None) -> Any:
out = default
try:
out = self.__getitem__(key)
except KeyError:
pass
return out
As a new convert to the church of "Function calls aren't free" I decided to inline these functions, but one at a time so I could see if one made more of a difference. I started with __getitem()__ since it should speed up calls to get() as well:
def __getitem__(self, key: Hashable) -> Any:
target_idx = abs(hash(key)) % self._size
nbhd = self._nbhds[target_idx]
for nbhd_idx, nbr in enumerate(bin(nbhd)[:1:-1]):
if nbr == "1":
neighbor = (target_idx + nbhd_idx) % self._size
nbr_data_idx = self._indices[neighbor]
if nbr_data_idx < 0:
raise RuntimeError(
(
"Index {} has supposed displaced neighbor that points to "
"free index"
).format(target_idx)
)
if self._keys[nbr_data_idx] == key:
return self._values[nbr_data_idx]
raise KeyError(key)
The results of this change were about what I expected:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 31.293 | 30.706 | 31.081 | 30.277 | 30.649 |
200,000 | 64.821 | 61.902 | 63.436 | 62.404 | 62.803 |
300,000 | 98.271 | 95.039 | 95.210 | 95.691 | 94.354 |
400,000 | 129.184 | 125.397 | 125.962 | 126.203 | 124.701 |
500,000 | 160.912 | 157.095 | 157.568 | 157.721 | 156.081 |
600,000 | 193.148 | 189.096 | 189.567 | 189.612 | 188.144 |
700,000 | 224.824 | 220.124 | 220.731 | 221.112 | 219.256 |
800,000 | 255.335 | 250.220 | 250.941 | 251.277 | 249.239 |
900,000 | 291.044 | 285.749 | 286.396 | 286.922 | 284.597 |
1,000,000 | 322.325 | 316.716 | 317.424 | 317.797 | 315.226 |
This set of runs had an average of 317.312 seconds, a 75.1% reduction from baseline and a 7.3% reduction from inlining _get_lookup_index_info(). Unfortunately no run managed to insert the first 100,000 primes in under 30 seconds, but I was so close I could taste it. Inlining get() was just as straightforward as inlining __getitem()__:
def get(self, key: Hashable, default: Any = None) -> Any:
value = default
target_idx = abs(hash(key)) % self._size
nbhd = self._nbhds[target_idx]
for nbhd_idx, nbr in enumerate(bin(nbhd)[:1:-1]):
if nbr == "1":
neighbor = (target_idx + nbhd_idx) % self._size
nbr_data_idx = self._indices[neighbor]
if nbr_data_idx < 0:
raise RuntimeError(
(
"Index {} has supposed displaced neighbor that points to "
"free index"
).format(target_idx)
)
if self._keys[nbr_data_idx] == key:
value = self._values[nbr_data_idx]
break
return value
However, this codebase had one last surprise for me:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 27.323 | 28.238 | 29.217 | 28.369 | 28.267 |
200,000 | 57.053 | 57.541 | 59.872 | 57.111 | 57.497 |
300,000 | 86.559 | 86.707 | 90.534 | 86.132 | 86.896 |
400,000 | 113.862 | 114.659 | 119.797 | 113.779 | 114.805 |
500,000 | 141.902 | 143.490 | 149.893 | 142.246 | 143.362 |
600,000 | 170.320 | 171.977 | 180.275 | 171.141 | 172.053 |
700,000 | 198.440 | 200.167 | 209.973 | 199.397 | 200.114 |
800,000 | 225.147 | 227.097 | 238.692 | 226.594 | 227.143 |
900,000 | 257.204 | 259.495 | 272.634 | 259.093 | 259.584 |
1,000,000 | 284.555 | 287.787 | 301.939 | 286.972 | 287.585 |
Not only did every run insert its first 100,000 primes in under 30 seconds, almost every run managed to complete in under five minutes; this additional inlining brought my average down to 287.448 seconds, a 77.4% runtime reduction from baseline and a 16% reduction from inlining _get_lookup_index_info(). I couldn't figure out why this was until I remembered x-fast tries call get() in their hot loops.
Combining the last two inlined sections together didn't do that much to improve runtimes, but did noticeably reduce variance:
Primes Inserted | Run 1 | Run 2 | Run 3 | Run 4 | Run 5 |
---|---|---|---|---|---|
100,000 | 27.469 | 27.623 | 26.969 | 27.326 | 27.369 |
200,000 | 57.077 | 56.781 | 55.529 | 55.596 | 55.659 |
300,000 | 86.460 | 85.585 | 84.221 | 85.187 | 84.415 |
400,000 | 113.868 | 114.721 | 111.533 | 113.197 | 111.709 |
500,000 | 141.722 | 142.840 | 139.822 | 141.662 | 139.905 |
600,000 | 169.838 | 171.516 | 168.496 | 170.604 | 168.483 |
700,000 | 197.348 | 199.727 | 196.434 | 198.824 | 196.432 |
800,000 | 224.039 | 226.883 | 223.438 | 225.998 | 223.248 |
900,000 | 255.873 | 259.314 | 255.681 | 258.438 | 255.577 |
1,000,000 | 283.175 | 287.230 | 283.378 | 286.249 | 283.121 |
Inlining both get() and __getitem()__ resulted in an average test runtime of 284.267 seconds, a 77.74% reduction in runtime from baseline and a 16.98% reduction from inlining _get_lookup_index_info(). At this point I had no more stretch goals to hit, so I decided to stop here.
Wrapping Up
So what is there to say after beating my code with a stick for a couple days? For starters, when I was recording 1200 second baseline runs I never thought anywhere near 300 second runs were possible - I'd hoped I was skilled enough to not write such unperformant code, but here I am. Next, make sure the program used to generate data reflects real-world usage - I didn't do this as my use case for this library is lookup-heavy but I figured there was some fat to cut in my insertion logic and I wasn't going to save any RAM with my lookups. Lastly, use real tools to find where your actual hot loops are and accept you'll be surprised when you find them - I never would've guessed how glacial bit-banging to find displaced neighbors would be.
As for tools - after using Scalene for a couple days I have some concerns about the accuracy of its memory tagging and I wish I had the option to see which functions call a particular function the most, but perhaps some of these gripes are signs I need more than just Scalene. I tried an additional profiler later in this process, but I think at the time it was just more information without more utility.
Unfortunately, despite all my gains in increasing performance I didn't seem to make a dent in memory usage. I guess most of the memory growth comes from the X fast trie, so I'll have to target it directly to see if I can bring the memory usage down.