My Little Profiler: Flamegraph is Magic

Mon 12 December 2022

Until recently, profilers were a tool I'd heard people gushing over but never gotten around to trying myself. I put ox_profile on my list of neat tools to try out when I heard of it at PyCon in 2019 and replaced it with Scalene when I heard of that at PyCon in 2021, but I still didn't have any plans to see what a profiler could actually do for me.

A few months back, I got a bug up my butt about making a library I wrote run faster and use less memory; I have definitely made the library faster and hopefully with a couple more days of work I can get it to use less memory.

Prologue

Since I was trying to profile a library instead of a standard application, I needed something to invoke the library and give me data I could use; I came up with this:

from pathlib import Path
from time import perf_counter
from zipfile import ZipFile

from py_fast_trie import XFastTrie

def main():
    t = XFastTrie()
    start = perf_counter()

    for i in range(1, 2):
        name = f"primes{i}"
        p = Path(f"~/{name}.zip").expanduser()
        ctr = 0

        with ZipFile(p) as z:
            with z.open(f"{name}.txt") as f:
                print(f"opened {name}.txt")
                for l in f.readlines()[2:]:
                    for n in l.split():
                        t += int(n)
                        ctr += 1

                        if ctr % 100000 == 0:
                            print(f"{ctr}: {perf_counter() - start}")

if __name__ == "__main__": main()

In short, this script reads one million prime numbers from a zipfile into an X fast trie. Additionally, the script prints out its elapsed runtime every 100,000 primes inserted to spotlight any behavior happening during insertion that may not affect the overall runtime.

My testing methodology was to reboot my computer before each set of runs to avoid influence from other processes and have a completely cold cache, then let Scalene profile the script five times to generate data points, then drop the best and worst runs before averaging the runtime of a given build; I could also use the Scalene profile data to find hotspots for optimization.

Baseline

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 115.665 112.900 111.302 115.191 114.632
200,000 241.377 239.297 235.710 239.913 242.562
300,000 368.667 366.665 362.053 369.749 375.604
400,000 492.063 491.821 485.105 492.931 501.103
500,000 618.964 618.813 612.279 618.787 631.176
600,000 748.827 747.478 742.050 747.763 764.497
700,000 875.075 872.175 867.659 872.513 893.623
800,000 1004.142 1000.391 996.166 1000.819 1025.974
900,000 1153.979 1148.325 1145.739 1149.454 1178.279
1,000,000 1281.768 1274.578 1273.444 1275.811 1306.384

The baseline numbers weren't great at an average of 1277.386 seconds, but considering I had nothing to compare them to I didn't know if I was expecting more than I should. Before I began profiling, I had a hunch most of my time and memory would be spent in a dependency I wrote - I tried something fancy with how I stored some underlying data which I expected to come back to bite me - and I was right, but not where I thought:

A subset of frequently-called functions; _get_displaced_neighbors takes up 62.5% of runtime

I was spending 62.5% of my time in _get_displaced_neighbors(), which didn't have much code in it:

def _get_displaced_neighbors(lookup_idx, nbhd, nbhd_size, max_size):
    if lookup_idx < 0:
        raise ValueError("Indexes cannot be negative")
    elif lookup_idx >= max_size:
        raise ValueError(f"Index {lookup_idx} outside array")

    result = []

    for i in range(nbhd_size):
        if nbhd & (1 << i) > 0:
            result.append((lookup_idx + i) % max_size)

    return result

The per-line data wasn't much nicer:

A subset of frequently-hit lines; finding neighbors takes up 49.5% of runtime

I swear the point of a profiler is to make you feel like an absolute idiot, so when you stop sucking you feel like a genius. Nevertheless, spending half my time bit-banging is about the lowest-hanging fruit I could possibly get, so I started there.

String Manipulation

I started looking around for faster ways to pull displaced neighbors out of a bit field when I found a StackOverflow post suggesting to turn the field into an actual string and manipulate it that way, so the function would now look like this:

def _get_displaced_neighbors(lookup_idx, nbhd, max_size):
    if lookup_idx < 0:
        raise ValueError("Indexes cannot be negative")
    elif lookup_idx >= max_size:
        raise ValueError(f"Index {lookup_idx} outside array")

    result = []

    for nbhd_idx, neighbor in enumerate(bin(nbhd)[:1:-1]):
        if neighbor == "1":
            result.append((lookup_idx + nbhd_idx) % max_size)

    return result

The results were very promising:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 65.006 61.806 62.424 64.088 62.385
200,000 130.040 127.776 126.299 130.893 127.129
300,000 195.861 193.759 192.098 196.506 192.607
400,000 260.300 257.007 256.125 259.970 256.474
500,000 325.722 321.910 321.469 324.689 321.622
600,000 391.856 387.077 387.087 389.936 386.920
700,000 456.190 452.615 450.729 453.175 450.606
800,000 519.202 517.478 515.014 516.446 514.334
900,000 591.787 592.855 588.375 589.005 587.216
1,000,000 654.960 660.119 652.161 651.986 650.770

The new average of 653.036 seconds was faster by 624.35 seconds, a 48.9% reduction in runtime. After this change the library was only spending 31.5% of its time in _get_displaced_neighbors().

A subset of frequently-hit lines after switching to string manipulation for neighbor detection

Buoyed by the results of my first attempt at optimization, I pressed forward.

List Comprehensions

I was still concerned about what Scalene said was happening to memory inside _get_displaced_neighbors() - I was fairly certain those lines were marked as potential issues because that's where the script spends most of its time, but in case all those enums were doing more than I thought, I decided to develop a new plan of attack.

I decided to try a flyer next go-round - comprehensions in Python are supposed to be faster than the corresponding imperative code, right? And hopefully they'll save a bit of memory too. In the end the function now looked like this:

def _get_displaced_neighbors(lookup_idx, nbhd, max_size):
    if lookup_idx < 0:
        raise ValueError("Indexes cannot be negative")
    elif lookup_idx >= max_size:
        raise ValueError(f"Index {lookup_idx} outside array")

    result = [(lookup_idx + i) % max_size for (i, c) in enumerate(bin(nbhd)[:1:-1]) if c == "1"]

    return result

After rebooting and retesting, I got these results:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 73.705 72.608 75.885 73.100 73.176
200,000 148.738 147.913 153.808 149.307 149.385
300,000 224.068 223.060 231.448 225.558 226.757
400,000 296.912 296.410 306.960 300.250 302.188
500,000 371.301 371.032 384.256 376.168 379.173
600,000 446.013 446.353 461.598 452.560 455.946
700,000 518.918 520.006 537.578 527.141 531.602
800,000 593.010 593.645 613.945 601.947 607.021
900,000 676.714 677.744 700.153 686.662 692.907
1,000,000 749.768 750.165 776.237 760.944 768.211

Not all experiments pan out - this one certainly didn't, at 759.773 seconds the average runtime is 16.3% longer than using string manipulation with no change to memory usage - but I didn't let this slow me down.

Inlining

While looking for the next place to make gains, I realized something: _get_displaced_neighbors() was made its own function for separation of duties, it's only ever called from _get_lookup_index_info() and there's only three lines in it - plus function calls aren't cheap, so why not inline it? It would make sense:

def _get_lookup_index_info(self, lookup_idx):
    if lookup_idx < 0:
        raise ValueError("Indexes cannot be negative")
    elif lookup_idx >= self._size:
        raise ValueError(f"Index {lookup_idx} outside array")

    neighbors = []
    lookup_offset = calcsize(self._pack_fmt) * lookup_idx

    data_idx, nbhd = unpack_from(self._pack_fmt, self._lookup_table, lookup_offset)

    for nbhd_idx, neighbor in enumerate(bin(nbhd)[:1:-1]):
        if neighbor == "1":
            neighbors.append((lookup_idx + nbhd_idx) % max_size)

    return data_idx, neighbors

In addition to saving a function call _get_lookup_index_info() has the same argument checks as _get_displaced_neighbors() so they wouldn't need to be done twice. Everything appeared to come up Milhouse, so all that was left was to start testing to confirm:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 61.012 59.153 57.793 57.753 58.787
200,000 122.970 120.708 116.476 120.183 120.297
300,000 184.217 184.045 176.358 180.544 181.876
400,000 244.400 243.895 234.348 238.553 240.313
500,000 305.208 305.069 293.614 297.827 300.458
600,000 367.373 366.513 353.343 357.399 360.990
700,000 426.911 426.363 411.766 415.699 419.735
800,000 486.667 484.390 470.073 474.072 478.545
900,000 555.432 551.842 538.063 542.677 546.490
1,000,000 615.373 610.409 597.570 602.015 605.454

Inlining neighbor detection resulted in an average of 605.959 seconds, over 52% faster than the baseline and a little over 7% faster than string manipulation alone. More importantly, it gave me the first run under 10 minutes - when I went from 21 to 11 minutes with just one change I knew knocking off another minute was doable. Time spent in the function was still decreasing when comparing the combined 53.3% of _get_displaced_neighbors() and _get_lookup_index_info() from the fastest string manipulation to the 49.5% of the inlined _get_lookup_index_info().

Neighbor Toggling

At this point I was running kinda dry - there wasn't much to the inlined _get_lookup_index_info() but Scalene still said almost half my time was spent there. There wasn't much else to remove other than some argument checks, meanwhile a quarter of my time was ostensibly spent searching for neighbors:

A subset of frequently-hit lines after inlining neighbor detection in _get_lookup_index_info

The worst part was most calls to _get_lookup_index_info() didn't even use neighbors, they were slowing the library down for essentially no benefit. I tried everything I could think of to compute them lazily, from generators to coroutines, and nothing worked - until I had a brain blast: why can't I just put them behind a flag? _get_lookup_index_info() is only called in like six places, adding an argument for whether or not to run neighbor detection wouldn't be that hard to do, so I did it:

def _get_lookup_index_info(self, lookup_idx, get_neighbors=False):
    if lookup_idx < 0:
        raise ValueError("Indexes cannot be negative")
    elif lookup_idx >= self._size:
        raise ValueError(f"Index {lookup_idx} outside array")

    neighbors = []
    lookup_offset = calcsize(self._pack_fmt) * lookup_idx

    data_idx, nbhd = unpack_from(self._pack_fmt, self._lookup_table, lookup_offset)

    if get_neighbors:
        for nbhd_idx, neighbor in enumerate(bin(nbhd)[:1:-1]):
            if neighbor == "1":
                neighbors.append((lookup_idx + nbhd_idx) % self._size)

    return data_idx, neighbors

The results speak for themselves:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 50.270 52.114 51.133 51.797 49.237
200,000 102.333 106.391 103.681 106.344 101.553
300,000 155.588 159.445 157.069 159.732 152.880
400,000 207.121 211.264 209.120 212.322 203.195
500,000 258.304 264.065 260.925 265.330 254.578
600,000 309.204 317.347 313.710 319.027 306.253
700,000 358.971 369.373 365.078 371.287 357.013
800,000 408.148 421.080 416.441 422.898 406.914
900,000 465.355 480.437 475.625 482.984 464.682
1,000,000 515.270 531.929 527.056 534.965 514.848

With an average runtime of 524.751 seconds, the updated library runs almost 59% faster than baseline and over 13% faster than inlining alone.

Lookup Table Entry Size Precomputation

At this point Scalene said my library still spends over 40% of its time in _get_lookup_index_info() and there weren't that many other stones for me to turn over. Given the benefit from inlining _get_displaced_neighbors() I'm somewhat surprised I didn't think doing the same for calcsize() would yield much of an improvement, but I knew you have to shed as much dead weight as possible in your hot loops so I stored this value in an instance variable and referenced it in the function:

def _get_lookup_index_info(self, lookup_idx, get_neighbors=False):
    if lookup_idx < 0:
        raise ValueError("Indexes cannot be negative")
    elif lookup_idx >= self._size:
        raise ValueError(f"Index {lookup_idx} outside array")

    neighbors = []
    lookup_offset = self._lookup_size * lookup_idx

    data_idx, nbhd = unpack_from(self._pack_fmt, self._lookup_table, lookup_offset)

    if get_neighbors:
        for nbhd_idx, neighbor in enumerate(bin(nbhd)[:1:-1]):
            if neighbor == "1":
                neighbors.append((lookup_idx + nbhd_idx) % self._size)

    return data_idx, neighbors

Let me never doubt how much a "small" function call costs in a hot loop again:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 48.860 49.114 48.528 48.684 50.097
200,000 99.665 99.705 99.455 98.915 100.216
300,000 148.832 150.283 150.723 150.419 150.644
400,000 196.822 199.569 200.046 199.414 199.440
500,000 246.204 249.753 251.155 249.415 249.322
600,000 296.087 300.836 303.075 299.705 299.609
700,000 344.516 350.515 353.604 349.321 348.919
800,000 392.337 398.952 403.745 398.809 398.923
900,000 447.934 455.667 461.380 456.021 456.071
1,000,000 496.665 504.922 512.022 504.274 506.819

With a final average of 505.338 seconds, the test script now runs almost 61.5% faster than baseline, 3.7% faster than toggling neighbor generation alone.

Splitting Index/Neighbor Data

At this point I shifted my focus to accessing housekeeping data in self._lookup_table more quickly, and I couldn't stop thinking whatever space savings I got from combining lookup index and neighborhood data into a single data structure was more than lost in time spent calling unpack_from() every time I needed the data - so I split each out into its own array. When I began, the function creating the housekeeping data structure looked like this:

def _make_lookup_table(table_size: int) -> tuple[bytearray, str, int]:
    if table_size < 0:
        raise ValueError("Lookup table cannot have negative length")

    table_log_size = table_size.bit_length()

    if table_log_size < 8:
        struct_fmt = "b B"
    elif table_log_size < 16:
        struct_fmt = ">h H"
    elif table_log_size < 32:
        struct_fmt = ">i I"
    else:
        struct_fmt = ">l L"

    return (
        bytearray(pack(struct_fmt, HopscotchDict.FREE_ENTRY, 0) * table_size),
        struct_fmt,
        calcsize(struct_fmt),
    )

After splitting lookup index data and neighborhood data apart, the function looked like this:

def _make_lookup_tables(table_size: int) -> tuple[array[int], array[int]]:
    if table_size < 0:
        raise ValueError("Lookup table cannot have negative length")

    table_log_size = table_size.bit_length()

    if table_log_size < 8:
        index_fmt = "b"
        nbhd_fmt = "B"
    elif table_log_size < 16:
        index_fmt = "h"
        nbhd_fmt = "H"
    elif table_log_size < 32:
        index_fmt = "i"
        nbhd_fmt = "I"
    else:
        index_fmt = "l"
        nbhd_fmt = "L"

    lookup_table = array(index_fmt, (HopscotchDict.FREE_ENTRY,) * table_size)
    nbhd_table = array(nbhd_fmt, (0,) * table_size)

    return (lookup_table, nbhd_table)

After updating all the downstream functions, I received a nice reduction in runtime for my efforts:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 42.586 43.356 42.258 42.285 43.635
200,000 87.960 89.509 86.394 87.398 90.212
300,000 133.104 136.025 132.977 132.543 137.193
400,000 175.373 179.548 175.902 175.404 181.664
500,000 219.386 224.529 219.753 219.754 227.008
600,000 264.242 270.176 264.836 264.754 273.625
700,000 306.358 313.558 308.133 307.340 318.601
800,000 349.192 356.049 352.180 348.939 361.678
900,000 401.421 410.147 404.848 401.295 413.659
1,000,000 444.160 455.015 447.733 443.461 457.737

My test script now finishes in an average of 448.969 seconds, nearly 64.9% faster than the baseline and 11.1% faster than using the combined data structure.

More Inlining

Once I broke up the housekeeping data structure, I realized there wasn't much use for _get_lookup_index_info() anymore - most things really wanted either the lookup index or the neighbors for a bucket and now they could get each easily - so I got rid of it, inlining its code as needed. Even though I managed to bring the average runtime of the test script from over 21 minutes to under 8 minutes already, this change still astounded me:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 33.634 32.534 33.731 33.640 33.592
200,000 69.395 66.432 69.372 67.865 67.757
300,000 104.827 99.754 104.390 101.887 102.279
400,000 138.866 132.095 137.511 134.830 135.320
500,000 173.498 165.230 171.808 168.828 169.377
600,000 208.386 198.927 206.518 202.880 203.921
700,000 242.714 231.813 240.597 235.885 238.040
800,000 275.938 263.833 273.817 267.645 271.588
900,000 314.713 300.933 312.255 304.742 310.409
1,000,000 348.945 333.755 345.821 337.084 344.395

Inlining _get_lookup_index_info() brought my average down to 342.433 seconds, 73.1% faster than the baseline and a whopping 23.7% reduction from accessing lookup indices and neighbor data separately. I couldn't believe there were still optimizations to be found that could knock over 100 seconds off my average, I knew there were still somehow more to find.

Even More Inlining

At this point I had two goals: to see a run insert the first 100,000 primes in under 30 seconds and to see a run finish in under five minutes. I knew with a current average around 33 seconds the first goal was doable, but unless I found another big piece of low-hanging fruit I might not get the second. Checking the Scalene profiles from my last set of runs showed the test script was spending over 15% of its time in get() and __getitem()__ even though there isn't much to them:

def __getitem__(self, key: Hashable) -> Any:
    _, idx = self._lookup(key)
    if idx is not None:
        return self._values[idx]
    else:
        raise KeyError(key)

def get(self, key: Hashable, default: Any = None) -> Any:
    out = default
    try:
        out = self.__getitem__(key)
    except KeyError:
        pass
    return out

As a new convert to the church of "Function calls aren't free" I decided to inline these functions, but one at a time so I could see if one made more of a difference. I started with __getitem()__ since it should speed up calls to get() as well:

def __getitem__(self, key: Hashable) -> Any:
    target_idx = abs(hash(key)) % self._size

    nbhd = self._nbhds[target_idx]
    for nbhd_idx, nbr in enumerate(bin(nbhd)[:1:-1]):
        if nbr == "1":
            neighbor = (target_idx + nbhd_idx) % self._size
            nbr_data_idx = self._indices[neighbor]

            if nbr_data_idx < 0:
                raise RuntimeError(
                    (
                        "Index {} has supposed displaced neighbor that points to "
                        "free index"
                    ).format(target_idx)
                )

            if self._keys[nbr_data_idx] == key:
                return self._values[nbr_data_idx]

    raise KeyError(key)

The results of this change were about what I expected:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 31.293 30.706 31.081 30.277 30.649
200,000 64.821 61.902 63.436 62.404 62.803
300,000 98.271 95.039 95.210 95.691 94.354
400,000 129.184 125.397 125.962 126.203 124.701
500,000 160.912 157.095 157.568 157.721 156.081
600,000 193.148 189.096 189.567 189.612 188.144
700,000 224.824 220.124 220.731 221.112 219.256
800,000 255.335 250.220 250.941 251.277 249.239
900,000 291.044 285.749 286.396 286.922 284.597
1,000,000 322.325 316.716 317.424 317.797 315.226

This set of runs had an average of 317.312 seconds, a 75.1% reduction from baseline and a 7.3% reduction from inlining _get_lookup_index_info(). Unfortunately no run managed to insert the first 100,000 primes in under 30 seconds, but I was so close I could taste it. Inlining get() was just as straightforward as inlining __getitem()__:

def get(self, key: Hashable, default: Any = None) -> Any:
    value = default
    target_idx = abs(hash(key)) % self._size

    nbhd = self._nbhds[target_idx]
    for nbhd_idx, nbr in enumerate(bin(nbhd)[:1:-1]):
        if nbr == "1":
            neighbor = (target_idx + nbhd_idx) % self._size
            nbr_data_idx = self._indices[neighbor]

            if nbr_data_idx < 0:
                raise RuntimeError(
                    (
                        "Index {} has supposed displaced neighbor that points to "
                        "free index"
                    ).format(target_idx)
                )

            if self._keys[nbr_data_idx] == key:
                value = self._values[nbr_data_idx]
                break
    return value

However, this codebase had one last surprise for me:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 27.323 28.238 29.217 28.369 28.267
200,000 57.053 57.541 59.872 57.111 57.497
300,000 86.559 86.707 90.534 86.132 86.896
400,000 113.862 114.659 119.797 113.779 114.805
500,000 141.902 143.490 149.893 142.246 143.362
600,000 170.320 171.977 180.275 171.141 172.053
700,000 198.440 200.167 209.973 199.397 200.114
800,000 225.147 227.097 238.692 226.594 227.143
900,000 257.204 259.495 272.634 259.093 259.584
1,000,000 284.555 287.787 301.939 286.972 287.585

Not only did every run insert its first 100,000 primes in under 30 seconds, almost every run managed to complete in under five minutes; this additional inlining brought my average down to 287.448 seconds, a 77.4% runtime reduction from baseline and a 16% reduction from inlining _get_lookup_index_info(). I couldn't figure out why this was until I remembered x-fast tries call get() in their hot loops.

Combining the last two inlined sections together didn't do that much to improve runtimes, but did noticeably reduce variance:

Primes Inserted Run 1 Run 2 Run 3 Run 4 Run 5
100,000 27.469 27.623 26.969 27.326 27.369
200,000 57.077 56.781 55.529 55.596 55.659
300,000 86.460 85.585 84.221 85.187 84.415
400,000 113.868 114.721 111.533 113.197 111.709
500,000 141.722 142.840 139.822 141.662 139.905
600,000 169.838 171.516 168.496 170.604 168.483
700,000 197.348 199.727 196.434 198.824 196.432
800,000 224.039 226.883 223.438 225.998 223.248
900,000 255.873 259.314 255.681 258.438 255.577
1,000,000 283.175 287.230 283.378 286.249 283.121

Inlining both get() and __getitem()__ resulted in an average test runtime of 284.267 seconds, a 77.74% reduction in runtime from baseline and a 16.98% reduction from inlining _get_lookup_index_info(). At this point I had no more stretch goals to hit, so I decided to stop here.

Wrapping Up

A line graph plotting runtime of the script at the top of this post after various optimization strategies; made using linegraphmaker.co

So what is there to say after beating my code with a stick for a couple days? For starters, when I was recording 1200 second baseline runs I never thought anywhere near 300 second runs were possible - I'd hoped I was skilled enough to not write such unperformant code, but here I am. Next, make sure the program used to generate data reflects real-world usage - I didn't do this as my use case for this library is lookup-heavy but I figured there was some fat to cut in my insertion logic and I wasn't going to save any RAM with my lookups. Lastly, use real tools to find where your actual hot loops are and accept you'll be surprised when you find them - I never would've guessed how glacial bit-banging to find displaced neighbors would be.

As for tools - after using Scalene for a couple days I have some concerns about the accuracy of its memory tagging and I wish I had the option to see which functions call a particular function the most, but perhaps some of these gripes are signs I need more than just Scalene. I tried an additional profiler later in this process, but I think at the time it was just more information without more utility.

Unfortunately, despite all my gains in increasing performance I didn't seem to make a dent in memory usage. I guess most of the memory growth comes from the X fast trie, so I'll have to target it directly to see if I can bring the memory usage down.