Function Calls Considered Expensive

Thu 21 December 2023

When I wrote both of my blog posts on increasing performance through code profiling, I linked to a blog post that measured function call overhead from Python 3.5 to 3.8. That post was enlightening, but it was written in 2020 and significant effort has gone into improving CPython. Since the phrase "function calls are expensive" now lives rent-free in my head, I began to wonder how much of that post still holds up.

Testing Setup

I used a Rasperry Pi as the computer for my experiments, a 3B-non-plus model with a quad-core 1.2GHz processor I boot using a USB3-SATA adapter that gives me constant undervolting issues. I chose 64-bit Alpine 3.19.0 for the OS based on the lack of extraneous processes running by default; I installed the OS fresh and updated all the base system packages before testing, and made no tweaks to the boot config for additional performance. For the Python interpreter, I used the most recent version of the three most recent minor releases available in Alpine packages: 3.9.18, 3.10.13 and 3.11.6.

I chose to run this test on old, slow hardware because I have a bug up my butt about modern developers relying on super-fast modern hardware to conceal their inability to write performant code and wanted to minimize the effect of hardware assistance; I'll save my rant about how nobody writing JavaScript has heard of a profiler and it shows for another post.

Testing Methodology

The original post used empty functions to capture the overhead of a function call directly, but I believe using a function that performs an actual task will yield a more useful real-world result even if doing so requires comparisons to be relative instead of absolute.

I settled on converting integers to Roman numerals as my test function, writing the first version recursively to generate function calls:

def rn_recursive(num):
        if num >= 1000:
                return "M" + rn_recursive(num - 1000)

        elif num >= 900:
                return "CM" + rn_recursive(num - 900)

        elif num >= 500:
                return "D" + rn_recursive(num - 500)

        elif num >= 400:
                return "CD" + rn_recursive(num - 400)

        elif num >= 100:
                return "C" + rn_recursive(num - 100)

        elif num >= 90:
                return "XC" + rn_recursive(num - 90)

        elif num >= 50:
                return "L" + rn_recursive(num - 50)

        elif num >= 40:
                return "XL" + rn_recursive(num - 40)

        elif num >= 10:
                return "X" + rn_recursive(num - 10)

        elif num >= 9:
                return "IX" + rn_recursive(num - 9)

        elif num >= 5:
                return "V" + rn_recursive(num - 5)

        elif num >= 4:
                return "IV" + rn_recursive(num - 4)

        elif num >= 1:
                return "I" + rn_recursive(num - 1)

        else:
                return ""

I then rewrote the function iteratively as a control, taking care to replicate the recursive code flow as much as possible:

def rn_loop(num):
        result = ""

        while num > 0:
                if num >= 1000:
                        result += "M"
                        num -= 1000
                        continue

                if num >= 900:
                        result += "CM"
                        num -= 900
                        continue

                if num >= 500:
                        result += "D"
                        num -= 500
                        continue

                if num >= 400:
                        result += "CD"
                        num -= 400
                        continue

                if num >= 100:
                        result += "C"
                        num -= 100
                        continue

                if num >= 90:
                        result += "XC"
                        num -= 90
                        continue

                if num >= 50:
                        result += "L"
                        num -= 50
                        continue

                if num >= 40:
                        result += "XL"
                        num -= 40
                        continue

                if num >= 10:
                        result += "X"
                        num -= 10
                        continue

                if num >= 9:
                        result += "IX"
                        num -= 9
                        continue

                if num >= 5:
                        result += "V"
                        num -= 5
                        continue

                if num >= 4:
                        result += "IV"
                        num -= 4
                        continue

                if num >= 1:
                        result += "I"
                        num -= 1
                        continue

        return result

Once both functions were ready I began the experiment, 100 runs of each function converting the same number 100 times: [1]

recursive_times = {}
loop_times = {}
recursive_log = Path.home().joinpath("recursive.times")
loop_log = Path.home().joinpath("loop.times")

for i in range(1, 4000):
        loop_timer = Timer(
                stmt = f"rn_loop({i})",
                setup = "from __main__ import rn_loop",
                timer = process_time_ns
                )

        recursive_timer = Timer(
                stmt = f"rn_recursive({i})",
                setup = "from __main__ import rn_recursive",
                timer = process_time_ns
                )

        loop_times[i] = loop_timer.repeat(number = 100, repeat = 100)
        recursive_times[i] = recursive_timer.repeat(number = 100, repeat = 100)

recursive_log.write_text(dumps(recursive_times))
loop_log.write_text(dumps(loop_times))

Analysis

I was going to go full data science with standard deviations and stuff but instead followed the suggestion in the Python docs to focus on the minimum value from each run. I'll begin with the overall comparisons:

Average Minimum Run Times

A bar chart comparing average minimum run times across all strategies and interpreter versions

Python Version Iterative (μs) Recursive (μs)
3.9.18 36.237 48.444
3.10.13 28.234 38.967
3.11.6 15.452 19.497

The average minimum run time is the mean of the fastest 100-run test for each number converted - the fastest 100-run test to convert the number 1, averaged with the fastest 100-run test to convert the number 2, etc. All run times are averaged before the mean is normalized to a single run and converted to microseconds.

The preferable strategy is apparent from this comparison: iterative runs take 74% the time of recursive runs on Python 3.9, 72% the time on Python 3.10 and 79% the time on Python 3.11.

An unexpected surprise is the benefit of a newer interpreter: iterative runs take 77% the time and recursive runs take 80% the time after upgrading from 3.9 to 3.10; after upgrading from 3.10 to 3.11 iterative and recursive runs take 54% and 50% the time respectively.

Worst-Case Average Run Times

A bar chart comparing worst-case average run times across all strategies and interpreter versions

Python Version Iterative (μs) Recursive (μs)
3.9.18 78.751 100.638
3.10.13 61.675 82.254
3.11.6 33.938 41.434

The worst-case average run time is the mean of all runs of the number with the largest minimum run time; the mean is normalized to a single run and converted to microseconds. [2]

I would have expected the iterative strategy to increase its lead over the recursive strategy in the worst case, but the gap tightened slightly: iterative runs took 78%, 74% and 81% the time of recursive runs on Python 3.9, 3.10 and 3.11 respectively.

The upgraded interpreter runtime benefit was also reduced in the worst case, but slightly less than the inter-strategy benefit: iterative runs take 78% the time and recursive runs take 81% the time after upgrading from 3.9 to 3.10; they respectively take 55% and 50% the time after upgrading from 3.10 to 3.11.

Average Bucketed Minimum Run Times

A graph comparing average bucketed minimum run times across all strategies and interpreter versions

Loops Performed 3.9.18 Iterative (ns) 3.9.18 Recursive (ns) 3.10.13 Iterative (ns) 3.10.13 Recursive (ns) 3.11.6 Iterative (ns) 3.11.6 Recursive (ns)
1 6873 12603 5381 9823 2779 4566
2 12032 18764 9352 14671 5145 7128
3 16955 24669 13261 19547 7274 9643
4 21715 30525 16936 24336 9312 12121
5 26508 36530 20658 29162 11387 14648
6 31436 42610 24489 34096 13464 17159
7 36563 48872 28433 39251 15610 19700
8 41699 55111 32418 44346 17738 22271
9 46901 61469 36530 49720 19917 24826
10 52109 67918 40703 55057 22132 27458
11 57292 74360 44797 60475 24306 30013
12 62598 80961 48968 65845 26572 32665
13 67881 87567 53053 71215 28746 35312
14 72968 93800 57008 76532 30940 37822
15 77972 100202 61043 81861 33215 40619

The average bucketed minimum run time is the mean of the smallest run time for each number requiring a given number of loops for conversion - the minimum run times for the numbers 1, 4, 5, 9, 10, 40, 50, 90, 100, 400, 500, 900 and 1000 are averaged together as they can all be converted in one loop, and so on. All run times are averaged before the mean is normalized to a single run.

The graph shows the amount of time needed to complete a single run appears to increase linearly with the number of loops for each strategy and interpreter, with the data points hewing very close to the trend line; I believe this suggests there was little outside effect during testing, although the somewhat large error bars do raise an eyebrow.

While the fact the iterative strategies are faster than the recursive strategies everywhere is unsurprising, one new piece of information is the difference between the strategies widens as the number of loops increases - this surprised me so I dug into it more:

A graph comparing average bucketed minimum run time differences across all strategies and interpreter versions

Loops Performed 3.9.18 (ns) 3.10.13 (ns) 3.11.6 (ns)
1 5730 4442 1787
2 6732 5319 1983
3 7714 6286 2369
4 8810 7400 2809
5 10022 8504 3261
6 11174 9607 3695
7 12309 10818 4090
8 13412 11928 4533
9 14568 13190 4909
10 15809 14354 5326
11 17068 15678 5707
12 18363 16877 6093
13 19686 18162 6566
14 20832 19524 6882
15 22230 20818 7404

The average bucketed minimum run time difference is the difference between the iterative and recursive means for each interpreter - the first graph was a bit cluttered so I made a second to highlight this difference. While the average bucketed minimum run time graph shows how much faster Python 3.11 is to the older versions, this graph shows how much tighter Python 3.11 execution is with each additional loop requiring almost a third the runtime of the other two versions.

I believe this graph is the most important to my hypothesis as the code paths between the iterative and recursive strategies are so similar that taking their difference should leave only the implicit differences, which should only be the function calls from recursion.

One additional piece of information this graph provides is the difference between run times on Python 3.9 and 3.10 appears to be more of a shallow parabola than linear, while Python 3.11 appears to remain linear. If this is the case, it suggests a recursive strategy may be even worse with these versions than initially expected.

Conclusion

So after all this effort, did I manage to prove function calls were the primary factor behind the decreased performance of the recursive strategy? I believe I did: the test functions were similar enough to make differing code path issues unlikely; the test environment was isolated and built specifically for this experiment to minimize noise; and finally the test data trends replicated across interpreter versions. [3]

Barring the possibility of me completely missing something incredibly obvious, I believe the average bucketed minimum run time difference shows the effect of additional function calls and I can now say it's faster to write code in an iterative style and inline small functions where possible. [4]

Appendix

While writing my conclusion, I realized the recursive function creates new integer objects to pass around and that overhead will show up in my data even though it isn't strictly part of a function call; then I remembered both that the substantial bucketed run time differences at a loop value of 1 suggests the difference cannot come from integer creation alone and that the trend still held at small integers even though Python interns small integers preventing new integer object creation.

As a sanity check, I compared the minimum run times for two groups of integers that required the same number of loops to convert: ones that should be interned by Python (188, 238) and ones that shouldn't (3707, 3998); somehow the integers that weren't interned had significantly lower minimum run times. I don't know why this would be the case but I have spent too much time on this post already and want to do something else - have a collection of graphs I couldn't find a place for in lieu of me making this post even longer.

A graph comparing minimum run times for the iterative strategy across all interpreter versions

A graph comparing minimum run times for the recursive strategy across all interpreter versions

A graph comparing minimum run times across strategies for Python 3.9.18

A graph comparing minimum run times across strategies for Python 3.10.13

A graph comparing minimum run times across strategies for Python 3.11.6

[1]I thought the natural smoothing of aggregating 100 runs would be beneficial; during analysis I wished I'd made 10000 individual runs instead
[2]The most time-intensive number to convert is 3888, which is a bit strange until you think about it
[3]I suppose I could throw the functions into a disassembler and read the Python bytecode to confirm the code paths are similar, but I've spent too much time on this post already
[4]Will the code maintain its readability and maintainability after doing these things? Who can say