Four Cores Are Better Than One

Wed 29 March 2023

Occasionally my mind regains the ability to recall something from more than two weeks in the past, the latest item being a blog post I read on the relative power of shell parallelization. Recently I found myself dreading an enormously parallelizable task and decided in lieu of spending 45 seconds performing it by hand I would spend 15 minutes figuring out how to make computers do it for me.

I like to read comics on my tablet but I'm lazy about transferring them over so when I do I typically have a bunch of them to move at once, which wouldn't be a problem if I didn't also reoptimize them to save some extra disk space. Running the optimization program on each comic folder and repackaging them into CBZs isn't hard but it is the type of tedious task that's perfect for fobbing of on a computer, and with some additional knowledge from the parallelization blog post above is also the type of task a computer can complete very quickly.

After letting my comic backlog build up for a while, the test data I had to work with was 13 comics comprising 357 pages with a total size of 483.3 MB. I was going to compare the serial and parallel workflow on two tasks: reoptimizing all the images and repackaging each comic into its own CBZ.

This preamble is getting too long, so lemme move to the results:

Image Optimization

The serial workflow was mostly straightforward, the one trick I used to hopefully speed things up was to pass all the folders through to sh at once and handle looping myself instead of calling sh for each folder:

find . -type d -exec sh -c 'for d; do cd "$d"; pwd; jpegoptim --strip-all --all-progressive -m80 *.jpg; cd ..; done' {} +

Whereas the parallel workflow was much more involved:

find . -type d ! -path . -print -execdir sh -c 'for d; do cd "$d"; ls -1 | xargs -P 4 -n 1 jpegoptim --strip-all --all-progressive -m80; done' - {} +

I only did one run of each so the times aren't necessarily statistically valid, but they're still pretty evident of the difference between them:

Time (s) Serial Parallel
Real 75.922 38.921
User 73.526 108.964
System 1.586 15.616

Despite the fact the parallelized workflow used 4 times the processing power, the real effect was only a runtime reduction by roughly half [1]. The most surprising thing about these numbers to me is definitely the time spent in syscalls for the parallel workflow - I'm assuming most of it is related to optimizing each image individually instead of a folder at a time in the serial workflow, but I can't be sure. The user time makes sense-ish if you quarter it for each CPU, but unless the kernel only works in a single-threaded fashion [2] the sum of user/system times don't add up.

Compressing/Repackaging

Again, apart from my too-clever-by-half trick to hopefully only shell out once the serial workflow was straightforward:

find . -type d -exec sh -c 'for d; do zip -r "$d.cbz" "$d"; done' {} +

The parallelized workflow, however, is far less janky:

find . -type d ! -path . -print0 | xargs -P 4 -n 1 -0 sh -c 'zip -r "$1.cbz" "$1"' -

I think the lack of jankiness really shows when comparing the timing differences:

Time (s) Serial Parallel
Real 11.749 6.782
User 11.340 17.976
System 0.316 0.561

Again the runtime is only halved despite using 4 CPUs, but the increase in system time from the serial to the parallel workflow is only ~75% as opposed to almost 900% during optimization.

Conclusion

I don't really know how to wrap this up - using more than one core makes a trivially parallelizable task much faster, surprise? I don't even know if the general increase in system time between serial and parallel workflows is that surprising, but in all this blog post I failed to mention that the terminal output was all kinds of garbage. I guess it doesn't really matter - for my purposes the only numbers that really mattered were the wall times and I proved that loading up all my cores does make things faster, now I'm glad I have two commands I can run every month or so for 30 seconds instead of two other commands I can run every month or so for 50 seconds.

[1]I think this is due to the fact there are only two physical cores in my system, but I'm not entirely sure.
[2]Which would make sense, now that I think about it.