Wednesday, April 12, 2023

Improving build times for derive macros by 3x or more

I was dissatisfied with argument parsing crates. There didn't seem to be one that checked all the boxes I cared about:

  • Simple. Don't overcomplicate it! All I want to do is provide a rudimentary human interface to my app. If I wanted interface complexity, I'd put a whole TUI library in it.
  • Don't make me repeat myself, and don't make me write unnecessary boilerplate. The only thing a parser needs to do is compare given args to a known list and populate some state based on it. It shouldn't require adding a bunch of attributes to define how that state is populated, and it shouldn't require writing the application "help text" that describes the arguments separately from struct declaration.
  • Correctness. If I give my app invalid UTF-8 and it panics or returns an error inappropriately, I'm going to be furious.
  • Small! I once used clap as a young Rustacean and a third of my binary size was for parsing arguments! The app was fairly sophisticated, doing things like parsing ELF files. A whole third of the size of the binary was only used in the first few milliseconds of the app's entire runtime. (I may be exaggerating a bit, and I can't find any evidence to back the claim. The very first commit in the project in question doesn't contain any argument parsing library. No doubt due to the experience described.)
  • Fast! Build times are terrible for all arg parsing #[derive] macros that I've seen. Even on my relatively modern/high-end desktop (with a Ryzen 9 5900X running Windows 11 and Ubuntu 22.04 in WSL2), the fastest derive crate (argh) takes 2.15 seconds for the initial build. For comparison, the fastest all-around crate is pico-args at 370 ms.

There's a list of popular argument parsing crates along with various benchmarks at https://github.com/rosetta-rs/argparse-rosetta-rs. These results sparked my curiosity. If we eliminate all crates that don't support derive and invalid UTF-8 from this list, we're left with just two options: bpaf_derive and clap_derive. Both of these fail the "small" and "fast" requirements, though bpaf_derive is the better of the two.

For comparison's sake, the bpaf_derive build time on my high-end machine is 3.18 seconds. This machine is clearly faster than the one used for the results reported in the benchmarks repo (currently standing at 5.91 seconds with rustc 1.68.1 (8460ca823 2023-03-20)).

I will also use my 2017 MacBook Pro, in fairness, which is slower than either of the other machines. It was high-end 6 years ago, and today I consider it comparable to a low-to-mid-range system. It should provide a decent contrast and help establish an expected range for build times. The laptop builds the bpaf_derive example in 8.57 seconds and the pico-args example in 1.22 seconds.

A note on benchmarking methodology

The build times that I will focus on here are "full" build times such as those experienced after running cargo clean. Notably this does not include the time taken to download dependencies. I will also be building with the default debug profile. Finally, I will not entirely ignore incremental builds, but I must comment that I'm personally less interested in incremental build times, for reasons discussed below.

System quiescence is controlled by ensuring that things like web browsers (including Discord and Spotify) and automated backups are not running in the background while performing the benchmarks.

The benchmarking tool used was hyperfine, which reports the arithmetic mean of all runs with standard deviation following a single warmup run. Some of this info is absent from the argparse-rosetta-rs summary tables, but it can all be found in the raw JSON data in the runs/ directory.

The specific compiler version used was rustc 1.70.0-nightly (9df3a39fb 2023-04-11) on all platforms, and the linker was rust-lld on Ubuntu/Windows, and zld 1.3.9 on macOS.

Why "full" build times matter

There are at least three significant scenarios I can think of where the "full" build times are important. The first is cargo install. When you install an application with cargo install (as I did with hyperfine, for instance) it has the same effect as cloning the repo and building it fresh. Cargo caches the crates it downloads, but it does not reuse compiled object files between projects. (I do not know the technical feasibility of doing so. It just doesn't, and that has consequences. Apparently sccache can do this?) Every dependency needs to be built from scratch in these cases.

The second scenario is when using Rust as a "system scripting language" as in Bash, Python, or NodeJS. In other words, making Rust source code directly "executable". https://internals.rust-lang.org/t/pre-rfc-cargo-script-for-everyone/18639 is a recent discussion in this space.

And third would be CI. As with the others, caching can help, but only goes so far and still requires populating the cache in the first place.

One can argue that the initial build time for installing an application is just a one-time cost. This would be true if software never updates (which I have written about at length in the past). I only update my cargo-managed applications infrequently, but it's always a coffee break while updating half a dozen of them.

But the "cargo script" scenario should really bring the issue of build times to the forefront. Even for a one-off script that just needs the most basic argument parsing, depending on bpaf_derive means that the first run on my laptop will take about 9 seconds. This is unacceptable! I may as well just stick with Python for scripting. (Inevitably, using Python will consume more time in the long run debugging runtime errors, but first impressions cannot be discounted.)

What can be done?

In short, I want my cake and I want to eat it too! I want the convenience of bpaf_derive with the low overhead and build times of pico-args. So, I did what anyone in my situation would do.

"Fine, I'll do it myself." ~ Thanos from the end credits in Avengers: Age of Ultron.

My first attempt was a derive macro with build time 2.079 s ± 0.006 s on the high-end machine (roughly 2.7 seconds on Windows on the same hardware) and 5.955 s ± 0.068 s on the laptop. This is about 1.53x faster than bpaf_derive on the high-end machine and about 1.47x faster on the laptop.

But it's still 5.6x slower than pico-args on the high-end machine (Ubuntu) and 4.9x slower on the laptop (0.370 s ± 0.002 s high-end Ubuntu / 1.216 s ± 0.046 s laptop). Obviously, there was more work to do. To start with, profiling with cargo build --timings pointed squarely at syn as the dependency taking the most time:

`cargo build --timings` on the high-end machine with Ubuntu
`cargo build --timings` on the high-end machine with Windows
`cargo build --timings` on the laptop with macOS

The macro crate in question, onlyargs_derive, took about 300 ms itself, which is already pushing against the total build time for the target goal (370 ms). But let's focus first on syn and its dependency tree. The fastest code is the code that doesn't run, and honestly my little derive macro does not use much of syn. I can get rid of it entirely as a dependency by replacing it with a hand-written syntax parser.

Without spoiling too much, here are the updated graphs for the rewrite:

`cargo build --timings` on the high-end machine with Ubuntu after optimizations
`cargo build --timings` on the high-end machine with Windows after optimizations
`cargo build --timings` on the laptop with macOS after optimizations

The graphs are getting kind of absurd now, but they show build times of 0.591 s ± 0.001 s on Ubuntu, 0.779 s ± 0.006 s on Windows, and 1.889 s ± 0.075 s on macOS. That represents a 3.52x improvement on Ubuntu over the syn-based version, a 3.46x improvement on Windows, and a 3.15x improvement on macOS. Not too shabby!

Compared to bpaf_derive (which if you'll remember was the "best bet" crate listed in the argparse-rosetta-rs repo, given our constraints) we get the following speedups:

  • 5.39x faster on Ubuntu
  • 5.46x faster on Windows
  • 4.54x faster on macOS

And compared to pico-args, which is our target for build times:

  • 1.59x slower on Ubuntu
  • 1.53x slower on Windows
  • 1.55x slower on macOS

The difference is almost entirely from the additional onlyargs_derive crate, which adds about 220 ms to the build time on Ubuntu. (0.591 - 0.22 = 0.371, exactly on target!) It is most certainly worth optimizing this crate further. But I am thrilled to get even this far in such a short time!

That's not all! The binary overhead beats even pico-args. (At least on Ubuntu and macOS. I didn't check Windows because I kind of gave up trying to document it after doing all of the above.) Here is the comparison table against the top "competitors":

Ubuntu macOS
pico-args 28 KiB 23 KiB
bpaf_derive 242 KiB 191 KiB
onlyargs 20 KiB 18 KiB
onlyargs_derive 24 KiB 19 KiB

What about incremental builds? Here's another table:

Ubuntu macOS
pico-args 0.152 s ± 0.002 s 0.438 s ± 0.006 s
bpaf_derive 0.174 s ± 0.003 s 0.434 s ± 0.012 s
onlyargs 0.153 s ± 0.003 s 0.433 s ± 0.005 s
onlyargs_derive 0.155 s ± 0.003 s 0.445 s ± 0.024 s

Not terribly interesting since this only measures build time for the example apps and link time. But somehow onlyargs still wins compared to bpaf_derive, at least on Ubuntu.

The results

You've seen the benchmarks, so now it's time for the big announcement! Introducing four new crates for build-time-and-overhead-minded Rustaceans:

  1. myn: Minimalist Rust syntax parsing for procedural macros.
  2. onlyerror: Obsessively tiny error derive macro.
  3. onlyargs: Only argument parsing! Nothing more.
  4. onlyargs_derive: Derive macro for onlyargs.

You saw some mention of myn, onlyargs, and onlyargs_derive before. New to the scene is onlyerror, which wants to live in the same realm as the (absolutely incredible) thiserror crate. It remains to be seen if it can live up to this lofty goal (and associated expectations). But it satisfies the same fast build times and low overhead checkboxes that motivated onlyargs.

In addition, onlyerror also supports no_std out of the box with nightly compilers! I think there is still some work required for thiserror to provide similar functionality: add #![no_std] support · Issue #196 · dtolnay/thiserror (github.com)

And finally here's my raw data for the benchmark comparisons, and a summary included in the myn repo:

How it works

There's really no magic or silver bullet. The only thing I did was make a calculated tradeoff. Since I do not have a need for the full expressive power of syn and its related crates, I was able to remove them from my dependency tree, netting more than a 3x speedup in build times across the board.

The primary tradeoff is that my alternative, myn, is very thin. It's a no-frills parser which offers no opinion on the AST types that it will parse into (the user, e.g. onlyargs_derive, does that on their own). It also does not support the entire Rust language syntax by design. It only supports the subset that I have (so far) found useful for derive macros. This lets us get away with the bare minimum needed to write a procedural macro while still benefitting from some code reuse.

This is more or less the same approach taken by the nanoserde crate, which I discovered after publishing all of the resulting crates shown above. They could hypothetically make use of myn in nanoserde but I'm not sure if it's something they would want to depend on.

Right about now is a good time to mention that myn (and similar roll-your-own approaches) is not only intentionally limited in scope, but also takes the "worse is better" mantra to the extreme. This is the opposite design strategy to syn, which has a very rigid and thorough engineering philosophy. In other words, they just have different goals.

When to choose these over established alternatives

I suppose this decision will come down to whether you give more importance to fast build times and low overhead than features. These crates are not going to be applicable to everyone! And as mentioned in the argparse-rosetta-rs README, sharing dependencies will bring down both build times and overhead.

In other words, if you already have syn in your dependency tree (which is quite likely), you are not going to gain much by switching to any of these crates. Likewise, if you can't live without all the stuff that clap does, don't bother trying to replace it with onlyargs! You will only experience suffering.

That said, do I believe there will suddenly be a mass syn exodus with most procedural macros migrating to lightweight alternatives like myn? Nope! I would really love that, but I'm not going to insist. I might be willing to continue NIHing alternative crates in specific cases, though. Because in 2023, I still care more about fast build times than batteries included libraries and I'm willing to make certain tradeoffs for it.

Conclusion

I made my best effort to be fair and unbiased with the comparisons, but of course some of my personality will have leaked into the prose. None of this is meant to be disrespectful to other Rustaceans or the hard work they put into their crates. In fact, I have only praise for the authors that came before me! I can only hope that whatever small contributions I make will have inspired others to a similar degree.

With some non-trivial effort and more than a little tuning of constraints (if you squint and hold it at the right angle), it is possible to get more than 3x speedup on initial build times for procedural macros and similar crates. Whether it's worth the trouble is left as a decision to the reader.

Follow up (2023-04-13)

I was informed by matklad on the Reddit thread of a crate called venial that is also a smaller alternative to syn. I created an experimental branch that rewrites onlyargs_derive with this crate to compare the build times. The benchmark result with this branch is 1.539 s ± 0.006 s on the high-end machine with Ubuntu. The --timings graph is shown below.

`cargo build --timings` on the high-end machine with Ubuntu using `venial`

Build times are improved over syn (about 1.35x faster) but worse than myn (about 2.6x slower).

No comments:

Post a Comment