BlogWerx: Tiny MCU 3D Renderer Part 10: Benchmarking, profiling, and optimizations

The reason I have been so quiet recently is because I've been working on a very dry subject; testing. Unit tests are probably the most uninteresting thing I do on a daily basis as a software architect. Create some fixtures, write some assertions, run the test suite, fix any issues, repeat. It makes me cringe, just thinking about it.

But here I am, blogging about how I've been writing tests in my fun personal project, too! I am taking this project very seriously, so tests are a necessary evil. I haven't checked coverage yet (because I hit this bug with tarpaulin vs serde_derive), but it's probably somewhere around 70~80%.

Update 2017-11-05: serde_derive is only used in the app. The lib doesn't have this dependency, so I can run coverage on the lib separately. The result is even better than I imagined: 85.14% coverage, 573/673 lines covered

Surprise! There are only 26 tests (ignoring the 4 benchmark tests), how can coverage to be so high? This is one of the most interesting things about 3D rasterization; it doesn't require a whole lot of code! To be honest, though, I'm only exposing the bare minimum set of APIs; for example there's no support in the fixed function pipeline for stencil/scissor, alpha blending, etc.

There are two tests that are particularly interesting: renderer::tests::rasterize_triangle and renderer::tests::render_mesh. These tests are almost integration tests; they run the complete pipeline, including a simplified shader program. They respectively rasterize a single triangle, and an attribute buffer containing two triangles. The test works by comparing the resulting pixel buffer with a buffer captured from a previous run. That makes these more or less regression tests.

These same tests are also included in a small set of benchmark tests:

And wow, they are pretty damn fast! Under 200 microseconds in both cases, on my laptop. In terms of fill-rate, these tests both rasterize between 7,000 and 8,000 pixels. In terms of percentage of the frame buffer, it's about 13%. So we can extrapolate this simple example to about 1.3ms to fill the entire frame buffer. Which is perfect. It falls within the 16.667ms budget for rendering at 60 frames per second. However, this also assumes ideal behavior where every pixel is touched once and only once, and that it uses the simplest shader program possible with almost no logic.

So this establishes a good baseline. It is certainly possible to do 3D rendering at this small resolution on a single laptop-class CPU core. How well it works on a low-power ARM MCU is still a mystery. Even if it's 10x slower, it's still within the 60fps timing budget, but again that's under ideal conditions.

One of the tricks I've kept up my sleeve is SIMD; there are a lot of potential optimizations to do with vector instructions (like SSE or AVX on Intel CPUs, and NEON on ARM). There's just one problem with that idea; NEON isn't available on the low-power Cortex-m architecture. If fill rate remains a bottleneck, then I could always switch over to a high-power Cortex-A, where NEON vector instructions could rasterize up to 4 pixels per iteration!

Profiling

There aren't any good profiling tools in the Rust ecosystem available for macOS. I had to do all of my profiling in Docker with cargo-profiler. It's not a bad setup, but I do wish I could profile natively. This crate runs the app (i.e. not the tests) under valgrind. The app depends on SDL2, which in turn needs X11. To get around this, my Dockerfile uses xvfb like this

xvfb-run -s "-screen 0 1024x768x24" cargo profiler callgrind -n 25 --release

The initial profiling pointed me to the f32::powf() function as one of the hottest code paths with my crazy sunbeam and Gouraud shaders. This function is used exclusively in the gamma correction code, where it raises a normalized floating point number to the power of γ (2.2) as well as its inverse (1.0 / 2.2). I was able to optimize the powf() function with some IEEE magic from Jim Blinn. Here's the implementation I ended up with, including benchmarks:

#![feature(test)]

extern crate test;

#[inline]
fn std_powf(x: f32, y: f32) -> f32 {
    x.powf(y)
}

#[inline]
fn fast_powf(x: f32, y: f32) -> f32 {
    let one = 0x3F800000;
    let px: *const f32 = &x;
    let ix = unsafe { *(px as *const i32) };

    let t = (y * (ix - one) as f32) as i32 + one;
    let pt: *const i32 = &t;
    unsafe { *(pt as *const f32) }
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn bench_std_powf(b: &mut Bencher) {
        b.iter(|| std_powf(0.334, test::black_box(2.2)));
    }

    #[bench]
    fn bench_fast_powf(b: &mut Bencher) {
        b.iter(|| fast_powf(0.334, test::black_box(2.2)));
    }
}

And the corresponding benchmark results:

No joke. It's so fast, it can't even be measured with a granularity of nanoseconds. That's because it compiles down into seven instructions:

c7 85 70 fc ff ff cd cc 0c 40   movl        $1074580685, -912(%rbp)
48 8d 85 70 fc ff ff            leaq        -912(%rbp), %rax
f3 0f 10 85 70 fc ff ff         movss       -912(%rbp), %xmm0
f3 0f 59 05 9a 36 08 00         mulss       538266(%rip), %xmm0
f3 0f 2c c8                     cvttss2si   %xmm0, %ecx
81 c1 00 00 80 3f               addl        $1065353216, %ecx
89 8d 70 fc ff ff               movl        %ecx, -912(%rbp)

The only question is what kind of accuracy does this optimization have? Well, I've only tested a very small sample size, so far. But it seems to have a margin of error around 0.6%. That's completely acceptable for my use case! And I have seen a decent perf improvement with this optimization, roughly 3~4% lower CPU utilization with the little sunbeam shader scene shown in the last video. The powf() function has also disappeared from the profiler view, as you would expect.

But it doesn't end there! The profiler now gives a really unhelpful result with 33.3% of the time spent in an unknown function, and another 21.1% spent in a second unknown function. Missing debug symbols, probably! I'm not sure where to go from here, as far as profiling and testing are concerned. So I guess that wraps up the MCU 3D Renderer blog series for now! Maybe I'll revisit the topic when it comes time to vectorizing the rasterizer with NEON. In the meantime, I'll start to focus more on the development environment, which means finally implementing imgui!

Overall, I'm very happy with this stopping point, and in the progress on this project over the last three months. The renderer is just one small part of a much bigger vision. And I will continue writing about it as frequently as I always have.

Onward to LoFi 3D gaming!

BlogWerx

Friday, November 3, 2017

Tiny MCU 3D Renderer Part 10: Benchmarking, profiling, and optimizations

Profiling

No comments: