Sunday, February 15, 2015

melonJS should be *All About Speed* Part 6

Let's continue talking WebGL.

The original WebGL article was split into two parts, as the material covered too much ground. In the first half, WebGL was described in agonizing detail. In this second half, the WebGL theory takes a backseat to melonJS. This time around, there are a lot of high-level details revolving around rendering a scene in melonJS using all those crazy triangles.

As always, you can revisit previous articles in the series. Make sure you don't miss Part 5, as this article assumes you've read it and have and intimate understanding of WebGL.

Part 1 : http://blog.kodewerx.org/2014/03/melonjs-should-be-all-about-speed.html
Part 2 : http://blog.kodewerx.org/2014/04/melonjs-should-be-all-about-speed-part-2.html
Part 3 : http://blog.kodewerx.org/2014/10/melonjs-should-be-all-about-speed-part-3.html
Part 4 : http://blog.kodewerx.org/2014/12/melonjs-should-be-all-about-speed-part-4.html
Part 5 : http://blog.kodewerx.org/2015/02/melonjs-should-be-all-about-speed-part-5.html

Here comes the wall of text.

melonJS Gets Fast

The upcoming melonJS v2.1.0 release contains significant improvements in its WebGL support. The most obvious improvement is in rendering speed, which was achieved using the research that lead to the WebGL primer writeup in Part 5 of this series. After gaining an understanding of how WebGL works, it was pretty clear that we had to do some things quite a bit different from the typical 3D scene rendering paradigms used today.

melonJS only renders 2D scenes, and it is historically tied to the HTML5 Canvas API. So what we've developed is a compromise on both sides; an API that is mostly compatible with Canvas, and mostly compatible with WebGL. "Mostly compatible" refers to the fact that there are differences from the pure Canvas API, and there are features entirely missing; gradients for example. The team believes this "best of both worlds" approach is the right thing to do for v2.1.0. In future releases, we will be focusing on WebGL as the primary renderer, and slowly deprecating Canvas.

What's New

In melonJS, we now have WebGL rendering frames using only two native calls: bufferData, (send elements to the GPU) and drawElements (draw the data that was just sent). All of the draw operations are batched up into a series of primitives (quads or lines), and melonJS flushes the batch operation when it needs to switch the primitive type (or everything has already been computed for the entire scene).

This new "Compositor" replaces the alpha WebGLRenderer that we shipped in melonJS v2.0.0. I guess you could call the WebGLRenderer in v2.1.0 a beta, now. Building the Compositor required the addition of two new classes, and the simplification of two others:

  • me.WebGLRenderer.Compositor is the new Compositor class that stores WebGL operations into a single streaming buffer, manages all of the texture units and shaders, etc.
  • me.Renderer.TextureCache is a new class that caches Texture objects, providing texture lookups by image reference, texture unit (index) lookups, and bounds the total number of texture units to hardware limitations.
  • me.video.renderer.Texture is the new name for an old class (me.TextureAtlas) that handles texture mapping; an image and its associated regions for e.g. animation sheets and tile sets.
  • me.Matrix2d was overhauled to make it safe for use with WebGL, and me.Matrix3d has been removed (it was not a 3D matrix implementation).

The Compositor is responsible for sending all commands to the GPU. This was implemented as its own class (outside of the WebGLRenderer class) to satisfy a need for custom shaders. Because the Compositor is tied directly to the shaders, it necessitates the need for a custom Compositor for game developers who wish to use custom shaders. The WebGLRenderer simply forwards all draw operations to the Compositor, and the Compositor contains the logic to batch these operations into the smallest possible number of calls to the GPU.

Why WebGL is Fast

That last statement is part of what makes WebGL fast; Having the hardware do the rasterization (putting pixels into a frame buffer) is where most of the speed comes from, but you can't take full advantage of that hardware unless you can quickly send it information about "what to draw". Compositing an entire scene on the CPU and sending the information in one large batch is the best way to reduce overhead in vertex memory bandwidth and the User Agent/GPU driver.

But reducing that overhead is only part of the solution. Higher performance can also be achieved by reducing the total size of payload data sent to the GPU. It follows that sending less data allows the GPU to begin rendering sooner. In fact, a lot of the "WebGL benchmarks" I have seen were written specifically to benchmark the GPU by only sending the minimal amount of information necessary to render a scene.

It's pretty straightforward to design a complex vertex shader which accepts only a single variable (the time delta) to render an animated scene entirely on the GPU without any vertex bandwidth concerns. But that doesn't qualify as a game by any stretch. Sure, throw in some more variables like gamepad inputs and such to make it interactive. Now you have a custom shader that performs incredibly well, but will only work for one game (or in the best case, one style of game). Some notable examples are a Flappy Bird Clone and a Legend of Zelda Clone.

How melonJS Benefits From WebGL

Unfortunately, the non-interactive and inflexible shader approaches are not at all compatible with the melonJS vision. We need shaders that work with any kind of generic renderable (a 2D image) that can be translated, scaled, and rotated arbitrarily and independently of all other renderables. This requirement necessitates a different kind of design, which I will get to later in the article. Suffice to say we've put our WebGL support through several revisions and experiments to get the most out of WebGL in a way that matches our framework.

That's something that can't be stressed enough; What we have built significantly lowers the entry barrier for any melonJS developer to start using WebGL right away. Just flip the switch, and it works! Not only does it work, but it really is faster than the 2D canvas renderer. And getting to that point was a lot of hard work.

How We Got Here

It didn't happen over night! In November, we launched v2.0.0 with an alpha quality WebGLRenderer. It's "alpha quality" because it's actually slower than the CanvasRenderer! That may seem shocking since the term "WebGL" is almost a buzzword for "fast 3D graphics on the web". I opened the Part 5 article with a bit of a disclaimer that flipping on the WebGL switch won't magically make your game run faster. And the sad truth is that this is exactly what I meant. Using WebGL doesn't just grant 60fps high resolution, multi-textured, antialiased, dynamically lit, billion-polygon-count rendered 3D scenes for free; you have to get there through sheer will and determination.

The alpha WebGLRenderer is naïve; For everything that needs to be drawn:

  • vertices are computed and uploaded to an attribute buffer (!)
  • texture coordinates are uploaded to an attribute buffer (!)
  • the index buffer is bound to the element buffer (!)
  • a texture is bound to texture unit 0 (!)
  • the transformation matrix is uploaded as a uniform variable
  • the color is uploaded as a uniform variable (!)
  • and 6 vertices (two triangles) are drawn immediately (!)

Everything I marked with (!) is an operation that equates to unnecessary overhead. Of these, only the transformation matrix changes often (representing the result of translate, rotate, scale). This ends up being something on the order of 10-15 calls per renderable. Even the simple platformer example routinely draws about 140 renderables at a time, making roughly 2,000 WebGL calls per frame. Or 120,000 calls per second. That's a lot of overhead for a very simple game.

Internally, the Canvas API in the browser is doing its own compositing, which results in better performance than the naïve GPU-poking approach. Clearly, we had to do better. With the new Compositor, melonJS makes 2 WebGL calls per frame (about 120 per second) in the best case scenario. This is a significant reduction in the number of calls (and driver overhead) but we now have new bandwidth requirements; the payload size for each call has increased. Reducing the payload size is the way forward to get even more performance out of the hardware.

The Architecture of the melonJS Compositor

And now the moment you've all been waiting for! (amirite?) Now that you know how WebGL works, you can probably imagine the difficulty of getting good results out of it while hanging on to a legacy API. We're sticking with the same renderer API that we introduced in melonJS v2.0.0 to ease the transition for game developers familiar with the canvas API. This was an important decision we made so that WebGL can be used and taken advantage of with the least amount of resistance. This is how we did it.

Recall that earlier in this article, I mentioned that the shaders used in melonJS need to be interactive and flexible; they need to support hundreds of individually moving images, each with its own rotation angle, scaling factor, and positioning information. This is all state that needs to be provided to the GPU, and for that number of images, it can only be provided as part of the vertex attributes. Yes, every vertex sent to the GPU contains information about rotation, scaling, etc.

Let's start with one of the Compositor's primitive rendering components, the quad. Quad is short for quadruple, meaning four; An image has four corners, a quad has four vertices, an image is a quad. Since the GPU works with triangles (not quads), we have to describe a quad as a set of two triangles. We use the ELEMENT_ARRAY_BUFFER to describe the triangles in our quad; every six elements in the ELEMENT_ARRAY_BUFFER points to the four vertices in the quad, in the following order:
[ 0, 1, 2, 2, 1, 3 ]
The first three elements are triangle 1, and the second three are triangle 2. You can see that both triangles share two vertices. That lets us send just the four vertices in our quad using the ARRAY_BUFFER. During initialization, the Compositor creates a large "index buffer" containing indices for 32,000 triangles (16,000 quads) like above. The index array above describes the first two triangles, and the second two triangles are described by [ 4, 5, 6, 6, 5, 7 ] ... With 16,000 of these blocks in total. (That's 96,000 total floats...) This large index buffer is created once and uploaded to the GPU as the ELEMENT_ARRAY_BUFFER. It is never touched by JavaScript again. (Though other index buffers may be bound in its place! The line shader's index buffer, for example.)

Each quad is made of four pieces of information, currently:

  • Vertex (vec2) : A point in pixel coordinate space.
  • Color (vec4) : A color sent to the fragment shader for blending.
  • Texture (float) : The texture unit index used by the fragment shader as the Sampler2D selector.
  • Region (vec2) : Texture coordinates for the Sampler2D.

You can count 9 floats (per vertex) that need to be streamed to the GPU, or 36 floats per quad. Unfortunately, the last three of those per-vertex bits are static for all four vertices in the quad! So we end up sending a lot of duplicated information to the GPU. The good news is that there's a lot of vertex memory bandwidth available, so it's a good tradeoff. (See the "Experiments" section below for our plans to reduce the number of floats per quad.)

Architecture Rationale

This vertex streaming approach is in comparison to using uniform variables for the blend color, texture index, and texture coordinates. As you know, uniforms are constant across the draw call, and we don't necessarily have the opportunity to share these values across every quad in the scene. Hypothetically, we could have used uniforms and individual draw calls, but it would have certainly degenerated to the worst-case scenario nearly every time; with one draw call per quad (due to lack of sharing, and draw order requirements). The cost would have been too great.

If you're familiar with WebGL, you might be wondering what happened to the matrices? Well, there's only one matrix used in our vertex shader today, and that's the projection matrix! (The projection matrix transforms our pixel coordinate space into WebGL clip space.) There is no concept of a view matrix or model matrix in the current iteration of the melonJS WebGL Compositor. Instead, those matrices are premultiplied before they ever reach the Compositor (this was done historically as we only supported the canvas renderer when the code was written). Once inside the compositor, the global "ModelView" matrix is multiplied with every vertex, and that's what gets sent to the vertex shader!

The vertex shader is very simple; it just multiplies the projection matrix (a uniform variable) with the vertex, and sends the remaining attributes to the fragment shader through varying variables.

The fragment shader selects the correct Sampler2D based on the texture unit index, samples it with the texture coordinates, and finally combines the color. This shader is more interesting because of how difficult it is to do Sampler2D selection. It is not possible to dynamically index arrays within the fragment shader (see the WebGL Spec) So instead we use a method pioneered by Kenneth Russell and Nat Duca from the Chromium team (see: http://webglsamples.org/sprites/readme.html) Which uses a series of if-then-else statements to select the correct Sampler2D.

But wait! Their example only supports four textures. Surely that's not going to be enough for melonJS?! As it turns out, WebGL requires a minimum of 8 texture units. But what about hardware that supports more than 8? We don't want anything bad to happen like crashing the GPU process when attempting to index too many Sampler2Ds, assuming the shader GLSL will even compile! And in the case that it does compile and doesn't crash, we would just be wasting GPU memory with a bunch of useless if-statements that never run!

The solution is compiling your GLSL before compiling your GLSL. ;) The GLSL is just a string, after all; we can manipulate it in any way we want at runtime before it is compiled into a working shader program. The best thing I came up with for doing the GLSL preprocessing is running it through a template engine. I've used doT before with great success, so it seemed like the obvious choice; it's tiny, it's fast, and it's extremely expressive.

We now have all of our GLSL sources written as doT templates. The fragment shader template in particular uses JavaScript evaluation to create a series of if-then-else statements in a loop (according to compiler theory, it creates an unrolled loop!) The templates are compiled to functions by doT at build-time, melonJS passes template variables to the template functions at runtime which produces the final GLSL source, and finally it's compiled by the UA into a usable shader program.


Experiments for a Faster Future

What we have now is pretty good, but it can get a lot better. Some experimentation was done that attempts to classify vertex attributes by how often they change, and only send them to the GPU when necessary. What we found is that the additional bookkeeping required to "detect changes" is in fact a lot of wasted CPU effort. It's more efficient to just send everything regardless.

Our experiments were complicated by the nature of the melonJS rendering pipeline, which already attempts to optimize by only drawing objects that are known to be visible in the scene. As things in the scene move, leaving and entering the viewport, the position of renderables within the stream buffer changes. That explains the extra bookkeeping requirement.

At first glance, it appears to be a great deal easier uploading everything to the GPU regardless of its visibility, and only sending changes as they are performed by the game. In other words, keeping the entity state within the GPU and synchronizing changes. Problems arise when entities get added and removed from the scene, especially particles; That puts you right back into a memory management role, with plenty of bookkeeping. This experiment lead to seeking other ways to make the compositor more efficient.

Another, simpler approach to working around bandwidth limitations is reducing the size of each vertex element. It's common to send a color as a vec4; R, G, B, and A components in the range 0..1. But that's considerably wasteful when a color needs to be included with every vertex. To reduce the size of the color information, these components can be packed into a single float (similar to the more common 32-bit RGBA unsigned integer that most game developers are familiar with). And the GPU can unpack the color into a vec4 within the vertex shader. Some precision information is lost in the packing/unpacking process, so that needs to be taken into consideration.

The next step is to rework all of the drawing code outside of the compositor, getting away from the "old way" of doing things like 2D Canvas. The Canvas API has a consistent global state for color and transformation matrix. The matrix, as I mentioned before, is a bit like a combined "ModelView matrix"; it handles the camera position and entity position in one. Replacing that with just a view matrix means we can remove a lot of heavy math operations from JavaScript, and move them to the GPU! A model matrix is not necessary for quads, because it would only apply to six vertices! (It would make sense for much larger meshes like Spine.) This work will take place in ticket #637. The API will change enough that WebGL will benefit in terms of better performance, but will still work with 2D Canvas.

Strict requirements for draw order have further complicated our WebGL architecture. We've disabled the depth buffer and draw everything with the same depth to simulate a pure 2D rendering environment. With the addition of extra metadata afforded by the work in #637, we will be able to make use of the depth buffer after all; every draw operation can be provided with z-index information. This in turn means the compositor can be transformed from a single stream buffer with many attributes to a series of buffers with fewer attributes and more use of uniforms (all the while being mindful of draw order for proper transparency/blending.)

WebGL 2.0 is an update that will be coming to UAs in the future, and it contains a lot of goodies that we can make use of: Array Textures, Vertex Attribute Objects, Multiple Render Targets, Instanced Objects, etc. I'm sure we'll have some very interesting things to look forward to, and some good optimizations by using them.

Benchmarks

It's time for some pretty pictures and geeky numbers! The hardware I used for these completely unscientific benchmarks is my Late 2011 Macbook Pro; Radeon HD 6770M, 2.5GHz quad core i7 Sandy Bridge, 8GB DDR3 RAM. The UA is Chrome 40.0.2214.111 (64-bit).

Since we don't have usable numbers from the particle debug panel (at the time of writing) I used the small Stats.js library and hacked it into the melonJS RAF. Then I added a particle emitter to the platformer example and configured it to spew 2,500 particles. First, some screenshots of this configuration with the CanvasRenderer.

CanvasRenderer

melonJS v2.1.0 CanvasRenderer (FPS)
melonJS v2.1.0 CanvasRenderer (MS)

Ouch! Right away, the CanvasRenderer is under heavy stress, running at an average 50fps. The particle motion was very obviously not smooth during this test. The second screen shows the reason for the lower FPS: each me.game.draw() call takes between 16 and 20ms! It's right on the threshold for 60FPS. Combined with the particle updates, CPU time goes over 16.7ms on many frames.

Still, the results for the CanvasRenderer are very impressive. I expected it to perform a lot worse in this test. Props to the Chrome team for that!

WebGLRenderer

Next, the same exact code running with WebGL. Take note the only difference is the URI! The code itself is unchanged. Also notice the bilinear filtering, especially around the spaceman's helmet; This is caused by Chrome lacking CSS3 image-rendering: pixelated; support (I'm using Chrome 40; pixelated will be added to Chrome 41). melonJS is already using this CSS property.

melonJS v2.1.0 WebGLRenderer (FPS)
melonJS v2.1.0 WebGLRenderer (MS)

Much better! A solid 60FPS, with each frame draw taking about 11 to 13ms. The FPS screen shows "59", and even stranger the range shows it hit "61" at some point. While I could just explain it away as a known issue in Chrome, I'm going to be very forward and say that I am not affected by that bug on this machine. However, it does manifest on a Macbook Air (especially with external monitor attached) that I use at work. So it's actually a thing. Instead I'm going to explain it away as an acceptable margin of error. ;)

One thing you can notice in the FPS graph (far left) is that it starts out pretty rocky, dropping as low as 37FPS! This is from a combination of JIT warmup and Compositor behavior as it resets and uploads new textures to the GPU when switching between the loading screen and play screen states. The particle emitter is also a bit of a beast that instantaneously creates 2,500 particles and sleeps until it can create more (as older particles die off). This causes a bit of a CPU spike until the particle distribution evens out.

In case you were wondering, the little red "GL" icon on the right side of the address bar is the WebGL Inspector Chrome extension. Very handy for debugging (disabled in these tests).

If we take it at face value, the 18ms → 11ms change marks a 39% improvement over CanvasRenderer. In other words, WebGLRenderer is 1.6x faster. That's not bad for a beta! Still more improvements to come.

All Done

While we don't get the raw GPU performance of a non-interactive benchmark, we still get really good results with fully interactive content and unlimited potential flexibility. We also afford the ability to replace the Compositor and shaders with completely custom code, just in case you want to create a non-interactive benchmark in melonJS. ;) Or more likely if you just want to pull off some crazy special effects that require additional attributes to be passed to your shaders. It would also be interesting to see other Compositors designed to be even faster than what we've built!

So far, WebGL support in melonJS is finally starting to shape up. It takes a backward-compatible approach to drawing the same way as the 2D Canvas API, which is limiting in terms of WebGL's strengths. However, it does lower the bar for melonJS users to get their games performing better with less work, and it also provides the kind of flexibility that a general purpose game engine cannot survive without.

In closing, there are a lot of tricky details to deal with when using WebGL. It's not all unicorns and rainbows. New algorithms need to be created to get the most out of the API. The biggest performance gains come from application-specific shaders, if you can manage. For a general purpose game engine, there's only so far you can stretch it. The parallelism really needs to be shared by application code (on the CPU) and shader code (on the GPU).

No comments: