Thursday, 14 May 2009

Faster, Pussycat! Kill! Kill!

One of the unending background tasks round here is to make Moviestorm run faster. I managed to persuade Julian to take a few minutes out to tell you what he's been doing. If you get bogged down in the techspeak, just skip to the last paragraph. ;)

Recently, I have used the small amounts of time not devoted to watching Russ Meyer films to attempt to make Moviestorm run faster. I don't even want to count the number of changes I've made, but it's quite a number.

One of the things about performance improvement is that not everything you think makes things faster actually does. In a previous job I was reliably informed by a Games Industry "Guru" that "I shouldn't need to profile my code because I should know where the bottlenecks are". Let's leave aside the distinct possibility that this was to avoid him having to buy me a profiler. The real moral of the tale is "what you assume makes an ASS out of U and ME". I'm frequently wrong about such things - most devs I know are, to err is human etc, and in a complex app with threads and the like such as Moviestorm, it is very easy for us mortal non-Gurus to miss the wrong end of the wrong stick.

So my life has involved the use of Java profilers for a 30,000ft view of what's going on, and an OpenGL debugger to see what the individual molecules are up to within our favourite application. First off, I've identified some of the critical code, the 10% that runs 90% of the time. (I say "some of" because Moviestorm is an open system, and adding props and characters with new materials and activities can drastically change the balance of the call state.) Once I know what they are, there are a number of strategies I've employed to speed things up:

* Open GL state management. I've written a GL wrapper that tracks state and rejects redundant changes. Using my magic tools, I've seen that the graphics card on the test machine was hardly breaking a sweat on a scene with 7 characters. Therefore I don't expect to see this optimisation making a big impact YET. I guess it's good news - we can do lots more pretties without hurting the frame rate provided they don't cost a lot of CPU to set up.

* Workspace variables in bottleneck systems. You'll see a lot of maths code using static variables named things like vWork1 to avoid the cost of doing eg

Vector3f pos = new Vector3f();

Yes, the code is less readable, but that's optimisation for ya (in more complex cases I've preserved the semantics so that instead of writing

Vector3f pos = new Vector3f();

I write

Vector3f pos = vWork1;

and hopefully the JIT can optimise, but who knows?)

Why is this inefficient? Firstly, the memory allocation takes time; there are also THREE constructors called when you construct a Vector3f for instance - Vector3f, Tuple3f and Object; plus there is time to zero the x/y/z fields AND there is a hit on the GC side because lots of small short-lived objects can be a nightmare. In frequently-called code, this can accumulate and deferring the calculations to pre-allocated workspace saves these overheads.

As a general point of advice, it's only worth doing this if you know the code is causing a performance problem - ie you've profiled it some way (either using Netbeans' profiler or timing method calls manually). Early optimisation is the root of all evil, remember! (Donald Knuth, Mr Computer Science, circa 197x).

* In-place calculations. In some functions, the results are allocated dynamically (which exacerbates the problems mentioned above). So I've written versions that use pre-allocated arguments, and inlined the relevant code where appropriate.

* Removed redundant field settings in constructors. Eg in Vector3f(), the fields were set to 0 even though Java guarantees to have done this already.

* Caching of invariant state. Many bits of code do searches for things which never change, wasting valuable cycles.

* Lazy evaluation of dynamic state. Some states change at a significantly lower frequency than that which they are polled at. For instance the world transform of a scene object. This gives us an opportunity to only recalculate such state values when we really need to (in the SceneObject's case when its local transform is modified in some way).

* Map iteration. I've changed some iterations over hash maps to use the entry sets rather than the key sets as then the values and keys come for free (as opposed to doing a lookup for the value).

* Many of our skeleton operations are faster now that I've speeded up traversal of bones, and eliminated a number of tests that only need to be performed once.

The first item aside, these are all on the CPU side of things. I'm currently working on some graphics optimisation. I've changed the way our primitives sort and collected them up into batches. Batches require only one setup and teardown, so we win when the batches are bigger than 1 prim, and never lose (the overhead for batch generation being small). It's going well though it has required a big overhaul of the render code.

OK, so far, so much techno-babble. How has this improved speed? Well, on our test machine, we were rendering at 10.6fps on my test scene with 7 characters, lights and quite a few props. After 2 weeks of optimisation, the same scene ran at 32fps. Cool! But the machine is a monster, and it remains a priority to get the speed up on lesser beasts. Optimisation is a war of attrition: lots of small changes can be hardly noticeable; but put one more in, and suddenly things start to flow. Then again, when one bottleneck is removed, another will surely rise to take its place. Slowly slowly catchee pussycat (oh, how I love mixing antonymic metaphors). There are a number of areas we are aware of where we can make yet more speed gains; my notional target is 20-25fps for a moderate scene on an average PC / Mac, and I'm optimistic we can hit this. The only way is Up!

So there you have it. With luck, I'll get Dave to tell you something next week about the work he's been doing on the launcher, which is already giving us much faster load times on our test machines. We've obviously got to do a shedload of testing on the new code before we ship it to you, as there's always a risk that we've broken something major, so we've got this pencilled in for the release after next, assuming all goes well.

Oh, and please don't ask me what any of this means. All I know is Moviestorm runs a lot faster now.


Walvince said...
This comment has been removed by the author.
Walvince said...

Excellent news! It's really an important point that sure. Maybe the first. Keep in this way!

snorkel said...

Well, I've got the shader batching up and running. Hurrah! I had some real nasty bugs to fight to get there, and I am happy in all but one respect: there is no notable improvement in performance. Yes, sometimes the universe defies one's abilities to understand it; and just as the rotation curves of galaxies lead inexorably to the hypothesis of Dark Matter, so the performance of Moviestorm leads me to the conclusion that there must exist a thing such as Dark Code. This is invisible to all observational techniques of debugging, but whose presence can be inferred by its ability to grind away at frame rate. Having spent some time chasing the ghosts of "Heisenberg Code" (the bugs that only appear when you are not looking for them) I am now looking earnestly for the presence of an analogue to Dark Energy which will cause inexplicable speed-ups in the application. Some may claim this all points to a fundamental misunderstanding of the universe; I say it's more likely to be because it's Friday and my brain is aching, and when I come in Monday, everything will be as it should. Om Mani Padme Hum...

Norrie said...

"Heisenberg Code" made me laugh far more than I should admit to.

"Dark Code" sent me over the geek ledge.