C++ – CodeItNow

Direct3D 11 Multithreading

Rory — Wed, 22 Apr 2009 04:11:02 +0000

I’ve been putting it off for a while, but with my recent trip to GDC and the arrival of the Direct3D 11 beta, I thought it was about time I switched my renderer to be multithreaded. One of the things I learned at a Direct3D 11 talk at GDC is that it works on ‘down-level hardware’, which means DirectX 9 & 10 cards. Of course, you don’t get the snazzy new hardware features, but you do get some of the benefits of the new API, like multithreading and limited compute shaders (albeit not as fast as it will be on the real hardware).

There has been some multithreading support in earlier DirectX versions for a while now by using the multithreaded flag when creating the device. Typically though, the pattern has been to run a dedicated rendering thread and submit objects to be rendered to that thread. This allows the device to stay in single threaded mode where it is faster.

Things have changed a lot with Direct3D 11. The rendering API has been separated from the factory functions into a separate object called the device context. The factory functions on the device are all free threaded, meaning that they can be called from any thread. The device context functions are designed to be called from the same thread.

The basic idea behind multithreading in Direct3D 11 is that you create an immediate device context on the main thread. Then, for each thread on which you’d like to be able to render, you create a deferred context. As you can probably guess from the names, commands executed on the immediate context get executed immediately, but those on the deferred context just get saved off into a command list. You then execute the deferred command lists on the main thread using the immediate device context. Sounds easy enough.

Thread Pools

Given that you can submit draw calls to deferred contexts on multiple threads, it makes sense to ditch the single rendering thread concept and switch to using something like a thread pool for issuing the draw calls. This scales far better than a dedicated rendering thread. It’s also pretty easy to set up a simple thread pool, and give each worker thread a deferred render context.

There are plenty of places on the internet to read about thread pools so I’m not going to get into it here, but one thing I can’t stress enough is to make sure that you get your synchronization right! In my initial implementation, I used my normal queue data structure, but wrapped it up in mutexes (mutices?) to make sure it was thread-safe. This worked out well since I was very confident that things were working correctly, but a quick foray into VTune told me that I was spending 40% of the time waiting on synchronization points!

After some quick digging around, I came across a few articles that Herb Sutter wrote for Dr Dobb’s Journal about producer/consumer queues. I implemented the low-lock queue recommended by Sutter, and got a good speedup of at least 30% (that number is off the top of my head, but I remember it was a lot). The relevant articles I read are single producer/consumer queue, generalized concurrent queue, and measuring performance.Â I still use events for sending the worker threads to sleep when there is nothing left to work on, and to wake them up when data is added to the queue.

My application already stores up all of the state needed for a draw call in an object called a RenderContext, so instead of passing off this render context to the renderer on the main thread, instead it just gets enqueued to be rendered by one of the threads in the thread pool. When the worker thread gets to it, it passes the render context off to a thread-local renderer object initialized with a deferred device context. This renderer sets all of the changed state and issues the final draw call.

Finally, back on the main thread, it waits until all of the render contexts have been submitted to the deferred device contexts, and then executes each of these on the immediate device context.

Test Scenario

In order to stress my renderer a bit, I fabricated a scenario with 10,000 models. Each model has a sphere and a ground plane with their own material. I use a loose octree for culling out the models outside of the frustum, but I don’t do any sorting of any kind. This means that the alternating materials that get rendered for the sphere and then the ground put a fair amount of stress on the CPU side of the renderer.

My single threaded renderer took about 50 ms to render the intial view of the scene. By switching to using the thread pool, this went down to about 30 ms. A nice improvement, that’s for sure. Obviously, as fewer objects are visible, the gains of using the multithreaded renderer disappeared.

Profiling

I was happy that the multithreading appeared to be doing its job, but I wasn’t quite satisfied because I couldn’t really tellÂ how well it was doing. Time for some profiling!

There appear to be quite a few CPU profilers out there. First of all I downloaded an evaluation of Intel VTune. It’s pretty overwhelming, but it gave me a lot of pertinent information. The bugger is that you have to pay a hefty sum for it, so I tossed it out of the window.Â I also tried out Microsoft xperf. This sampling profiler gave me a pretty good overview of what was expensive with the standard inclusive/exclusive view. It was a great help for quickly tracking down some areas of the code that I could very easily improve. I still use this.

The trouble with most of the sampling profilers is that they don’t know about frames. They just add up all of the samples over the given time period which gives you an idea on average what is happening. I wanted to get information about what was happening within the frame, so I implemented a really simple frame profiler.

Frame Profiler

An in-game profiler is a really handy tool to have. It lets you see in real-time exactly how your CPU time is being spent in one frame on each of your threads. It’s also pretty easy to set up.

First of all, I created a class called ThreadProfiler.Â As the name suggests, the ThreadProfiler class is responsible for recording events on a specific thread.Â This class has functions to notify it of the beginning and end of the frame, as well as when a profiling event begins and ends. All it really does is to record the name of the event, a color for display, and the timestamps when the event begins and ends. The events can be nested, so it maintains a stack of active events and records the depth of the stack for each event.

Next I created the singleton FrameProfiler class. The idea for this class is to hold all of the ThreadProfiler objects, and to forward events onto those classes based on the current thread ID. Threads are required to register their thread ID with the frame profiler in order for events to be recorded.

        class FrameProfiler : public Core::Singleton
        {
        public:

            FrameProfiler();

            void RegisterThread(int threadId);

            void BeginFrame(bool enabled);
            void EndFrame();

            void BeginEvent(int threadId, const Core::String& name, uint32 color);
            void EndEvent(int threadId);

            DataStructures::ArrayList& GetThreadProfilers();
            const DataStructures::ArrayList& GetThreadProfilers() const;

        private:

            DataStructures::ArrayList m_threadProfilers;
        };

The final piece is a really simple macro which grabs the function name and creates an object which tells the FrameProfiler when it is created and destroyed. This is the macro that I place into whatever function or loop I’d like to profile.
class ScopedProfileEvent
{
public:

        ScopedProfileEvent(const Core::String& name, uint32 color)
        {
                if (FrameProfiler::IsCreated())
                {
                        FrameProfiler::Instance().BeginEvent(Core::Platform::GetCurrentThreadId(), name, color);
                }
        }

        ~ScopedProfileEvent()
        {
                if (FrameProfiler::IsCreated())
                {
                        FrameProfiler::Instance().EndEvent(Core::Platform::GetCurrentThreadId());
                }
        }
};
#define PROFILE(X) const Profile::ScopedProfileEvent event__LINE__(String(__FUNCTION__), (uint32)X)

Unlike a sampling profiler, this kind of profiling has a certain amount of processing overhead. There are a couple of quick things you can do to help with this though. The first is just to make sure that you don’t always have the overhead, and compile it out for your final builds. It’s important to do your profiling on an optimized build, so I would recommend debug, release, and final configurations or something similar. The second thing you can do is to just not run it every frame. I have it set on a key press so that I can get to the area I’d like to profile without the overhead, then hit the button to profile the next frame and display the results.
I’m not sure about how accurate this would be, but you could probably compare the previous frame’s duration to the profiled frame to get a rough estimate of the overhead that the profiling functions added. I wouldn’t rely on that though.
There’s actually quite a bit of information that can be gleaned from these profiling events, but the first thing I did was to render out the events as rectangles on a timeline. In the image below, I have two threads running. The main thread at the bottom has three levels of nested events being shown, and the top worker thread just has one.

Ok, there’s no legend right now, but I’m working on it. Each black/grey bar in the background represents one millisecond of frame time.
The bottom row on the main thread represents the update in green, the render in blue, and the call to Device::Present in red. Given the long red bar, I’d say I’m GPU limited in this scene.
The row above represents the breakdown of the render function from the bottom row. The cyan sliver is shadow rendering (actually I’m not rendering any shadows which is why it’s tiny). The huge magenta bar is the model rendering, and the yellow bar is post-processing.
The top row in the bottom thread represents the breakdown of the model rendering function. The green slivers are models being found in the octree and the red blocks are models being prepared for rendering. The large white bar is actually the command list from the worker thread being executed on the immediate device context. I was pretty surprised to see this segment so large, since I didn’t notice it in the other profilers at all.
Experiments
Now that I have a frame profiler, I can really experiment with my thread pool setup to see how it affects the frame. My computer has a dual core processor, so based on Sutter’s articles, I was expecting that one main thread and one worker would be the best setup. Even so, I tried running a variety of numbers of worker threads to see how it looked. Here’s what four threads looks like:

The first thing I noticed was just how much worse all of the threads fared. Each worker thread appeared to perform a tiny bit of work, and then get swapped out for another thread. The main thread really suffered due to this too. This is a great example of how visualizing this data is really illuminating. The scene was already GPU bound, so even though the rendering code was performing far worse, the frame rate actually stayed the same.
Another experiment I wanted to run was just how much other applications could affect the frame rate of my application. In this case, I just had sysinternals process exlporer running and polling the system processes every half second. It only took me a few tries to hit a frame where I could see the effect:

Notice the scale of the millisecond bars now – this frame took over twice as long to run as my first example with the exact same setup. You can see a big gap on the worker thread where another process stole its time. Event when it did get some time, it appears to be running very slowly.
Also, you can now see a large grey bar in the middle row of the main thread which shows the main thread waiting for the worker thread to finish.
The execution of the command list is pretty consistently taking up three and half milliseconds or so. This is much higher than I had thought it would be. I really hope that this time gets reduced with newer drivers or hardware.
One last thing I’ve done to investigate what is happening in my application is to display the frame rate history. I use a moving average to calculate the frame time, so I have the last 100 frames stored anyway. It’s a simple enough task to just display this.

You can see how varied the frame times are even though the camera isn’t moving. This is probably due to other processes on my computer interfering I’d imagine.
Final Thoughts
It was a fun adventure porting my code to Direct3D 11, particularly implementing a multithreaded renderer using a thread pool. I would recommend trying it out to those of you who have Direct3D 10 engines at the moment.
The jump from Direct3D 10 to 11 is nowhere near as bad as the previous jump from 9 to 10. It took me about three hours to change my rendering code to deal with the changes. The most awkward part was probably having to pass in the device context to functions which need to map buffers, since these functions are no longer on the buffers themselves.
Visualizing profiling data in real time can be a real eye-opener for understanding how your code is actually running rather than how you think it may be running. It has really helped me identify good candidates for moving to using the thread pool as well as pointing out areas of the code that are taking a surprisingly large amount of frame time.

Irradiance Caching: Part 1

Rory — Mon, 19 Jan 2009 01:35:35 +0000

Solving the rendering equation with even just one bounce of indirect lighting can take a long time. The majority of time spent rendering a frame is in estimating the lighting integral. For example, rendering a single bounce of indirect lighting at 720p resolution with 256 sample rays for a Monte Carlo estimator requires about 237 million rays to be cast. This doesn’t even include the rays needed for sampling the lights for direct lighting, so in practice, the total will be even higher.

One interesting observation made by Greg Ward in his Siggraph ’88 paper is that contrary to direct lighting, where shadows and lights can cause harsh changes, the indirect lighting on a surface tends to vary relatively slowly. One way to picture why this is, is to imagine the computing average color from the what you can see from each of your eyes. Even though each eye has a slightly different view on the world, the images they see are nearly similar, and so the average color is also nearly the same.

The image below shows the same scene from my previous post with just the indirect irradiance, and it’s pretty clear that for each surface, the lighting varies in a very smooth fashion.

Ward proposed using this knowledge to reduce the number of times that the Monte Carlo estimator was evaluated by interpolating between nearby previously calculated values. At the time he just called it ‘lazy evaluation’, which I personally think is a good way to picture the idea. Later it became known as irradiance caching.

Irradiance Caching

The basic concept for irradiance caching is really simple: For each point on a surface at which you want to evaluate irradiance, if the cache contains any valid entries then interpolate between them. Otherwise, calculate a new irradiance entry, and add it to the cache.

A cache entry contains the position and normal for the point on the surface where the irradiance was evaluated as well as the irradiance value itself. One important additional piece of information that the cache requires is the range over which the entry is considered potentially valid. This range could be calculated in a number of ways, but the most common one is to use the harmonic mean of the hit distance of the rays used for the estimator. For n estimator samples, each with hit distance d, the harmonic mean is simply:

Using the harmonic mean distance makes the cache entry distribution very dense in corners and crevices, and sparse in open spaces. This matches up very well with where the indirect irradiance is likely to be changing the fastest. To get an idea of how the cache entry distribution looks, here’s the scene above with the cache entry positions shown as red dots:

Once you can add entries into the cache, you need to know how to find whether or not a particular cache entry can be used for interpolating the irradiance at a sample point. There are potentially quite a few ways that you can discard invalid cache entries depending on how fancy you want to get. For now, I’m using three simple tests.

Discard the entry if any of the following are true

It is out of range of the sample point.
It has a normal that is too different than the sample normal.
It is in front of the sample point.

Once you have a valid cache entry, you need to calculate a weight for that entry, then carry on looking for other entries that are potentially valid. As you come across each valid cache entry, you need to keep the sum of the weighted irradiance values, and the sum of the weights themselves. From these two sums, you can calculate the final interpolated irradiance:

The weight for a particular cache entry is another part of the algorithm that can potentially be calculated in many different ways. For now, I’m using the weight that Ward proposes, but there’s some interesting information about the weights used at Dreamworks in this paper. Here’s Ward’s initial weighting function:

Note that you have to be a little bit wary of this function, since it is unbounded. When the sample point lies exactly at the same point as the cache entry then there you will get a divide by zero.

Typically, you would also discard cache entries that are below some weight threshold as specified by the user. This effectively scales the density of the cache entries and allows the user to make the trade off between speed and quality.

Implementation

I’ve made a very bare bones implementation of irradiance caching as outlined above. At the moment I’m not using a quad tree to store the cache entries, so each cache check requires iterating through an array of entries. Clearly this is a very slow way to process the cache entries, but for now it does a decent enough job to allow me to focus on the irradiance caching algorithm itself. Here are the results:

Not very impressive, or smooth, is it? I was hoping that the simple implementation I have made would provide better results than this, but apparently not. At the moment there’s one crucial improvement to the algorithm that my implementation is missing though – Irradiance Gradients. Irradiance Gradients basically give a better clue as to how to interpolate the irradiance cache entries, both positionally and rotationally. I’m hoping that they will significantly reduce the artefacts visible at the moment.

One problem that can occur when using an irradiance cache is that later cache entries don’t contribute to previously rendered pixels. When this happens, you can see blocky artefacts where the irradiance values have been interpolated differently. Something like this:

One thing you can do to avoid this situation is to perform an irradiance gathering pass before doing the final render. When you perform the final render, you should have no cache misses. In my case, I am using a progressive renderer, so the cache is actually fairly well primed before rendering the 1×1 pixel size.

Improvements

In addition to irradiance gradients, there have been a load of improvements made to irradiance caching since the inital paper. The course notes for the Siggraph 2008 course provide details of many of these. I’ll post up some screenshots when I’ve added the irradiance gradients.

MockItNow: Throwing Exceptions

Rory — Thu, 15 Jan 2009 08:53:44 +0000

I’ve made a small update to MockItNow to allow you to throw exceptions when replaying function calls. You basically record the function call as normal, and provide the exception object that you want to throw during the replay using the EXPECT_THROW macro. You can also make a function default to throwing an exception at registration time using REGISTER_THROW.

If you want to see a couple of examples of this feature, take a look at the bottom of the sample file here.

Better Sampling

Rory — Thu, 08 Jan 2009 07:33:51 +0000

A couple of days ago, I compared the images my ambient occlusion integrator produced with those of Modo using similar settings. I noticed immediately how much ‘cleaner’ the render from Modo was. Clearly there was an issue with the way I was picking my samples, so I set about improving things.

My approach for generating the ambient occlusion rays was to generate uniform random samples over the hemisphere about the normal. Based on two random numbers in the range [0,1), I calculate the normalized sample direction using the following function:

Vector3 Sample::UniformSampleHemisphere(float u1, float u2)
{
	const float r = Sqrt(1.0f - u1 * u1);
	const float phi = 2 * kPi * u2;

	return Vector3(Cos(phi) * r, Sin(phi) * r, u1);
}

This generates points on a hemisphere from uniform variables u1 and u2, where each point has equal probability of being selected. The following image was generated with 256 random uniform samples:

It looks pretty noisy, that’s for sure. Part of the trouble comes from the fact that there’s no way to ensure that there’s an even distribution of the rays. A common way to alleviate this problem is to do stratified sampling instead of fully random sampling. The idea of stratified sampling is to split up the domain into evenly sized segments, and then to pick a random point from within each of those segments. You still get some randomness, but the points are more evenly distributed, which in turn reduces the variance. Less variance means less noise. Here’s the scene again, using 256 rays, but this time using stratified sampling:

As expected, it’s much less noisy, and for the same amount of computation!

Sampling for Diffuse Monte Carlo Estimator

The stratified sampler helps out with the indirect diffuse lighting calculation too, but one other thing you can do to reduce noise for the Monte Carlo estimator is to choose random values that have a similar ‘shape’ to the integral you are estimating. Looking at the integral for diffuse reflections, you will see the familiar cosine term inside the integral:

Where c is the diffuse material color, Li is the incoming radiance, and pi is the energy conservation constant.

Rather than wasting samples on areas of the integral where they will get mulitiplied out by the cosine term, why not just choose proportionally fewer samples in those areas?

Recall that the Monte Carlo estimator for an the integral of the function f(x), with probability density function p(x) is:

The probability density function is just a function that returns the probability that a particular value will be chosen. For the uniform hemisphere sampling function above, the pdf is just a constant, (1 / (2 * pi)). This makes the Monte Carlo estimator for the diffuse integral:

Rather than mutliply by the cosine term above, we just want to generate proportionally fewer rays at the bottom of the hemisphere. The integral of the pdf over the hemisphere must equal one, so by switching to a cosine-weighted sample distribution, the pdf becomes (cos(theta) / pi).

This makes the estimator:

Which cleans up rather nicely to:

Normally I would post a couple of images up for comparison’s sake, but in this case, the difference is pretty difficult to perceive without being able to compare one on top of the other. The difference is small, but it is definitely worth it!

The common way to generate a cosine weighted hemisphere sampler is to generate uniform points on a disk, and then project them up to the hemisphere. Here’s some code:

Vector3 Sample::CosineSampleHemisphere(float u1, float u2)
{
	const float r = Sqrt(u1);
	const float theta = 2 * kPi * u2;

	const float x = r * Cos(theta);
	const float y = r * Sin(theta);

	return Vector3(x, y, Sqrt(Max(0.0f, 1 - u1)));
}

Just by doing these two small steps, I’ve been able to clean up my images significantly. Here’s the scene from above again, this time with single bounce final gather with 256 rays, stratified cosine-sampled:

Next on my list is to take a look at path tracing, followed by irradiance caching (wasn’t that the point of all this?). This should allow me to get fairly cheap multi-bounce diffuse lighting.

The Holidays: Time for fun work!

Rory — Sun, 04 Jan 2009 01:48:52 +0000

For the first time in about three years, I’ve had two weeks off work. I’ve spent a lot of time just relaxing and taking a break from things, but I’ve also been able to get back to doing some graphics work. Ever since Vivendi bought Activision, the project that I was leading has been “put on hold”, so I’ve been back on the game team. It’s not as fun for me, that’s for sure, but luckily, I have my code at home to play with, so all is not lost! With the holidays, I’ve found some motivation to get back to it.

What have I been doing? Well, as I was approaching the break, I read through the course notes from the Practical Global Illumination with Irradiance Caching course at Siggraph last year. I thought the course itself was really good, and very clearly presented. After blitzing through the notes again, I thought I’d have a go at writing a ray tracer. It seemed simple enough at the time, but like most things, the devil is in the details.

The first thing I did was to set up a really simple single-threaded ray tracer that just displayed the color of the surface it hit. This was fairly quick to get up and running once I had written a few supporting classes for the cameras and shapes. It’s not very glamorous, but it’s a start:

Next, I added point lights and directional lights, and wrote a new integrator to calculate direct diffuse lighting. Once you have a function to trace rays around the scene, it’s really easy to add hard-edged shadows. It looks a lot better than the solid color integrator I first used, but it still not very impressive.

Here’s the scene with a single directional light and hard-edged shadows:

I wanted to flex the ray tracer a little bit, so and easy next step was to add an ambient occlusion integrator. Initially, I just used a function generate random uniform rays on the hemisphere around the hit normal, and used the ratio of misses to hits as the occlusion value. I found that this was really pretty noisy, so I tried using the length of the ray hits to weight the occlusion values. This definitely improved things, but it’s still pretty noisy. The obvious way to reduce the noise is to use more rays, but I’d like to find a cheaper way to do this if possible.

Here’s the scene rendered with the ambient occlusion integrator using 4096 rays per hit:

The first time I tried to render this scene using 4096 ambient occlusion rays per pixel, it took about thirteen minutes. I’ve never really used release builds at home and the settings weren’t great, so I tweaked some of the project settings, and defined out asserts. This got the time down to about ten minutes. I’m running these renders on my Macbook Pro, so I have a whole other core just sitting there doing nothing. Switching to using a multithreaded renderer basically sped the renders up by a factor of two.

Combining some of the concepts of the ambient occlusion integrator, and the direct diffuse integrator, I created a multi-bounce diffuse integrator. Like the direct diffuse integrator, it calculates the direct diffuse lighting at the hit point. Additionally though, it uses a Monte Carlo estimator to approximate the diffuse lighting integral over the hemisphere about the normal of the hit point. It can handle any number of bounces of indirect light, but the render time increases exponentially with each bounce added. Like the ambient occlusion integrator, it requires a large number of sample rays to get an acceptable level of noise.

Here’s the scene again with one bounce of indirect light, and 4096 rays per hit:

When a ray misses the scene, it looks up an environment color, which you can see in the background. Most of the indirect rays actually miss the scene, so this background color actually has a huge effect over the look of the scene. I should mention as well that I’m using a really simple tone mapping operator to map the HDR ray tracer values down to the 8 bit per channel texture.

While working on the ray tracer, I would often be playing around with the objects and lights in the scene. I quickly found out that it’s really not very fun to wait for the ray trace to complete before getting some feedback. I can reduce the number of indirect rays to make things quicker, but even at relatively low values, it can take a while to render the final scene.

I had already split the rendering of the scene into 32 by 32 blocks when I switched to a multi-threaded ray tracer, so it was a really simple extension to change the resolution in each of these blocks on the fly. I basically start things off by rendering with each ray covering 32 by 32 pixels, then when that completes, I immediately kick off another render at 16 by 16, and so on. Each successive render takes four times as long as the previous render, so if the 1 by 1 render takes about a minute, then you get the 8 by 8 render in about a second!

Here’s the scene rendered using 512 indirect samples, paused at the 4 by 4 resolution:

And here’s the scene at the conclusion of rendering (note that the time is cumulative of all the previous renders):

It’s pretty clear at the 4 by 4 resolution how the render is going to look, and it only took four seconds to get there, whereas the final scene took nearly a minute. The 1 by 1 resolution actually took only 40 seconds of that minute to render, but still, having the feedback within a tenth of the final render time seems worth the extra wait at the end.

That’s basically as far as I got over the past couple of weeks. Like many things I do, there seems to be more to do now than at the beginning. One of the things I’d really like to do is to be able to render out the lighting to radiosity normal maps. This would allow me to combine the static precomputed lighting in my DirectX10 engine. I could also output spherical harmonic coefficients for light probes which would allow me to render dynamic objects using the precomputed lighting.

Well, work starts back up in a couple of days, so the amount of time I can spend on this is going to be limited again, but I’ll post any significant updates. I have another article about the the lighting calculation on the the way, but it’s competing for my time!

Minor Update to MockItNow

Rory — Sat, 15 Nov 2008 18:01:07 +0000

This is just a quick note to say that I’ve updated MockItNow on Google Code to allow you to define storage types on a per-class basis using the DECLARE_STORAGE_TYPE macro.Â I did this so that the Mocker can deal with abstract class parameters. Please note that the macro must be declared at global scope because it uses partial template specialization.

I updated the download, and the source. You can see the new test at the end of the fileÂ here, and the only other affected file is Storage.h.

Thanks to Lance for pointing this problem out.