All Articles

Walking through the heap properties in DirectX 12

Indifferent to the difference?

Back to the old times, you never worried about how the physical memory would be allocated when you’re dealing with OpenGL and DirectX 11, GPU and the video memory were hidden behind the driver so well that you might even not realize they were there. Nowadays we get Vulkan and DirectX 12 (of course Metal, but…nevermind), that the “zero driver-overhead” slogan (not “Make XXX Great Again” sadly) become the reality on desktop platforms. And “ohh I don’t know that we need to manually handle the synchronization” or “ohh why I can’t directly bind that resources” and so on and on. Of course, the new generation (already not new actually) graphic API is not for casual usages, your hands get dirtier and your head gets more drizzling, while there are still a bunch of debugging layer warning and error messages keeping pop up. Long story in short, if you want something pretty and simple, turn around and rush to modern OpenGL (4.3+) or DirectX 11 and happy coding; if you want something pretty and fast, then stay with me a while and let’s see what’s going on with the new D3D12 memory model.

The fundamental CPU-GPU communication architecture is quite similar around different machines, you have a CPU chip, you have a GPU chip, you have some memory chips, gotcha! The typical PC with a dedicated graphics card would have 2 memory chips, one we often referred as the main memory and another one as the dedicated graphics card memory, or more commonly used (not so strict) name convention are RAM and VRAM for them. Other architectures like those game consoles, the main memory, and the video card memory would be the same physical one, we name such kind of memory accessing model as UMA - Uniform Memory Access. Also, the functional microchips of CPU and GPU would be put together or closer in some certain designs (for example PS4) to get optimized communication performance. You should remember that you just paid once for your DDR4 16GB fancy “memory” when you’re crafting your state-of-the-art PC right? They are the “main” RAM for the general-purpose, like loading your OS after power-up or put the elements of your std::vector<T> inside. But if you also purchased an AMD or NVIDIA dedicated graphics card, you might notice the printed instructions on the package box that there are other couple-few Gibibytes some sort of memory on it. That’s the VRAM memory where the raw texture and mesh data would stay when you are playing CS:GO and swearing random Ruglish in front of your screen.

An over-simplified PC

So, if you want to render the nice SWAT soldier mesh in CS::GO, you need to load it from your disk into the VRAM and then ask GPU to schedule some parallel rasterization work to draw it. But unfortunately, you can’t access the VRAM directly in your C++ code due to the physical design of the hardware. You could reference a main memory virtual address by semantics like Foo* bar = nullptr in C++, because it would be finally compiled into some machine instructions like movq, $(0x108), $0 (it should be binary instruction data actually, for the sake of human-readability here I use assembly language instead) that your CPU could execute. But generally speaking, you can’t expect the same programming experience on GPU, since it is designed for highly parallel computational tasks thus you can’t refer to some fine-grin global memory addresses directly (there are always some exceptions, but let’s stay foolish at present). The start offset of a bunch of raw VRAM data should be available for you in order to create a context for GPU to prepare and execute works. If you were familiar with OpenGL or D3D11 then you had already used interfaces such as glBindTextures or ID3D11DeviceContext::PSSetShaderResources. These 2 APIs expose the VRAM memory not explicitly to developers, instead, you would get some indirect objects in runtimes like an integer handle in OpenGL or a COM object pointer in D3D11.

A step closer

GPU is a heavily history-influenced peripheral product, as time goes by its ability becomes more and more general and flexible. As you might know the appearing of Unified Scalar Shader Architecture and Highly Data Parallel Stream Processing made GPU become compatible for almost every kinds of parallel computation works, the only thing lying between the developer and the hardware is the API. The old generation of graphics APIs like OpenGL or DirectX 11 were designed with emphasis, that they’d better lead developers to a direction that they’d spend more time with the specific computer graphics related tasks they want to hand on with, rather than too much low-level hardware related details. But the experience told us, more abstraction, more overhead. So when the clock ticking around 2015 that the latest generation of graphics API was released to the mass developer like me, a brand new or I’d rather to say “retro” design philosophy appearing among them, no more pre-defined “texture” or “mesh” or “constant buffer” object models, instead we get some new but lower-level objects such as “resources” or “buffers” or “command”.

Honestly speaking, it’s a little bit painful to transit the programming mindset from OpenGL/D3D11 era to Vulkan/D3D12. It’s quite like a 3-years-old kid who used to ride his cute tiny bike with auxiliary wheels now need to drive a 6-shifts manual gear 4WD car. Previously you call a glGen* or ID3D11Device::Create* interfaces you would get the resource handles in no means more than few milliseconds. Now you even can’t “invoke” functions to let GPU do these works! But wait, could we actually ask GPU to allocate a VRAM range for us and put some nice AK-47 textures inside before? Just the graphic cards vendor’s implementation handled the underlying dirty business for us, all the synchronization of CPU-GPU communication, all the VRAM allocation and management, all the resources binding details, we had even not taken a glimpse about them before! But it’s not as bad as I exaggerated, you just have to take care the additional steps which you don’t obligate to do previously, and if you succeeded you’d not only get more code in your repo but also a tremendous performance boost in your applications.

Let’s forget about the API problems for a couple few minutes and take a look back at the hardware architecture to better understand the physical structure that triggered the API “revolution”. The actual memory management relies on the hardware memory bus (it’s part of the I/O bridge) and the MMU - Memory Management Unit, they work together to transfer data between the processor and different external adapters to RAM, and mapping physical memory address to a virtual one. So when you want to load some data from your HDD to RAM, the data would travel through the I/O bridge to CPU and then after some parsing processes it would be stored into RAM. If you had a performance-focused attitude when writing codes, you may wonder is there any optimizations for usage cases like simply loading an executable binary file to RAM, which doesn’t require any additional processing to the data itself. And yes, we had DMA - Direct Memory Access! With DMA the data doesn’t need to travel through CPU anymore and instead, it would be loaded directly from the HDD to RAM.

A closer look at the Processor-Memory communication model

As we could imagine, CPU and GPU could have individual RAMs and MMUs and Memory Buses, thus they could execute and load-store data into their RAMs individually. That’s perfect, two kingdoms live peacefully with each other. But the problems emerge as soon as they start to communicate, the data needs to be transferred from the CPU side to GPU side or vice versa, and we need to build a “highway” for it. One of the “highway” hardware communication protocol that widely used today is PCI-E, I’d omit the detail instructions and focus on what we’d care about here. It’s basically another bus-like design and provides the functionality that we could transfer data in between different adapters, such as a dedicated graphics card and main memory. With its help, we could almost freely (sadly highway still need payment, it’s not a freeway yet) write something utilizing CPU and GPU together now.

Too many bridges (omit MMU and Memory Bus for simplification)!

The bridges are a little bit too many, isn’t it? If you remembered that I’ve briefly introduced a memory architecture called UMA before, it basically just looks like we merging RAM and VRAM together. Since its design requires the chip and memory manufacturers to produce such products, and until now I’ve never seen one in the customer hardware market, we can’t craft it by ourselves. But still, if you had an Xbox One or PS4 you’ve enjoyed the benefit of UMA.

UMA, Umami

Heap creation

So now it’s time to open your favorite IDE and #include some headers. In D3D12, all the resources would resident inside some explicitly specified memory pools, and the responsibility to manage the memory pool belongs to the developer now. This is the how the interface

HRESULT ID3D12Device::CreateHeap(
  const D3D12_HEAP_DESC *pDesc,
  REFIID                riid,
  void                  **ppvHeap

comes. If you’re familiar with D3D11 or other Windows APIs in COM model you could easily understand the function signature style. It is made by the combination of a reference to a description structure instance, a COM object class’s GUID and a pointer to store the created object instance’s address. The return value of the function is the execution result.

Now let’s take a look at the description structure:

typedef struct D3D12_HEAP_DESC {
  UINT64                SizeInBytes;
  D3D12_HEAP_PROPERTIES Properties;
  UINT64                Alignment;
  D3D12_HEAP_FLAGS      Flags;

It apparently follows the consistent code style of D3D12 API, and here we get another property structure to fulfill in:

typedef struct D3D12_HEAP_PROPERTIES {
  D3D12_HEAP_TYPE         Type;
  D3D12_MEMORY_POOL       MemoryPoolPreference;
  UINT                    CreationNodeMask;
  UINT                    VisibleNodeMask;

This structure would inform the device which kind of the physical memory should the heap refer to. Since the documentation of D3D12 is comprehensible enough, I’d rather not talk about too many things which have been listed there. When D3D12_HEAP_TYPE Type is not D3D12_HEAP_TYPE_CUSTOM, then the D3D12_CPU_PAGE_PROPERTY CPUPageProperty should be always D3D12_CPU_PAGE_PROPERTY_UNKNOWN, because the CPU accessibility of the heap has already been indicated by the D3D12_HEAP_TYPE so you shouldn’t repeat the information; Similar reason, D3D12_MEMORY_POOL MemoryPoolPreference should always be D3D12_MEMORY_POOL_UNKNOWN when D3D12_HEAP_TYPE Type is not D3D12_HEAP_TYPE_CUSTOM.

In UMA architecture, there is only one physical memory pool which is both shared by CPU and GPU, the most common case is that you got an Xbox One and start to write some D3D12 games on it. In such case only D3D12_MEMORY_POOL_L0 is available and thus we don’t need to take care of it at all.

The most of the desktop PC with a dedicated graphics card are NUMA memory architecture (although recent years there are something like AMD’s hUMA appeared and gone), in such case D3D12_MEMORY_POOL_L0 is the RAM and D3D12_MEMORY_POOL_L1 is the VRAM.

So now if we set the heap type to D3D12_HEAP_TYPE_CUSTOM, then we could have a more flexible control over the heap configuration. I’ll list a chart below that how different combination of D3D12_CPU_PAGE_PROPERTY and D3D12_MEMORY_POOL would finally look like on NUMA architectures.

L0 Similar as D3D12_HEAP_TYPE_DEFAULT, a GPU access-only RAM (but a little bit non-sense configuration for common usage cases) Similar as D3D12_HEAP_TYPE_UPLOAD, it is uncached for CPU read operation so the reading result won’t always stay coherent but write operation is faster because now the memory ordering is trivial and irrelevant, perfect for GPU to read Similar as D3D12_HEAP_TYPE_READBACK, all the GPU write operation would be cached and CPU read operation would get a coherent and consistent result
L1 Similar as D3D12_HEAP_TYPE_DEFAULT, a GPU access-only VRAM Invalid, CPU can’t access VRAM directly Invalid, CPU can’t access VRAM directly

It looks like that we don’t need a custom heap property structure on NUMA architectures (or single engine/single adapter case), all possible heap types have been already provided by the pre-defined types, there is not too much space for us to maneuver in order to get some advanced optimization. But if your application wants any better customization for all the possible hardware that it would run on, then using custom heap properties is still worth enough to investigate.

The Processor-Memory model in D3D12

And finally, we had a misc flag mask to indicate the detailed usage of the heap:

typedef enum D3D12_HEAP_FLAGS {
} ;

Depends on the specific D3D12_RESOURCE_HEAP_TIER that different hardware support, some certain D3D12_HEAP_FLAGS are not allowed to use alone or combine together. The furthermore detail is well documented on the official website so I’ll not discuss them here. Because some of the enums are just the alias to the others, the actual possible heap flags are less than how many it is defined, and I’ll list a chart below to demonstrate different usage cases and the corresponding flags.

Tier1 Tier2
Swap-chain surface only D3D12_HEAP_FLAG_ALLOW_DISPLAY Same as Tier1
Shared heap (multi-process) D3D12_HEAP_FLAG_SHARED Same as Tier1
Shared heap (multi-adapter) D3D12_HEAP_FLAG_SHARED_CROSS_ADAPTER Same as Tier1
Memory write tracking D3D12_HEAP_FLAG_ALLOW_WRITE_WATCH Same as Tier1
Atomic primitive D3D12_HEAP_FLAG_ALLOW_SHADER_ATOMICS Same as Tier1

As you can see above, the only meaningful difference between Tier1 and Tier2 here is that Tier2 support a D3D12_HEAP_FLAG_ALLOW_ALL_BUFFERS_AND_TEXTURES flag thus we could put all the common resources into one heap. It again depends on what specific task you would like to finish, sometimes you want an all-in-one heap, sometimes it’s better to separate them into different heaps by the usage cases.

Resource creation

After you created a heap successfully, you could start to create resources inside it now. There are 3 different ways to create a resource:

  1. Create resource which has only virtual address inside the already created heap, it requires us to map to the physical address manually later. ID3D12Device::CreateReservedResource is the interface for such a task;
  2. Create resource which has both virtual address and mapped physical address inside the already created heap, the most commonly-used resources are this type. ID3D12Device::CreatePlacedResource is the interface for such a task;
  3. Create placed-resource and an implicate heap at the same time. ID3D12Device::CreateCommittedResource is the interface for such a task.

If you don’t want to manually manage the heap memory at all, then you could choose to use committed-resource with some sacrifices to the performance, but naturally it’s not a good idea to stick with committed-resource heavily in the product code (unless you’re lazy like me who don’t want to write more code in show-case projects). The more mature choice is using placed-resources since we’ve already could create heaps, the only thing left that you have to do now is designing a heap memory management module with some efficient strategies. You could just use as many design patterns and architectures from the experience when you’re implementing the main RAM heap memory management system (still malloc() inside 16ms? No way!). A ring buffer or a double buffer for Upload heap or some linked-list for Default heap or whatever, there are no limitations for the imagination, just analysis your application requirement and figure out a suitable solution (but don’t write a messy GC system for it:). There shouldn’t be too many choices since in the most D3D12 applications like a game, the most of the resources are CPU write-once and others are dynamic buffers which won’t occupy too much space but update frequently.

The more advanced situation which rely on a tremendous memory size, such like mega-texture (maybe you need a 64x64 km2km^2 terrain albedo texture?) or sparse-tree volume textures (maybe you need a voxel-cone-traced irradiance volume?), which would index over the physical VRAM address easily or the actual texture size is beyond the maximum hardware support. In such cases a dynamic virtual memory address mapping technique is necessary. Developers intended to implement a software cache solution for this problem in the past because the APIs didn’t provide any reliable functionalities at that time (before D3D11.2 and OpenGL 4.4 which started to support tiled/sparse textures). The reserved-resources in D3D12 are the fresh new one-for-all solution today, it inherited the design of the tiled-resources architecture in D3D11 but also provided more flexibilities. But still, it depends on the hardware support when you wonder how to fit your elegant and complex SVOGI volume texture into the VRAM, it’s better to query D3D12_TILED_RESOURCES_TIER and see if the target hardware support tiled-resource or not at first.

Published Sep 18, 2019

Random randomness in randomized randomization.