Real-time wavelet decomposition using compute shader

Check out this post to have a short introduction about Haar wavelets.

In order to do IBL using wavelet triple product integral, we need to represent env light, BRDF, and per-vertex visibility by wavelet basis functions, meaning we need to do wavelet transform on these signals.

For BRDF and visibility in static scenes, we could precompute their wavelet transform using CPU, in offline; for lighting we have to compute the wavelet transform in run-time, because the lighting is dynamically-sampled in each frame.

OpenGL compute shader gives me good incentives to implement 2D Haar wavelet transform in it. Firstly it is an independent compute stage outside graphics pipeline, so I don’t need to supply primitives or setup those graphics pipeline states that are unrelated to the core algorithm, secondly it provides shared memory access for inter-thread communication, just like CUDA or OpenCL, and finally, unlike CUDA or OpenCL, it can access OpenGL buffer objects just like other GLSL shaders.

In my implementation, in each frame the light is firstly sampled into an octahedral parameterization using fragment shader, and non-standard wavelet decomposition is performed using compute shader.

Here is the result of 1-level decomposition on the dynamically-sampled light:

(Note that right now the 3D scene you see is still rendered using pre-filtered env map)

And the following is the full-level decomposition:

Short intro. to Haar wavelet transform

Short introduction to Haar Wavelets:

Haar basis was developed by Alfred Haar in 1909. It is the simplest yet the most widely adopted wavelet basis. For clarity, we first start from one dimensional Haar basis. Given a 1D discrete signal, we can represent it by a combination of average and difference (or detail), like the following example:

1D average and difference

The functions in the right hand side  can be expressed as

5 \times \phi(x) + 4 \times \psi(x) ,

basis def

\phi(x) and \psi(x) are called mother scale function and mother wavelet function, respectively. Other basis functions could be derived by scaling and translating them. Here are some examples:

basis example

For a 1-D discrete signal with length N (N>2), we can repeat this average & difference  process to obtain 1 scale coefficient and N-1 detail coefficients.

The following is an example of transforming a 1-D discrete signal [ 9 7 3 5 ]

size4 example

Thus the wavelet transform of [ 9 7 3 5 ] is given by  [ 6 2 1 -1].

Less significant detail coefficients could be discarded for data compression purposes, like the following image shows:

1d compress example

To perform 2D Haar wavelet transform, we can simply do full 1D transform along one dimension and then do another full 1D transform along the other dimension. This is called standard decomposition.


Or, we can perform 1D Haar transform alternatively along different dimensions, in each level. This is called nonstandard decomposition.


Non-standard decomposition has some desired properties, such fewer assignments are required than in standard decomposition, and square local supports for every basis functions.

2D haar basis

One application of 2D Haar wavelet is image compression. Here is an image reconstruction example using different numbers of coefficients:

2D haar compression

It’s an example produced by my own implementation, and the coefficient pickup scheme is naive (just pick those with large magnitude), so it does not yield best compression quality.


1. Eric J. Stollnitz Tony D. DeRose David H. Salesin, Wavelets for Computer Graphics: A Primer.

Project cubemap to octahedral map

Because compressing a spherical function sampled in cubemap form requires us to do 2D Haar Wavelet transform 6 times, one for each cubemap face, to simplify this transformation step, Wang et al. [1] propose using a single 2D octahedral map to represent a spherical function. So by using octahedral map to store our per-vertex visibility, we only need to do Haar Wavelet transform (# of vertices) times, whereas cubemap representation requires (6 x # of vertices) times.


In my implementation, I simply use the converting method provided in [2] and [3] to convert 2D coordinates in an octahedral map to 3D direction vectors.

The following screenshot shows the visibility cubemap of a vertex, and the octahedral map converted from that cubemap.


It’s not easy to imagine what the transformation looks like on a black and white image; I also made a screenshot showing the converted environment cubemap.



1. Rui Wang, Ren Ng, David Luebke, Greg Humphreys, Efficient Wavelet Rotation for Environment Map Rendering, Eurographics Symposium on Rendering (2006)

2.  Cigolle, Donow, Evangelakos, Mara, McGuire, Meyer, A Survey of Efficient Representations for Independent Unit Vectors, Journal of Computer Graphics Techniques (JCGT), vol. 3, no. 2, 1-30, 2014

3. Quirin Meyer, Jochen Süßmuth, Gerd Sußner, Marc Stamminger, and Günther Greiner. 2010. On floating-point normal vectors. In Proceedings of the 21st Eurographics conference on Rendering

Per-vertex visibility cubemaps generation

Image-based lighting (IBL) using triple-product integral technique factors lighting, BRDF, and visibility into three independently sampled functions. The rendering equation is evaluated per-vertex, that is, doing triple-product integral of those three functions on each vertex position. Usually one Lighting function, represented in the form of an environment map, is sampled at run time and applied to all shading vertices, and by utilizing precomputed wavelet-rotation matrices (see my previous post, or the original paper), we only need one copy of BRDF function defined in local frame for each material. Only visibility have to be sampled individually on each vertex position. Graphics hardware could be used to help us generate these visibility “cubemap” on each vertex position. Like the traditional old way of generating a reflection environment map in a certain position, to generate the visibility cubemap for a vertex we could position the camera in that vertex’s position, and render the scene 6 times facing 6 different directions. (+x,-x,+y,-y,+z,-z) We can improve the efficiency of the creation of these per-vertex visibility cubemaps, by utilizing the newer GPU and OpenGL capabilities. First, we can use layered rendering to render the 6 cubemap faces in one pass. By attaching a cubemap texture to the FBO, we can, in geometry shader, emit primitives to all cubemap faces. Here is a simple geometry shader code:

void main()
for( int i = 0; i < 6; ++i )
 gl_Layer = i;
 gl_Position = u_mvps[i] * gl_in[0].gl_Position;

 gl_Position = u_mvps[i] * gl_in[1].gl_Position;

 gl_Position = u_mvps[i] * gl_in[2].gl_Position;


Built-in output variable gl_Layer controls which layer (or cubemap face, if the render target is a cubemap texture) the primitive will be sent to. Furthermore, we could potentially increase the performance by using instancing geometry shader. Instancing makes the geometry shader execute multiple time on the same input primitive. Here is a sample code:

layout ( triangles, invocations=6 ) in; 
layout ( triangle_strip, max_vertices = 3 ) out;


void main()
 gl_Layer = gl_InvocationID;

 gl_Position = u_mvps[gl_InvocationID] * gl_in[0].gl_Position;

gl_Position = u_mvps[gl_InvocationID] * gl_in[1].gl_Position;

gl_Position = u_mvps[gl_InvocationID] * gl_in[2].gl_Position;



You can see in the input layout qualifier we specify the GS to be invoked 6 times for each primitive. The invocation id (0~5) could be retrieved via gl_InvocationID. I say “potentially increase the performance” because if the number of input primitives is far larger than that of shader cores, instancing seems not that helpful. I need to do a timing test see if there is difference. Finally, for per-vertex visibility cubemaps generation, since we know all the rendering parameters beforehand, we can store them in a shader storage buffer or texture buffer, and use instancing rendering to eliminate the redundant API calls. For a scene containing 150000 vertices, using naive rendering loops requires 150000 draw API calls. Ideally, just one API draw call is ever needed when instanced rendering is used. However, because we need to retain each vertex’s visibility map, we need to switch FBO or FBO attachments within the rendering iterations. Allocating 150000 textures is also not a good idea. Because 6x32x32 resolution is usually enough for a vertex’s visibility cubemap, we can somewhat alleviate this problem by using large-dimension, like 6x8192x8192, textures and groups a number of visibility fields into one large texture. And we do instancing rendering with these large textures serving as render target. The number of instancing draw API calls is that of the large textures.

In my test scene, which has 175000 vertices, the visibility cubemap generation only takes a few seconds in a machine featuring a nVidia GTX670;  I haven’t precisely measured it though.

Visibility field for a vertex

Visibility field for a vertex

Wavelet rotation for image-based lighting, and per-vertex visibility precomputation using OpenGL 4.x features

The triple product wavelet integral IBL proposed by Ng et al. [1] enables changing lighting as well as viewing directions by separately precomputing BRDF and visibility functions. In their implementation, BRDF, visibility, and lighting functions are all sampled in the same global frame. Because BRDFs are usually defined in local frame, Ng et al. precompute BRDFs for a set of orientations (defined in global frame), that is, each single material BRDF has multiple copies for different orientations. During runtime, when shading a vertex, the surface normal is used for choosing and interpolating BRDFs with closest orientations. This approach requires huge memory for storing precomputed BRDFs set.

Wang et al. [2] proposes a computational approach for efficient wavelet rotation. First, they choose octahedral mapping for parameterizing the spherical domain, because it has good uniformity and reduces wavelet rotation to translation in 2D.


Second,  they precompute a wavelet rotation matrix R per orientation (So they get a bunch of rotation matrices). A wavelet rotation matrix is very sparse because of the compact local support of wavelet bases. Quantization could further reduce the number of non-zero elements. If we have 32×32 predefined normal orientations, we got 1024 rotation matrices.

These wavelet rotation matrices are independent of BRDF or lighting or visibility, so they could be computed, stored in disk, and loaded into memory when the application starts. And we only need one BRDF copy for each material.

In runtime, the light function is dynamically sampled and wavelet-transformed, and then rotated by those precomputed wavelet rotation matrices to obtain different rotated versions of lighting functions (or more specifically, wavelet coefficients).

These rotated versions of  lighting wavelet coefficients will then be used for performing shading in each vertex, that is, doing dot-products with BRDFs defined in local frame.

Wang et al. [2] doesn’t take into account visibility, so they only deal with dot-products when doing shading. They perform shading entirely on CPU; SSE instructions is enough to achieve near-interactive frame rate.


In order to incorporate visibility and perform wavelet triple product integral, we have to precompute the visibility function on each vertex, that is, sampling visibility in spatial domain and wavelet-transforming it. With OpenGL 4.x features, we might be able to shrink the precomputation time to acceptable levels.

Back in mid 2000 we had to make 6 drawing API calls per vertex position to generate visibility cubemaps for each vertex; today we could use multi-draw indirect to greatly reduce the number of CPU making expensive API calls.

Second, layered rendering with geometry shader make it possible render all 6 cubemap faces in one draw.

Third, bindless textures reduces the overhead of repeatedly binding/unbinding those visibility textures when doing wavelet transform on GPU.


1. Ren Ng, Ravi Ramamoorthi, Pat Hanrahan, Triple Product Wavelet Integrals for All-Frequency Relighting, Siggraph 2004

2. Rui Wang, Ren Ng, David Luebke, Greg Humphreys, Efficient Wavelet Rotation for Environment Map Rendering, Eurographics Symposium on Rendering (2006)

Prefiltered Environment Maps On Glossy Materials

For Phong BRDF materials, we can do convolution on each environment map texel over its surrounding texels covered by a designated specular lobe. We can store different prefiltered environment maps corresponding to different specular powers in a mipmap chain. Mipmap level 0 contains the envmap w.r.t. highest specular power, mipmap level 1 contains the envmap w.r.t. 2nd highest specular power, and so forth.

Again we can use cubemapgen to help us generate this multilevel cubemap texture.

In a GLSL shader, we use reflective vector to sample the environment map, as opposed to the use of normal vector in irradiance maps for Lambertian surfaces. There are different ways to derive the desired mipmap level, based on how the mipmap chain is generated. For instance, cubemapgen offers a method by which you specify the maximum specular power (mipmap level 0), and the decreasing ratio for the subsequent mipmap levels. Then, the desired mipmap level could be derived using a statement like the following:

 float mipmapLevel = log( materialSpecular/ MAX_SPECULAR_POWER) / log(POWER_DEC_RATIO); 

Then we can sample the texture using texturLod, which allows us to specify the desired mipmap level.

To enable smooth transition, we also have to enable linear filtering between mipmap levels for this cubemap texture.

The following screenshots are my rendering results after adding Phong specular contribution to the final shading.

Specular power 10:


Specular power 60:



1. Jan Kautz, Pere-Pau Vázquez, Wolfgang Heidrich, and Hans-Peter Seidel. 2000. Unified Approach to Prefiltered Environment Maps. In Proceedings of the Eurographics Workshop on Rendering Techniques 2000