Real-time wavelet decomposition using compute shader

Check out this post to have a short introduction about Haar wavelets.

In order to do IBL using wavelet triple product integral, we need to represent env light, BRDF, and per-vertex visibility by wavelet basis functions, meaning we need to do wavelet transform on these signals.

For BRDF and visibility in static scenes, we could precompute their wavelet transform using CPU, in offline; for lighting we have to compute the wavelet transform in run-time, because the lighting is dynamically-sampled in each frame.

OpenGL compute shader gives me good incentives to implement 2D Haar wavelet transform in it. Firstly it is an independent compute stage outside graphics pipeline, so I don’t need to supply primitives or setup those graphics pipeline states that are unrelated to the core algorithm, secondly it provides shared memory access for inter-thread communication, just like CUDA or OpenCL, and finally, unlike CUDA or OpenCL, it can access OpenGL buffer objects just like other GLSL shaders.

In my implementation, in each frame the light is firstly sampled into an octahedral parameterization using fragment shader, and non-standard wavelet decomposition is performed using compute shader.

Here is the result of 1-level decomposition on the dynamically-sampled light:

(Note that right now the 3D scene you see is still rendered using pre-filtered env map)

And the following is the full-level decomposition:

Short intro. to Haar wavelet transform

Short introduction to Haar Wavelets:

Haar basis was developed by Alfred Haar in 1909. It is the simplest yet the most widely adopted wavelet basis. For clarity, we first start from one dimensional Haar basis. Given a 1D discrete signal, we can represent it by a combination of average and difference (or detail), like the following example:

1D average and difference

The functions in the right hand side  can be expressed as

5 \times \phi(x) + 4 \times \psi(x) ,

basis def

\phi(x) and \psi(x) are called mother scale function and mother wavelet function, respectively. Other basis functions could be derived by scaling and translating them. Here are some examples:

basis example

For a 1-D discrete signal with length N (N>2), we can repeat this average & difference  process to obtain 1 scale coefficient and N-1 detail coefficients.

The following is an example of transforming a 1-D discrete signal [ 9 7 3 5 ]

size4 example

Thus the wavelet transform of [ 9 7 3 5 ] is given by  [ 6 2 1 -1].

Less significant detail coefficients could be discarded for data compression purposes, like the following image shows:

1d compress example

To perform 2D Haar wavelet transform, we can simply do full 1D transform along one dimension and then do another full 1D transform along the other dimension. This is called standard decomposition.

standardCompose

Or, we can perform 1D Haar transform alternatively along different dimensions, in each level. This is called nonstandard decomposition.

nonstandardDecomp

Non-standard decomposition has some desired properties, such fewer assignments are required than in standard decomposition, and square local supports for every basis functions.

2D haar basis

One application of 2D Haar wavelet is image compression. Here is an image reconstruction example using different numbers of coefficients:

2D haar compression

It’s an example produced by my own implementation, and the coefficient pickup scheme is naive (just pick those with large magnitude), so it does not yield best compression quality.

References:

1. Eric J. Stollnitz Tony D. DeRose David H. Salesin, Wavelets for Computer Graphics: A Primer.

Project cubemap to octahedral map

Because compressing a spherical function sampled in cubemap form requires us to do 2D Haar Wavelet transform 6 times, one for each cubemap face, to simplify this transformation step, Wang et al. [1] propose using a single 2D octahedral map to represent a spherical function. So by using octahedral map to store our per-vertex visibility, we only need to do Haar Wavelet transform (# of vertices) times, whereas cubemap representation requires (6 x # of vertices) times.

octahedral

In my implementation, I simply use the converting method provided in [2] and [3] to convert 2D coordinates in an octahedral map to 3D direction vectors.

The following screenshot shows the visibility cubemap of a vertex, and the octahedral map converted from that cubemap.

oct_visibility

It’s not easy to imagine what the transformation looks like on a black and white image; I also made a screenshot showing the converted environment cubemap.

oct_envmap

References:

1. Rui Wang, Ren Ng, David Luebke, Greg Humphreys, Efficient Wavelet Rotation for Environment Map Rendering, Eurographics Symposium on Rendering (2006)

2.  Cigolle, Donow, Evangelakos, Mara, McGuire, Meyer, A Survey of Efficient Representations for Independent Unit Vectors, Journal of Computer Graphics Techniques (JCGT), vol. 3, no. 2, 1-30, 2014

3. Quirin Meyer, Jochen Süßmuth, Gerd Sußner, Marc Stamminger, and Günther Greiner. 2010. On floating-point normal vectors. In Proceedings of the 21st Eurographics conference on Rendering

Per-vertex visibility cubemaps generation

Image-based lighting (IBL) using triple-product integral technique factors lighting, BRDF, and visibility into three independently sampled functions. The rendering equation is evaluated per-vertex, that is, doing triple-product integral of those three functions on each vertex position. Usually one Lighting function, represented in the form of an environment map, is sampled at run time and applied to all shading vertices, and by utilizing precomputed wavelet-rotation matrices (see my previous post, or the original paper), we only need one copy of BRDF function defined in local frame for each material. Only visibility have to be sampled individually on each vertex position. Graphics hardware could be used to help us generate these visibility “cubemap” on each vertex position. Like the traditional old way of generating a reflection environment map in a certain position, to generate the visibility cubemap for a vertex we could position the camera in that vertex’s position, and render the scene 6 times facing 6 different directions. (+x,-x,+y,-y,+z,-z) We can improve the efficiency of the creation of these per-vertex visibility cubemaps, by utilizing the newer GPU and OpenGL capabilities. First, we can use layered rendering to render the 6 cubemap faces in one pass. By attaching a cubemap texture to the FBO, we can, in geometry shader, emit primitives to all cubemap faces. Here is a simple geometry shader code:


void main()
{
for( int i = 0; i < 6; ++i )
{
 gl_Layer = i;
 gl_Position = u_mvps[i] * gl_in[0].gl_Position;
 EmitVertex();

 gl_Position = u_mvps[i] * gl_in[1].gl_Position;
 EmitVertex();

 gl_Position = u_mvps[i] * gl_in[2].gl_Position;
 EmitVertex();

 EndPrimitive();
}
}

Built-in output variable gl_Layer controls which layer (or cubemap face, if the render target is a cubemap texture) the primitive will be sent to. Furthermore, we could potentially increase the performance by using instancing geometry shader. Instancing makes the geometry shader execute multiple time on the same input primitive. Here is a sample code:


layout ( triangles, invocations=6 ) in; 
layout ( triangle_strip, max_vertices = 3 ) out;

...

void main()
{
 gl_Layer = gl_InvocationID;


 gl_Position = u_mvps[gl_InvocationID] * gl_in[0].gl_Position;
 EmitVertex();

gl_Position = u_mvps[gl_InvocationID] * gl_in[1].gl_Position;
 EmitVertex();

gl_Position = u_mvps[gl_InvocationID] * gl_in[2].gl_Position;
 EmitVertex();

EndPrimitive();

}

You can see in the input layout qualifier we specify the GS to be invoked 6 times for each primitive. The invocation id (0~5) could be retrieved via gl_InvocationID. I say “potentially increase the performance” because if the number of input primitives is far larger than that of shader cores, instancing seems not that helpful. I need to do a timing test see if there is difference. Finally, for per-vertex visibility cubemaps generation, since we know all the rendering parameters beforehand, we can store them in a shader storage buffer or texture buffer, and use instancing rendering to eliminate the redundant API calls. For a scene containing 150000 vertices, using naive rendering loops requires 150000 draw API calls. Ideally, just one API draw call is ever needed when instanced rendering is used. However, because we need to retain each vertex’s visibility map, we need to switch FBO or FBO attachments within the rendering iterations. Allocating 150000 textures is also not a good idea. Because 6x32x32 resolution is usually enough for a vertex’s visibility cubemap, we can somewhat alleviate this problem by using large-dimension, like 6x8192x8192, textures and groups a number of visibility fields into one large texture. And we do instancing rendering with these large textures serving as render target. The number of instancing draw API calls is that of the large textures.

In my test scene, which has 175000 vertices, the visibility cubemap generation only takes a few seconds in a machine featuring a nVidia GTX670;  I haven’t precisely measured it though.

Visibility field for a vertex

Visibility field for a vertex

Wavelet rotation for image-based lighting, and per-vertex visibility precomputation using OpenGL 4.x features

The triple product wavelet integral IBL proposed by Ng et al. [1] enables changing lighting as well as viewing directions by separately precomputing BRDF and visibility functions. In their implementation, BRDF, visibility, and lighting functions are all sampled in the same global frame. Because BRDFs are usually defined in local frame, Ng et al. precompute BRDFs for a set of orientations (defined in global frame), that is, each single material BRDF has multiple copies for different orientations. During runtime, when shading a vertex, the surface normal is used for choosing and interpolating BRDFs with closest orientations. This approach requires huge memory for storing precomputed BRDFs set.

Wang et al. [2] proposes a computational approach for efficient wavelet rotation. First, they choose octahedral mapping for parameterizing the spherical domain, because it has good uniformity and reduces wavelet rotation to translation in 2D.

octahedral

Second,  they precompute a wavelet rotation matrix R per orientation (So they get a bunch of rotation matrices). A wavelet rotation matrix is very sparse because of the compact local support of wavelet bases. Quantization could further reduce the number of non-zero elements. If we have 32×32 predefined normal orientations, we got 1024 rotation matrices.

These wavelet rotation matrices are independent of BRDF or lighting or visibility, so they could be computed, stored in disk, and loaded into memory when the application starts. And we only need one BRDF copy for each material.

In runtime, the light function is dynamically sampled and wavelet-transformed, and then rotated by those precomputed wavelet rotation matrices to obtain different rotated versions of lighting functions (or more specifically, wavelet coefficients).

These rotated versions of  lighting wavelet coefficients will then be used for performing shading in each vertex, that is, doing dot-products with BRDFs defined in local frame.

Wang et al. [2] doesn’t take into account visibility, so they only deal with dot-products when doing shading. They perform shading entirely on CPU; SSE instructions is enough to achieve near-interactive frame rate.

——————————————————————————————————————

In order to incorporate visibility and perform wavelet triple product integral, we have to precompute the visibility function on each vertex, that is, sampling visibility in spatial domain and wavelet-transforming it. With OpenGL 4.x features, we might be able to shrink the precomputation time to acceptable levels.

Back in mid 2000 we had to make 6 drawing API calls per vertex position to generate visibility cubemaps for each vertex; today we could use multi-draw indirect to greatly reduce the number of CPU making expensive API calls.

Second, layered rendering with geometry shader make it possible render all 6 cubemap faces in one draw.

Third, bindless textures reduces the overhead of repeatedly binding/unbinding those visibility textures when doing wavelet transform on GPU.

References:

1. Ren Ng, Ravi Ramamoorthi, Pat Hanrahan, Triple Product Wavelet Integrals for All-Frequency Relighting, Siggraph 2004

2. Rui Wang, Ren Ng, David Luebke, Greg Humphreys, Efficient Wavelet Rotation for Environment Map Rendering, Eurographics Symposium on Rendering (2006)

Prefiltered Environment Maps On Glossy Materials

For Phong BRDF materials, we can do convolution on each environment map texel over its surrounding texels covered by a designated specular lobe. We can store different prefiltered environment maps corresponding to different specular powers in a mipmap chain. Mipmap level 0 contains the envmap w.r.t. highest specular power, mipmap level 1 contains the envmap w.r.t. 2nd highest specular power, and so forth.

Again we can use cubemapgen to help us generate this multilevel cubemap texture.

In a GLSL shader, we use reflective vector to sample the environment map, as opposed to the use of normal vector in irradiance maps for Lambertian surfaces. There are different ways to derive the desired mipmap level, based on how the mipmap chain is generated. For instance, cubemapgen offers a method by which you specify the maximum specular power (mipmap level 0), and the decreasing ratio for the subsequent mipmap levels. Then, the desired mipmap level could be derived using a statement like the following:

 float mipmapLevel = log( materialSpecular/ MAX_SPECULAR_POWER) / log(POWER_DEC_RATIO); 

Then we can sample the texture using texturLod, which allows us to specify the desired mipmap level.

To enable smooth transition, we also have to enable linear filtering between mipmap levels for this cubemap texture.

The following screenshots are my rendering results after adding Phong specular contribution to the final shading.

Specular power 10:

phong_glossy_pow10

Specular power 60:

phong_glossy_pow60

References:

1. Jan Kautz, Pere-Pau Vázquez, Wolfgang Heidrich, and Hans-Peter Seidel. 2000. Unified Approach to Prefiltered Environment Maps. In Proceedings of the Eurographics Workshop on Rendering Techniques 2000

2. https://seblagarde.wordpress.com/2012/06/10/amd-cubemapgen-for-physically-based-rendering/

Shadowing In IBL Using Sparse 3D Textures

To accurately compute shadow in a scene with omni-directional light sources, each shaded point has to shoot many shadow rays toward the surrounding environment to test light visibility (if a shadow ray can reach the right source without hitting a blocking geometry. To do these occlusion tests we need to know the whole scene geometry, and it is not trivial to access the information about scene geometry.

In an effort to combine voxel-based GI with IBL, I made an experiment of casting shadows using scene geometry represented by sparse 3D textures:

voxel_buddha

To represent this “buddha on a plane” using an ordinary 512x512x512x8bit 3D texture requires 128MB memory, whereas a sparse 3D texture consumes only 12.25MB. In this “buddha on a plane” scene, only 196 texel pages (each page has 64x32x32 texels in a 8-bit 3D texture) are resident (occupying memory). By using sparse texture, we can use the maximal 3D texture resolution without consuming too much memory. Furthermore, only 1 bit per voxel is enough for storing the occlusion information, so a 8-bit texel could store 2x2x2 occlusion voxels, which further reduces the memory usage.

With this scene representation, I was able to cast shadow rays on each surface point. The following screenshot is the result of casting 18 shadow rays on each surface point:

18_shadowray

Clearly 18 shadow rays is not enough for accurate shadowing.

Right now this naive approach has several obvious drawbacks:

1. Unlike ambient occlusion, which only tests occlusion within a short distance, shadow rays for environment map lighting have to traverse all way to the scene boundary, that makes the traversal costly.

2. Many, many times of texture fetching is required in the fragment program, and it is also very costly.

3. This approach can only be used on diffuse materials. To accurately compute lighting integrals on glossy materials, visibility functions have to be incorporated into the integrals.

A brief introduction to triple-product wavelet image-based relighting

Precomputed Radiance Transfer (PRT) proposed by Sloan et al. [1] enables interactive image-based lighting that takes into account of soft shadows and indirect illumination under low-frequency dynamic lighting environments. PRT precomputes light transport functions capturing the way an object shadows scatters, and reflects light, and encodes them by low-order spherical harmonics (SH) basis functions, which makes them to be represented in compact vector form. Therefore, rendering becomes a simple inner product of a light vector which is sampled run-time and also represented by SH basis, with the precomputed transport vector. For glossy BRDFs, a transport matrix instead of vector is precomputed for every vertex to allow for view-dependent shading.

Similar to Fourier series, low-order SH can only approximate low-frequency signals, and is not an ideal candidate for efficiently encoding high frequency signals like sharp light and hard shadows.

Ng et al. [2] broke the low-frequency limitation by using non-linear wavelet approximation to represent the lighting and transport vectors, achieving all-frequency illumination and shadows at interactive rates. Wavelets require far fewer coefficients than SH to approximate high-frequency data. This technique was developed primarily for image relighting. For changing viewpoint, it was limited to diffuse objects because of the need of sampling in viewing direction. This additional sampling is required for every vertex, which is too costly to be done.

The following image is a comparison between SH (left) and Haar wavelets (right) PRT from Ng et al. [2] (SH and Haar both uses 100 coefficients):

SH_vs_wavelets

Ng et al. [3] factors light, visibility, and BRDF into 3 independent components, and  thus realizes all-frequency dynamic light as well as changing view.

The following portion is a brief explanation of their work.

Considering the rendering equation without indirect illumination, the equation

L_o(x,\omega_o)=\int_{\Omega}L(x,\omega_i) f_r(x,\omega_i,\omega_o)V(x,\omega_i)(n \cdot \omega_i)d\omega_i

describe outgoing readiance L_o at surface location x in viewing direction \omega_o , where L is the light source, \omega_i is incident light direction, f_r is BRDF, n is surface normal, and V visibility. The integral is over the visible or upper hemisphere \Omega. Here we make several simplifications. First, we incorporate the cosine term in the BRDF definition. Second, we assume distant illumination, making L independent of surface location. Third, we only consider spatially uniform BRDF. Finally, we define all functions in a global coordinate system. However, BRDFs are typically defined in local coordinates with respect to surface normals, making rotation operations necessary when applying the BRDF in the  global coordinate system to different orientations of surface normals. By the above simplification, the rendering equation can be expressed as

L_o(x,\omega_o)=\int_\Omega L(\omega_i) f_r (\omega_i, \omega_o) V(x,\omega_i) d\omega_i

where the integrand is a prodcut of three functions. If we consider only one vertex in fixed viewpoint, the equation can be further expressed as

L_o = \int_\Omega L(\omega_i)f_r(\omega_i)V(\omega_i)d\omega_i

The functions L, f_r, and V can then be expended in appropriate orthonormal basis functions B(\omega):

L(\omega)=\sum\limits_{i} L_{i} B_{i} (\omega),

f_r(\omega) = \sum\limits_j f_{rj}B_j (\omega),

V(\omega) = \sum\limits_k V_k B_k(\omega),

where L_i, f_{rj}, and V_k are coefficients of corresponding basis functions. With this basis expansion, we can write the simplified rendering equation as:

L_o = \int_{\Omega} \left( \sum\limits_i L_i B_i (\omega) \right) \left( \sum\limits_{j} f_{rj} B_j (\omega) \right) \left( \sum\limits_{k} V_k B_k (\omega) \right) d\omega

=\sum\limits_{i}\sum\limits_{j}\sum\limits_{k} L_{i} f_{rj} V_k \int_{\Omega} B_{i}(\omega) B_{j}(\omega) B_{k} (\omega) d\omega

=\sum\limits_{i}\sum\limits_{j}\sum\limits_{k} C_{ijk}L_{i}f_{rj}V_{k} ,

where the triple product tensor, C_{ijk}, is defined by

C_{ijk}=\int_{\Omega}B_{i}(\omega)B_{j}(\omega)B_{k} (\omega) d\omega

Ng et al.[1] called it tripling coefficient. Due to the presence of tripling coefficient, the evaluation of the above equation is very complicated.  Generally we need consider all (i,j,k) combinations, making the complexity O(N^3). Ng et al.[1] proved that the tripling coefficient of 2D Haar basis is nonzero if and only if certain criteria are met [1]. So instead of O(N^3) complexity, the complexity of evaluating the equation in Haar basis is O(N log N), specifically the exact number of non-zero Haar tripling coefficients for the first N basis functions is 2 - N + 3 N log_4 N . Furthermore, the evaluation could be reduced to linear time complexity by dynamic programming.

In the work of Ng et al.[3], the light function is sampled run-time from an environment map; the visibility function is precomputed beforehand, by sampling an occlusion cubemap in each vertex position; the BRDF function is precomputed in a predefined set of orientations. When rendering, the BRDF for a specific orientation is obtained from interpolating the precomputed set of BRDFs.

Here is a sample of these 3 functions, from [3],  shown in cubemap form:

functions

And the following is a sample rendered result from the work of Ng et al.[3]:

wavelet sample img

Ma et al. [5] uses spherical wavelets and compute the triple product integrals in local coordinate to eliminate the need of BRDF interpolation and greatly reduces the data size.

Although the higher fidelity and run-time interactivity can be achieved by triple-product integrals with Haar wavelet basis, its high precomputation costs and data size prevents it from being adopted by graphics industry, which adopts SH-based rendering widely nowadays because of its simplicity and low costs.

References:

  1. Sloan, P.-P., Kautz, J., and Snyder, J., Precomputed radiance transfer for
    real-time rendering in dynamic, low-frequency lighting environments, in
    Proceedings of the 29th annual conference on Computer graphics and
    interactive techniques. 2002, ACM Press: San Antonio, Texas.
  2. Ng, R., Ramamoorthi, R., and Hanrahan, P., All-frequency shadows using
    non-linear wavelet lighting approximation, in ACM SIGGRAPH 2003
    Papers. 2003, ACM Press: San Diego, California.
  3. Ng, R., Ramamoorthi, R., and Hanrahan, P., Triple product wavelet
    integrals for all-frequency relighting, in ACM SIGGRAPH 2004 Papers.
    2004, ACM Press: Los Angeles, California.
  4. Ramamoorthi, R., Precomputation-Based Rendering, Foundations and Trends in
    Computer Graphics and Vision Vol. 3, No. 4 (2007) 281–369.
  5. Wan-Chun Ma, Chun-Tse Hsiao, Ken-Yi Lee, Yung-Yu Chuang, Bing-Yu Chen, Real-time triple product relighting using spherical local-frame parameterization, The Visual Computer , Sep. 2006, Volume 22, Issue 9-11, pp 682-692

IBL on Lambertian surfaces using irradiance environment map

This is the 1st stage of my real-time image-based light (IBL) renderer work. In this part, a basic asset-loading and camera system is constructed. And irradiance environment maps are used to shade diffuse surfaces.

Environment maps are commonly used as light sources in image-based light (IBL). In a non-filtered environment map, each texel represents a single discrete light source.

grace_cross

If the object being shaded is perfect mirror-like , only one texel on the environment map needs to be sampled for each surface point on that object. (texel sampled over the direction of specular reflection)

perfect-mirror

However, a certain range of light need to be sampled for shading other types of material (if we don’t consider refraction). For example, to shade Lambertian surfaces, the area of light source needs to be sampled is the whole upper-hemisphere over the shaded point.

L=Kd \sum_{i \subset \Omega} max(0, D_i \cdot N) l_i

For Lambertian surfaces, the area of light that needs to be sampled is solely determined by the surface normal. So we can create an “pre-filtered” environment map in which each texel value is \sum_{i \subset \Omega} max(0, D_i \cdot N) l_i . Basically it is a convolution operation over the whole environment map for each environment map texel.  Then in runtime renderinig, only one texel lookup in the prefiltered environment is needed to shade the Lambertian surfaces.

To do this convolution operation in spatial domain is very slow. Because diffuse irradiance values vary very slowly w.r.t. to normal change. We can transform the environment map into frequency space and only keep low-frequency information. A spherical function (e.g.  environment map) could be represented in frequency domain using spherical harmonic transform, and a few SH bases  is enough to represent the low frequency data of the whole environment map required by diffuse lighting. This reduces the convolution complexity from 6 \times m \times m \times 6 \times m \times m to k \times 6 \times m \times m (m is the dimension of cubemap faces, and k is the number of preserved SH coefficients)

So what we need to do is to transform the spatial domain environment to frequency domain by SH transform, preserve only a few most significant basis coefficients, perform convolution with Lambertian reflectance function (precomputed, also represented in frequency domain)  and transform the computed environment map back to spatial representation.

CubemapGen, an environment map manipulation tool, could help you do the pre-filtering process.

Here is the result with 25 SH coefficients preserved:

grace_cross_irradiance

And this is the rendering result of Lambertian surfaces:

buddha_diffuse