CUDA ray-tracer ‎(Min Kyu Jeong‎)

CUDA ray-tracer

This page introduces my final project for the CS384G class, taught in Fall 2008 by Prof. Don Fussell at the University of Texas at Austin.

Dec 12 2008, Min Kyu Jeong

1. Introduction

In this project, I ported the ray-tracer used in our class assignment to the NVIDIA graphics processors.

2. Necessary background

A. Programmable Graphics Processors

Recent graphics processing units (GPU) have replaced the fixed function units for the vertex processing (vertex transformation) and the fragment processing (pixel shading) with programmable processors. These programmable processors are designed to sustain high computational throughput, nearly orders of magnitude more peak floating operations per second (FLOPS) than contemporary CPUs. This high peak performance comes from its highly parallel design, in which a large number of simple processing units work simultaneously. Applications that have abundant parallelism can be accelerated by running them on the GPUs.

B. CUDA

CUDA is the programming model for the NVIDIA GPUs. It provides the abstraction of the GPU to use it for general purpose computation. The CUDA models the GPU as a coprocessor helping the CPU. When an application executed on the CPU reaches a point where a large amount work needs to be done (for example, a loop computing large matrix-matrix multiplication), the work is sent to the GPU to accelerate it. Once the work finished, the result is copied back to the CPU and application continues.

A GPU is modeled as a collection of multi-threaded processors, Streaming Multiprocessors (SM) in NVIDIA term, and programmed in a Single Program Multiple Data (SPMD) fashion. A programmer writes a kernel and spawns a lot of threads that executes the same kernel. Each thread works on a different set of data in parallel.

When the threads are spawned, they are grouped into thread-blocks and blocks are distributed among SMs. The threads mapped to the same SM are executed in a time-multiplexed fashion to maximize the processor utilization. Like the usual SPMD programming model, each thread identifies its working dataset using its thread id. In CUDA, the thread id consists of the 2-dimensional thread-block id and 3-dimensional thread-id local to the block.

The quadruple <BlockId.x, BlockId.y, ThreadIdx.x, ThreadIdx.y> identifies a thread.

The programmer specifies how the threads are grouped into thread-blocks. When the kernel is invoked, the geometry of the block grid (2x3 in the above figure) and thread block (3x4 in the above figure) should be provided together. This should be chosen carefully to achieve good load balancing between SMs. The optimal blocking scheme depends on how much resource (registers and shared memory) each kernel instance requires and how much resource the GPU provides.

B. Programming restrictions

Kernels can be written in a subset of C. The following is a partial list of the restrictions that a kernel function has to follow.

1. No recursion

2. Cannot declare static variables inside function body

3. Cannot have variable number of arguments

4. No function pointer

5. Function parameter to a kernel function is limited to 256 bytes

3. What have been accomplished

A. Phong shading with shadow

Currently, the ported ray-tracer can trace a scene consist of spheres, point and directional lights. It uses Phong shading model with hard shadows.

B. Implementation

The most labor intensive challenge in porting the existing ray-tracer to the GPU is eliminating the C++ features that are not supported in C. Template and class inheritance are extensively used and should be rewritten to equivalent C structures (It turns out later that CUDA compiler supports template and limited C++ class features, including member functions and operator overloading). For each non-virtual class, a corresponding struct was written. Conversion functions between class and matching struct were written instead of modifying the parser. All member functions were rewritten as global functions. Template vector and matrix types were explicitly instantiated. The following code snippet shows a vector-scalar multiplication example.

__device__ __host__ void V3MulScalar(SVec3f *a, float b, SVec3f *result) {

result->x = a->x * b;

result->y = a->y * b;

result->z = a->z * b;

}

Once all the groundwork has been finished, one most important modification to the algorithm should be made. Parallelization. The original ray-tracer traces one ray at a time in a for loop, but the GPU ray-tracer will trace them in parallel. As described earlier, this is abstracted as spawning multiple threads. In this implementation, each thread is assigned a set of pixels to render statically. In more complex scheme, a work pool consisting of rays being traced can be managed and threads can pull and push work dynamically from this poll. This is left as future work.

The image is partitioned into 64 blocks. Each block is assigned to a single thread-block which consists of 11x11 threads. Each thread-block shades 11x11 pixels at a time, one pixel per thread.

Image is partitioned into 8x8 blocks. Bxx represents a block.

For each block, subblock of 11x11 is processed at a time. txx represents a thread.

This figure is showing the case of 22x22 pixels per block, therefore an image of 176x176 pixels.

GPU uses a derivation of IEEE floating-point standard and the hardware is designed for single precision floating point. Therefore all the double type of the original ray-tracer is converted to float. One modification necessary due to this change is the RAY_EPSILON value used to prevent secondary rays from intersecting with the originating objects. It has been adjusted to higher value (from 0.00001 to 0.001).

C. CUDA programming pitfalls

Although they are rather peripheral problems for this project, I believe some GPU programming pitfalls I found are worth to mention here. Substantial amount of implementation time was spent on debugging the unexpected behavior. The emulation mode is provided to provide debuggability because the GPU does not provide any. Therefore, when the program works correctly on emulation mode but not on the GPU, it is hard to track down the problem.

Such difference between emulation and GPU mode can occur due to the following reasons. First, the emulation mode emulates the parallel execution the GPU by creating multiple threads. However, the executions of threads are serialized and problem from uncontrolled, simultaneous access of the same memory address does not manifest itself. Second, the memory space of CPU and GPU are separated, but the emulation mode uses the CPU memory space to emulate GPU memory space. Even if GPU code references CPU memory addresses, emulation mode does not cause any problems. Third, there is little documentation about what features of the language are and are not supported. Even worse, compiler might generate code that is not what programmer intended without any warning. At this point, the programmer has to sanity check the basic features one at a time. The partial list of what I have examined is: Is a C structure correctly passed as kernel function argument? Can pointer to the local variable be used? Is byte-granularity access to the GPU memory correctly handled?

The problem I finally found after 4 days of eliminating one doubt at a time is the compiler issue which incorrectly assumes memory alignment of structures. The Sphere struct was of size 204 bytes, and the GPU code generates array element addresses in increment of 208 bytes. For example, &(sphere[1]) - &(sphere[0]) is 208. Therefore accesses to the data member were all wrong. This problem has been reported and discussed in the CUDA forum (http://forums.nvidia.com/index.php?showtopic=73806). One solution is to manually pad the structures to align them as the compiler assumes.

C. Performance comparison to the CPU version

No performance optimization effort was put in the current implementation. In order to achieve good performance out of GPUs, the memory access pattern of nearby threads should be carefully tuned. Texture cache can also be used to stage scene objects.

GPU execution time includes all the setup time such as data structure conversion and data transfer time between CPU and GPU. It is unclear what should be included as the execution time, but I believe all the overhead necessary for acceleration should be included. As the amount of such fixed overhead is substantial, execution time of the GPU version does not increase much as the image size grows. Larger image size or anti-aliasing will highlight the benefit of GPU even more.

4. Artifacts

All images are rendered on a NVIDIA Geforce 9800GX2 graphics card. The card features G90 family GPUs.

A. Sphere (sphere.ray)

176x176 pixels

1 directional light, 1 point light and 1 sphere

2 light sources create 2 highlight spots on the sphere

B. Spheres (spheres.ray)

528x528 pixels

2 directional lights and 7 spheres

One directional light in parallel with the camera angle gives the overall color and the other directional light from the bottom-right gives the shadow and secondary highlights the spheres.

5. References

[1] NVIDIA CUDA Programming Guide Version 2.0, http://www.nvidia.com/object/cuda_develop.html, July 2008

[2] CUDA programming forum, http://forums.nvidia.com

Min Kyu Jeong

바로가기

Recent site activity