This page introduces my final
project for the CS384G class, taught in Fall 2008 by Prof. Don Fussell
at the University of Texas at Austin.
Dec 12 2008, Min Kyu Jeong
1. IntroductionIn this project, I ported the ray-tracer used in our class assignment to the NVIDIA graphics processors.
2. Necessary backgroundA. Programmable Graphics ProcessorsRecent graphics
processing units (GPU) have replaced the fixed function units for the
vertex processing (vertex transformation) and the fragment processing
(pixel shading) with programmable processors. These programmable
processors are designed to sustain high computational throughput,
nearly orders of magnitude more peak floating operations per second
(FLOPS) than contemporary CPUs. This high peak performance comes from
its highly parallel design, in which a large number of simple
processing units work simultaneously. Applications that have abundant
parallelism can be accelerated by running them on the GPUs.
B. CUDACUDA
is the programming model for the NVIDIA GPUs. It provides the
abstraction of the GPU to use it for general
purpose computation. The CUDA models the
GPU as a coprocessor helping the CPU. When an application
executed on the CPU reaches a point where a large amount work
needs to be done (for example, a loop computing large
matrix-matrix multiplication), the work is sent to the GPU to
accelerate it. Once the work finished, the result is copied back to the
CPU and application continues.
A
GPU is modeled as a collection of multi-threaded processors, Streaming
Multiprocessors (SM) in NVIDIA term, and programmed in a Single Program
Multiple Data (SPMD) fashion. A programmer writes a kernel and spawns a lot of threads that executes the same kernel. Each thread works on a different set of data in parallel.
When
the threads are spawned, they are grouped into thread-blocks and blocks
are distributed among SMs. The threads mapped to the same SM are
executed in a time-multiplexed fashion to maximize the processor
utilization. Like
the usual SPMD programming model, each thread identifies its
working dataset using its thread id. In CUDA, the thread id consists of
the 2-dimensional thread-block id and 3-dimensional
thread-id local to the block.
The quadruple <BlockId.x, BlockId.y, ThreadIdx.x, ThreadIdx.y> identifies a thread.
The programmer specifies how the threads are grouped into thread-blocks. When
the kernel is invoked, the geometry of the block grid (2x3 in the above
figure) and thread block (3x4 in the above figure) should be provided
together. This should be chosen carefully to achieve good load
balancing between SMs. The optimal blocking scheme
depends on how much resource (registers and shared memory) each kernel
instance requires and how much resource the GPU provides.
B. Programming restrictionsKernels
can be written in a subset of C. The following is a partial list of the
restrictions that a kernel function has to follow.
1. No recursion
2. Cannot declare static variables inside function body
3. Cannot have variable number of arguments
4. No function pointer
5. Function parameter to a kernel function is limited to 256 bytes
3. What have been accomplishedA. Phong shading with shadowCurrently,
the ported ray-tracer can trace a scene consist of spheres,
point and directional lights. It uses Phong shading model with
hard shadows.
B. ImplementationThe most
labor intensive challenge in porting the existing ray-tracer to
the GPU is eliminating the C++ features that are not supported in
C. Template and class inheritance are extensively used and should
be rewritten to equivalent C structures (It turns
out later that CUDA compiler supports template and limited
C++ class features, including member functions and operator
overloading). For
each non-virtual class, a corresponding struct was written.
Conversion functions between class and matching struct were written
instead of modifying the parser. All member functions were
rewritten as global functions. Template vector and matrix
types were explicitly instantiated. The following code
snippet shows a vector-scalar multiplication example.
Once all the groundwork has been finished, one most important modification to the algorithm should be made. Parallelization.
The original ray-tracer traces one ray at a time in a for loop, but
the GPU ray-tracer will trace them in parallel. As described
earlier, this is abstracted as spawning multiple threads. In this
implementation, each thread is assigned a set of pixels to render statically. In
more complex scheme, a work pool consisting of rays being
traced can be managed and threads can pull and push work dynamically
from this poll. This is left as future work.
The
image is partitioned into 64 blocks. Each block is assigned to a single
thread-block which consists of 11x11 threads. Each thread-block shades
11x11 pixels at a time, one pixel per thread.
Image is partitioned into 8x8 blocks. Bxx represents a block.
For each block, subblock of 11x11 is processed at a time. txx represents a thread.
This figure is showing the case of 22x22 pixels per block, therefore an image of 176x176 pixels. GPU uses a
derivation of IEEE floating-point standard and the hardware is designed
for single precision floating point. Therefore all the double type of the original ray-tracer is converted to float. One
modification necessary due to this change is the RAY_EPSILON
value used to prevent secondary rays from intersecting with the
originating objects. It has been adjusted to higher
value (from 0.00001 to 0.001).
C. CUDA programming pitfallsAlthough they
are rather peripheral problems for this project, I
believe some GPU programming pitfalls I found are worth to mention
here. Substantial amount of implementation time was spent on
debugging the unexpected behavior. The emulation mode is provided to
provide debuggability because the GPU does not provide any.
Therefore, when the program works correctly on emulation mode but not
on the GPU, it is hard to track down the problem.
Such difference
between emulation and GPU mode can occur due to the following
reasons. First, the emulation mode emulates the parallel
execution the GPU by creating multiple threads. However, the
executions of threads are serialized and problem from
uncontrolled, simultaneous access of the same memory address
does not manifest itself. Second, the memory space of CPU and GPU
are separated, but the emulation mode uses the CPU memory space to
emulate GPU memory space. Even if GPU code references
CPU memory addresses, emulation mode does not cause any problems.
Third, there is little documentation about what features of the
language are and are not supported. Even worse, compiler
might generate code that is not what
programmer intended without any warning. At this point, the
programmer has to sanity check the basic features one at a
time. The partial list of what I have examined is: Is a C
structure correctly passed as kernel
function argument? Can pointer to the
local variable be used? Is byte-granularity access to
the GPU memory correctly handled?
The
problem I finally found after 4 days of eliminating one
doubt at a time is the compiler issue which incorrectly
assumes memory alignment of structures. The Sphere struct was
of size 204 bytes, and the GPU code generates array element
addresses in increment of 208 bytes. For example, &(sphere[1]) -
&(sphere[0]) is 208. Therefore accesses to the data member
were all wrong. This problem has been reported and discussed in the CUDA forum (http://forums.nvidia.com/index.php?showtopic=73806). One solution is to manually pad the structures to align them as the compiler assumes.
C. Performance comparison to the CPU versionNo performance
optimization effort was put in the current implementation. In
order to achieve good performance out of GPUs, the memory access
pattern of nearby threads should be carefully tuned. Texture
cache can also be used to stage
scene objects.
GPU
execution time includes all the setup time such as data structure
conversion and data transfer time between CPU and GPU. It is
unclear what should be included as the execution time, but I
believe all the overhead necessary for acceleration
should be included. As the amount of such fixed
overhead is substantial, execution time of the GPU version does not
increase much as the image size grows. Larger image size or
anti-aliasing will highlight the benefit of GPU even more.
4. ArtifactsAll images are rendered on a NVIDIA Geforce 9800GX2 graphics card. The card features G90 family GPUs.
A. Sphere (sphere.ray)
176x176 pixels
1 directional light, 1 point light and 1 sphere
2 light sources create 2 highlight spots on the sphere
B. Spheres (spheres.ray)
528x528 pixels
2 directional lights and 7 spheres
One
directional light in parallel with the camera angle gives the overall
color and the other directional light from the bottom-right gives
the shadow and secondary highlights the spheres.
5. References[1] NVIDIA CUDA Programming Guide Version 2.0, http://www.nvidia.com/object/cuda_develop.html, July 2008
[2] CUDA programming forum, http://forums.nvidia.com
|
_displayNameOrEmail_ - _time_ - Remove
_text_