It is difficult to achieve good performance in CUDA and OpenCL. It really feels like lining all your ducks in a row. If anything goes out of line, you get worse performance than an optimized CPU implementation. And very easily things go out of line.
First, you must be aware of how memory will be accessed by your algorithm. You must access your data in the same way it is being fetched by the GPU. You may find yourself doing weird things, like using a space-filling curve to linearize access to data in multiple dimensions. Most likely you will need to use the workgroup's local memory to cache values each workitem would normally get from global memory.
You must pick workload sizes that will result in an optimal processor occupancy. But be very careful here. All your local variables will be using GPU registers. If you cramp too many threads at once you may run out of registers. Your code will still run, but the variables that did not fit in registers will be stored in global memory. This will make the performance orders of magnitude slower than a CPU implementation. Different cards will have different numbers of registers. If you want an optimal workload size, your implementation must become aware of the different cards out there.
And then you need to think of what your code actually does. It pays to use vector operations whenever possible, to sacrifice precision when it doesn't really matter. Control flow instructions are very expensive so use them if there is no other choice. OpenCL and CUDA are SIMT (Single Instruction Multiple Thread). What this means is that all concurrent threads are running the same instruction. Whenever your code branches the processor will have to serialize the different branches and run them one after the other. If you want to get over that, your code then needs to become aware of processor wavefronts, or warps like the are called in CUDA.
As you see, you must be fully aware of the architecture and how your solution is being mapped into it. If you are a GPU developer who was doing generic computing using vertex and fragment shaders, you will find OpenCL and CUDA refreshing and liberating. If you are a CPU developer, well, I hope you enjoy learning new things.
On top of everything there is Amdahl's Law . This is just a way to compute how much parallelization will improve performance. It tells you that any portions of your code that need to remain serial will have a large impact on the overall performance, no matter how many processors or threads are running the rest. Real-life solutions cannot be fully parallel. Often there are large fragments of your logic that you do not want to run in parallel, either because it would be expensive to port them, or because they are out of your control. So in some cases the massive parallelization offered by OpenCL and CUDA, even if exploited properly, improves very little.
After my initial experiment porting Dual Contouring to OpenCL, I realized some improvements were necessary. My first attempt was the naive approach: use a regular grid and run the kernels on every element of the grid, regardless of it was part of the surface or not. The results were better than a single threaded CPU implementation, but the speed up was about two times.
For the second iteration, it was clear I needed to be selective about which elements to compute. This is how I did it:
- Run the density function for every point in the regular grid
- Classify which edges in the grid contain a sign change between endpoints
- Scan the edge classification grid and create a compact array of intersected edges
- Using the edge array, compute intersection point for each edge
- Compute the normal of the surface at each intersection point
- Classify which voxels in the grid have at least one intersection in one of its 12 edges
- Scan the voxel classification and create a compact array of voxels
- For each voxel in the compacted array, run the QEF solution
As you can see, that's a lot of kernels. Everything needs to be classified, scanned and compacted. The scanning alone may involve more than one call to a single kernel. If you are not clear about this operation, please check an earlier post: Scan or Die. As you must remember the scan operation operated in several branches of a tree. This solution took more memory as well, since we need classification grids for edges and voxels, and also the arrays to store the edges and voxels that really needed to be processed.
The million dollar question is, was it faster? It was. It was really fast. This optimization hit a nerve, which was computing the surface normals. The density functions I was running were very complex to evaluate. At any point the terrain had dozens of layers, each one with five to ten octaves of Perlin noise. In total over a hundred octaves of Perlin for each point. To compute a single normal I had to evaluate the density function another three times. By being selective and computing normals only when needed I was able to compute many more normals at the same time.
But this was not the last iteration I did. As I had suspected, evaluating the density function was the bottleneck of the entire algorithm. I was still evaluating it for every point in the grid. Improving that took me to the next iteration, which may come with a surprise (or not, if you consider my earlier rant about OpenCL complexity, Amdahl et al).