CUDA C++ is a parallel programming language for NVIDIA GPU’s. Using the CUDA Toolkit (http://developer.nvidia.com/object/cuda_3_2_toolkit_rc.html), developers can write programs that take advantage of the large parallel computing potential of the GPU, speeding up their programs several orders of magnitude. CUDA programs are executed on the GPU, which works as a coprocessor with the CPU, having its own memory and instruction set. But, the question is, after the developer invested the time to parallelize his program, can the CUDA program run on a PC without an NVIDIA GPU? Does the developer have to redo all his software?
As a software developer, you may be faced with the problem of modifying the behavior of a third-party program. For example, you're writing a debugger that will check the usage of memory allocation in the program being debugged. You often do not have the source for the program, so you cannot change it. And, you do not have the sources for the memory allocation library that the program uses, so you cannot change that code either. This is just the problem I was facing with Nvidia's CUDA API . Although it is not a particularly hard problem, what I learned was yet another example of how frustrating it is to find a solution that should be well described, but is not.
Last month, I took a short course called Introduction to Multicore Programming, which was taught by James T. Demers of Physical Sciences Inc. This course introduced me to current hardware and software systems for parallel computing. Personally, the offering of this course was timely: I have been reading about how important parallel programming is becoming; and, I was starting to become interested in parallel algorithms but my knowledge of parallel computing was embarrassingly sparse. The last time I looked at the subject was in the early 1990’s when I wrote a program that used MPI. Though I have an Intel Q6600 quad-core CPU machine (Figure 1, Figure 2), a byproduct of the rivalry between Intel and AMD, I never really bothered to program it for parallel computing because I did not think four cores would offer that much over one core. What I learned from this course was that I had my head in the sand. In fact, I was surprised to learn that that the graphics card I owned, an NVIDIA GeForce 9800 GT (Figure 3), was a parallel computing system which I could program. So, I decided to apply what I learned in the class by solving two programming exercises on my system (Figure 4): matrix multiplication and graph breadth-first search. In this post, I describe the first problem, matrix multiplication.
For several months, I had been editing a new edition of a textbook (Atlas of the Canine Brain, ISBN 978-0-916182-17-5). This book was first published in Russian in 1959, then translated and published in English in 1964. Although the English book was for sale, the publishing company (NPP Books) had only a limited number of copies left. So, a Print-On-Demand (POD) version of the book was needed. Of course, in 1964 there were no personal computers. (Even in the early '70's, I was still using punched cards.) The book was written by typewritten on 8.5" by 11" paper, but the original manuscript, which also included the figures, was lost. Fortunately, the text and figures were recovered from the Russian and English books using a scanner and optical character recognition (OCR). Call me old fashioned, but it still seems quite remarkable that the technology exists to recover text from old books. Continue reading →
Visual Studio 2010 has some interesting new features, one of which is parallel programming. This simple example, written in C# and F#, tests the difference between serial and parallel computations of Fibonacci numbers. The example uses the System.Threading.Tasks.Parallel.For method. Indeed, there is about a 4-fold improvement on my quad-core multiprocessor for long computations but not for short computations. There is a trade-off between the size of the computation versus the overhead to create, schedule, and synchronize threads. Continue reading →