In CUDA, OpenCL, and C++ AMP, a group is a collection of threads that execute in parallel in lock-step fashion. In CUDA, it is called a block; in OpenCL, it is called a work-group; in C++ AMP, it is called a tile. The purpose of a group is to allow threads within the group to communicate with each other using synchronization and/or shared memory. The size of thread groups is set by the programmer, but hardware constraints limit the maximum size to 512 or 1024. While programmers usually need to tailor algorithms to be aware of thread groups, there are a few tricks that can make programming easier.
Trying to get information of the underlying design of a GPGPU programming language environment and hardware can be difficult. Companies will not publish design information because they do not want you or other companies to copy the technology. But, sometimes you need to know details of a technology that are just not published in order to use it effectively. If they won’t tell you how the technology works, the only recourse to gain an understanding is experimentation [1, 2]. What is the performance of OpenCL, CUDA, and C++ AMP? What can we learn from this information?
Every time Microsoft releases a new version of Visual Studio C++, NVIDIA releases a new version of CUDA to work with the new version of Visual Studio. Unfortunately, that seemed to take NVIDIA an incredibly long time the last time. After Visual Studio 2010 was released in April 4, 2010, CUDA integration with with Visual Studio 2010 didn’t happen until CUDA 4.0 RC1 in March 4, 2011–a year later! And, to this day, the build rules have never worked cleanly for me (http://stackoverflow.com/questions/6156037/issue-with-production-release-of-cuda-toolkit-4-0-and-nsight-2-0). Because I’m developing C++ AMP and CUDA side by side, and cannot wait for NVIDIA, I decided to develop the build rules myself, and work out the details of calling CUDA from Visual Studio C++ 2012.