Justin's Code Haus
Ramblings of a compiler engineer

Author: Justin Holewinski

LLVM 3.0: PTX Backend

NOTE: The information is this article only applies to LLVM 3.0 and 3.1. As of LLVM 3.2, the PTX back-end has been replaced with the NVPTX back-end.

With the release of LLVM 3.0, the PTX back-end is now in a fairly usable state.  It even integrates with the Clang OpenCL front-end to produce correct PTX code usable by the nVidia OpenCL run-time.  However, please note that the back-end is still experimental and there are unimplemented features.  As always, please post any questions to the llvm-dev mailing list.

In this post, I aim to give a quick overview of how to use the back-end to compile OpenCL kernels.

As an example, consider the following matrix multiplication routine written in OpenCL:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#define BLOCK_SIZE 16

__kernel
void matmul(__global float* A, __global float* B, __global float* C) {

  __local float scratchA[BLOCK_SIZE][BLOCK_SIZE];
  __local float scratchB[BLOCK_SIZE][BLOCK_SIZE];

  int globalX = get_global_id(0);
  int globalY = get_global_id(1);
  int size = get_global_size(0);
  int k;
  float sum = 0.0f;
  int numBlocks = size / BLOCK_SIZE;
  int b;

  int tidX = get_local_id(0);
  int tidY = get_local_id(1);

  for(b = 0; b < numBlocks; ++b) {
    // Populate a cache for A/B
    int x;
    int y;

    x = b * BLOCK_SIZE + tidX;
    y = globalY;

    scratchA[tidY][tidX] = A[y * size + x];

    x = globalX;
    y = b * BLOCK_SIZE + tidY;

    scratchB[tidY][tidX] = B[y * size + x];

    barrier(CLK_LOCAL_MEM_FENCE);

    for(k = 0; k < BLOCK_SIZE; ++k) {
      float myA;
      float myB;

      myA = scratchA[tidY][k];
      myB = scratchB[k][tidX];

      sum += myA * myB;
    }

    barrier(CLK_LOCAL_MEM_FENCE);
  }

  C[globalY * size + globalX] = sum;
}

We can use the libclc library, written by Peter Collingbourne, to provide the OpenCL built-in functions for Clang.  This library will map OpenCL built-in functions to target-specific functions in the LLVM IR that the PTX back-end knows how to handle.  If $LIBCLC points to the download of libclc, then you can invoke Clang with:

clang -ccc-host-triple ptx32
  -Xclang -target-feature -Xclang +ptx23
  -Xclang -target-feature -Xclang +sm20
  -I$LIBCLC/include/generic -I$LIBCLC/include/ptx
  -include clc/clc.h -Dcl_clang_storage_class_specifiers
  -O3 matmul_kernel.cl -S -o matmul_kernel.ptx

The options can be a bit verbose at the moment, but practically all of them can be placed in a wrapper script.  Clang will compile the kernel and emit the generated PTX code to matmul_kernel.ptx.  This code can then be loaded as an OpenCL binary kernel using the nVidia OpenCL SDK, using the clCreateProgramWithBinary function.  As an added bonus, the performance is about the same as if the kernel was compiled using the nVidia OpenCL compiler!

Posted Fri 02 December 2011 by Justin Holewinski in Programming (GPU, LLVM, OpenCL)

Space Hogs Binary Release

I've converted my old Space Hogs game project to XNA 4.0 (it was originally written in XNA 1.0). There were enough API changes to make it a pain, but I think I have everything working now.

This game was developed by myself, Jason Kim, Joseph Ahn, Vjekoslav Kovacevic, and Daniel Guinn for a computer animation class during Winter Quarter 2007.

Spacehogs Screenshot

You can find a zip file here. This requires XNA 4.0 and the February 2010 DX packages to be installed on your machine. For convenience, I've included both of the redistributable packages in the zip file.

The source can be found on BitBucket.

Enjoy!

Posted Fri 17 February 2012 by Justin Holewinski in Programming (Windows, Games)

The Beauty of C++ Templates

Every so often, I'll get a random C++ question from a friend or colleague.  Most of the time the answers are trivial, at least for someone who has a history with the language.  Other questions make me stop and ponder, searching for the best "C++" way to do something.  Yesterday, the question was simple and the solution turned out to be equally simple, but getting to the solution made me stop and appreciate some of the cool things one can do with C++ templates.

The Problem

The problem was simple.  Suppose you have a C++ template class/struct that is parameterized by a single type, e.g.

template<typename T>
class my_data {
  // ...
private:
  T element_;
};

The Solution

Now, the question is, "how do I write a method for this class/struct that maps the type of T to an enumeration value?"  For context, the real problem involved mapping T to an MPI data type, e.g. (float -> MPI_FLOAT), (double -> MPI_DOUBLE), etc..

The first thought for anyone familiar with containers may be to explicitly generate a map, e.g. std::map in this case, to hold all possible mappings from the C++ type (via typeid()) to the MPI type (really just an integer).  Such a solution is certainly valid and may be the best way to approach the problem in another language such as C# or Java.  After pondering the "C++" solution to the problem for a few minutes, my colleague and I came up with a fairly elegant solution involving templates.  Or, at least I found it quite elegant.

/**
 * This struct wrappers the MPI data type value for the given C++ type.
 *
 * Any valid MPI data type value must have a corresponding explicit template
 * instantiation below.
 */
template<typename T>
struct mpi_type_wrapper {
  int mpi_type;
  mpi_type_wrapper();
};

// Explicit instantiation for `float'
template <>
mpi_type_wrapper::mpi_type_wrapper()
: mpi_type(MPI_FLOAT) {}

// Explicit instantiation for `double'
template <>
mpi_type_wrapper::mpi_type_wrapper()
: mpi_type(MPI_DOUBLE) {}

The mpi_type_wrapper struct is a convenient way to convert an arbitrary C++ type to an equivalent MPI type.  All one has to do is declare a local variable of type mpi_type_wrapper<T> (with appropriate T) and read the value of its mpi_type field.  Of course, none of this is specific to MPI in any way.  The only requirement is that an explicit instantiation of the constructor must be provided for any C++ types that are to be converted.

Why This Solution?

This solution strikes me as elegant for two reasons.  First, it is a solution that would be difficult, if not impossible, to express in many other languages.  Second, and most interesting to me, there is no run-time overhead associated with this solution.  You can even compile this with RTTI turned off.  Any reasonable compiler automatically inlines the appropriate constructor, then constant propagation replaces any uses of the mpi_type field with the appropriate MPI_* enumeration value.  There is no memory overhead associated with explicitly keeping a map at run-time, nor any time overhead of performing a map look-up.  The final code just uses the constant value!  If you do not believe me, check out this example:

/**
 * Some template class that needs to know the MPI_DataType value for its
 * template parameter type.
 */
template<typename T>
struct some_type {
  void printType() {
    mpi_type_wrapper<T> wrap;

    printf("My Type: %d", wrap.mpi_type);
  };
};

int main() {
  some_type<float> floatClass;
  some_type<double> doubleClass;

  floatClass.printType();
  doubleClass.printType();

  return 0;
}

And the generated code?

_main:
  pushq %rbx
  leaq L_.str(%rip), %rbx
  movq %rbx, %rdi
  xorl %esi, %esi
  xorb %al, %al
  callq _printf
  movl $1, %esi
  movq %rbx, %rdi
  xorb %al, %al
  callq _printf
  xorl %eax, %eax
  popq %rbx
  ret

Conclusion

While this example is probably trivial for most experienced C++ programmers out there, including myself, I always find myself stopping and appreciating such solutions.  In this case, C++ templates provide such an elegant and efficient solution that I cannot help feeling giddy.

Posted Fri 01 April 2011 by Justin Holewinski in Programming (C++)

UnrealScript: Brace Placement Matters!

I was playing around with the Unreal Development Kit this evening, and discovered a rather interesting quirk in the handling of braces within UnrealScript.  All of the sample code I read use a syntax style that places opening braces on the following line:

event PostBeginPlay()
{
  // Do something
}

However, my typical style places the opening brace on the current line:

event PostBeginPlay() {
  // Do something
}

Unfortunately, this does not seem to work for defaultproperties blocks. If you place the brace on the same line, the compiler will not give you any warnings or errors, but the entire defaultproperties block is just ignored!

So this code works:

defaultproperties
{
  PlayerControllerClass=class'MyPlayerController'
}

while the following code compiles but silently just ignores all of the contained settings:

defaultproperties {
  PlayerControllerClass=class'MyPlayerController'
}

I was banging my head on the wall for at least an hour figuring this one out!

I hope this can help prevent someone else from repeating my mistake.

Posted Sat 17 March 2012 by Justin Holewinski in Programming (Unreal)