# AMD APP: Getting Device Assembly

Sometimes it is useful to look at the intermediate and assembly code for GPU programs.  This can lead to some interesting performance insights, especially for compiler writers.  Unfortunately, the AMD APP SDK is a bit limited on Linux, and the AMD APP KernelAnalyzer, which conveniently dumps the AMDIL and Device ISA for an OpenCL kernel, is not available on Linux.  However, digging through the AMD APP OpenCL Programming Guide, one finds an environment variable that can be used for the same purpose: GPU_DUMP_DEVICE_KERNEL.

According to the programming guide, this environment variable can take one of three values:

 1 Save intermediate IL files in local directory. 2 Disassemble ISA file and save in local directory. 3 Save both the IL and ISA files in local directory.

Therefore, if you run your OpenCL program with:

$GPU_DUMP_DEVICE_KERNEL=3 ./my-program  You will get two files in your local directory: [kernel-name]_[device-name].il and [kernel-name]_[device-name].isa, which contain AMDIL and Device ISA disassembly, respectively. Posted Thu 09 February 2012 by Justin Holewinski in Programming (GPU, OpenCL, AMD) # LLVM 3.0: PTX Backend NOTE: The information is this article only applies to LLVM 3.0 and 3.1. As of LLVM 3.2, the PTX back-end has been replaced with the NVPTX back-end. With the release of LLVM 3.0, the PTX back-end is now in a fairly usable state. It even integrates with the Clang OpenCL front-end to produce correct PTX code usable by the nVidia OpenCL run-time. However, please note that the back-end is still experimental and there are unimplemented features. As always, please post any questions to the llvm-dev mailing list. In this post, I aim to give a quick overview of how to use the back-end to compile OpenCL kernels. As an example, consider the following matrix multiplication routine written in OpenCL:   1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 #define BLOCK_SIZE 16 __kernel void matmul(__global float* A, __global float* B, __global float* C) { __local float scratchA[BLOCK_SIZE][BLOCK_SIZE]; __local float scratchB[BLOCK_SIZE][BLOCK_SIZE]; int globalX = get_global_id(0); int globalY = get_global_id(1); int size = get_global_size(0); int k; float sum = 0.0f; int numBlocks = size / BLOCK_SIZE; int b; int tidX = get_local_id(0); int tidY = get_local_id(1); for(b = 0; b < numBlocks; ++b) { // Populate a cache for A/B int x; int y; x = b * BLOCK_SIZE + tidX; y = globalY; scratchA[tidY][tidX] = A[y * size + x]; x = globalX; y = b * BLOCK_SIZE + tidY; scratchB[tidY][tidX] = B[y * size + x]; barrier(CLK_LOCAL_MEM_FENCE); for(k = 0; k < BLOCK_SIZE; ++k) { float myA; float myB; myA = scratchA[tidY][k]; myB = scratchB[k][tidX]; sum += myA * myB; } barrier(CLK_LOCAL_MEM_FENCE); } C[globalY * size + globalX] = sum; }  We can use the libclc library, written by Peter Collingbourne, to provide the OpenCL built-in functions for Clang. This library will map OpenCL built-in functions to target-specific functions in the LLVM IR that the PTX back-end knows how to handle. If$LIBCLC points to the download of libclc, then you can invoke Clang with:

clang -ccc-host-triple ptx32
-Xclang -target-feature -Xclang +ptx23
-Xclang -target-feature -Xclang +sm20
-I$LIBCLC/include/generic -I$LIBCLC/include/ptx
-include clc/clc.h -Dcl_clang_storage_class_specifiers
-O3 matmul_kernel.cl -S -o matmul_kernel.ptx


The options can be a bit verbose at the moment, but practically all of them can be placed in a wrapper script.  Clang will compile the kernel and emit the generated PTX code to matmul_kernel.ptx.  This code can then be loaded as an OpenCL binary kernel using the nVidia OpenCL SDK, using the clCreateProgramWithBinary function.  As an added bonus, the performance is about the same as if the kernel was compiled using the nVidia OpenCL compiler!

Posted Fri 02 December 2011 by Justin Holewinski in Programming (GPU, LLVM, OpenCL)