Monday, October 12, 2009

Hacking CUDA and OpenCL on Fedora 10

I discovered Fedora 11 is not compatible with NVIDIA's CUDA toolkit (now on version 2.3; see note about driver version below) because the latter requires GCC 4.3 where Fedora 11 provides GCC 4.4. So I'll have to back down to Fedora 10. Here are some handy notes for setting up Fedora 10. I installed a number of RPMs to get CUDA to build.
sudo yum install eclipse-jdt eclipse-cdt \
freeglut freeglut-devel kernel-devel \
mesa-libGLU-devel libXmu-devel libXi-devel

The Eclipse stuff wasn't all necessary for CUDA but I wanted it.

In a comment to an earlier posting, Jesper told me about OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. NVIDIA supports this and has an OpenCL implementation which required updating my NVIDIA drivers to version 190.29, more recent than the version 190.18 drivers on NVIDIA's CUDA 2.3 page. When I installed 190.29, it warned me that it was uninstalling the 190.18 drivers.

Python enthusiasts will be interested in PyOpenCL.

NVIDIA provides a lot of resources and literature for getting started with OpenCL.

Friday, October 09, 2009

Sampled profiling in Linux

A few years back I was doing some cross-platform work on a mostly-Python package with a bunch of math libraries. I happened to be jumping back and forth between Linux and Mac, and I discovered that the Mac had this magical performance measurement tool that, as far as I knew, didn't exist for Linux. How could that be? How could such a great idea, obviously amenable to open source, NOT have made it into the Linux world? I was flabbergasted.

That tool was a sampling profiler. It works like this. At regular intervals, you stop the processor and say, what are you doing? what process are you in? what thread are you in? what does your call stack look like? and you collect statistics on all this stuff, figure out the percentages of time spent in different places, and present it in an easily understandable visual tree format.

Thinking about it more, I saw that to make this work, you need to have built your code with debug enabled, so that all the symbols are present in the compiled binary. Most Linux software is built without debug symbols (or at least that was the common practice a few years ago) which is why this idea hadn't gotten traction in the Linux world.

So today I stumbled across a cool review of Fedora 11 Linux and near the bottom, it talks about some of the development work that Red Hat has been doing on the Eclipse IDE, including a sampling profiler called OProfile. There's a nice little video of a Red Hat engineer demonstrating OProfile. Very cool stuff. One of the most impressive things is that in addition to simply sampling at fixed time intervals, you can also choose to sample on events happening in the computer hardware, like when the processor drives data onto the front-side bus. Wow, yikes. I don't have a clear idea what that would help with, but I guess some performance issues can correlate to things happening in the hardware.

Here are some of my old notes I found on this topic. These notes are old and I didn't know about OProfile at the time, though OProfile originated in the same group at HP that did the QProf work described below. The old notes will be in italics to avoid confusion with more current information.

Finally, something the Mac has that Linux doesn't have

The Mac has this cool little utility called sample. Here's how it works. Suppose some program is running as process ID 1234. You type "sample 1234 60" in another terminal, and it interrupts that process 100 times per second for 60 seconds, getting a stack trace each time. You end up with a histogram showing the different threads and C call stacks, and it becomes very easy to figure out where the program is spending most of its time. It's very cool. It doesn't even significantly slow down the process under study.

I started looking to see if there was anything similar for Linux. There ought to be. This isn't so different from what GDB does. The closest thing I could find was something called pstack, which does this once, but doesn't seem so good at extracting the names of C routines. I've never seen pstack get a stack that was more than three calls deep. I also found a variant called lsstack.

I think the Mac can do this because Apple can dictate that every compile must have the -g switch turned on. The pstack utility came originally from the BSD world.

If there's ever a working sample utility for Linux, it will be a brilliant thing.

First you need a timer process that interrupts the target process. On each interrupt, you collect a stack trace. You do that by having an interrupt handler that gets the stack above it and throws it in some kind of data structure. In Mac OS, you can do all this without recompiling any of the target process code.

You run the timer for however long and collect all this stuff. You end up with a data structure with all these stacks in it. Now you can create a tree with histogram numbers for each node in the tree, where each node in the tree represents a program counter. Next you use map files to convert the program counters to line numbers in source files, and if you want brownie points, you even include the source line for each of them.

This should be feasible in Linux. Check out "man request_irq".

The only issue with making this work as well in Linux as it does on the Mac is that code that isn't built with "-g" will not have the debug information for pulling out function names and line numbers. Somebody (probably David Mosberger-Tang) has already done this project: