Table of Contents
Whether you look inside a mobile device or a PC, you'll increasingly spy the CPU and GPU sitting together on the same silicon die. You might call this "integrated graphics" on a desktop or laptop, or a "system-on-chip" (SoC) in a smartphone or tablet. But don't let the cozy proximity fool you. These two guys have different personalities, speak different languages, boast different strengths and weaknesses, and previous efforts to blend one into the other have failed miserably. To understand the distinction, you only need to take a peek at their architecture.
Let's use a quad-core Ivy Bridge processor as an example. Even without labeling up the picture of the silicon die layout above, you might be able to guess where the CPU ends and the GPU begins. Taking up the two-thirds of the chip on the left-hand side, you can see that the majority of the CPU's elements are broadly bunched into four identical groups. Each group represents a core and, in turn, each core consists of many intricate little features, as well as being surrounded by a wealth of shared resources that span all the cores. That's how it is with the CPU: its cores are rich, well-connected and big-brained enough to wade into even the most complex types of sequential problems.
By contrast, the GPU on the right side of the diagram is much dumber-looking. It breaks down into 16 identical groups, reflecting the fact that Ivy Bridge contains a 16-core GPU. But just look at how small and simple those cores are compared to the CPU cores! And they don't have a great deal of shared infrastructure, either.
Discrete graphics cards are more advanced, while mobile class GPUs are generally less so, but they all share this basic layout: many rows of small, identical clusters. The clusters are allowed to be small because they're specialized. They only respond to certain types of mathematical challenges, and they're mainly focused on putting the right color on the right pixel -- that's why they've historically been called "shaders."
Since they have to output many pixel and vertex values at the same time, shaders are designed to function in parallel. If you could zoom right into a graphics core like that on Ivy Bridge above or Kepler on the left, you'd just see more "parallelness" because they tend to be built out of row upon row of identical smaller units, like "arithmetic logic units" and "stream processors." To add a twist of humiliation, the type of work a GPU does may even be called "embarrassingly parallel" because it can be easily divided into streams and there's little need for cross-talk between the streams.
Anyway, that's how it used to be. We're here to tell you how it is. Because a few things have changed over the past few years, and the kid that was once the butt of playground jokes has now attracted a following of engineers and marketers who are convinced of his genius.
These are the people who used to tell us about how many polygons or triangles a GPU could process per second, but who now insist that we also acknowledge its compute performance as well. This performance is measured in floating-point operations per second, usually counted in gigaFLOPS in a phone or tablet GPU, and teraFLOPS in a PC, plus obscenely advanced double-precision floating-point operations per second (which are also counted in FLOPS, though there tend to be fewer of them).
Further emphasizing the powers of compute, NVIDIA's graphics cores are now called CUDA cores (Compute Unified Device Architecture). Intel too, despite having once been publicly skeptical of CUDA's prowess versus the traditional CPU, has specifically reworked the Ivy Bridge graphics component in order to embrace GPU compute. AMD, with perhaps a little inspiration from Pepsi, talks about Graphics Core Next (GCN -- shown below) powering its cards, and also spearheads the HSA Foundation to promote "heterogeneous computing" (yet another term for what we're talking about here). ARM -- which happens to be a member of the HSA Foundation -- promises that its new Mali-T604 will be the first "significant" compute-capable GPU in the mobile space, and will, for example, allow mobile phones to deliver the type of image processing and stabilization performance you'd find in a "$1,000 DSLR."
This isn't all just spin. Something real is going on beneath all this new enthusiasm and jargon, and it boils down to this: Although they remain specialized compared to CPU cores, graphics cores are now much more flexible than they used to be. Instead of having a core just for shading pixels and another just for calculating vertices, all the cores make themselves available to carry out a range of increasingly complex mathematical tasks. This is often called a unified shader architecture.
And how does a programmer talk to the cores? It's done through an application programming interface (API) such as Open Compute Language (OpenCL), which is meant to be able to communicate with unified shaders in any brand of product, and which is strongly supported by AMD. In addition to OpenCL, NVIDIA also has an API specifically for talking to its CUDA cores. Microsoft has a collection of APIs called DirectX, which -- since version 10, but especially version 11 -- have insisted that compatible GPUs are able to understand their instructions. There's also Google's RenderScript, which does a similar job specifically for Android devices.
So, now that the programmer can take the reins of the GPU and make it do more interesting things, what are the results? The two greatest impacts have undoubtedly been more grey hairs and less time spent with families, since programming for the GPU is notoriously difficult. Aside from that, however, the examples that are usually held up to prove the power of GPU compute include scientific applications, such as engineering simulations running on AMD's and NVIDIA's professional FirePro and Quadro graphics cards. Beyond that, proponents often point to smaller productivity apps like WinZip that are able to use OpenCL to speed up tasks, or just regular games.
NVIDIA and AMD like to show off the benefits of the DirectCompute API that comes with DirectX 11, including effects like ambient occlusion (AO), in which the GPU calculates realistic shadows by tracing out the paths of light from different sources. There's also depth-of-field trickery in games like Metro 2033 (shown below), where DirectCompute is used to blur more distant parts of the landscape. What about mobile gaming? The benefits of GPU compute in a smartphone or tablet have barely been glimpsed as yet, but we did recently catch a preview of enhanced OpenCL physics and lighting effects on ARM's new Mali T604 GPU, running as part of a Samsung Exynos 5 chip that could very likely power the successor to the Galaxy S III.
All of this is great for gamers, but it's not quite living up to GPU compute's promise of providing more than just graphics. For that, we can turn to one last gaming example, whose cunning use of the GPU isn't just about visuals, but also about the deeper mathematical power of what highly parallel cores are capable of.
Civilization V came out towards the end of the 2010, but its achievements in this area have yet to be bettered, and it's AnandTech's game of choice for testing a GPU's compute ability. Firaxis's philosophy is simple: It'll divert any task to the GPU that can be readily broken down into "buckets" or "work chunks" that are able to be processed by the new and "exotic" algorithms allowed by GPU compute. This includes not only the way the game looks, but also the way it is -- in other words, what it is able to accomplish with available system resources.
For example, one of Civ V's biggest selling point is its vast, randomly generated maps. The size of the game terrain would normally be limited by available system resources, since just the contours of a single map can weigh up to 128MB -- too burdensome for most systems to store, process and display in real-time. But GPU compute offers a smart way around this.
Instead of randomly generating an entire, playable map, Firaxis's designers only use the CPU to generate a "height map" that contains simple coordinates and 8-bit height values. This weighs a much friendlier 64MB, but it cannot be played on because it's too blocky -- there would be visible steps between points with different heights. So, this approach requires one more layer of effort: converting the height map (or at least its visible tiles) into a "normal map" which is smooth and natural enough to be subjected to lighting effects and then displayed on screen. The problem with this is that it's mathematically intense -- it requires sophisticated filters that are too much for the CPU.
The first filter that is applied to the height map is a simple Gaussian blur, which the GPU handles with ease, but which -- if left to its own devices -- would simply smooth out all the dramatic extremes in the height map and leave us with a bunch of bland rolling hills. Solving this necessitates a whole new level of math: a filter called bilateral weighting, which passes over the scene and selectively reduces blurring at points of extreme contour changes, thereby preserving edges and rescuing detail. (See the slide below, courtesy of Firaxis and AMD.) The CPU would struggle to complete that kind of non-linear task in real-time. In fact, a recent comparison by Tom's Hardware (see More Coverage) showed that even the best CPUs struggle with smart blurring effects compared to even a budget GPU, which will generally complete the work between two and five times faster.
Without GPU compute, maps in Civ V would either be smaller, less random, less realistic or they'd require an extremely powerful system. Perhaps the only thing we can say against this type of use of GPU compute is that it hasn't gone further, and unburdened the CPU enough to make the game run smoothly on system with even less power than the current minimum specs. We have a $700 AMD-powered Samsung Series 5 laptop that contains a relatively healthy DirectX 11- and OpenCL-compatible GPUL -- the Radeon 7500G. But it struggles to run Civ V at over 25 fps, even when there's hardly anything on the map. That's a 2010 game running on a 2012 system, so there are clearly limits to what GPU compute has so far achieved.
We have to fit one more example in here, because it's important to end on a high note -- even if it's slightly discordant. The final proof that GPU compute has the power to change how people do things is demonstrated in the video below. It shows how either OpenCL or CUDA can be used to allow real-time rendering of very complex effects in the latest version of Adobe's video editing software, Premiere Pro CS6, so long as you have a supported graphics card. It wasn't so long ago that we'd apply an effect -- such as color correction -- to a clip and then have to sit back and watch the computer render our changes before we could play them back. But Adobe has been able to shift its entire rendering engine over the GPU and, in the process, demonstrate just what those parallel, math-crunching transistors are truly capable of.
Now for that minor touch of discord: the video itself is pretty self-explanatory, but there's something it doesn't show. After doing the filming, we tried get Adobe's GPU-accelerated playback engine to work on a Retina MacBook Pro, which contains an NVIDIA GTX 650M graphics card. Officially, this graphics processor contains 384 CUDA cores and it supports OpenCL too, so by rights it ought to work -- even though Adobe hasn't certified it for CS6. Undeterred, we did a little hackery in order to convince CS6 that our graphics card really was CUDA-compatible, and this allowed us to activate GPU acceleration on the Mercury Engine. The result? Instability. Plus much, much slower performance. We don't lay an ounce of blame at Adobe's door, since the company is perfectly clear about which cards are supported, and the 650M isn't yet on that list. Nevertheless, this fussiness over devices shows OpenCL and CUDA aren't quite mature and pan-device yet -- if they were, Adobe engineers wouldn't have to bother with certifying individual cards.
It's easy to get lulled into a feeling that, just because billion-dollar companies and extremely smart engineers put their weight behind something, it's going to work. Reality doesn't always bear such confidence out. Just look at that other alleged GPU revolution: hardware-accelerated video transcoding. We won't go into in detail here, but suffice it to say that this trend started out with similar claims to GPU compute -- albeit more niche -- and Intel, AMD and NVIDIA all adjusted their silicon to support it. But recent tests by The Tech Report found that the whole thing was a "mess" and that, despite everyone's best efforts, the smartest way to decode video is to simply arm yourself with a fast CPU and decent software -- in other words, to ignore the GPU just like we did in the past.
We can believe that dismal verdict, and that's why we also reckon that GPU compute needs a lot more attention before a happy future can be guaranteed. Developers shouldn't have to spend ages tweaking GPU algorithms to run on different devices that are already supposed to be compatible. Crucially, instead of demanding the most expensive hardware, GPU compute needs to get to a point where it actually saves people money.
That said, this project has already come too far for it to fail completely. We've seen how terms like "CUDA" and "Graphics Core Next" relate to genuine changes in the nature of a GPU. Highly specialized areas of silicon have been transformed into unified compute cores that can be exploited by programmers to pull off useful tricks in a range of applications, from gaming to digital photography. Sure, GPU compute is messy in some areas, but it's hardly a mess. When we come to check out new graphics cards, or mobile phone processors, or the next big thing in gaming, we'll be looking with even keener eyes to make sure that those GPU compute credentials are present and correct.