It’s time to pull back the curtain on the Turing GPU inside Nvidia’s radical new GeForce RTX 20-series, the first-ever graphics cards designed to handle real-time ray tracing thanks to the inclusion of dedicated tensor and RT cores. But the GeForce RTX 2080 and RTX 2080 Ti were also designed to significantly improve performance in traditionally rendered games, with enough power to feed those blazing-fast 4K, 144Hz G-Sync HDR gaming monitors.
Nvidia revealed plenty of numbers during the GeForce RTX 2080 Ti announcement. Clock speeds, memory bandwidth, CUDA core counts—it was all there. This deeper dive explains the underlying, architectural changes that make Nvidia’s Turing GPU more potent than its Pascal predecessor. We’ll also highlight some new Nvidia tools that developers can embrace to speed up performance even more, or bring the AI-boosted power of Nvidia’s Saturn V supercomputer into your graphics card.
This isn’t a review of the GeForce RTX 2080 or GeForce RTX 2080 Ti. This is examining the Turing architecture itself. Head over to our exhaustive GeForce RTX 2080 and 2080 Ti review for a full benchmark-backed evaluation of their speeds, feeds, and promises.
Nvidia Turing GPU overview
Before we dig in, here’s a high-level specifications overview for the Turing TU102 GPU inside the flagship GeForce RTX 2080 Ti.
Here’s Nvidia’s high-level overview, in case what you’re looking at isn’t clear:
“The TU102 GPU includes six Graphics Processing Clusters (GPCs), 36 Texture Processing Clusters (TPCs), and 72 Streaming Multiprocessors (SMs). Each GPC includes a dedicated raster engine and six TPCs, with each TPC including two SMs. Each SM contains 64 CUDA Cores, eight Tensor Cores, a 256 KB register file, four texture units, and 96 KB of L1/shared memory which can be configured for various capacities depending on the compute or graphics workloads… Tied to each memory controller are eight ROP units and 512 KB of L2 cache.”
You’ll also find a single RT processing core within each SM, so there are 72 in the GeForce RTX 2080 Ti. Because the RT and tensor cores are baked right into each streaming multiprocessor, the lower you go in the GeForce RTX 20-series lineup, the fewer you’ll find of each. The RTX 2080 has 46 RT cores and 368 tensor cores, for example, and the RTX 2070 will have 36 RT cores and 288 tensor cores.
With all that cutting-edge stuff packed in, all requiring dedicated hardware, it shouldn’t come as a surprise that Turing is absolutely massive. The die measures in at a whopping 754mm, compared to the 471mm Pascal GPU inside the GTX 1080 Ti.
Inside Nvidia Turing GPU: Shading and memory improvements
Let’s explain the improvements to the long-established stuff before digging into the exotic new tensor and RT cores.
Nvidia says the GeForce RTX 2080 can be roughly 50 percent faster than the GTX 1080 in traditional games. Many of the comparisons occur in games with HDR enabled, which take a performance hit on current GTX 10-series cards. The GeForce RTX 2080 can be more than twice as fast as the GTX 1080 in games that support Nvidia’s DLSS technology, Nvidia claims (we’ll talk more about DLSS later), and surpass 60 frames per second in several triple-A games at 4K resolution with HDR visuals enabled.
The real-world performance of Nvidia’s high-end RTX duo remains to be seen. (Update: Read our GeForce RTX 2080 and 2080 Ti review.) Nvidia hasn’t uttered a peep about the GeForce RTX 2080 Ti’s performance in traditional games. We still have no idea how the GeForce RTX 2080 compares against the older GTX 1080 Ti in non-HDR games. Nvidia’s frame rates in the 4K/60 HDR games listed above don’t mention what graphics settings they were tested with.
Because the RTX 2080 is at least in the same performance ballpark as the GTX 1080 Ti despite having nearly 20 percent fewer overall CUDA cores, it’s clear those CUDA cores have been upgraded.
Scratch that: The Turing GPU’s simultaneous multiprocessors haven’t just been upgraded, they’ve been overhauled. Beyond the addition of tensor and RT cores, Nvidia also added a new integer pipeline (INT32) alongside the floating point pipeline (FP32) traditionally used to process shading.
When Nvidia examined how real-world games behaved, it found that for every 100 floating point instructions performed, an average of 36 and as many as 50 non-floating point instructions were also processed, jamming things up. The new integer pipeline handles those extra instructions separately from and concurrently with the FP32 pipeline. Executing the two tasks at the same time results in a big speed boost, according to Jonah Alben, Nvidia’s VP of GPU engineering.
Nvidia also rejiggered how the memory caches inside its simultaneous multiprocessors work. Now, smaller SMs each feed into a unified pool of L1 and shared memory, which in turn feeds into an L2 cache that’s twice as large as before. The shake-up means Turing has almost three times more L1 memory available than the Pascal GPUs in the GTX 10-series, with twice as much bandwidth and lower latency.
Add it all up and Nvidia claims that the Turing GPU performs traditional shading a whopping 50 percent better than Pascal. That’s a massive architectural improvement, though the actual gains will vary from game to game, as shown in the slide above.
But games aren’t bound by shading performance alone. Memory bandwidth can directly affect how well your games play. Turing improves upon Pascal’s superb memory compression technology, and the GeForce RTX 2080 and 2080 Ti build atop that with the introduction of Micron’s next-gen GDDR6 memory—the first time it’s appeared in a GPU. GDDR6 blazes along at 14Gbps despite being 20 percent more power-efficient than GDDR5X, and Nvidia optimized Turing’s RAM for 40 percent lower crosstalk than in its predecessor.
The grab-bag of improvements gives the RTX 2080 Ti a 50-percent increase in effective memory bandwidth over the GTX 1080 Ti, Nvidia says. In real-world terms, the GeForce RTX 2080 Ti hits a total memory bandwidth of 616GBps, versus the GTX 1080 Ti’s 484GBps, even though both cards offer identical memory capacities and bus sizes. That’s the power of GDDR6.
Turing’s new shading technologies
As with most major GPU architecture launches, Nvidia also introduced some new shading technologies that developers can take advantage of to improve performance, visuals, or both.
Mesh shading help take some of the burden off your CPU during very visually complex scenes, with tens or hundreds of thousands of objects. It consists of two new shader stages. Task shaders perform object culling to determine which elements of a scene need to be rendered. Once that’s decided, Mesh Shaders determine the level of detail at which the visible objects should be rendered. Ones that are farther away need a much lower level of detail, while closer objects need to look as sharp as possible.
Nvidia showed off mesh shading with an impressive, playable demo where you flew a spaceship through a massive field of 300,000 asteroids. The demo ran around 50 frames per second despite that gargantuan object count because mesh shading reduced the number of drawn triangles at any given point down to around 13,000, from a maximum of 3 trillion potential drawn triangles. Intriguing stuff.
Variable rate shading is sort of like a supercharged version of the multi-resolution shading that Nvidia’s supported for years now. Human eyes only see the focal points of what’s in their vision at full detail; objects at the periphery or in motion aren’t as sharp. Variable rate shading takes advantage of that to shade primary objects at full resolution, but secondary objects at a lower rate, which can improve performance.
One potential use case for this is Motion Adaptive Shading, where non-critical parts of a moving scene are rendered with less detail. The image above shows how it could be handled in Forza Horizon. Traditionally, every part of the screen would be rendered at full detail, but with Motion Adaptive Shading, only the blue sections of the scene get such lofty treatment.
Content Adaptive Shading applies the same principles, but it dynamically identifies portions of the screen that have low detail or large swathes of similar colors, and shades those at lower detail, and more so when you’re in motion. It looked damned fine in action during a playable Wolfenstein II demo that let you toggle the feature on and off. I couldn’t perceive any change in visual quality, but Nvidia’s Alben says that activating CAS boosts imaging speed by 20 fps or more in situations where you’re targeting 60 fps on a mainstream GPU. Fingers crossed developers support this sort of technology with more gusto than they did multi-res shading, which blew me away in Shadow Warrior 2 but didn’t gain any traction beyond that.
Variable rate shading can also help in virtual reality workloads by tailoring the level of detail to where you’re looking. Another new VR tech, Multi-View Rendering, expands upon the Simultaneous Multi-Projection technology introduced with the GTX 10-series to allow “developers to efficiently draw a scene from multiple viewpoints or even draw multiple instances of a character in varying poses, all in a single pass.”
Finally, Nvidia also introduced Texture Space Shading, which shades an area around an object rather than a single scene to let developers reuse shading in multiple perspectives and frames.
What happens inside a frame with Turing
For a standard GPU architecture, that’d be all you need to know. But we’re just getting started with Turing. In fact, before we dig in, let’s quickly cover how Turing handles workloads that take advantage of all its capabilities, such as a ray traced game.
The lines up top marked “1 Turing frame” are the key. For Nvidia’s other GPUs, it would consist of just one portion—the yellow FP32 line. But it’s more complicated in Turing.
While that standard FP32 shader processing takes place, the dedicated RT cores and integer pipeline are executing their own specialized tasks at the same time. Once all that’s done, everything’s handed off to the tensor cores for the final 20 percent of a frame, which perform their machine learning-enhanced magic—such as denoising ray traced images or applying Deep Learning Super Sampling—in games that utilize Nvidia’s RTX technology stack.
Next page: RT cores, tensor cores, and video/display upgrades.