How memory bandwidth is killing AMD's 32-core Threadripper performance

Here's just how much memory bandwidth constraints might be hurting the performance.

Gordon Mah Ung

AMD's 32-core Threadripper 2990WX is the fastest consumer CPU ever sold. And let's be clear: We're in full agreement with anyone who said that. But we would also be the first ones to say it has its limitations, too. 

The most glaring is the lack of consumer applications that can truly exploit the cores available. The other limitation is apparent in the diagram below, which shows how AMD built this 32-core monster. Rather than a single chip with every single CPU core on it, AMD connects four dies using its high-speed Infinity Fabric.

Why memory bandwidth affects the 32-core Threadripper

If you look closer at the diagram, you can see that two of the dies don't have their own memory controllers or PCIe access. Instead, they have to talk to an adjacent CPU die.

It is, essentially, like having having a two-apartment unit where the second one must access the hallway outside by going through the first apartment.


AMD says the four-die Threadripper has 25GB of bandwidth shared among all of the chips.

Perhaps more important is the overall bandwidth available. AMD had initially said the total bandwidth available between the four CPU dies was 25GBps bi-directional. The company amended its original documentation to state it was total bandwidth. Compare that with the 16-core Threadripper 2950X, with its 50GBps of bandwidth and two links between the two dies (also updated information from AMD.)


A two-die 16-core Threadripper 2950X has 50GBps and two links between two dies,vs. the 25GBps among four dies that AMD originally claimed (and then amended).

Many believe this is Threadripper 2990WX's main weakness: Lack of memory bandwidth per core is impacting it in memory-intensive tasks such as compression and encoding. Even worse for Threadripper 2990WX is that bandwidth has to be shared on a CPU with 14 more cores than Intel's Core i9-7980XE.

Below, you can see the result of Sandra 2018 Titanium's memory bandwidth test and the available bandwidth per core. As you can see, the bandwidth per core plummets from almost 5GB at 8-core and 16-core to just 2GB when you utilize all 32 cores. 


Sisoft Sandra 2018 Titanium's per core memory bandwidth results say the Threadripper has only 2GB per core available.

Synthetic memory bandwidth tests are one thing. To dig further into performance in memory-intensive tests, we fired up the newest version of the free and popular 7-Zip application. Written by Igor Pavlov, this open-source compression and decompression utility is popular and generally awesome. For example, when I run tests on a laptop and decompress Cinebench R15.08 and its thousands of small files with Windows 10's built-in utility, it takes several minutes to finish. I can actually connect to the Internet, download 7-Zip, and decompress the contents of Cinebench R15.08 with it in less time than it takes the built-in Windows utility to do its thing.

The GUI version runs two tests, for compression and decompression. The overall score looks like a simple average of the two results.

What 7-Zip tests

You can read more about the test on the web site, but we've highlighted some of the key information about the tests here. Regarding the Compression test, the website discusses the factors that influence the test results, saying it "strongly depends from memory (RAM) latency, Data Cache size/speed and TLB. Out-of-Order execution feature of CPU is also important for that test." The site goes on: "The compression test has big number of random accesses to RAM and Data Cache. So big part of execution time the CPU waits the data from Data Cache or from RAM."

About the Decompression test, the website says it "strongly depends on CPU integer operations. The most important things for that test are: branch misprediction penalty (the length of pipeline) and the latencies of 32-bit instructions ('multiply', 'shift', 'add' and other). The decompression test has very high number of unpredictable branches."

How we retested Threadripper vs. Core i9

For our retest, we decided to lock both the Threadripper 2990WX and the Core i9-7980XE at 3GHz to remove any variables from each CPU's boost schemes. This was done to make the comparison more dependent on the test rather than the clock speed differences between the two. We also set both to DDR4/3,200 clocks, and both were run in quad-channel mode except where noted. To be up-front: The Threadripper system had a slight edge in CAS latency at CL14 and 1T, while the Core i9 was running at CL15 and 2T. As in our original review, both were running Founders Edition GTX 1080 cards using the same drivers and the same version of Windows 10 Enterprise Edition.

Because much of the concern over Threadripper is its per-core memory bandwidth performance, we decided to run from 1 thread to the maximum number of threads on each CPU. We also decided to see whether performance of the Threadripper would change if you turned off dies, so we ran it with a single die (8 cores/16 threads) and two dies (16 cores/32 threads), and all four (32 cores/64 threads).

In the integer-focused decompression component of 7-Zip, the performance was quite nice. Although we don't see perfect scaling, there's little difference in 7-Zip decompression performance as you switch off dies.

All of the tests were also completed using the GUI version of 7-Zip 18.05 with the default dictionary size of 32MB (although we did decide to recompile our own version, too.)


There's no apparent change in the decompression performance by moving between one, two, or four dies on the 32-core Threadripper.

You're probably more interested in the Core i9 vs. Threadripper 2990WX, so we ran that, of course. For the most part, it's not bad for either part. Interestingly, Threadripper 2990WX seems to have that slight fall-off in decompression performance as you cross the threshold of 8 cores. Core i9 has a decent performance advantage up to about 16 cores, but after that it runs out of steam and ends up losing to the 32-core Threadripper 2990WX CPU.


The 7-Zip LZMA decompression is more sensitive to integer, branch prediction, and instruction latency. Although Core i9 has some advantage, it's clear that more cores are better in the end.

This shouldn't surprise too many, though. The CPU performance when you don't run out of memory bandwidth is a known quantity of the Threadripper 2990WX. You only have to look at our multi-threaded rendering tests to see how it's simply a monster.

The question is, what happens under memory bandwidth or memory latency tests? Here are the results of the Threadripper 2990WX in 7-Zip's compression test. It's not pretty, but the the good news is switching dies off didn't seem to matter. As you can see, the CPU appears to hit a ceiling at 26 threads, and then it just gets worse from there.


We ran the Threadripper 2990WX in single-die, dual-die and quad-die configuration to see if memory bandwidth issues would ease. 

Perhaps worse is when you compare it to the Core i9-7980XE. Again—remember both of the CPUs were at a fixed clock speed of 3GHz and DDR4/3200.


7-Zip's compression test is said to be memory latency, cache, and out-of-order efficiency sensitive. Obviously, it doesn't do great on the 32-core Threadripper

That's just not a good look for the 32-core Threadripper 2990WX and does seem to confirm that memory latency and bandwidth chores suffer greatly.

But can memory bandwidth also hurt Core i9? To find out, we switched the Core i9 system from quad-channel mode into single-channel mode. Unfortunately, for our test, we did have to lower total memory to 16GB rather than 32GB due to lack of density on modules. The good news is the 7-Zip with the default dictionary fits fine, and we don't believe overall memory capacity was the issue. We can say that overall memory bandwidth as measured in Sandra 2018 was cut from 77GBps in quad-channel memory mode to 18.5GBps in single-channel mode on the Intel part. Per-core memory bandwidth went from 4.8GBps in quad-channel to 1GBps in single-channel mode.


Does cutting memory bandwidth on the Core i9-7980XE also kill its 7-Zip compression performance? Yup.

As you can see, the performance of Core i9-7980XE also suffers when its memory bandwidth is drastically cut. It doesn't suffer as much as the Threadripper 2990XE, but this doesn't appear to be the fault of some pro-Intel code at work. 

Linux tests bring a surprise. Keep reading!

Linux tests show how Windows 10 affects results

I'd normally say, okay, memory bandwidth and latency are the real issues, but there is that Linux thing. That is, in tests run by Michael Larabel at Linux-focused site Phoronix, the Threadripper 2990WX actually performs on a par with the Core i9-7980XE rather than heavily trail it. Phoronix runs a slightly older version of 7-Zip, but it's clear that moving to Linux helps Threadripper 2990WX. A lot. Phoronix even tested it using Windows 10 Server.


Maybe it's not the Threadripper after all?

Phoronix's Linux test shows issues not just with 7-Zip, but also several other tests where Windows 10 underperformed the Linux version. So it's clear Windows has an issue right now. But if you're in the crowd that wholesale dismisses it as a weakness at all, I'm not so sure.

One Linux vs. Windows test that would back up memory bandwidth and latency as issues are tests by Steve Walton over at Walton tested Windows and Linux performance using the latest 7-Zip version and found Core i9 still ahead despite having fewer cores. Greatly improved for Threadripper? Yes. But still clearly slower in a multi-threaded test that does scale to all available cores.


Techspot's Linux vs. Windows test still puts Threadripper behind the Core i9.

The compiler is another factor

In searching for more answers on Threadripper's 7-Zip performance, we wondered whether the compiler was at fault. If an outdated compiler was used to build the 7-Zip executable, it could certainly hurt the Threadripper's performance. To find out, we downloaded the source code for 7-Zip, the latest version of Microsoft's Visual Studio 2017, and compiled it into an executable.

We ended up with basically the same result, and it looks like the latest version of 7-Zip is actually on the latest available Visual C++ compiler. This doesn't completely dismiss compilers, as different compilers do matter. If, for example, the applications on Linux were compiled with the GCC or Intel compiler, it might explain the performance differences.


We recompiled the sourcecode for 7-Zip 18.05 using the latest version of Visual Studio 2017 and found that, well, that's probably what 7-Zip was recently compilled with.

HandBrake test brings up more questions

While Windows 10 clearly, clearly has issues with the design of Threadripper, it would be wrong to say memory bandwidth and latency aren't in play.

To see just how much memory bandwidth helps or hurts both CPUs, we took VeraCrypt and ran it with the larger 1GB workload. As we saw with 7-Zip, the Core i9 's VeraCrypt performance drops off a cliff and is actually is worse than the Threadripper's (albeit with quad-core memory), as you can see from the blue bars below.

The Threadripper 2990WX does suffer greatly with the 1GB workload. But if the issue is how Windows handles the memory configuration on the Threadripper, it should get better after shutting off two dies, right? It does—but as you can see in the green bars below, performance increases only slightly when limiting it to just 16 cores and two threads. The result is again confusing, because if Windows 10 is at fault for the poor performance of the shared memory controller design,why is the performance of the Threadripper 2990WX not as fast as the Core i9's? Remember—both CPUs are locked at 3GHz.


Cutting memory bandwidth just kills performance of the Core i9 (blue) but oddly the Threadripper's performance doesn't bump up when two of the dies are switched off.

Our last test used HandBrake 1.1.1 to encode a 4K video file using the 1080p Chromecast preset. Note: This HandBrake result is different from others we've run, so it can't be compared to previous results.

Video encoding is often associated with increased memory bandwidth. While it does matter, we can see it's not a big deal even when you go from 77GBps to 18GBps on the Core i9 on this particular preset.

Our results from cutting the Threadripper's die use from four to two also isn't a big deal. It's actually slightly faster with two dies turned off, but almost within the margin for error in HandBrake encodes.

This leads us to believe that the only reason a 32-core Threadripper is slightly slower than an 18-core Core i9 in this particular HandBrake run is likely due to the vagaries of HandBrake itself, and how well it runs on each processor. We should also note that the app itself is multi-threaded, but doesn't scale with core counts.


Gutting memory bandwidth on the Core i9 didn't see as drastic a change in performance as you'd expect which tells you how video encoding isn't as dependent on memory bandwidth as you think.

There's no easy answer

If you were hoping for an easy answer to your lingering Threadripper performance questions—take a number. Based on our tests, the answer is, it's complicated.

While we didn't do Linux testing, we've seen enough results run by others now to say that Windows 10 is handcuffing performance in certain applications (although the compiler used for those particular tests might share some blame, too.)

We also believe that Threadripper 2990WX can be handcuffed by memory bandwidth and latency in some workloads. It just makes sense when you're talking about sharing quad-channel memory among 32 cores, versus sharing quad-channel memory among 18 cores.

In the end, we think you should still choose your high-performance CPU based on the task it'll do. Our results from our original review still basically apply. If you do thread-heavy tasks such as 3D rendering or modelling or tend to multi-task, having 32 cores and 64 threads in a Threadripper 2990WX ($1,749 on Amazon) will be unlike anything you've ever had before.

If, however, you tend to stick to workloads that aren't has heavily threaded, such as most video encoding chores, and need higher clock speeds on apps on lightly threaded applications—and also are very memory bandwidth dependent, the Core i9-7980XE ($2,000 on Amazon) might be the better choice for you.


If your applications tend to use fewer threads and prefer higher clock speeds, you live on the left side of this chart, and Core i9 makes more sense. If, however, you need more cores, you live on the right side of this chart, and Threadripper is the better choice.