Tag Archives: Ryzen

AMD Ryzen 9 3950X Folding@Home Review: Part 3: SMT (Hyperthreading)

Hi all. In my last post, I showed that the AMD Ryzen 9 3950x is quite a good processor for fighting diseases like Cancer, Alzheimer’s, and COVID-19. Folding@Home, the distributed computing project helping researchers understand various diseases, definitely makes good use of the 16 cores / 32 threads on the 3950x.

In this article, I’m taking a look at how virtualized CPU cores (Simultaneous Multithreading in AMD speak or Hyperthreading for you Intel fans) helps computational performance and efficiency when running Folding@Home on a high-end CPU such as the Ryzen 9 3950x.

Instead of regurgitating all of the previous information, here are some links to bring you up to speed if you haven’t read the previous posts.

Socket AM4 Benchmark Machine

AMD Ryzen 9 3950x Review: Part 1 (Overview)

AMD Ryzen 9 3950X Review: Part 2 (Average Results vs. # of Threads)

Test Setup

For this test, I used the same settings as in Part 2, except that I disabled SMT in the BIOS on my motherboard. Thus, Windows 10 will only see the 16 physical CPU cores, and will not be able to run two logical threads per CPU core. As before, I ran all testing using Folding@Home’s V7 client. I set the CPU slot configuration for a thread value of 1-16. At each setting, I ran five work units and averaged the results. Note that AMD’s core performance boost was turned off for all tests, so at all times the processor ran at 3.5 GHz.

Performance

As expected, as you throw more CPU cores at a problem, the computer can chew through the math faster. Thus, more science gets done in a given amount of time. In the case of Folding@Home, this performance is rated in terms of Points Per Day (PPD). The following plot shows the increase in computational performance as a function of # of threads utilized by the solver. Unlike in my previous testing on the 3950x, here an increase of 1 thread corresponds to an increase of 1 engaged CPU core, since virtual threads (SMT / Hyperthreading) are disabled.

The plot below includes the individual samples at each data point as light gray dots, as well as a + / – 2 sigma (95%) confidence interval. This means that 95% of the results for a given thread setting are statistically predicted to fall within the dashed lines.

AMD Ryzen 9 3950x Performance SMT Off

As a side note, certain settings of thread count actually result in the exact same performance, because the Folding@Home client is internally using a different number than the specified value. For example, setting the CPU slot to 5 threads will still result in a 4-thread solve, because the solver is avoiding the numerical issues that occur when trying to stitch the solution together with 5 threads (5 is a tricky prime number to work with numerically). I noted these regions on the plot. If you would like more detail about this, please read the previous part of this review (part 2).

One interesting observation is that the maximum performance occurs with 15 CPU cores enabled, not the complete 16! This is somewhat similar to what was observed in Part 2 of this review (SMT enabled), where 30 threads provided slightly more points than 32 threads. More on that in a moment…

Power Consumption

Using my P3 Kill A Watt Power Meter, I measured the power consumption of the entire computer at the wall. As expected, as you increase the number of CPUs engaged, the instantaneous power consumption goes up. The power numbers reported here are averaged by “the eyeball method”, since the actual instantaneous power goes up and down by a few watts as the computer does its thing. I’d estimate that these numbers are accurate within 5 watts.

AMD Ryzen 9 3950x Power Consumption SMT Off

Efficiency

The ultimate goal of this blog is to find the most efficient settings for computer hardware, so that we can do the most scientific research for a given amount of power consumption. Thus, this next plot is just performance (in PPD) divided by power consumption (in watts). I left off all the work unit variation and confidence interval lines, since it looks about the same as the performance plot, and it’s cleaner with just the one average line.

AMD Ryzen 9 3950x Efficiency SMT Off

As with performance, setting Folding@Home to use 15 CPUs instead of the full 16 is surprisingly the best option for efficiency. The difference is pretty profound here, as the processor used more power at 16 threads than at 15 threads while producing less points at 16 threads than at 15.

Comparison to Hyperthreaded Results

To get a better idea of what’s going on, here are the same three plots again with the average results overlaid on the previous results from when SMT was enabled. Of course the SMT results go up to 32 threads, since with virtual cores enabled, the 16-core Ryzen 9 3950x can support 32 total threads.

AMD Ryzen 9 3950x Performance SMT Off vs On

AMD Ryzen 9 3950X Performance: SMT Study

AMD Ryzen 9 3950x Power SMT Off vs On

AMD Ryzen 9 3950X Power Consumption: SMT Study

AMD Ryzen 9 3950x Efficiency SMT Off vs On

AMD Ryzen 9 3950X Efficiency: SMT Study

Conclusion

Disabling SMT (aka Hyperthreading) essentially limits the Ryzen 9 3950x to a maximum thread count of 16 (one thread per physical core). The results from 1-16 threads are very similar to those results obtained with SMT enabled. Due to work unit variation, the performance and efficiency plots show what I would say is effectively the same result with SMT on vs. off, up to 16 threads. One thing to note was that the power consumption in the 12-16 thread range did trend higher for the SMT off case, although the offset was small (about 5-10 watts). This is likely due to Windows scheduling work to a new physical core to handle the higher thread count when SMT is disabled, as opposed to virtualizing the work onto an already-running core using SMT. Ultimately, this slightly higher power consumption didn’t have a noticeable effect on the efficiency plot.

The big takeaway is that for thread counts above 16 (the physical core count), the Ryzen 9 3950x can utilize thread virtualization very well. The logical processors that Windows sees don’t work quite as well as true physical cores (hence the decrease in slope on the performance and efficiency plots above 16 CPUs). However, when the thread count is doubled, SMT still does allow the processor to eek out an extra 100K PPD (about 33% more) and run more efficiently than when it is limited to scheduling work to physical CPUs.

Pro Tip #1: Turn on Hyperthreading / SMT and run with high core counts to get the most out of Folding@Home!

The final observation worth noting is that in both cases, setting the F@H client to use the maximum available number of threads (16 for SMT off, 32 for SMT on) is not the fastest or most efficient setting. Backing the physical core count down to 15 (and, similarly, the SMT core count down to 30) results in the fastest and most efficient solver performance.

My theory is that by leaving one physical core free (one physical core = 2 threads with SMT on), the computer has enough spare capacity to run all the crap that Windows 10 does in the background. Thus, there is less competition for CPU resources, and everything just works better. The computer is also easier to use for other tasks when you don’t fully max out the CPU core count. This is also especially valuable for those people also trying to fold on a GPU while CPU folding (more on that in the next article).

Pro Tip #2: For high core count CPUs, don’t fold at 100% of your processor’s core capacity. Go right to the limit, and then back it off by a core.

Since you’re using SMT / Hyperthreading due to Pro Tip #1, this means setting the CPUs box in the client to 2 less than the maximum allowed. On my 16-core, 32-thread Ryzen 9 3950x, this means CPUs = 32 (theoretical max) – 2 (2 threads per core) = 30

CPU Slot Config

This result will be different on CPUs with different numbers of cores, so YMMV…I always recommend testing out your individual processor. For lower core count processors such as Intel’s quad core Q6600, running with the maximum number of cores offers the best performance. I previously showed this here.

Future Work

In the next article, I’m going to kick off folding on the GPU, an Nvidia GeForce 1650, which I previously tested by its lonesome here. In a CPU + GPU folding configuration, it’s important to make sure the CPU has enough resources free to “feed” the GPU, or else points will suffer.

I’ve also started re-running the thread tests with Core Performance Boost enabled. This allows the processor to scale up in frequency automatically based on the power and thermal headroom. This should significantly change the character of the SMT On and SMT Off plots, since everything up till now has been run at the stock speed of 3.5 GHz.

Support My Blog (please!)

If you are interested in measuring the power consumption of your own computer (or any device), please consider purchasing a P3 Kill A Watt Power Meter from Amazon. You’ll be surprised what a $35 investment in a watt meter can tell you about your home’s power usage, and if you make a few changes based on what you learn you will save money every year! Using this link won’t cost you anything extra, but will provide me with a small percentage of the sale to support the site hosting fees of GreenFolding@Home.

If you enjoyed this article, perhaps you are in the market for an AMD Ryzen 9 3950x or similar Ryzen processor. If so, please consider using one of the links below to buy one from Amazon. Thanks for reading!

AMD Ryzen 9 3950x Direct Link

AMD Ryzen (Amazon Search)

AMD Ryzen 9 3950X Folding@Home Review: Part 2: Averaging, Efficiency, and Variation

Welcome back everyone! In my last post, I used my rebuilt benchmark machine to revisit CPU folding on my AMD Ryzen 9 3950x 16-core processor. This article is a follow-on. As promised, this includes the companion power consumption and efficiency plots for thread settings of 1-32 cores. As a quick reminder, I did this test with multi-threading (SMT) on, but with Core Performance Boost disabled, so all cores are running at the base 3.5 GHz setting.

Performance

The Folding@Home distributed computing project has come a long way from its humble disease-fighting beginnings back in 2000. The purpose of this testing is to see just how well the V7 CPU client scales on a modern, high core-count processor. With all the new Folding@Home donors coming onboard to fight COVID, having some insight into how to set up the configuration for the most performance is hopefully helpful.

For this test, I simply set the # of threads the client can use to a value and ran five sequential work units. I averaged the performance (Points Per Day), but I also plot the individual work unit performance values to give you a sense of the variation. Since the Ryzen 9 3950x supports 32 threads, I essentially ran 160 tests. Since I wanted the Folding@Home Consortium to get useful data in their fight against COVID-19, I let each work unit run to completion, even though I only need them to run to about 10-20% complete to get an accurate PPD estimate from the client.

So, without further blabbing on my part, here is the graph of Folding@Home performance vs. thread count in Windows 10 on the Ryzen 9 3950x

Ryzen_3950x_Performance_SMT_Off_CPB_On

Here, the solid blue line is the averaged performance, and the gray circles are the individual tests. The dashed blue lines represent a statistical 95% confidence interval, which is computed based on the variation. The expected Points Per Day (PPD) of a work unit run on the 3950x is expected to fall within this band 95% of the time.

My first observation is, holy crap! This is a fast processor. Some work units at high thread counts get really close to 500K PPD, which for me has only been achievable by GPU folding up to this point.

My second observation is that there is a lot of variation between different work units. This makes sense, because some work units have much larger molecules to solve than others. In my testing, I found the average variation of all 160 tests to be 12.78%, with individual variance up to 25%.

My third observation is that there seems to be two different regions on this plot. For the first half, the thread count setting is less than the number of physical cores on the chip, and the results are fairly linear. For the second half, the thread count setting is higher than the number of physical cores on the chip (thus forcing the CPU to virtualize those cores using SMT). Performance seems to fall off when the CPU cores become fully saturated (threads = 16), and it takes a while to climb out of the hole (threads = 24 starts showing some more gains).

As a side note, the client does not actually run all of these thread count settings, since some prime numbers, especially large primes (7, 11) and multiples thereof cause numerical issues. For example, when you try to run a 7-thread solve, the client automatically backs the thread count down to 6. You can see warnings in the log file about this when it happens.

Prime Number Thread Adjust

I noted all the relevant thread counts where this happens on the x-axis of the plot. Theoretically, these should be equivalent settings. The fact that the average performance varies a bit between them is just due to work unit variation (I’d have to run hundreds of averages to cancel all the variation out).

Finally, I noticed that the highest PPD actually occurred with a thread count of 30 (PPD = 407200) vs a thread count of 32 (PPD = 401485). This is a small but interesting difference, and is within the range of statistical variation. Thus I would say that setting the thread count to 30 vs 32 provides the same performance, while leaving two CPU threads free for other tasks (such as GPU folding…more on that later!).

Power Consumption

Power consumption numbers for each thread setting were taken at the wall, using my P3 Kill A Watt meter. Since the power numbers tend to walk around a bit as the computer works, it’s hard to get an instantaneous reading. Thus these are “eyeball averaged”. There was enough change at each CPU thread setting to clearly see a difference (not counting those thread settings that are actually equivalent to an adjacent setting).

Ryzen_3950x_Power_SMT_Off_CPB_On

The total measured power consumption rose fairly linearly from just under 80 watts to just under 160 watts. There’s not too much surprising here. As you throw more threads at the CPU, it clocks up idle cores and does more work (which causes more transistors to switch, which thus takes more power). This seems pretty believable to me. At the high end, the system is drawing just under 160 watts of power. The AMD Ryzen 9 3950x is rated at a 105 watt TDP, and with CPB turned off it should be pretty close to this number. My rough back of the hand calculation for this rig was as follows:

  1. CPU Loaded Power = 105 Watts
  2. GPU Idle Power (Nvidia GTX 1650) = 10 Watts
  3. Motherboard Power = 15 Watts
  4. Ram Power = 2 watts * 4 sticks = 8 watts
  5. NVME Power = 2 watts * 2 drives = 4 watts
  6. SSD Power = 2 watts

Total Estimated Watts @ F@H CPU Load = 144 Watts

Factor in a boat load of case fans, some silly LED lights, and a bit of PSU efficiency hit (about 90% efficient for my Seasonic unit) and it’ll be close to the 160 watts as measured.

Efficiency

This being a blog about saving the planet while still doing science with computers, I am very interested in energy efficiency. For Folding@Home, this means at doing the most work (PPD) for the least amount of power (watts). So, this plot is just PPD/Watts. Easy!

Similar to the PPD plot, this efficiency plot averages five data points for each thread setting. I chose to leave off the individual points and the confidence interval, because that looks about the same on this plot as it does on the PPD plot, and leaving all the clutter off makes this easier to read.

Ryzen_3950x_Efficiency_SMT_Off_CPB_On

As with the PPD plot, there seem to be two regions on the efficiency curve. The first region (threads less than 16) shows a pretty good linear ramp-up in efficiency as more threads are added. The second region (threads 16 or greater) is what I’m calling the “core saturation” region. Here, there are more threads than physical cores, and efficiency stays relatively flat. It actually drops off at 16 cores (similar to the PPD plot), and doesn’t start improving again until 24 or more threads are allocated to the solver.

This plot, at first glance, suggests that the maximum efficiency is realized at # of threads = 30. However, it should be noted that work unit variation still has a lot of influence, even with reporting results of a 5-sample average. You can see this effect by looking at the efficiency drop at threads = 31. Theoretically, the efficiency should be the same at threads = 31 and threads = 30, because the solver runs a 30-thread solution even when set to 31 to prevent domain decomposition.

Thus, similar to the PPD plot, I’d say the max efficiency is effectively achieved at thread counts of 30 and 32. My personal opinion is that you might as well run with # of threads = 30 (leaving two threads free for other tasks). This setting results in the maximum PPD as well.

Weird Results at Threads = 16-23

Some of you might be wondering why the performance and efficiency drops off when the thread count is set to the actual number of cores (16) or higher. I was too, so I re-ran some tests and looked at what was happening with AMD’s built-in Ryzen Master tool. As you can see in the screen shot below, even though the # of threads was set to 18 in Folding@Home (a number greater than the 16 physical cores), not all 16 cores were fully engaged on the processor. In fact, only 14 were clocked up, and two were showing relatively lazy clock rates.

Two Cores are Lazy!

Folding@Home 18-Thread CPU Solve on 16-Core Processor

I suspect what is happening is that some of the threads were loaded onto “virtual” CPU cores (i.e. SMT / hyper threading). This might be something Windows 10 does to preserve a few free CPU cores for other tasks. In fact, I didn’t see all of the cores turbo up to full speed until I set Folding@Home’s thread count to 24. This incidentally is when performance starts coming back in on the plots above.

This weird SMT / Hyper-threading behavior is likely what is responsible for the large drop-off / flat part of the performance and efficiency curves that exists from thread count = 16 to 23. As you can see in the picture below, once you fully load all the available threads, the CPU frequencies on each core all hit the maximum value, as expected.

Ryzen_Master_32_Thread_Solve

Folding@Home 32-Thread CPU Solve on 16-Core Processor

Results Comparison

The following plots compare overall performance, power consumption, and efficiency of my new AMD Ryzen 9 3950x Folding@Home rig to other hardware configurations I have tested so far.

Performance

As you can see from the plot below, the Ryzen 9 3950x running a 32-thread Folding@Home solve can compete with relatively modern graphics cards in terms of raw performance. High-end GPUs will still offer more performance, but for a processor, getting over 400K PPD is very impressive. This is significantly more PPD than the previous processors I have tested (AMD Bulldozer-based FX-8320e, AMD Phenom II X6 1100t, Intel Core2Quad Q6600, etc). Admittedly I have not tested very many CPUs, since this is much more involved than just swapping out graphics cards to test.

AMD Ryzen 9 3950x Performance

Power Consumption

From a total system power consumption standpoint, my new benchmark machine with the AMD Ryzen 9 3950x has a surprisingly low total power draw when running Folding. Another interesting point is that since the 3950x lacks onboard graphics, I had to have a graphics card installed to get display. In my case, I had the Nvidia GTX 1650 installed, since this is a relatively low power consumption card that should provide minimal overhead. As you can see below, folding on the 3950x CPU (with the 1650 GPU idle) uses nearly the same amount of power as folding on the 1650 GPU (with the 3950x idle).

AMD Ryzen 9 3950x Power Consumption

Efficiency

Efficiency is the point of this blog, and in this respect the 3950x comes in towards the upper middle of the pack of hardware configurations I have tested. It’s definitely the most efficient processor I have tested so far, but graphics cards such as the 1660 Super and 1080 Ti are more efficient. Despite drawing more total power from the wall, these high-end GPUs do a lot more science.

Still, a PPD/Watt of over 2500 is not bad, and in this case the 3950x is more efficient than folding on the modest GPU installed in the same box (the Nvidia GTX 1650). Compared to the much older AMD FX-8320e, the Ryxen 9 3950x is 14x more efficient! What a difference 7 years can make!

AMD Ryzen 9 3950x Efficiency

Conclusion

The 16-core, 32-thread AMD Ryzen 9 3950x is one fast processor, and can do a lot of science for the Folding@Home distributed computing project. Although mid to high-end graphics cards such as the 1080 Ti ($450 on the used market) can outperform the $700 3950x in terms of performance and efficiency, it is still important to have a smattering of high-end CPU folding rigs on the Folding@Home network, because some molecules can only be solved on CPUs.

There is a general trend of increasing efficiency and performance as the # of CPU threads allocated to Folding@Home increases. For the Ryzen 9 3950x, using a setting of 30 or 32 threads is recommended for maximum performance and efficiency. If you plan on using your computer for other tasks, or for simultaneously folding on the GPU, 30 threads is the ideal CPU slot setting.

Please Support My Blog!

If you are interested in measuring the power consumption of your own computer (or any device), please consider purchasing a P3 Kill A Watt Power Meter from Amazon. You’ll be surprised what a $35 investment in a watt meter can tell you about your home’s power usage, and if you make a few changes based on what you learn you will save money every year! Using this link won’t cost you anything extra, but will provide me with a small percentage of the sale to support the site hosting fees of GreenFolding@Home.

If you enjoyed this article, perhaps you are in the market for an AMD Ryzen 9 3950x or similar Ryzen processor. If so, please consider using one of the links below to buy one from Amazon. Thanks for reading!

AMD Ryzen 9 3950x Direct Link

AMD Ryzen (Amazon Search)

Future Work

In the next article, I’ll disable multithreading (SMT) to see the effect of virtualized CPU cores on Folding@Home performance.

Later, I plan to enable core performance boost on the 3950x to see what effect the automatic clock frequency and voltage overclocking has on Folding@Home performance and efficiency.

 

 

AMD Ryzen 9 3950X Folding@Home Review: Part 1: PPD vs # of Threads

Welcome back everyone. Over the last month, I’ve been experimenting with my new Folding@Home benchmark machine to see how effectively AMD’s flagship Ryzen processor (Ryzen 9 3950X) can fight diseases such as COVID-19, Cancer, and Alzheimer’s. I’ve been running Folding@Home, a charitable distributed computing project, which provides scientists with valuable computing resources to study diseases and learn how to combat them.

This blog is typically focused on energy efficiency, where I try to show how to do the most science for the least amount of power consumption possible. In this post, I’m stepping away from that (at least for now) in order to understand something much simpler: how does the Folding@Home CPU client scale with # of processor threads?

I’d previously investigated Folding@Home performance and efficiency vs. # of CPU cores on an old Intel Q6600. I’ve also done a few CPU articles on AMD’s venerable Phenom II X6 1000T and my previous processor, the AMD FX-8320e. These CPU articles were few and far-between however, as I typically focus on using graphics cards (GPUs). The reason is twofold. Historically, graphics cards have produced many more points per day (PPD) for a given amount of power, thanks to their massively parallel architecture, which is well-suited for running single precision molecular dynamics problems such as those used by Folding@Home. Also, graphics cards are much easier to swap out, so it was relatively easy to make a large database of GPU performance and efficiency.

Still, CPU folding is just as important, because there are certain classes of problems that can only be efficiently computed on the CPU. Folding@Home, while originally a project that ran exclusively on CPUs, obtains the bulk of its computational power from GPU donors these days. However, the CPU folders sill play a key part, running work units that cannot be solved on GPUs, thus providing a complete picture of the molecular dynamics.

In my last article, I highlighted the need for me to build a new benchmark machine for testing out GPUs, since my old rig would soon become a bottleneck and slow the GPUs down (thus potentially affecting any comparison plots I make). Now that this Ryzen-based 16-core monster of a desktop is complete, I figured I’d revisit CPU folding once more to see just what a modern enthusiast-class processor like the $749 Ryzen 9 3950X is capable of. For this first part of a multi-part review, I am simply looking at the preliminary results from running Folding@Home on the CPU. Instead of running with the default thread settings, I manually set up the client, examining just how performance results scale from the 1 to 32 available threads on the Ryzen 9 3950x.

Test Setup

Testing was performed in Windows 10 Home, using the latest Folding@Home client (7.6.13). Points Per Day were estimated from the client window for each setting of # of CPU threads. These instantaneous estimates have a lot of variability, so future testing will investigate the effect of averaging (running multiple tests at each setting) on the results.

Benchmark Machine Hardware:

Case Raidmax Sagitta (2006)
Power Supply Seasonic Prime 750 Titanium
Fresh Air 2 x 120 mm Enermax Front Intake
Rear Exhaust 1 x 120 mm Scythe Gentile Typhoon
Side Exhaust 1 x 80 mm Noctua
Top Exhaust 1 x 120 mm (Seasonic PSU)
CPU Cooler Noctua NH-D15 SE AM4
Thermal Paste Arctic MX-4
CPU AMD Ryzen 9 3950X 16 Core 32 Thread (105W TDP)
Motherboard ASUS Prime X570-P Socket AM4
Memory 32 GB (4 x 8 GB) Corsair Vengeance LPX DDR4 3600 MHz
GPU Zotac Nvidia GeForce 1650
OS Drive Samsung 970 Evo Plus 512 GB NVME SSD
Storage #1 Samsung 860 Evo 2 TB SSD
Storage #2 Western Digital Blue 256 GB NVME SSD (for Linux)
Optical Samsung SH-B123L Blu-Ray Drive
OS Windows 10 Home, Ubuntu Linux (on 2nd NVME)

Processor Settings:

The AMD Ryzen 9 3950x is a beast. With 16 cores and 32 threads, it has a nominal power consumption of 105 watts, but can easily double that when overclocked. With the factory Core Performance Boost (CPB) enabled, the processor will routinely draw 150+ watts when loaded due to the individual cores turboing as high as 4.7 GHz, up from the 3.5 GHz base clock. Under heavy multi-threaded work loads, the processor supports an all-core overclock of up to 4.3 GHz, assuming sufficient cooling and motherboard power delivery.

This automatic core turbo behavior is problematic for creating a plot of folding at home performance (PPD) vs # of threads, since for lightly threaded loads, the processor will scale up individual cores to much higher speeds. In order to make an apples to apples comparison, I disabled CPB, so that all CPU cores run at the base speed of 3.5 GHz when loaded. In future testing, I will perform this study with CPB on in order to see the effect of the factory automatic overclocking.

A note about Cores vs. Threads

Like many Intel processors with Hyper-Threading, AMD supports running multiple code execution strings (known as threads) on one CPU core. The Simultaneous Multi-Threading (SMT) on the Ryzen 9 3950x is simply AMD’s term for the same thing: a doubling of certain parts within each processor core (or sometimes the virtualization of multiple threads within one CPU core) to allow multiple thread execution (two threads per core, in this case). The historical problem with both Hyper-Threading and SMT is that it does not actually double a CPU core’s capacity to perform complex floating point mathematics, since there is only one FPU per CPU core. SMT and Hyperthreading work best when there is one large job hogging a core, and the smaller job can execute in the remaining part of the core as a second thread. Two equally intensive threads can end up competing for resuorses within a core, making the SMT-enabled processor actually slower. For example: https://www.techspot.com/review/1882-ryzen-9-smt-on-vs-off/

For the purposes of this article, I left SMT on in order to make the coolest plot possible (1-32 threads!). However, I suspect that SMT might actually hurt Folding@Home performance, for the reasons mentioned above. Thus in future testing, I will also try disabling this to see the effect.

Preliminary Results: PPD vs # Threads on Ryzen 9 3950x

So, to summarize the caveats, this test was performed once under each test condition (# of threads), so there are 32 data points for 32 threads. SMT was on (so Folding@Home can run two threads on one CPU core). CPB was off (all cores set to 3.5 GHz).

The figure below shows the results. As you can see, there is a general trend of increasing performance with # of threads, up to around the halfway point. Then, the trend appears to get messy, although by the end of the plot, it is clear that the higher thread counts realize a higher PPD.

Ryzen 9 3950X PPD vs Thread Count 1

Observations

It is clear that, at least initially, adding threads to the solution makes a fairly linear improvement in points per day. Eventually, however, the CPU cores are likely becoming saturated, and more of the work is being executed in via SMT. Due to the significant work unit variability in Folding@Home (as much as 10-20% between molecules), these results should be taken with a grain of salt. I am currently re-running all of these tests, so that I can show a plot of average PPD vs. # of Threads. I am also logging power using my watt meter, so that we can make wall power consumption and efficiency plots.

Conclusions

Seeing a processor produce nearly half a million points per day in Folding@Home was insane! My previous testing with old 4, 6, and 8-core processors was lucky to show numbers over 20K PPD. In general, allowing Folding@Home to use more processor threads increases performance, but there is significant additional work needed to verify a statistical trend. Stay tuned for Part II (averaging).

P.S.

Man, that’s a lot of cores! You’d better be scared, COVID-19…I’m coming for you!

Cores!

So Many Cores!

New Folding@Home Benchmark Machine: It’s RYZEN TIME!

Folding@Home, the distributed computing project that fights diseases such as COVID-19 and cancer, has hit an all-time high in popularity. I’m stunned to find that my blog is now getting more views every day than it did every month last year. With that said, this is a perfect opportunity to reach out and see if all the new donors are interested in tuning their computers for efficiency, to save a little on power, lighten the burden on your wallet, and hopefully produce nearly the same amount of science. If this sounds interesting to you, let me know in the comments below!

In my last post, I noted that the latest generation of graphics cards are starting to push the limits of what my primary GPU Folding@Home benchmark rig can do. That computer is based on an 11-year-old chipset (AMD 880), and only supports PCI-Express 2.0. In order for me to keep testing modern fast graphics cards in Windows 10, I wanted to make sure that PCI-Express slot bandwidth wasn’t going to artificially bottleneck me.

So, without further ado, let me present the new, re-built Folding@Home rig, SAGITTA:

Sagitta Desktop

I’ve (re)created a monster!

This build leverages the Raidmax Sagitta case that I’ve had since 2006. This machine has hosted multiple builds (Pentium D 805, Core 2 Duo e8600, Core 2 Quad Q6600, Phenom II X6 1100T, and the most recent FX-8320e Bulldozer). There have been too many graphics cards to count, but the latest one (Nvidia GTX 1650 by Zotac) was carried over for some continuity testing. The case fans and power supply (initially) were also the same since the previous FX build (they aren’t the same ones from back in 2006…those got loud and died long ago). I also kept my Blu-Ray drive and 3.5 inch card reader. That’s where the similarities end. Here is a specs comparison:

Sagitta Rebuild Benchmark Machine Specs

  • Note I ended up updating the power supply to the one shown in the table. More on that below…

System Power Consumption

Initially, the power consumption at idle of the new Ryzen 9 build, measured with my P3 Kill A Watt Meter, was 86 watts. The power consumption while running GPU Folding was 170 watts (and the all-core CPU folding was over 250 watts, but that’s another article entirely).

Using the same Nvidia GeForce GTX 1650 graphics card, these idle and GPU folding power numbers were unfortunately higher than the old benchmark machine, which came in at 70 watts idle and 145 watts load. This is likely due to the overkill hardware that I put into the new rig (X570 motherboards alone are known to draw twice the power of a more normal board). The system’s power consumption difference of 25 watts while folding was especially problematic for my efficiency testing, since new plots compared to graphics cards tested on the old benchmark machine would not be comparable.

To solve this, I could either:

A: Use a 25 watt offset to scale the new GPU F@H efficiency plots

B: Do nothing and just have less accurate efficiency comparisons to previous tests

C: Reduce the power consumption of the new build so that it matches the old one

This being a blog about energy efficiency, I decided to go with Option C, since that’s the one that actually helps the environment. Lets see if we can trim the fat off of this beast of a computer!

Efficiency Boost #1: Power Supply Upgrade

The first thing I tried was to upgrade the power supply. As noted here, the power supply’s efficiency rating is a great place to start when building an energy efficient machine. My old Seasonic X-650 is a very good power supply, and caries an 80+ Gold rating. Still, things have come a long way, and switching to an 80+ Titanium PSU can gain a few efficiency percentage points, especially at low loads.

80+ Table

80+ Efficiency Table

With that 3-5% efficiency boost in mind, I picked up a new Seasonic 750 Watt Prime 80+ Titanium modular power supply. At $200, this PSU isn’t cheap, but it provides a noticeable efficiency improvement at both idle and load. Other nice features were the additional 100 watts of capacity, and the fact that it supported my new motherboard’s dual pin (8 + 4) CPU aux power connection. That extra 4-pin isn’t required to make the X570 board work, but it does allow for more overclocking headroom.

Disclaimer: Before we get into it, I should note that these power readings are “eyeball” readings, taken by glancing at the watt meter and trying to judge the average usage. The actual number jumps around a bit (even at idle) as the computer executes various background tasks. I’d say the measurement precision on any eyeball watt meter readings is +/- 5 watts, so take the below with a grain of salt. These are very small efficiency improvements that are difficult to measure, and your mileage may vary. 

After upgrading the power supply, idle power dropped an impressive 10 watts, from 86 watts to 76. This is an awesome 11% efficiency improvement. This might be due to the new 80+ Titanium power supply having an efficiency target at very low loads (90% efficiency at 10% load), whereas the old 80+ Gold spec did not have a low load efficiency requirement. Thus, even though I used a large 750 watt power supply, the machine can still remain relatively efficient at idle.

Under moderate load (GPU folding), the new 80+ titanium PSU provided a 4% efficiency improvement, dropping the power consumption from 170 watts to 163. This is more in line with expectations.

Efficiency Boost #2: Processor Underclock / Undervolt

Thanks to video gaming mentality, enthusiast-grade desktop processors and motherboards are tuned out of the box for performance. We’re talking about blistering fast, competition-crushing benchmark scores. For most computing tasks (such as running Folding@Home on a graphics card), this aggressive CPU behavior is wasting electricity while offering no discernible performance benefit. Despite what my kid’s shirt says, we need to reel these power hungry CPUs in for maximum GPU folding efficiency.

Never Slow Down

Kai Says: Never Slow Down

One way to improve processor efficiency is to reduce the clock rate and associated voltage. I’d previously investigated this here. It takes exponentially more voltage to support high frequencies, so just by dropping the clock rate by 100 MHz or so, you can lower the voltage a bunch and save on power.

With the advent of processors that up-clock and up-volt themselves (as well as going in the other direction), manual tuning can be a bit more difficult. It’s far easier to first try the automatic settings, to see if some efficiency can be gained.

But wait, this is a GPU folding benchmark rig? Why does the CPU’s frequency and power settings matter?

For GPU folding with an Nvidia graphics card, one CPU core is fully loaded per GPU slot in order to “feed” the card. This is because Nvidia’s implementation of open CL support using a polling (checking) method. In order to keep the graphics card chugging along, the CPU constantly checks on the GPU to see if it needs any data. This polling loop is not efficient and burns unnecessary power. You can read more about it here: https://foldingforum.org/viewtopic.php?f=80&t=34023. In contrast, AMD’s method (interrupts) is a much more graceful implementation that doesn’t lock up a CPU core.

The constant polling loop drives modern gaming-oriented processors to clock up their cores unnecessarily. For the most part, the GPU does not need work at every waking moment. To save power, we can turn down the frequency, so that the CPU is not constantly knocking on the GPU’s metaphorical door.

To do this, I disabled AMD’s Core Performance Boost (CPB) in the AMD Overclocking section of the BIOS (same thing as Intel’s Turbo Boost). This caps the processor speed at the base maximum clock rate (3.5 GHz for the Ryzen 9 3950x), and also eliminates any high voltage values required to support the boost clocks.

Success! GPU folding total system power consumption is now much lower. With less superfluous power draw from the CPU, the wattage is much more comparable to the old Bulldozer rig.

Ryzen 9 3950x Power Reduction Table

It is interesting that idle power consumption came down as well. That wasn’t expected. When the computer isn’t doing anything, the CPU cores should be down-clocked / slept out. Perhaps my machine was doing something in the background during the earlier tests, thus throwing the results off. More investigation is needed.

GPU Benchmark Consistency Check

I fired up GPU folding on the Nvidia GeForce GTX 1650, a card that I have performance data for from my previous benchmark desktop. After monitoring it for a week, the Folding@Home Points Per Day performance was so similar to the previous results that I ended up using the same value (310K PPD) as the official estimate for the 1650’s production. This shows that the old benchmark rig was not a bottleneck for a budget card like the GeForce GTX 1650.

Using the updated system power consumption of nominally 140 watts (vs 145 watts of the previous benchmark machine), the efficiency plots (PPD/Watt) come out very nearly the same. I typically consider power measurements of + / – 5 watts to be within the measurement accuracy of my eyeball on the watt meter anyway, due to normal variations as the system runs. The good news is that even with this variation, it doesn’t change the conclusion of the figure (in terms of graphics card efficiency ranking).

GTX 1650 Efficiency on Ryzen 9

* Benchmark performed on updated Ryzen 9 build

Conclusion

I have a new 16-core beast of a benchmark machine. This computer wasn’t built exclusively for efficiency, but after a few tweaks, I was able to improve energy efficiency at low CPU loads (such as Windows Idle + GPU Folding).

For most of the graphics cards I have tested so far, the massive upgrade in system hardware will not likely affect performance or efficiency results. Very fast cards, such as the 1080 Ti, might benefit from the new benchmark rig’s faster hardware, especially that PCI-Express 4.0 x16 graphics card slot. Most importantly, future tests of blistering fast graphics cards (2080 Ti, 3080 Ti, etc) will probably not be limited by the benchmark machine’s background hardware.

Oh, I can also now encode my backup copies of my blu-ray movies at 40 fps in H.265 in Handbrake (old speed was 6.5 fps on the FX-8320e). That’s a nice bonus too.

Efficiency Note (for GPU Folding@Home Users)

Disabling the automatic processor frequency and voltage scaling (Turbo Boost / Core Performance Boost) didn’t have any effect on the PPD being generated by the graphics card. This makes sense; even relatively slow 2.0 GHz CPU cores are still fast enough to feed most GPUs, and my modern Ryzen 9 at 3.5 GHz is no bottleneck for feeding the 1650. By disabling CPB, I shaved 23 watts off of the system’s power consumption for literally no performance impact while running GPU folding. This is a 16 percent boost in PPD/Watt efficiency, for free!

This also dropped CPU temps from 70 degrees C to 55, and resulted in a lower CPU cooler fan speed / quieter machine. This should promote longevity of the hardware, and reduce how much my computer fights my air conditioning in the summer, thus having a compounding positive effect on my monthly electric bill.

Future Articles

  • Re-Test the 1080 Ti to see if a fast graphics card makes better use of the faster PCI-Express bus on the AM4 build
  • Investigate CPU folding efficiency on the Ryzen 9 3950x

 

Shout out to the helpers…Kai and Sam