Nvidia Rubin GPUs: 200 teraFLOPS FP64 from software emulation

Nvidia’s new Rubin GPUs leverage software emulation to enhance FP64 performance for HPC, challenging AMD’s recent lead in this area, despite AMD’s expressed reservations regarding the method’s real-world applicability.

Double precision floating point computation (FP64) is essential for modern HPC and scientific computing applications. Nvidia’s newly unveiled Rubin GPUs provide 33 teraFLOPS of peak FP64 performance without emulation, which is one teraFLOP less than the four-year-old H100. With software emulation enabled in Nvidia’s CUDA libraries, the chip can reportedly achieve up to 200 teraFLOPS of FP64 matrix performance. This represents a 4.4x increase over its outgoing Blackwell accelerators’ hardware capabilities.

Dan Ernst, senior director of supercomputing products at Nvidia, stated, “What we found is, through many studies with partners and with our own internal investigations, is that the accuracy that we get from emulation is at least as good as what we would get out of a tensor core piece of hardware.”

Nicholas Malaya, an AMD fellow, noted, “It’s quite good in some of the benchmarks, it’s not obvious it’s good in real, physical scientific simulations.” Malaya suggested FP64 emulation requires further research and experimentation.

FP64 remains the standard for scientific computing due to its dynamic range, capable of expressing over 18.44 quintillion (2⁶⁴) unique values. Modern AI models like DeepSeek R1, conversely, are frequently trained at FP8, which can express 256 unique values. HPC simulations rely on fundamental physical principles, making them error-intolerant, unlike AI workloads. Malaya explained, “As soon as you start incurring errors, these finite errors propagate, and they cause things like blow ups.”

The concept of using lower-precision data types to emulate FP64 is not new. Ernst mentioned, “Emulation is old as dirt. We had emulation in the mid ’50s before we had hardware for floating point.” In early 2024, researchers at the Tokyo and Shibaura institutes of technology published a paper exploring this concept. Their method showed that FP64 matrix operations could be decomposed into multiple INT8 operations, achieving higher-than-native performance on Nvidia’s tensor cores. This approach, known as the Ozaki scheme, forms the basis for Nvidia’s FP64 emulation libraries, released late last year. Ernst clarified, “it’s still FP64. It’s not mixed precision. It’s just done and constructed in a different way from the hardware perspective.”

Modern GPUs contain low-precision tensor cores. Rubin’s tensor cores, for example, are capable of 35 petaFLOPS of dense FP4 compute. In FP64, these chips are more than 1,000x slower. Ernst explained that the efficiency of building and running these low-precision tensor cores prompted exploration into their use for FP64 computation. “We have the hardware, let’s try use it. That’s the history of supercomputing,” he said.

AMD expressed concerns about the accuracy of FP64 emulation. Malaya indicated that FP64 emulation performs well for well-conditioned numerical systems, citing the High Performance Linpack (HPL) benchmark. However, “when you look at material science, combustion codes, banded linear algebra systems, things like that, they are much less well conditioned systems, and suddenly it starts to break down,” he said. Malaya noted that FP64 emulation is not fully IEEE compliant, as Nvidia’s algorithms do not account for nuances such as positive versus negative zeros, not number errors, or infinite number errors. Small errors in intermediary operations for emulation can lead to inaccuracies. Increasing operations to mitigate this can negate performance advantages. Malaya also reported, “We have data that shows you’re using about twice the memory capacity in Ozaki to emulate that FP64 matrices.” AMD is therefore focusing on specialized hardware for double and single precision, with its upcoming MI430X utilizing chiplet architecture to bolster performance.

Ernst acknowledged gaps in Nvidia’s implementation. He contended that positive/negative zeroes are not critical for most HPC practitioners. Nvidia has developed supplemental algorithms to detect and mitigate issues like non-numbers and infinite numbers. Regarding memory consumption, Ernst conceded it can be higher, but stated this overhead is relative to the operation, not the application, typically involving matrices of a few gigabytes. He also argued that IEEE compliance issues often do not arise in matrix multiplication cases. “Most of the use cases where IEEE compliance ordering rules are in play don’t come up in matrix, matrix multiplication cases. There’s not a DGEMM that tends to actually follow that rule anyway,” Ernst shared.

FP64 emulation is primarily effective for a subset of HPC applications relying on dense general matrix multiply (DGEMM) operations. Malaya estimated that for 60 to 70 percent of HPC workloads, emulation offers minimal benefit. “In our analysis the vast majority of real HPC workloads rely on vector FMA, not DGEMM,” he said. For vector-heavy tasks, like computational fluid dynamics, Rubin GPUs operate on slower FP64 vector accelerators within CUDA cores. Ernst highlighted that higher FLOPS do not always equate to useful FLOPS, as memory bandwidth often limits real-world performance. He referenced the TOP500’s vector-heavy High Performance Conjugate Gradient benchmark, where CPUs often lead due to higher bits per FLOPS from their memory subsystems.

With new supercomputers integrating Nvidia’s Blackwell and Rubin GPUs, the viability of FP64 emulation will be tested. The algorithms’ inherent independence from specific hardware allows for potential improvements over time. Malaya confirmed AMD is also exploring FP64 emulation on chips like the MI355X via software flags to identify suitable applications. He indicated that IEEE compliance would validate the approach by ensuring consistent results between emulation and dedicated silicon. Malaya stated, “If I can go to a partner and say run these two binaries: this one gives you the same answer as the other and is faster, and yeah under the hood we’re doing some scheme — think that’s a compelling argument that is ready for prime time.” He added that specific applications might be more reliable with emulation, suggesting, “We should, as a community, build a basket of apps to look at.”

Featured image credit

Nvidia Rubin GPUs: 200 teraFLOPS FP64 from software emulation

Related Stories

Pixel 11 leak hints at new magenta and peach color options

Microsoft updates Windows 11 search with cleaner design and no ads

X updates algorithm to prioritize posts from mutual connections

Xiaomi launches SkyNomad brand with first extended-range SUV lineup