Two Chinese supercomputers reportedly break exascale barrier

5 min read

Sunway Oceanlite Supercomputer

Two Chinese supercomputers have reportedly broken the notorious exascale barrier, but their developers prefer to stay quiet about it for now.

Both systems are reportedly based on China’s homegrown Phytium and Sunway processors and therefore do not use crucial technologies developed outside of Tianxia.

If the information is correct, then China is ahead of the U.S. in exascale supercomputing, but there is a catch.

Almost a Year Ahead

Two systems in China achieved 1.3 ExaFLOPS peak performance and around 1.05 ExaFLOPS (or higher) sustained performance in Linpack benchmark in March 2021, reports NextPlatform.

However, neither of the machines is currently listed in the global Top 500 list of supercomputers as their developers do not want subcontractors of their partners to get into trouble with the U.S. government.

NextPlatform says it got the information from a source from the U.S. that knows what is going on in China. If the information is accurate, China has beaten the U.S. by almost a year with its exascale system as Oak Ridge Leadership Computing Facility’s Frontier supercomputer will only start operations in late 2021.

Yet, there are some factors to consider. Frontier’s target performance is about 1.5 TFLOPS, which is almost 50% higher when compared to the sustained performance of China’s exascale supercomputers.

Furthermore, Frontier is projected to consume around 30 MW of power, whereas one of its rivals from China has a power consumption of about 35 MW. Last but not least, Chinese developers use existing architectures developed for PetaFLOPS-scale systems and workloads, which may not be optimal in the future.

Sunway Architecture

The first Chinese exascale system is located at the National Supercomputing Center in Wuxi. The supercomputer, called Sunway Oceanlite, was designed by the National Research Center of Parallel Computer Engineering and Technology (NRCPC) and is based on proprietary hybrid manycore Sunway processors discussed in connection with exascale machines earlier this year.

The Shenwei/Sunway CPU architecture has been around since 2016, when the Sunway TaihuLight supercomputer powered by 40,960 Sunway SW26010 processors was launched. The SW26010 CPU uses four heterogeneous clusters(core groups CG) interconnected using a high-performance network-on-chip.

Each CG features a protocol processing unit (PPU), one management processing element (MPE) with a 256-bit vector engine, and 64 compute processing elements (CPEs) with the same 256-bit vector engine and a DDR3 memory controller. In total, each SW26010 has four MPEs and 256-bit CPEs that support coherency and run at around 1.5 GHz.

China envisioned that by increasing the number of MPE and CPE cores per CPU and altering their architecture (e.g., by adding support for 512-bit vector instructions to CPEs), it would be possible to build a foundation for up to a 4 ExaFLOPS supercomputer using Sunway architecture.

The report says that NRCPC engineers doubled the number of cores per processor (to 520 cores?) to double the performance per socket and produced their new CPU using a modern process technology to keep power consumption in check. Then they doubled the number of nodes, introduced a new interconnection system and possibly a new storage system to get to 1.03 sustained ExaFLOPS using 42 million 64-bit RISC cores.

A clear advantage of such an approach is that NRCPC retained a familiar architecture that can process both existing and upcoming high-performance computing (HPC) workloads that require FP64 or mixed precision for AI/ML workloads.

Meanwhile, that doubling of cores per socket and the number of sockets led to a 35 MW power consumption. This power consumption level is not tremendous, but it shows that the Oceanlite supercomputer is considerably less energy efficient than ORNL’s Frontier.

Phytium Architecture

China’s second exascale supercomputer is the Tianhe-3 machine located in the National Supercomputer Center in Guangzhou, China. The system is powered by Armv8-based Phytium 2000+ (FTP) processors primarily designed for traditional HPC workloads with full FP64 precision and the Matrix 2000+ (MTP) DSP accelerators.

There is no information about the sustained performance of Tianhe-3, but its Rpeak performance is reportedly around 1.3 ExaFLOPS and its Rmax performance is comfortably above 1 ExaFLOPS. It is also unclear how much power this supercomputer consumes.

Architecturally, Tianhe-3 resembles Tianhe-2A (launched in 2015) that relied upon Phytium’s FT-2000 CPUs and Matrix 2000 DSP accelerators. To get above 1 ExaFLOPS, developers had to increase the number of processors and accelerators, which probably involved making new silicon with more cores and processing elements made using a thinner fabrication process.

Without many details about Tianhe-3 available, it is hard to say how exactly it got to exascale class, but all we can say is that Phytium’s architecture from 2015 was scalable enough.

To develop two of the world’s first exascale supercomputers, scientists from the National Supercomputing Center in Wuxi and the National Supercomputing Center from Guanzhou decided to play it safe and rely on existing architectures.

As a result, developers from Sunway Microelectronics (or Shenwei Microelectronics) and Tianjin Phytium Information Technology have successfully designed appropriate chips and produced them using contemporary nodes.

It is unclear which process technologies were used to make the new chips though we can speculate about proven 14nm/16nm-class processes that have good yields and usage not under close watch by the U.S. government. It is also unknown whether China-based SMIC or Taiwan-based TSMC makes the chips.

Still, both companies have their advantages: the former cannot be controlled by the U.S. authorities in any way, whereas the latter has proven HPC-oriented libraries for its N16 node.

China’s exascale supercomputers may not be very energy efficient, but if they are indeed used to develop new weapons, power consumption is the last area of concern for their operators. They may also not efficiently scale to 2 ExaFLOPS or 4 ExaFLOPS, but they have plenty of performance to offer today. Furthermore, if their manufacturing is primarily localized, China can build more exascale supercomputers to become more competitive in various spheres.

Tianjin Phytium Information Technology and Sunway Microelectronics (or Shenwei Microelectronics) are on the U.S. Commerce Department’s Entity List, making it extremely hard for them to develop brand-new architectures state-of-the-art chips for future exascale supercomputers. With that said, while China might be the first to get to 1 ExaFLOPS, it may stay there for a while.

1.03 ExaFLOPS is plenty of computing horsepower, but the supercomputing race is accelerating, and only time will tell how fast companies like AMD, Intel, and Nvidia develop technologies enabling 4 ExaFLOPS or 10 ExaFLOPS systems for the U.S. and Europe. But China’s supercomputing capabilities will remain quite formidable for at least a couple of years from now.

Tony Simon

Leave a Reply

Your email address will not be published. Required fields are marked *