The dirty secret of high performance computing

The dirty secret of high performance computing

In the decades since Seymour Cray developed what is considered the world's first supercomputer, the CDC 6600 (opens in a new tab), an arms race has been waged in the high-performance computing (HPC) community. The goal: improve performance, by all means, at all costs.

Driven by advances in computing, storage, networking, and software, the performance of leading systems has increased trillions of times since the introduction of the CDC 6600 in 1964, from millions of floating point operations per second (megaFLOPS) to quintillions (exaFLOPS).

The current holder of the crown, a colossal American supercomputer called Frontier, is capable of 1102 exaFLOPS according to the High Performance Linpack (HPL) benchmark. But even more powerful machines are believed to be operating elsewhere, behind closed doors.

The arrival of so-called exascale supercomputers is expected to benefit virtually every industry – from science to cybersecurity, from healthcare to finance – and pave the way for powerful new models of AI that would otherwise have taken years to develop. enter.

CDC 6600

The CDC 6600, widely considered to be the world's first supercomputer. (Image credit: Computer History Museum)

However, increasing speeds of this magnitude comes at a cost: power consumption. At full speed, Frontier consumes up to 40 MW (opens in a new tab) of power, roughly the same as 40 million desktop computers.

Supercomputing has always been about pushing the limits of what is possible. But as the need to minimize emissions becomes increasingly apparent and energy prices continue to rise, the HPC industry will need to reassess whether its original guiding principle is still worth following.

performance vs. Efficiency

One organization operating at the forefront of this problem is the University of Cambridge, which, in partnership with Dell Technologies, has developed several state-of-the-art energy-efficient supercomputers.

The Wilkes3 (Opens in New Tab), for example, only ranks 100th in the Overall Performance Charts (Opens in New Tab), but ranks 500rd in GreenXNUMX (Opens in New Tab), a classification of HPC systems based on performance per watt of energy consumed.

In a conversation with TechRadar Pro, Dr. Paul Calleja, Director of Research Computing Services at the University of Cambridge, explained that the institution is much more concerned with building highly productive and efficient machines than with extremely powerful machines.

“We are not really interested in large systems, because they are very specific point solutions. But technologies deployed indoors have a much broader application and will allow systems to run an order of magnitude slower much more cheaply and energy efficiently,” says Dr. Calleja.

“By doing so, it democratizes access to IT for many more people. We are interested in using technologies designed for these great old systems to create much more durable supercomputers for a broader audience.

University of Cambridge

The Wilkes3 supercomputer may not be the fastest in the world, but it is among the most energy efficient. (Image credit: University of Cambridge)

In the coming years, Dr. Calleja also predicts an increasingly fierce drive for energy efficiency in the HPC industry and in the data center community in general, where energy consumption accounts for more than 90% of costs. , we are told.

Recent energy price changes related to the war in Ukraine will also have made supercomputers considerably more expensive, especially in the context of exascale computing, further illustrating the importance of performance per watt.

In the context of Wilkes3, the university found that there were a number of optimizations that helped improve the level of efficiency. For example, by lowering the clock speed at which certain components were running, depending on the workload, the team was able to achieve reductions in power consumption on the order of 20-30%.

“Within a particular architectural family, clock speed has a linear relationship with performance, but a square relationship with power consumption. This is the murderer”, explained Dr. Calleja.

“Reducing the clock speed reduces power consumption at a much faster rate than performance, but it also lengthens the time it takes to complete a job. So what we should be looking at is not the energy consumption during a run, but actually the energy consumed by the work. There is a perfect place.

software is king

Beyond fine-tuning hardware configurations for specific workloads, there are also a number of optimizations that need to be done elsewhere, in the context of storage and networking, and in related disciplines such as cooling and rack design. .

However, when asked where specifically he would like to see resources allocated in the quest to improve energy efficiency, Dr. Calleja explained that the focus should be on software first and foremost.

“The hardware is not the problem, it is about the efficiency of the applications. This will be the main bottleneck moving forward,” she said. "Today's exascale systems are based on GPU architectures, and the number of applications that can run efficiently at scale on GPU systems is small."

“To really take advantage of today's technology, we need to focus on application development. The development life cycle spans decades; the software used today was developed 20 or 30 years ago and it's hard when you have code that's so long that it needs to be redesigned.

The problem, however, is that the HPC industry is not in the habit of thinking software first. Much more attention has historically been paid to hardware, because, in the words of Dr. Calleja, “It's easy; you just bought a faster chip. You don't need to think smart."

"When we had Moore's Law, with processor performance doubling every eighteen months, you didn't have to do anything to increase performance. But that era is over. Now, if we want to make progress, we have to go back and retool the software. »

CPU with contacts facing up resting on the PC motherboard. the chip is highlighted with a blue light

As Moore's Law begins to weaken, advances in CPU architecture can no longer be relied upon as a source of performance improvement. (Image credit: Alexander_Safonov/Shutterstock)

Dr. Calleja reserved some praise for Intel in this regard. As the server hardware space becomes more diverse from a vendor perspective (in many ways a positive development), application compatibility can become an issue, but Intel is working on a solution.

“One differentiator I see for Intel is that they are investing heavily in the oneAPI ecosystem, to develop code portability between silicon types. It is these kinds of toolchains that we need to enable the applications of tomorrow to take advantage of emerging silicon,” he notes.

Separately, Dr. Calleja called for a greater focus on "scientific necessity." Too often, things “go wrong in translation”, creating a mismatch between hardware and software architectures and the actual needs of the end user.

According to him, a more forceful approach to cross-industry collaboration would create a "virtuous circle" of users, service providers and vendors, yielding both performance and efficiency benefits.

A zetta-scale future

Usually, with the symbolic landmark of the exascale falling, attention will now turn to the next one: the zettascale.

"Zettascale is just the next flag in the ground," said Dr. Calleja, "a totem pole that highlights the technologies needed to reach the next stage of computing advancement that cannot be obtained today."

“The fastest systems in the world are extremely expensive for what you get out of them, in terms of scientific output. But they are important because they demonstrate the art of the possible and move the industry forward.

University of Cambridge

Pembroke College, University of Cambridge, home to the Open Zettascale Lab. (Image credit: University of Cambridge)

Whether systems capable of achieving zettaFLOPS of performance, a thousand times more powerful than the current crop, can be developed in a way that aligns with sustainability goals will depend on the inventiveness of the industry.

There is no binary relationship between performance and power efficiency, but it will take a fair amount of skill in each sub-discipline to deliver the necessary performance boost within a suitable power envelope.

In theory, there is a golden ratio between performance and energy consumption, so the benefits to society generated by HPC can be considered worth the cost of carbon emissions.

The precise figure will remain elusive in practice, of course, but pursuing the idea is by definition a step in the right direction.