Graphcore Just On The Money In MLPerf First Appearance
When it comes to a silicon start-up bringing a product to market in a tough competitive environment, nothing is easy. The list of challenges is long but to be taken seriously against the incumbents, a strong MLperf performance is now paramount.
As usual, Nvidia and Google swept the MLPerf results this time around with the DGX A100 and TPU4 system respectively. While those results can be found here, we wanted to leave the clear performance leaders aside this time around and take a look at the results of startup Graphcore which, while not top of the charts, has notable price advantages. / performance.
Every business tries to do something smart with benchmark results reporting when it is at a pure performance disadvantage. But what Graphcore did is actually smart: They said, âLook at what you can do for half the cost. Is it the best? Nope. Can you imagine expanding that from a cost perspective? All things relative, yes.
As this is a benchmark that is ostensibly used to help potential sites make decisions about what to deploy, this seems to be a big deal, especially with training costs skyrocketing along size curves of the model. We can always deduct the cost of a system used for MLperf, but for this training iteration of MLPerf, Graphcore explains it: Their POD16 system costs half of what a DGX A100 does and delivers precise performance. That is, for two models, only one of which (focused on NLP) has a direct correlation in the real world nowadays.
Performance may be of the utmost importance for larger users of AI / ML training systems, but we are quickly reaching a point where that price / performance figure will be the defining metric, especially as more and more expensive hardware hits the market. in the years to come. AI training is already an elite, but how will the marketplace view purchases of dedicated training material five years from now?
It’s worth spending a lot of time on software and hardware resources to get Graphcore to do the MLPerf dance. It’s legit and allows them to tell this price story that’s more urgent than just scalable performance these days: sacrifice just a little on performance to get something you can afford now, something designed for. evolve with the addition of more boxes, and keep costs as low as possible. And at the same time, going through the hell of serving as a benchmark with limited internal resources, learn more about how far you can push your system and software stack.
Matt Fyles of Graphcore says it took several months of optimization while undergoing work for existing users on their internal hardware, but they eventually managed to deliver results in both open and closed sets for two models. , ResNet-50 for computer vision and BERT for natural language processing. âWhen MLPerf was first introduced, we were reluctant to contribute because we weren’t sure of the benefits for our customers. But as it got bigger, we knew we had to do MLPerf. We have to show that our technology can play in all the same boxes with all the same constraints as competing technologies. “
And by competitive tech, Fyles certainly means Nvidia, which he often referred to on a call to review the results. The GPU manufacturer’s DGX A100 results are the standard by which Graphcore places its performance results. And while there aren’t necessarily huge gains for the POD6 over the DGX A100, it’s the price and performance difference that Graphcore wants to push.
As a startup, it was not entirely possible for Graphcore to run its MLPerf results on something higher than its POD64 system (shown below) for BERT with the POD16 machine for ResNet-50 calculations less plentiful. Google and Nvidia bring complete supercomputers to the MLperf game. âWe can scale up to 64,000. It’s not something we’ll do next week, but we’re definitely looking to join the others sooner,â says Fyles. âThe sweet spot for our current customers is around POD16 and POD64, it’s the people who get PIIs to experiment. Ease of scaling is built into our architecture.
âWe could have done more with a more scalable system, but we show in these benchmarks how we configured the machines to take advantage of disaggregation,â says Fyles. He adds that by choosing different configurations to meet the needs of different workloads (ResNet vs preprocessing / data intensive BERT), they can show how different systems can deliver optimized performance and efficiency, depending on training requirements. .
Graphcore is directly targeting Nvidia’s DGX A100. âWe want to bring a different hardware architecture and platform that is not just another HBM-based accelerator system. Nvidia hasn’t released a price for the A100 system, but based on market information and what resellers tell us, it’s around $ 300,000. A POD16 from us costs $ 149,000.
Graphcore plans to do future training sessions for MLPerf that go beyond the POD64 setup in future iterations of the benchmark with systems that are more on par with what Google and Nvidia are submitting, although this is a stretch for the moment. âWe want to do it in 128, 256, and although it won’t be next week, 512,â Fyles says. âIt’s an investment to show that we are here, that we are in this space and that we compete on references and applications in the same way as the others. “
Scalable systems, while still matching pure performance against the DGX A100 or TPU4 when standardizing results, are not to be sneezed at. Seeing the only software scalability of the Class 512 system in the wild will be interesting, although we wonder if it will be ready by the next MLPerf training session.
Remember what was said above about finding an angle when you’re slightly at a performance disadvantage? Find a unique angle. Another proven tip is to choose which benchmark segments you want to report, which MLPerf runners are known to do. Graphcore went beyond the MLPerf suite entirely today and chose a particular benchmark, EfficientNet, which they believe means important things to some of their early customers. Here they were able to show even more dramatic price per dollar advantages over Nvidia’s DGX A100:
âWe are delighted to share our outstanding results: a BERT workout time of just over 9 minutes and a ResNet-50 workout time of 14.5 minutes on our IPU-POD.64,
said Fyles. “It’s the AI ââperformance levels of supercomputers, which puts us well ahead of NVIDIA in performance per dollar.”