스토리지 성능은 AI 수명주기의 여러 단계에서 핵심적인 역할을 한다. GPU를 연결하는 네트워크가 AI 애플리케이션 성능에 매우 중요한 것처럼, 고속 스토리지 어레이를 연결하는 스토리지 패브릭 중요성이 점차 커지고 있다.
AI, multi-modalization, model size↑, storage fabric expansion is essential
NVIDIA Spectrum-X accelerates AI storage by up to 48%
Storage performance plays a key role in many stages of the AI lifecycle. Just as the network that connects GPUs is critical to AI application performance, the storage fabric that connects high-speed storage arrays is becoming increasingly important.
NVIDIA announced on the 6th that it is expanding the NVIDIA Spectrum-X networking platform to a data storage fabric along with the storage ecosystem. Spectrum-X emphasized that it accelerates AI storage by up to 48%.
Spectrum-X accelerates read bandwidth by up to 48% and write bandwidth by up to 41%. This increased bandwidth speeds up the completion of storage-dependent steps in AI workflows, reducing task completion times during training and token-to-token latency during inference.
To meet market demands, NVIDIA and the storage ecosystem are extending the NVIDIA Spectrum-X networking platform into data storage fabrics, leading the way in Spectrum-X deployments to deliver higher AI performance and faster time to implementation.
Spectrum-X adaptive routing can mitigate flow collisions and expand effective bandwidth. Therefore, it emphasizes that storage performance is higher than RoCE v2, the Ethernet networking protocol that most data centers use for AI computing and storage fabric.
AI workloads are growing in size and complexity. Models are getting bigger, data is becoming more multimodal, and AI factories are often comprised of a huge number of switches, cables, and transceivers, so even a single down link can significantly degrade network performance.
In response, leading storage vendors are working with NVIDIA to integrate and optimize their solutions with Spectrum-X, bringing cutting-edge capabilities to the AI storage fabric.
To optimize Spectrum-X performance, NVIDIA built Israel-1, a generative AI supercomputer. This supercomputer provides a pre-tested and validated blueprint for the AI fabric, simplifying network deployments. It builds on this to provide a testing environment for the impact of Spectrum-X on storage workloads, and further demonstrates the impact of the network on storage performance in the context of a real-world supercomputer operating environment.
The Israel-1 team achieved these bandwidth results by measuring the read and write bandwidth incurred when NVIDIA HGX H100 GPU server clients access storage. They varied the number of GPU servers from 40 to 800, and saw improvements ranging from 20% to 48% for read bandwidth and from 9% to 41% for write bandwidth.
Why Spectrum-X makes such a big difference is because we need to look at the impact of storage on AI. AI performance is not simply determined by the time it takes the Large Language Model (LLM) stage to complete.
For example, model training often takes anywhere from several days to several months to complete. Therefore, it is reasonable to save the partially trained model as a checkpoint to storage every few hours during training. This has the advantage that training progress is not lost even if a system outage occurs.
Checkpoint states for models with billions and trillions of parameters can be up to several terabytes in size for today's largest LLMs. Saving or restoring them can result in 'elephant flow', a massive surge of data that can overwhelm switch buffers and links.
To eliminate elephant flow collisions and mitigate the network traffic generated during checkpointing, adaptive routing is used to dynamically load balance flows on a packet-by-packet basis across the network.
With Spectrum-X, the SuperNIC or data processing unit (DPU) of the target host figures out the correct order of packets and places them in order in host memory. It also keeps adaptive routing transparent to applications. This increases fabric utilization, increasing effective bandwidth and providing predictable and consistent results for checkpoints, data fetches, etc.
NVIDIA provides various SDKs, libraries, and software products supporting NVIDIA △Air △Cumulus Linux △Doca △NetQ △GPU Direct Storage, etc.