NVIDIA Fixes Blackwell: A Swift Response to the GPU Issue
12:57, 24.10.2024
NVIDIA CEO Jensen Huang acknowledged a design flaw in the Blackwell series GPU, which led to delays in the supply of AI chips. The issue involved a functional defect that resulted in a low yield of working chips. According to Huang, the fault was entirely on NVIDIA and not their manufacturing partner TSMC, as some sources had suggested. He emphasized that TSMC was not only uninvolved in the problem but also played an active role in helping to fix it.
Chip Improvements and TSMC’s Role
The issue was resolved by modifying the upper metal layers and silicon bumps in the GPU, which enhanced performance. The fix required significant efforts, given the need to simultaneously manufacture seven different types of chips from scratch. The main challenges were associated with the CoWoS-L packaging technology, which uses LSI silicon bridges, the RDL interposer, and GPU chiplets. Problems arose due to thermal expansion of the components, causing system deformation. Such fixes typically take around 10 cycles, but NVIDIA and TSMC managed to resolve the issue in record time.
Mass Production of the Updated Chips
The updated Blackwell B100 and B200 GPUs are set to enter mass production by the end of October, with shipments expected to begin early next year. While the production of the improved chips is ramping up, NVIDIA still anticipates some shortage of high-performance GPUs in 2024, particularly for major cloud providers such as AWS, Google, and Microsoft.