Since the explosive popularity of ChatGPT, the development of AI large models has been emerging one after another. In the midst of this "hundred-model war," the American chip company Nvidia has made a fortune with its outstanding performance in large model computing with GPUs.
However, a recent move by Apple has slightly cooled down Nvidia's heat.
01
AI model training, Apple chooses TPU over GPU
Nvidia has always been a leader in the field of AI computing infrastructure. In the AI hardware market, especially in the field of AI training, its market share is over 80%. Nvidia GPUs have always been the preferred computing solution for AI and machine learning for many tech giants such as Amazon, Microsoft, Meta, and OpenAI.
Therefore, Nvidia has also been continuously facing diverse challenges in the industry. Among the competitors, there are strong players who have independently developed GPUs, as well as pioneers who are exploring innovative architectures. Google's TPU has also become a formidable opponent that Nvidia cannot ignore with its unique advantages.
On July 30, Apple released a research paper. In the paper, Apple introduced two models that support Apple Intelligence - AFM-on-device (AFM is the abbreviation for Apple's basic model) and AFM-server (a large language model based on the server). The former is a 3 billion parameter language model, and the latter is a language model based on the server.
Advertisement
Apple said in the paper that in order to train its AI models, it used Google's two types of tensor processors (TPU), which are composed of large chip clusters. To build the AI model AFM-on-device that can run on the iPhone and other devices, Apple used 2048 TPUv5p chips. For its server AI model AFM-server, Apple deployed 8192 TPUv4 processors.Apple's strategic choice to abandon NVIDIA GPUs in favor of Google's TPU has dropped a bombshell in the tech industry, causing NVIDIA's stock price to plummet by more than 7% on the same day, marking the largest drop in three months, with a market value of $193 billion evaporating.
Industry insiders have said that Apple's decision indicates that some large technology companies may be looking for alternatives to NVIDIA graphics processing units in the field of artificial intelligence training.
02
TPU VS GPU, Who is more suitable for large models?
Before discussing who is more suitable for large models between TPU and GPU, we need to have a preliminary understanding of the two.
Comparison of TPU and GPU
TPU, full name Tensor Processing Unit, is a specialized chip designed by Google specifically to accelerate machine learning workloads, mainly used for the training and inference of deep learning models. It is worth noting that TPU also belongs to the category of ASIC chips, which are chips specifically customized for a particular need.
GPU is more familiar to everyone, GPU is a processor originally designed for graphics rendering, and later widely used for parallel computing and deep learning. It has strong parallel processing capabilities, and after optimization, it is also very suitable for parallel tasks such as deep learning and scientific computing.
It can be seen that these two different chips have their respective goals when they were initially designed.
Compared with traditional CPUs, the parallel computing capability of GPUs makes them particularly suitable for processing large-scale datasets and complex computing tasks. Therefore, in the past few years when AI large models have exploded, GPUs have once become the preferred hardware for AI training computing power.However, with the continuous development of AI large models, computational tasks are becoming increasingly large and complex at an exponential rate, which poses new requirements for computing power and resources. The low computational power utilization and high energy consumption of GPUs when used for AI computing, as well as the high price and supply tension of Nvidia's GPU products, have attracted more attention to the TPU architecture, which is designed for deep learning and machine learning. The dominant position of GPUs in this field is beginning to face challenges.
It is reported that Google began developing chips specifically for AI machine learning algorithms internally as early as 2013, and it was not until 2016 that this self-developed chip, named TPU, was officially announced. AlphaGo, which defeated Lee Sedol in March 2016 and Ke Jie in May 2017, was trained using Google's TPU series of chips.
If TPU is more suitable for AI large model training, it is difficult to convince people without specifically explaining its "abilities."
How is TPU suitable for large model training?
Firstly, TPU has multi-dimensional computing units to improve computational efficiency. Compared with the scalar computing units in CPUs and the vector computing units in GPUs, TPU uses two-dimensional or even higher-dimensional computing units to complete computational tasks. It implements the maximum data reuse by unfolding the convolutional operation loop, reducing data transmission costs, and improving acceleration efficiency.
Secondly, TPU has more time-saving data transmission and high-efficiency control units. The storage wall problem brought by the von Neumann architecture is particularly prominent in deep learning tasks, and TPU adopts a more aggressive strategy for data transmission design, and the control unit is smaller, leaving more space for on-chip storage and computing units.
Lastly, TPU has acceleration designed for AI, enhancing AI/ML computing capabilities. With accurate positioning, simple architecture, single-thread control, and custom instruction sets, the TPU architecture is highly efficient in deep learning operations and easy to scale, making it more suitable for large-scale AI training computing.
It is reported that Google's TPUv4 has 1.3-1.9 times lower power consumption than Nvidia's A100, and its efficiency is 1.2-1.9 times higher than A100 in various work models such as Bert and ResNet; at the same time, its TPUv5/TPU Trillium products can further improve the computational performance by 2 times/nearly 10 times compared to TPUv4. It can be seen that Google's TPU products have more advantages in cost and power consumption compared to Nvidia's products.
At the I/O 2024 developer conference in May this year, Alphabet CEO Sundar Pichai announced the sixth-generation data center AI chip Tensor Processing Unit (TPU) - Trillium, saying that the product's speed is almost five times that of the previous generation, and it is expected to be launched later this year.
Google said that the computing performance of the sixth-generation Trillium chip is 4.7 times higher than that of the TPU v5e chip, and the energy efficiency is 67% higher than v5e. This chip is designed to power technologies for generating text and other content from large models. Google also said that the sixth-generation Trillium chip will be available to its cloud customers by the end of this year.Google's engineers have achieved additional performance improvements by increasing the high-bandwidth memory capacity and overall bandwidth. Artificial intelligence models require a large amount of advanced memory, which has always been a bottleneck in further improving performance.
It is worth noting that Google does not sell its TPU chips as a standalone product, but instead provides TPU-based computing services to external customers through Google Cloud Platform (GCP).
In this plan, we can also see Google's cleverness: Direct sales of hardware involve high costs and complex supply chain management. By providing TPU through cloud services, Google can simplify the installation, deployment, and management process, reducing uncertainty and additional costs. This model also simplifies the sales process, eliminating the need to establish a separate hardware sales team. In addition, Google is in fierce competition with OpenAI in generative AI. If Google starts selling TPUs, it will compete with two powerful opponents at the same time: NVIDIA and OpenAI, which may not be the wisest strategy at present.
Speaking of the article, some people may ask: Since TPU has such excellent performance advantages, will it replace GPU in the near future?
03
Is it too early to talk about replacing GPUs now?
This issue is not so simple either.
It is one-sided to only talk about the advantages of TPUs and ignore the advantages of GPUs. Next, we also need to understand how GPUs are suitable for the current AI large model training compared to TPUs.
We see that the advantage of TPUs lies in their outstanding energy efficiency ratio and unit cost computing power index. However, as an ASIC chip, its high trial and error cost disadvantage is also clear.
In addition, in terms of ecosystem maturity. GPUs have a large and mature software and development tool ecosystem after many years of development. Many developers and research institutions have long been based on GPUs for development and optimization, accumulating a wealth of libraries, frameworks, and algorithms. In contrast, the TPU ecosystem is relatively new, and the available resources and tools may not be as rich as those for GPUs, which may increase the difficulty of adaptation and optimization for developers.In terms of versatility, GPUs were initially designed for graphics rendering, but their architecture is highly flexible and can adapt to a variety of different computing tasks, not limited to deep learning. This makes GPUs more adaptable when facing a diverse range of application scenarios. In contrast, TPUs are specifically designed for machine learning workloads and may not handle other non-machine learning related computing tasks as effectively as GPUs.
The GPU market is highly competitive, with manufacturers constantly driving technological innovation and product updates, and new architectures and performance improvements are relatively frequent. The development of TPUs is mainly led by Google, and their updates and evolution may be relatively slower.
Overall, NVIDIA and Google have different strategies in AI chips: NVIDIA promotes the performance limits of AI models by providing powerful computing power and extensive developer support; Google improves the efficiency of large-scale AI model training through efficient distributed computing architecture. These two different paths have made them show unique advantages in their respective application fields.
The reasons why Apple chose Google's TPU may include the following: first, TPU performs well in handling large-scale distributed training tasks, providing efficient, low-latency computing power; second, by using the Google Cloud platform, Apple can reduce hardware costs, flexibly adjust computing resources, and optimize the overall cost of AI development. In addition, Google's AI development ecosystem also provides a wealth of tools and support, enabling Apple to develop and deploy its AI models more efficiently.
Apple's example has proven the ability of TPU in large model training. However, compared with NVIDIA, TPU still has too few applications in the field of large models, and behind more large model companies, including giants such as OpenAI, Tesla, and ByteDance, the main AI data centers still generally use NVIDIA GPUs.
Therefore, it may be too early to say that Google's TPU can defeat NVIDIA's GPU, but TPU is definitely a very challenging contender.
04
GPU's challengers are not only TPU
There are also Chinese companies that bet on TPU chips - Zhonghao Xinying. The founder of Zhonghao Xinying, Yang Gong Yifan, was a core chip R&D personnel at Google and deeply participated in the design and development of Google TPU 2/3/4. In his view, TPU is an advantageous architecture born for AI large models.
In 2023, Zhonghao Xinying's "Shunian" chip was officially born. The "Shunian" chip, with its unique high-speed interconnect capability of 1024 chips, has built a large-scale intelligent computing cluster called "Tai Ze". Its system cluster performance is dozens of times higher than that of traditional GPUs, providing unprecedented computing power for the training and inference of AI GC large models with more than 100 billion parameters. This achievement not only highlights Zhonghao Xinying's deep accumulation in the field of AI computing power technology but also has won a valuable place for domestic chips on the international stage.However, in the current gold rush of artificial intelligence, against the backdrop of the Nvidia H100 chip being in short supply and expensive, companies of all sizes are seeking alternative AI chip products to replace Nvidia. This includes companies that follow the traditional GPU route, as well as companies that explore new architectures.
The challengers facing the GPU are far more than just TPU.
In the development of the GPU path, Nvidia's arch-rival is AMD. In January of this year, researchers trained a large model at the scale of GPT 3.5 using about 8% of the GPUs on the Frontier supercomputer cluster. The Frontier supercomputer cluster is entirely based on AMD hardware, consisting of 37,888 MI250X GPUs and 9,472 Epyc 7A53 CPUs. This research also broke through the difficulties of advanced distributed training models on AMD hardware, proving the feasibility of training large models on the AMD platform.
At the same time, the CUDA ecosystem is gradually being broken. In July of this year, the British company Spectral Compute launched a solution that can natively compile CUDA source code for AMD GPUs, greatly improving the compatibility efficiency of AMD GPUs for CUDA.
Intel's Gaudi 3 also directly targets Nvidia H100 when released. In April of this year, Intel launched Gaudi 3 for deep learning and large generative AI models. Intel claims that compared to the previous generation, Gaudi 3 can provide four times the floating-point format BF16 AI computing power, memory bandwidth is increased by 1.5 times, and the network bandwidth for large-scale system expansion is doubled. Compared with Nvidia's chip H100, if applied to the Meta Llama2 model with 7B and 13B parameters and the OpenAI GPT-3 model with 175B parameters, Gaudi 3 is expected to reduce the training time of these models by an average of 50%.
In addition, when applied to the Llama model with 7B and 70B parameters and the open-source Falcon model with 180B parameters, Gaudi 3's inference throughput is expected to be 50% higher than H100 on average, and the inference efficiency is 40% higher on average. Moreover, Gaudi 3 has a greater inference performance advantage on longer input and output sequences.
When applied to the Llama model with 7B and 70B parameters and the Falcon model with 180B parameters, Gaudi 3's inference speed is 30% faster than Nvidia H200.
Intel said that Gaudi 3 will be supplied to customers in the third quarter of this year, and in the second quarter to OEM manufacturers including Dell, HPE, Lenovo, and Supermicro, but did not disclose the price range of Gaudi 3.
In November last year, Microsoft released its first self-developed AI chip, Azure Maia 100, and the chip Azure Cobalt for cloud software services at the Ignite technology conference. The two chips will be manufactured by TSMC, using 5nm process technology.
It is reported that Nvidia's high-end products can sometimes sell for $30,000 to $40,000 each, and it is estimated that about 10,000 chips are needed for ChatGPT. This is a huge cost for AI companies. Technology giants with a large demand for AI chips are actively seeking alternative supply sources. Microsoft's choice to develop on its own is to enhance the performance of generative AI products such as ChatGPT while reducing costs.Cobalt is a general-purpose chip based on the Arm architecture, featuring 128 cores. Maia 100 is an Application-Specific Integrated Circuit (ASIC) chip specifically designed for Azure cloud services and AI workloads, used for cloud training and inference, with a transistor count reaching 105 billion. These two chips will be integrated into Microsoft Azure data centers to support services such as OpenAI and Copilot.
Rani Borkar, the Vice President in charge of the Azure chip division, stated that Microsoft has begun testing the Maia 100 chip with Bing and Office AI products, and OpenAI, Microsoft's main AI partner and the developer of ChatGPT, is also in the testing phase. Market commentary suggests that the timing of Microsoft's AI chip initiative is well-timed, coinciding with the takeoff of large language models cultivated by companies such as Microsoft and OpenAI.
However, Microsoft does not believe that its AI chips can widely replace Nvidia's products. Some analyses suggest that if Microsoft's efforts are successful, it could also help it gain more leverage in future negotiations with Nvidia.
In addition to chip giants, there are also impacts from startups. For example, Groq has introduced the Large Processing Unit (LPU), Cerebras has launched the Wafer Scale Engine 3, and Etched has released the Sohu, among others.
Currently, Nvidia controls about 80% of the AI data center chip market, and most of the remaining 20% is dominated by different versions of Google's Tensor Processing Unit (TPU). In the future, will the market share of TPU continue to rise? By how much? Will there be other architectures of AI chips that will divide the existing market into three parts? These suspenseful questions are expected to be gradually revealed in the coming years.
Comments