ChinAI #141: The PanGu Origin Story

Notes from an informative Zhihu Thread on PanGu

Greetings from a world where…

When asked to name a prominent Asian American, 42 percent of American adults answered “don’t know” (most common answer), followed by Jackie Chan (11 percent), and Bruce Lee (9 percent).

…Please please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay support access for all AND compensation for awesome ChinAI contributors). As always, the searchable archive of all past issues is here.

Feature Translation: Zhihu Thread on PANGU-α

*This week’s newsletter includes more text because it’s essentially “noteblogging” — making my note-taking on PanGu available to the public. It will be longer and include more technical language than usual. Whenever I venture into topics that require more domain expertise, I rely on readers with more technical chops to give feedback and share critiques, so please let me know what I get wrong and what’s missing.

Earlier this month, Jack Clark’s ImportAI newsletter flagged the Chinese-language equivalent of GPT-3:

A team of Chinese researchers have created 'PanGu', a large-scale pre-trained language model with around ~200 billion parameters, making it equivalent to GPT3 (175 billion parameters) in terms of parameter complexity. PanGu is trained on 1.1TB of Chinese text (versus 570GB of text for GPT-3), though in the paper they train the 200B model for a lot less time (on way fewer tokens) than OpenAI did for GPT-3. PanGu is the second GPT-3-esque model to come out of China, following the Chinese Pre-trained Language Model (CPM, Import AI 226), which was trained on 100GB of text and was only a few billion parameters, compared to a couple of hundred!

Context: Last month, PanGu was released at Huawei’s HDC.Cloud event, its flagship event for IT developers. The image below shows Huawei Cloud CEO Yu Chengdong presenting PanGu’s excellent performance on various Chinese Language Understanding Evaluation (CLUE) benchmarks. The name PanGu refers to the creator of the universe in Daoist legends.

Crucially, PanGu was a joint effort by researchers from both Huawei and Recurrent AI (循环智能), a provider of AI enterprise services. I was curious about PanGu. A simple search led me to a Zhihu thread titled: “What do you think of the PanGu model released by Huawei on April 25?” Zhihu, known as China’s Quora, is the country’s largest Q&A forum. The initial post linked to an article by Recurrent AI on PanGu. Plus, there were 40 responses to the thread, many of which were very insightful.

Key Takeaways from article linked in the initial Zhihu post:

In the article, Recurrent AI claims that PanGu improves on GPT-3 in three aspects. The key word here is “claims” as I wasn’t able to trace many of these points to the results reported in the PanGu article itself (arxiv download link).

  • First, it supposedly “surpasses GPT-3 in few-shot learning tasks, addressing issues the latter faces in dealing with complex commercial scenarios with few (training data) samples. For example, in scenarios involving customer voice analysis and analysis of employees’ ability to carry out tasks, when the Pangu NLP large model is used to produce semantic analysis, the sample size required to obtain the target result is only one-tenth of the GPT-3 model. That is, AI’s ​​production efficiency can be increased tenfold.”

  • Second, the Pangu team added prompt-based tasks in the pre-training phase, which greatly reduced the difficulty of fine-tuning. There have been difficulties with fine-tuing previous large models for different industry scenarios. One example from the article: “In a scenario about finding more target customers to increase the conversion rate, in which companies use communication content to determine customer purchase intentions, we found that the PanGu model can increase the order conversion rate by 27% compared to GPT-3.”

  • I’m not completely sure what Recurrent AI is arguing on the third innovation that PanGu makes on top of GPT-3 . They write, “PanGu can recognize intent (of customers?) through few-shot learning, and transform them into queries of knowledge bases and databases, which addresses the issue that large models are difficult to integrate with industry knowledge and data in the past.” My best guess is that they are arguing PanGu can adapt better to industry-specific vocabularies and communications.

Let’s dive into some of the top-rated Zhihu responses. First, the top-rated response (662 likes) is by Xuefeng Jin, who is the chief architect of Huawei’s Mindspore, an open-source deep learning framework that Huawei hopes will rival Google’s TensorFlow and Facebook’s PyTorch. PanGu was developed under the Mindspore framework.

Key Takeaways from Jin’s post on the thread:

  • Right at the beginning of his post, Jin clarifies that Huawei actually released two large NLP models at the HDC conference (both named after PanGu). The other one was an encoder-decoder Transformer. Here’s the key point: the training of both 100-billion parameter scale models was a collaboration between various Huawei divisions and Peng Cheng Lab (PCL), which provided computing power support. Here’s what I wrote about PCL back in Nov 2019 (ChinAI #73):

PCL is the more interesting body to me: It’s a Provincial Research Lab opened in 2018 by the Guangdong Provincial Government and funded and managed by the Shenzhen Municipal Government. In partnership with the Harbin Institute of Technology (Shenzhen), PCL also cooperates with local industry and research institutions such as the Shenzhen branches of Peking and Tsinghua Universities, the National Supercomputing Center in Shenzhen, Huawei, Tencent, and ZTE.

PCL’s CloudBrain (p. 8-10 of full translation) brings real value-add on through access to compute: CloudBrain 1 is a large-scale cluster system with 100 Petaflops of computing power, including NVIDIA GPUs, Huawei GPUs, and Cambrian AI chips. A machine of 1000 Petaflops will probably be built next year, which can be used by universities, research institutes, and SMEs for training models. The goal of 1000 Petaflows (an exaflop) is generally considered a big milestone for compute over the next few years, which the DOE is heading towards.

  • However, it’s not just about raw quantity of computing power. As Jin writes, “Even if you give us sufficient computing power, the training of super large models is still extremely complicated and far more difficult than imagined. For general algorithm engineers, when it comes to certain tasks, models with hundreds of millions of parameters are already considered large, but we don’t experience any difficulty with training them . . . but if the model continues to grow to the tens of billions, hundreds of billions, or even trillions, the complexity of parallelization and optimization strategies will sharply rise.” At that level, you have to partition the model into individual processors (e.g., the PanGU model uses 2048 Ascend AI processors). Jin emphasizes the importance of well-built “infrastructure,” or the overall architectural design of the training process, that enables collaboration between software and hardware — e.g. through things like memory-efficient optimizers.

  • He concludes with his expectation of future trends: “In order to gain more knowledge from pre-training, models such as GPT-3 and Pangu will become larger and larger. After all, we have not seen the limit of the pre-training benefits for large models. At that time, this type of model will have greater infrastructure requirements, and data parallelism and optimization strategies will be more complex . . . in the future, we need more researchers to devote themselves to the research of general intelligence and large-scale distributed computing.

Other interesting reactions on the Zhihu thread:

  • In a post upvoted 114 times, Cobot, a Nanyang Technological University in Singapore, PhD in Philosophy gives his thoughts:

  • Cobot argues that AI is moving into a new era of centralized competition among giants who can afford the computing power resources to train models like PanGu. These points remind me of Jack Clark’s writing on AI entering its industrial era. Cobot writes, “Now, the entire industry has entered the AI 2.0 era. In the previous 1.0 era, the main competition was over algorithms. It was so much fun watching the four little AI dragons (Sensetime, Megvii, Cloudwalk, Yitu) compete over various rankings and point system.”

  • Cobot also questions the practical value of systems like GPT-3. “Even with the strength of OpenAI, it seems that no suitable profit points have been found (with GPT-3). After a while, the public demonstrations and showboat activities ended. It’s still an open question regarding the cost-effectiveness of a model that costs several orders of magnitude higher than that of an ordinary model and brings only a few percentage points of accuracy improvement. Of course, the industry is still developing, so you don't need to pay too much attention to optimization and cost issues in the early stage. Finally, the security and privacy issues in large models are more difficult to clarify. How to avoid the issue of GPT-3 directly spitting out user account passwords is still a matter of concern.

  • Another reply to this thread (115 upvotes) is by Yang Jun, rated by Zhihu as giving excellent answers on the ML topic, who also works on the development of large-scale ML systems. I think Jun’s thoughts on different engineering implementations for training large models contain a lot of insights, but it’s way too in the weeds for me to fully digest. Those who are interested in Microsoft’s ZeRO-Infinity work, Megatron-LM vs. Mesh-TF, etc. can try to comprehend the Google translation of Jun’s post here.

ChinAI Links (Four to Forward)

Must-read: China’s “Involuted” Generation

Yi-Ling Liu on neijuan, the Chinese term for involution, a concept anthropologist Xiang Biao describes as “the experience of being locked in competition that one ultimately knows is meaningless.” Liu writes:

“Last December, a twenty-two-year-old employee surnamed Zhang at the e-commerce company Pinduoduo collapsed on the ground in the middle of the night, on her way home from work, and died six hours later, apparently from exhaustion and overwork. Two weeks later, another Pinduoduo employee leaped to his death, during a visit to his parents, reportedly after he was fired for criticizing the company’s work culture. In response to an outpouring of anger and grievance, the company appeared to dismiss Zhang’s death, posting a comment on its official social-media account: “Who hasn’t exchanged their life for money?”

Involution came to the fore once again, as online commentators tried to make sense of the deaths of two young people through its lens. I compulsively read posts on WeChat with titles such as “A Pinduoduo employee has died, why did we descend into an era of involution?” and “Workplace involution under the Pinduoduo model.” In contrast to exploitation or suppression or even alienation, involution is presented as part of the natural order of things—like bad weather. You can’t point fingers at an abstraction or rally against a fusty term from an anthropology text.”

Must-read: Nothing Breaks like A.I. Heart

In The Pudding, Pamela Mishkin writes a beautiful essay about AI and emotional intelligence — with an assist from GPT-3. Throughout the essay, there’s places where GPT-3 generated text is highlighted and the reader can click to toggle through alternative generated text options. You can also pick between different options that shape how the narrative path of the story unfolds. Read all the way through to the end, including the methodology and author’s note.

Should-read: China’s Public Diplomacy Operations

For Oxford Programme on Democracy & Technology, Marcel Schliebs, Hannah Bailey, Jonathan Bright, and Philip N. Howard expose the “inauthentic amplification of PRC Diplomats” on Facebook and Twitter, based on their review of social media activity by PRC diplomats and state-backed media outlets.

Should-read: ImportAI #247 — China makes its own GPT3

Have recommended Jack Clark’s ImportAI newsletter many times in the past but want to re-plug it this week. I was first inspired to look deeper into PanGU by this issue of ImportAI. It’s an essential resource for staying up-to-date on the latest technical trends in AI.

Thank you for reading and engaging.

These are Jeff Ding's (sometimes) weekly translations of Chinese-language musings on AI and related topics. Jeff is a PhD candidate in International Relations at the University of Oxford and a Predoctoral Fellow at Stanford’s Center for International Security and Cooperation, sponsored by Stanford’s Institute for Human-Centered Artificial Intelligence.

Check out the archive of all past issues here & please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay for a subscription will support access for all).

Any suggestions or feedback? Let me know at or on Twitter at @jjding99