ChinAI #231: Latest SuperCLUE rankings of large language models

Jul 31, 2023

Greetings from a world where…

we all make sure to drag our favorite newsletters from the promotions tab to the primary tab, right?

…As always, the searchable archive of all past issues is here. Please please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay support access for all AND compensation for awesome ChinAI contributors).

Feature Translation: The latest SuperCLUE ranking of Chinese and international large models

Context: Back in ChinAI #224, we highlighted the SuperCLUE benchmark, released in May, which aimed to test large language models from Chinese and international labs along three main dimensions: 1) foundational capabilities such as dialogue and coding; 2) specialized and academic capabilities like physics knowledge; and 3) capabilities in Chinese-language particularities such as knowledge of classical Chinese literature and Chinese idioms. Last week, the SuperCLUE team released its July rankings (link to original Chinese), updated with 3700 confidential test questions and 20 total participating models.

Key Takeaways: Among models from Chinese labs, Baidu’s ErnieBot rises to the top

When analyzing the previous SuperCLUE ranking, I expressed confusion as to why Baidu’s ErnieBot did not make it: “Based on my co-authored GovAI report on China’s large language model landscape, I see Baidu’s Ernie models as the strongest Chinese LLMs, so it’s puzzling to not see them on this list.” Well, two months later, we see Baidu’s ErnieBot (v2.2.0) at the top of the SuperCLUE list, behind only OpenAI’s GPT-4 and humans.
Baidu’s ErnieBot even surpasses Anthropic’s Claude-2 on the overall SuperCLUE score, though this gap was primarily a product of its superior performance on Chinese-language particularities.

A few notes on interpreting the various lists in the full translation, such as the overall SuperCLUE ranking above:

1st column is ranking; 2nd is model name; 3rd is lab/group; then come the scores in total and the three specific categories; finally, whether the model is exclusive (专有服务) or open access/commercially available (开源可商用）
Note: Foreign representative non-open source models (GPT4.0/Claude/gpt-3.5) participate in the list, but are not ranked numerically; nor do they receive the gold, silver, and bronze medal symbols. If you have further questions about interpreting the lists, feel free to comment in the Google doc.

Other important changes in the July update:

Two new entrants from Chinese labs — Baichuan Intelligence’s Baichuan-13B-Chat; Shanghai Artificial Intelligence Laboratory and SenseTime’s internlm-chat-7b — showcase good but not great results. There’s still a significant gap in overall SuperCLUE score between Baichuan and the leading Chinese models, let alone GPT-4.
Recall: Baichuan Intelligence is the startup launched by Wang Xiaochuan, former Sogou founder, after ChatGPT's debut in November 2022. Wang declared that China needed its own OpenAI.

FULL TRANSLATION: The latest July ranking of large models! 3700 confidential test questions and 20 large models participated in the evaluation｜SuperCLUE

ChinAI Links (Four to Forward)

Should-watch: The U.S.-China AI Race: Where do both countries stand?

Thanks to the National Committee on U.S.-China Relations for inviting me to talk about the AI race between the U.S. and China. They asked me to discuss the role of AI in U.S.-China technological competition, the current stage of the competition, and Chinese views on the risks of AI.

Should-watch: ‘A certain danger lurks there’

How did the inventor of the first chatbot turned against AI? A fascinating, almost thrilling, longread by Ben Tarnoff for The Guardian.

Two other articles I considered translating this week:

科技新知 provides an overview of different benchmarking efforts in large language models, including SuperCLUE. The piece explores how benchmarks have gradually become a key piece of the overall AI ecosystem.
隐私护卫队, a portal connected to the Nandu Personal Information Protection Research Center, covers the controversy over Miaoya Xiangji (妙鸭相机), an app that recently went viral but required users to upload 21 photos of their face to create an exclusive digital avatar.

Thank you for reading and engaging.

These are Jeff Ding's (sometimes) weekly translations of Chinese-language musings on AI and related topics. Jeff is an Assistant Professor of Political Science at George Washington University.

Check out the archive of all past issues here & please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay for a subscription will support access for all).

Any suggestions or feedback? Let me know at chinainewsletter@gmail.com or on Twitter at @jjding99

ChinAI Newsletter

Discussion about this post