ChinAI #224: Comparing Chinese large language models with SuperCLUE

May 22, 2023

Greetings from a world where…

we get high on SuperCLUE, not superglue

…As always, the searchable archive of all past issues is here. Please please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay support access for all AND compensation for awesome ChinAI contributors).

Feature Translation: SuperCLUE Benchmark Rankings

Context: From the team that brought you CLUE, a Chinese language understanding evaluation benchmark that tests the capabilities of language models, comes SuperCLUE, a more comprehensive benchmark released on May 9. The SuperCLUE team recently tested 10 models from Chinese and international labs along three different dimensions: 1) basic capabilities such as logical reasoning and coding; 2) specialized and professional capabilities like physics knowledge; and 3) capabilities in Chinese-language particularities such as idioms and knowledge of classic literature.

Each of these three dimensions contains different sub-categories. Here’s an example SuperCLUE test question in the Chinese idiom sub-category:

Choose one of the following sentences where the idiom is used incorrectly
A. 这个项目时间紧任务重，大家都在马不停蹄地奔波劳碌 [This project has a tight schedule and heavy tasks, and everyone is working non-stop].
B. 他常常口是心非，让人难以相信他说的话 [He is often double-faced, making it hard to believe what he said]
C. 两人是同学三年，一直保持着良好的关系，相互尊重、相敬如宾 [The two have been classmates for three years, and have maintained a good relationship with each other, and treat each other with the respect due to a guest].
D. 当地突发大火，整个村庄都鸡犬不宁，局势十分危急 [A fire broke out in the local area, stirring the whole village into pandemonium, and the situation was very critical].

*The idiom 相敬如宾 is used incorrectly in C.

Key Takeaways: There is still a large gap between international models like GPT-4 and Chinese large language models, according to the SuperCLUE leaderboard. GPT-4’s total score on SuperCLUE (76.67) is 23 points higher than the top performing Chinese model, iFlytek’s SparkDesk model [星火认知大模型], which sits at 53.58 points (see ranking list in image below).

Does the “L” in NLP (natural language processing) matter? The SuperCLUE team still thinks there is a great need for models that optimize for performance on Chinese-language tasks.

Their reasoning: Vicuna-13B, an open-source alternative to ChatGPT launched by researchers from U.S. universities, is a pretty good language model, but it ranks toward the bottom on tests of Chinese-language particularities. Models developed by Chinese organizations, or those that have been trained on Chinese-language datasets and tasks, have greatly surpassed Vicuna-13B’s performance on this dimension.
Here’s what I still can’t wrap my head around, though: ChatGPT performs so well in Chinese, despite being trained almost exclusively in English. Jan Leike, an OpenAI researcher, doesn’t know why either.

Caveats

I think things like SuperCLUE are indicators of a healthy ecosystem for diffusing information about large language models. Still, benchmarks are not without drawbacks. Some AI benchmarks, including the English-language SuperGLUE benchmarks, have hit saturation. One reason this can happen is if labs optimize for doing well on the benchmarks rather than what these metrics are supposed to be measuring.
Based on my co-authored GovAI report on China’s large language model landscape, I see Baidu’s Ernie models as the strongest Chinese LLMs, so it’s puzzling to not see them on this list. Also puzzling: Baidu’s Ernie models have used CLUE and FewCLUE (a version of CLUE for few-shot learning evaluation) as benchmarks, so I would expect them to undergo the SuperCLUE test as well. Even more mysterious: it seems like an earlier version of the SuperCLUE leaderboard had Baidu’s ErnieBot ranked last. There’s also some scuttlebutt that one of the SuperCLUE team members is connected to a joint lab of iFlytek.
We’ll check back in on this next month, as the SuperCLUE benchmark plan is updated monthly, so the next iteration might include Baidu’s ErnieBot. What I’ll say is that this causes me to update a little in the direction of iFlytek’s SparkDesk model being pretty strong.

***We’ve had some really good discussions on the last few Google docs of full translations, and I know some readers are familiar with the English-language SuperGLUE benchmarks, so I’d especially welcome your annotations in the FULL TRANSLATION: Chinese general large model evaluation benchmark SuperCLUE released an update, adding Claude and Tsinghua GLM 100 billion (parameter) models

ChinAI Links (Four to Forward)

Should-listen: Yours truly on the ChinaTalk Podcast

Much thanks to Jordan Schneider for having me on a recent episode of the ChinaTalk podcast talking about my recently published diffusion deficit paper and book manuscript. Teddy Collins, formerly of the White House Office of Science and Technology Policy and DeepMind, also contributed a lot of good stuff to the conversation. If you’re interested in digging deeper, Miles Brundage, OpenAI’s head of policy research, posted an insightful Twitter thread pushing back on one of the themes we discussed — the slow pace by which general-purpose technologies diffuse.

Should-read: China's smart cities and the future of geopolitics

A paper by Valentin Weber, a research fellow at the German Council on Foreign Relations, on the use of AI for urban governance in Chinese cities. The report highlights some of the security risks of smart cities built by Chinese companies abroad.

Should-read: Familiarity breeds both trust and contempt in AI adoption

Mike Horowitz, Lauren Kahn, Julia MacDonald and Jacquelyn Schneider have a new article forthcoming in AI & Society on the effects of familiarity on AI adoption:

Those with familiarity and expertise with AI and similar technologies were more likely to support all of the autonomous applications we tested (except weapons) than those with a limited understanding of the technology…However, familiarity cut both ways; individuals are also less likely to support AI-enabled technologies when applied directly to their life, especially if technology automates tasks they are already familiar with operating.

Should-read: GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models

Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock (researchers from OpenAI, OpenResearch, and UPenn) have a working paper on the potential implications of generative pre-trained transformers as general-purpose technologies (GPTs). They find: “Around 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of LLMs, while approximately 19% of workers may see at least 50% of their tasks impacted.”

H/t to Nathan Labenz for the recommendation.

Thank you for reading and engaging.

These are Jeff Ding's (sometimes) weekly translations of Chinese-language musings on AI and related topics. Jeff is an Assistant Professor of Political Science at George Washington University.

Check out the archive of all past issues here & please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay for a subscription will support access for all).

Any suggestions or feedback? Let me know at chinainewsletter@gmail.com or on Twitter at @jjding99

Lao Mein

May 25, 2023

Human performance in SuperCLUE (96.5%!!!) was unreasonably high compared to similar benchmarks - SuperGLUE has an estimated human performance of 88% in their paper. Even Winograd Schemas only have human performance of ~92%. The Github page for SuperCLUE notes that this was because it was based on 3 college/grad students with access to the internet. Still, they must have been very talented and motivated individuals, because every single one of them got a score of 100% on the Classical Chinese section, a notoriously difficult subject. Do they even teach that in Chinese Colleges to non-majors?

Expand full comment

ChinAI Newsletter

Discussion about this post