ChinAI #279: A reporter tests Chinese LLMs on how they help with her job
Greetings from a world where…
Technology and the Rise of Great Powers was the #7 new release in the Economics category on Amazon. Thanks everyone for helping diffuse the word about the book
…As always, the searchable archive of all past issues is here. Please please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay support access for all AND compensation for awesome ChinAI contributors).
Feature Translation: A reporter tests Chinese LLMs on how they can help with her job
Context: Ruilei Ma is a reporter who covers AI applications for AItechtalk[AI科技评论]. A large part of her job involves digesting large amounts of information, including AI-related research articles and reports. Half a year ago, she started using Moonshot AI’s Kimi app to help her browse documents and web pages; however, Kimi can’t read images, so it struggles to comprehend many of the statistical graphs in research papers. This time, to help her pick a model that is just as capable as Kimi at long-text comprehension but also boasts the ability to understand images, Ma decides to conduct a four-part test of four other Chinese models: Alibaba's Tongyi Qianwen, Baidu’s Erniebot, Bytedance’s Doubao, and Tencent’s Yuanbao. She also plays around with Anthropic’s Claude though she states, “It is better to be a Chinese model, so that I can use it at any time.”
Key Takeaways: The first test is to interpret a cartoon image from Guangdong’s 2016 gaokao (college entrance exam).
Tongyi, Erniebot, and Doubao all fail to convey reasonable interpretations of the slap and kiss marks. As for Tencent’s Yuanbao, Ma writes, “To be honest, I had given up on Yuanbao, because in my impression, Wenxin Yiyan, Tongyi Qianwen, and Doubao were all released at least half a year earlier than Yuanbao, and Yuanbao really had no presence in my life.”
Yet, Yuanbao performed the best on this test: “Tencent Yuanbao understands the core idea behind the cartoon – that the slap means the person did not meet their expectations about their grade and that the kiss means the person did exceed their expectations.”
The second test evaluates the ability of these large language models (LLMs) to synthesize long articles with graphs. Specifically, Ma wants the AI models to help her digest a recent Nature article, titled “An evolutionary model of personality traits related to cooperative behavior using a large language model.”
This paper uses LLMs to simulate the development of human society for 1,000 generations. In assessing the ability of these Chinese models to summarize the article’s key findings, Ma wanted to see if any of the models could relate the key trend in this graph: a rapid drop in cooperation around the 900th generation.
Again, Tencent Yuanbao impressed. But it wasn’t just about the accuracy of the summary. The interface mattered: “The visual design of the entire user interface is very consistent with reading habits. There is an outline of the paper on the left, and the main text is combined with the pictures to read the paper. If you don’t understand, you can also ask questions about the content in real time.”
Her review of Claude-3.5 on this task: “At first glance, Claude's reply is really concise. It mainly summarizes some key points of the paper. It is not particularly systematic, but I have to say that I actually read it because of the small number of words. But it is too concise. After reading it, I have no more to follow-up on. It is not very friendly for me, a beginner.”
The third test is to analyze a research report on social media trends from the 2024 Paris Olympics.
Again, Yuanbao provided a more comprehensive summary, including key figures (e.g, Douyin’s interactive content accounted for 70% of the interaction volume about the Paris Olymipcs). Interestingly, the most popular platform for product placements was Xiaohongshu (see ChinAI #277) because its hot topics concentrated on athletes, whereas Douyin focuses on patriotic topics.
Let’s conclude with two additional notes. First, this exercise represents a different way of differentiating between Chinese AI models, which differs from our past coverage of benchmarks like SuperCLUE. It’s useful to see how these models perform out in the wild in particular application scenarios. Second, it will be interesting to track how Tencent Yuanbao develops. When I last checked a Chinese AI products ranking portal (aicpb.com), Tencent Yuanbao had 836.29k visits, which was only 35th in the overall ranking, though it had an increase of 40.53 percent in usage over the month of July.
For more, including the fourth and final test on the ability of AI to understand memes, see FULL TRANSLATION: Tencent Yuanbao Cured My Information Anxiety
ChinAI Links (Four to Forward)
Should-attend: First two stops on the book tour
Book launch event, Tuesday, September 3, 12-1:30PM at GW Elliott School. Discussant is Richard Danzig, former Secretary of the Navy. RSVP to secure your in-person spot or get online Zoom info.
Book talk at Foreign Policy Research Institute, September 23, 5-6:30PM, in Philadelphia. Moderated by Mike Beckley, Associate Professor of Political Science at Tufts, whose work has inspired a lot of my research.
Should-read: Managing the Sino-American AI Race
For Project Syndicate, Karman Lucero offers an excellent analysis of the limitations to official dialogues between the U.S. and China on AI governance. His conclusion points toward a way forward, though: “Much more can be achieved through unofficial channels that connect experts from across both societies. At the very least, we can gain a better understanding of each other’s institutions and their purposes, as well as develop the infrastructure to act if hypothetical scenarios become real.”
Should-read: Interoperability in AI Governance: A Work in Progress
Interoperability has become a buzzword in global AI governance discussions. In Tech Policy Press, Angela Onikepe unpacks the concept by examining how it functions in various institutions, including the UN, the network of AI safety institutes, and regional/national settings.
Should-read: The Use of Trade Remedies Against China
I’ve enjoyed exploring China Trade Monitor, a new (to me) source on China’s economic relations with the world, published by Simon Lester and Huan Zhu. This piece assesses whether there’s been a surge in the use of trade remedies against Chinese products, based on data since China’s entry to the WTO.
Thank you for reading and engaging.
These are Jeff Ding's (sometimes) weekly translations of Chinese-language musings on AI and related topics. Jeff is an Assistant Professor of Political Science at George Washington University.
Check out the archive of all past issues here & please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay for a subscription will support access for all).
Also! Listen to narrations of the ChinAI Newsletter in podcast format here.
Any suggestions or feedback? Let me know at chinainewsletter@gmail.com or on Twitter at @jjding99