Greetings from a world where…
“Did everyone sever their balls in the elevator this morning?”
…As always, the searchable archive of all past issues is here. Please please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay support access for all AND compensation for awesome ChinAI contributors).
Feature Translation: DeepSeek-R1 Stability on Third-party Platforms Report— 18 web-based evaluations
Context: SuperCLUE, an organization that benchmarks large language models from Chinese and international labs, recently released a report (link to original Chinese) that evaluated third-party platforms that integrate DeepSeek’s R1 model. What do we mean by third-party deployments? In the case of first-party deployments, users access models from the original developers themselves (e.g., OpenAI’s ChatGPT). For third-party deployments, in contrast, users access models from from an org that is external from the original developer.
For instance, Perplexity AI provides an AI-powered search engine. Its paid subscription plan (Perplexity Pro) lets users choose among various AI models from other companies, including GPT-4 Omni and DeepSeek R1, for tailored queries. Perplexity clarifies its access to DeepSeek’s R1 model:
Because it's open sourced, it can be fine-tuned and utilized on various platforms. We are not using the DeepSeek API to provide answers through Perplexity. The model runs on secure U.S.-based data centers.
SuperCLUE tested R1 integrations on 18 third-party platforms, using 20 Math Olympiad problems at the elementary school level. Let’s check out the results.
Key Takeaways: Based on this preliminary analysis, international third-party platforms that have R1 integrations (Perplexity, together.ai, and Poe) perform better than Chinese third-party platforms that have R1 integrations (SiliconFlow’s pro edition and Luchen Tech Cloud).
The specific metrics: “On average, the complete response rate of international third-party platforms with paid subscriptions — Perplexity, together.ai and Poe — reached 92%, while the average complete response rate of paid subscription versions of Chinese third-party platforms, Silicon Flow Pro and Luchen Tech Cloud, was 83%; the average truncation rate of international paid third-party platforms was only 2%, while that of Chinese platforms reached 18%. In terms of inference time, international platforms only took 109 seconds per question on average, while Chinese paid subscription versions required 263 seconds on average.”
Three points of important context for these figures. First, Perplexity, together.ai, and Poe are all California-based platforms that offer access to various advanced AI models (Poe is a subsidiary of Quora). Second, these metrics compare the paid subscription offerings of Chinese and international companies, as free versions sometimes can’t provide complete responses to demanding questions, due to token limitations. Third, the complete response rate refers to the percent of outputs in which the model gives a full response, without cutting off (truncation rate) or not providing an answer (e.g., a request error).
These findings underscore one of the points I’m starting to get tired of repeating: look at AI through a diffusion-centered lens, not an innovation-centered one. If you do, it’s not that provocative to posit that the US AI ecosystem could benefit more from DeepSeek’s R1 than the Chinese one.
It should be noted that some free third-party deployments of R1 (e.g., Bytedance’s Volcengine) do provide a higher rate of complete responses than the subscription-based platforms. However, look at the column second from the right (see image below): their inference time is substantially higher (392 seconds for Bytedance’s Volcengine compared to 86 seconds for Perplexity’s paid version)
Bonus: Can you answer one of the math questions from SuperCLUE’s test set?
Try this: “At 6AM, a frog climbs up from the bottom of a 10-meter-deep well. For every 2 meters it climbs up, it will slide down 0.5 meters because the well wall is slippery. The time it takes to slide down 0.5 meters is half the time it takes to climb up 2 meters. At 6:12, the frog climbs to 2.5 meters from the wellhead. How many minutes does it take for the frog to climb from the bottom of the well to the wellhead?”
For the answer (given by the Gemini-2.0-Flash-Exp model, see the FULL TRANSLATION: DeepSeek-R1 Stability on Third-party Platforms Report— 18 web-based evaluations
ChinAI Links (Four to Forward)
Must-read: AI Proem — Grace Shao’s substack
Grace Shao, an AI researcher and former journalist, publishes the AI Proem substack that “provides reports and analyses on global AI x infrastructure, AI innovation, Physical AI, and big-tech AI. With a focus on U.S.-China.” I learned a lot from her recent post on DeepSeek’s integration into WeChat, including details about Tencent’s recent orders of H20 GPUs. H/t to Mary Clare McMahon for recommending.
Should-read: Generative AI at Work
Related to last week’s ChinAI issue on “artificial challenged intelligence” in the customer service industry, Erik Brynjolfsson, Danielle Li, and Lindsey Raymond studied the introduction of an AI tool in the workflow of 5,000+ customer service agents. Their The Quarterly Journal of Economics article finds: “Access to AI assistance increases worker productivity, as measured by issues resolved per hour, by 15% on average, with substantial heterogeneity across workers.” *Note: the context of these are findings are AI tools used to augment workers, not displace them.
Should-read: U.S. Open-Source AI Governance
Claudi Wilson and Emmie Hine’s report identify debates among USS policymakers on open-source AI policy. One of their findings: “blanket export controls on all open-source AI models would likely be sub-optimal and counterproductive. Requiring every user of every open model to undergo a know-your-customer (KYC) process would be highly disruptive to the development of specific-use applications, though it would have limited impact on frontier capabilities. It would also likely have limited efficacy in mitigating misuse risks by China.”
Should-apply: Tarbell Fellowship (center for AI journalism)
The South China Morning Post (SCMP) and ChinaTalk are both looking to host exceptional fellows as part of the 2025 Tarbell Fellowship (a one-year program for early-career journalists interested in covering artificial intelligence).
Thank you for reading and engaging.
These are Jeff Ding's (sometimes) weekly translations of Chinese-language musings on AI and related topics. Jeff is an Assistant Professor of Political Science at George Washington University.
Check out the archive of all past issues here & please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay for a subscription will support access for all).
Also! Listen to narrations of the ChinAI Newsletter in podcast format here.
The answer to the 井底之蛙 frog maths problem was incorrect, interestingly. (Challenge to other readers to work out why!)
I told ChatGPT 4o and DeepSeek R1 that it was incorrect, and asked them to explain why and give me the correct answer. Neither could manage it, but ChatGPT o3-mini got it (after reasoning for 4 minutes!).
Interesting translation, and thanks for the shout-out!