ChinAI #223: The Labor of AI Trainers
A data annotator's view from a small county town in northwest China
Greetings from a world where…
Is there anyone better at covering her field than Claire McNear on the Jeopardy beat
…As always, the searchable archive of all past issues is here. Please please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay support access for all AND compensation for awesome ChinAI contributors).
Feature Translation: The laborers behind ChatGPT, with a monthly salary of 3000
Context: In recent weeks, we’ve covered the large language models (LLM) launched by Chinese AI labs, in response to ChatGPT (e.g., ChinAI #219: Ernie Bot vs. GPT-4). In fact, last week, I published a GovAI report, co-authored with Jenny Xiao, on recent trends in China’s LLM ecosystem:
As large-scale pre-trained AI models gain popularity in the West, many Chinese AI labs have developed their own models capable of generating coherent text and realistic images and videos.These models represent the frontier of AI research and have significant implications for AI ethics and governance in China. Yet, to the best of our knowledge, there has been no in-depth English-language analysis of such models. Studying a sample of 26 large-scale pre-trained AI models developed in China, our review describes their general capabilities and highlights the role of collaboration between the government, industry, and academia in supporting these projects. It also sheds light on Chinese discussions related to techno-nationalism, AI governance, and ethics.
This week’s feature translation spotlights something else involved in building these LLMs that often gets lost: the data labeling and verification process. Keqi Yang (pseudonym) works as an AI “trainer” (训练师) in the data labeling industry in a small county town in northwest China. He shares his thoughts on trends in this industry, including salary, incomes, turnover, and professionalization. It comes from a New Weekly (新周刊 ) post (link to original Chinese), which I saw shared on Huxiu, a well-known platform that aggregates content on China’s science and technology ecosystem.
Key Passages: Keqi Yang shares his experience working for a data annotation company that helps AI companies with data preparation, cleaning, and labeling:
“1,000 workstations, 1,000 computers, and currently 800 AI trainers. They sit in front of the computer every day to draw frames, zoom in on the screen, adjust the frame line, and submit for review... There is air conditioning and Internet here. Each person’s office space is just less than 2 square meters.”
On the wide range of projects they’ve worked on, including an intriguing one related to reading palms: “We have counted sheep, wood, and iron blocks. The industries involved include medicine, security, and now automatic driving. We also took over a palmistry project. One party asked us to mark various hand lines on palms, and many employees have begun to study palmistry, which is very interesting. Generally speaking, for visual content to be accurately recognized by machines, at least 100,000 (labeled) pictures are required.
On the tradeoff between decent salary and repetitive/mundane work: Subject to the difficulty of the project, a frame costs 3-8 cents, and more than 2,000 frames need to be drawn in 8 hours on a working day. The per capita monthly income is 3,000-4,000 RMB…Take our company as an example, the staff turnover rate is 30%-40%, because the work is relatively simple, sitting in front of the computer for 8 hours a day, doing repetitive work. For some, it is a comfortable job, but for others, it is very boring and uninteresting.
Yang also discusses macro-level trends in the data labeling industry related to professionalization and skill standardization:
In 2020, China’s national occupational classification catalog included AI trainers for the first time. Based on the most recent occupational skills standards for AI trainers: “the AI trainers in the data labeling industry should be considered “primary workers” (初级工) in this skill level certification. There are four higher professional skill levels above it.”
On the maturation of the market: “In the early days of data labeling, the ‘crowdsourcing’ model emerged…which was similar to the current Meituan model…But as data labeling transitions to a standardized development stage from the barbaric growth stage, the number of part-timers in the market is decreasing. More and more part-time businesses are being replaced by labeling companies in counties like us.”
The very last section — The next stop of AI is the county town — was the most interesting to me:
Less urbanized than capitals and cities but more urbanized than countryside villages, county towns have become the data service bases for large tech companies. According to Yang, Baidu has the largest self-built data labeling team in the industry, with bases in 10 different county towns.
Drivers behind this trend: lower rent and labor costs AND government subsidies that encourage these data labeling companies to absorb college graduates for employment.
More details in FULL TRANSLATION: The laborers behind ChatGPT, with a monthly salary of 3000
ChinAI Links (Four to Forward)
Must-read: Recent Trends in China's Large Language Model Landscape
I’m excited to share this new GovAI report by me and Jenny Xiao based on our analysis of 26 large-scale pre-trained AI models developed by Chinese labs between 2020 and 2022. The report tracks key metrics across these 26 models, including their benchmarks, funding support, and training compute — marking the first in-depth study of China’s large language model landscape.
See my Twitter thread summarizing the key takeaways: https://twitter.com/jjding99/status/1653413881354616832
Should-read: China Is Blazing a Trail in Regulating Generative AI – on the CCP’s Terms
For The Diplomat, MERICS researchers Rebecca Arcesati and Wendy Chang provide an insightful take:
Information controls aside, China’s forward-thinking approach to regulating the input and output of large language models (LLMs) may still lend lawmakers elsewhere – including in Europe – interesting angles to consider when developing their own regulatory framework.
Should-read: Another “China” hiding in county towns
This week’s article on data labelers highlighted county towns (县城). While researching this topic last week, it was cool to stumble across Jinglin Gao and Jiang Jiang’s Ginger River Review. This post translates a popular article on the “economy, politics, and social dynamics of county towns.”
Should-read: Five Recommendations for Improving China’s Generative AI Services Draft Regulations
Re-upping my translation for last week’s ChinAI. Over the past week, the Google doc has accumulated good annotations and comments from folks with a lot of knowledge in this space, including Graham Webster (DigiChina editor-in-chief) and Tom Nunlist (senior analyst at Trivium China).
Thank you for reading and engaging.
These are Jeff Ding's (sometimes) weekly translations of Chinese-language musings on AI and related topics. Jeff is an Assistant Professor of Political Science at George Washington University.
Check out the archive of all past issues here & please subscribe here to support ChinAI under a Guardian/Wikipedia-style tipping model (everyone gets the same content but those who can pay for a subscription will support access for all).
Any suggestions or feedback? Let me know at chinainewsletter@gmail.com or on Twitter at @jjding99