ChinAI #47: The Sensenet Data Leak - What Actually Happened
Welcome to the ChinAI Newsletter!
These are Jeff Ding's (sometimes) weekly translations of writings on AI policy and strategy from Chinese thinkers. I'll also include general links to all things at the intersection of China and AI. Please share the subscription link if you think this stuff is cool. Here's an archive of all past issues. *Subscribers are welcome to share excerpts from these translations as long as my original translation is cited.
I'm a grad student at the University of Oxford where I'm based at the Center for the Governance of AI, Future of Humanity Institute.
An AI Data Leak Case Misunderstood by Multiple Parties: 5 Key Clues to Get to the Truth
Last month, a data leak revealed that a Chinese facial-recognition company called Sensenets had collected 6.7m GPS locations of 2.6 million people (almost all in Xinjiang) in one database in a 24-hour period, according to security researcher Victor Gevers who found the database. Obviously this discovery struck a chord, and it was covered in high-level forums such as the Financial Times and Washington Post editorial board. This week’s 6000-word+ translation of an investigative report by Chinese S&T media platform jiqizhixin represents the best Chinese-language reporting on the case. It’s not without flaws — any mention of Xinjiang is glaringly absent — but it sheds light on two key areas: 1) the growing reflexivity of Chinese media sources, as the author calls out “publicity stunt” reports and news reports that generated collective panic about the incidence, 2) the impressive quality of Chinese reporting on tech issues (again with the caveat that key aspects of the story are off-limits), especially in understanding the technical details, helped in large part by access to interviews with insiders.
Specifically, this report, which draws on interviews with researchers involved in cooperative projects between Chinese university laboratories and the public security system, well-known security AI company engineers, security engineers, and public security system personnel, answering five basic questions — giving insights that I haven’t seen in any English-language coverage of the case:
What was the nature of the leaked data? There was confusion in Chinese reporting about the leaks including “face recognition images” data (stills from security camera footage with frames around faces of interest), but this was not the case. In fact, the data leaked was even more sensitive including personal identity information.
Where did the data come from? A researcher involved in cooperative projects between Chinese university laboratories and the public security system believes that because there was ID card data involved, “there is a large probability that this flowed out of the public security system.” An engineer for a well-known security AI company in China, said that there’s also a possibility the data came from other sources (hotels, banks, etc.) that collect user identity information, which then gets sold in underground markets. Another key distinction: sometimes AI companies/research labs only get access to train their algorithms on a dataset but don’t get to directly copy and take the original data — this seems to be a case where Sensenet and the relevant public security bureau negotiated the data arrangement in “You open the data to us, we will guarantee the security of these data” terms.
Notably, the reporter investigates the connection between Sensenet and Sensetime, which had previously invested in Sensenet and provided algorithmic support, but withdrew its investment in November 2018.
How was the leaked database accessed? Sensenet used a MongoDB, a NoSQL method, to secure the database which is known to be very vulnerable and easily broken into. One researcher said while Sensenet is a smaller company with weak security protection, most AI algorithm companies’ databases do not physically isolate their database (i.e. they allow external IP access and critical datasets are not air-gapped). They recount another instance when a student at their lab was an intern at an AI company and used a single command to remotely send the company’s database to the university lab.
What are the implications of collecting “location information captured by cameras in the past 24 hours?” There are two scenarios to digest: First is the case where cameras are tracking in real-time location information matched to the pedestrian’s actual identity information; the second, is that the location information corresponds to a some pedestrian only identified by some code (pedestrian A, B, C, etc). The first scenario can only be achieved through checkpoint monitoring (e.g. at airports), whereas the security system as a whole can only realize the second scenario due to camera costs, lack of demand from public security systems themselves, and technical challenges of facial recognition with non-cooperative subjects (i.e. not being asked to turn and face the camera).
Notably, the reporter also writes, “even if the public security can get our ‘location information based on the cameras we have passed in the past 24 hours,’ there is some controversy over whether the public security system has the right to monitor the life trajectory of each of us, and what places we have passed each day; compared with identity information, which is information necessary to maintain law and order, and there is constant need to register (the identity information). But the monitoring of the former (real-time location in the past 24 hours) is very likely to violate our privacy.” PLEASE STOP with the notion that Chinese people don’t care about privacy.
How can the security of sensitive data in the AI security domain be improved? A variety of methods are proposed: physical separation of intranet from external networks (method employed by Huawei), multiple levels of approval to access/copy and take data, establishment of a special security research team to actively conduct attack/defense experiments on different levels of the AI company. China has a long way to go in this sphere: “During the course of interviews, several experienced security practitioners felt that compared with traditional established IT companies, the new generation of Internet companies are too slack in user data, and the artificial intelligence companies that inherited the genes of these Internet companies are even worse in their awareness of sensitive data.”
This Week's ChinAI Links
Chinese phrase of the Week: 抽丝剥茧 (chou1si1bao1jian3) -- to spin silk from cocoons — fig. to conduct a painstaking investigation of an incident
Georgetown's new Center for Security and Emerging Technology is looking to hire 2 full-time Chinese to English translators.
I wish everyone over-inflating China’s AI capabilities would read more articles like this one from Technode on why China is not prepared for a widespread AI rollout, especially in places outside first-tier cities like Beijing. A year ago, my report Deciphering China’s AI Dream called out over-inflation of China’s AI capabilities as one of the four common myths associated with China’s AI development.
Another flawed tendency is to see the global politics of AI as a two-agent US-China game. A useful corrective is to check out this survey of the EU’s AI ecosystem by Charlotte Stix here, and subscribe to her excellent EuropeanAI newsletter.
Currently available open access, read Julian Gerwitz’s article, “The Futurists of Beijing: Alvin Toffler, Zhao Ziyang, and China's 'New Technological Revolution" which examines the Chinese Communist Party’s modernization policies in the post-Mao period, particularly the Party’s response to the global rise of information technology and the surprising influence of American writer Alvin Toffler and futurist ideas.
Thank you for reading and engaging.
Shout out to everyone who is commenting on the translations - idea is to build up a community of people interested in this stuff. You can contact me at jeffrey.ding@magd.ox.ac.uk or on Twitter at @jjding99