Researchers have used the technology behind the artificial intelligence (AI) chatbot ChatGPT to create a fake clinical-trial data set to support an unverified scientific claim.
In a paper published in JAMA Ophthalmology on 9 November1, the authors used GPT-4 — the latest version of the large language model on which ChatGPT runs — paired with Advanced Data Analysis (ADA), a model that incorporates the programming language Python and can perform statistical analysis and create data visualizations. The AI-generated data compared the outcomes of two surgical procedures and indicated — wrongly — that one treatment is better than the other.
“Our aim was to highlight that, in a few minutes, you can create a data set that is not supported by real original data, and it is also opposite or in the other direction compared to the evidence that are available,” says study co-author Giuseppe Giannaccare, an eye surgeon at the University of Cagliari in Italy.
The ability of AI to fabricate convincing data adds to concern among researchers and journal editors about research integrity. “It was one thing that generative AI could be used to generate texts that would not be detectable using plagiarism software, but the capacity to create fake but realistic data sets is a next level of worry,” says Elisabeth Bik, a microbiologist and independent research-integrity consultant in San Francisco, California. “It will make it very easy for any researcher or group of researchers to create fake measurements on non-existent patients, fake answers to questionnaires or to generate a large data set on animal experiments.”
The authors describe the results as a “seemingly authentic database”. But when examined by specialists, the data failed authenticity checks, and contained telltale signs of having been fabricated.
The authors asked GPT-4 ADA to create a data set concerning people with an eye condition called keratoconus, which causes thinning of the cornea and can lead to impaired focus and poor vision. For 15–20% of people with the disease, treatment involves a corneal transplant, performed using one of two procedures.
The first method, penetrating keratoplasty (PK), involves surgically removing all the damaged layers of the cornea and replacing them with healthy tissue from a donor. The second procedure, deep anterior lamellar keratoplasty (DALK), replaces only the front layer of the cornea, leaving the innermost layer intact.
The authors instructed the large language model to fabricate data to support the conclusion that DALK results in better outcomes than PK. To do that, they asked it to show a statistical difference in an imaging test that assesses the cornea’s shape and detects irregularities, as well as a difference in how well the trial participants could see before and after the procedures.
The AI-generated data included 160 male and 140 female participants and indicated that those who underwent DALK scored better in both vision and the imaging test than did those who had PK, a finding that is at odds with what genuine clinical trials show. In a 2010 report of a trial with 77 participants, the outcomes of DALK were similar to those of PK for up to 2 years after the surgery2.
“It seems like it’s quite easy to create data sets that are at least superficially plausible. So, to an untrained eye, this certainly looks like a real data set,” says Jack Wilkinson, a biostatistician at the University of Manchester, UK.
Wilkinson, who has an interest in methods to detect inauthentic data, has examined several data sets generated by earlier versions of the large language model, which he says lacked convincing elements when scrutinized, because they struggled to capture realistic relationships between variables.
At the request of Nature’s news team, Wilkinson and his colleague Zewen Lu assessed the fake data set using a screening protocol designed to check for authenticity.
This revealed a mismatch in many ‘participants’ between designated sex and the sex that would typically be expected from their name. Furthermore, no correlation was found between preoperative and postoperative measures of vision capacity and the eye-imaging test. Wilkinson and Lu also inspected the distribution of numbers in some of the columns in the data set to check for non-random patterns. The eye-imaging values passed this test, but some of the participants’ age values clustered in a way that would be extremely unusual in a genuine data set: there was a disproportionate number of participants whose age values ended with 7 or 8.
The study authors acknowledge that their data set has flaws that could be detected with close scrutiny. But nevertheless, says Giannaccare, “if you look very quickly at the data set, it’s difficult to recognize the non-human origin of the data source”.
Bernd Pulverer, chief editor of EMBO Reports, agrees that this is a cause for concern. “Peer review in reality often stops short of a full data re-analysis and is unlikely to pick up on well-crafted integrity breaches using AI,” he says, adding that journals will need to update quality checks to identify AI-generated synthetic data.
Wilkinson is leading a collaborative project to design statistical and non-statistical tools to assess potentially problematic studies. “In the same way that AI might be part of the problem, there might be AI-based solutions to some of this. We might be able to automate some of these checks,” he says. But he warns that advances in generative AI could soon offer ways to circumvent these protocols. Pulverer agrees: “These are things the AI can be easily weaponized against as soon as it is known what the screening looks for.”