Image-generating AI models like DALL-E 2 and Stable Diffusion can — and do — replicate aspects of images from their training data, researchers show in a new study, raising concerns as these services enter wide commercial use.
Co-authored by scientists at the University of Maryland and New York University, the research identifies cases where image-generating models, including Stable Diffusion, “copy” from the public internet data – including copyrighted images – on which they were trained.
The study has not yet been peer-reviewed, and the co-authors presented it to a conference whose rules prohibit media interviews until the research has been accepted for publication. But one of the researchers, who asked not to be identified by name, shared high-level thoughts with TechCrunch via email.
“Even though diffusion models like Stable Diffusion produce beautiful images, and often images that look highly original and tailored to a specific text assignment, we show that these images can actually be copied from their training data, either wholesale or by copying only parts of training images ,” the researcher said. “Companies that generate data with distribution models may need to rethink wherever intellectual property laws are concerned. It is virtually impossible to verify that any particular image generated by Stable Diffusion is new and not stolen from the training set.”
Images of noise
Modern image generating systems such as Stable Diffusion are what are known as “diffusion” models. Diffusion models learn to create images from text prompts (eg, “a sketch of a bird sitting on a windowsill”) as they work through massive training datasets. The models—trained to “recreate” images instead of drawing them from scratch—start with pure noise and refine an image over time to make it increasingly closer to the text prompt.
It’s not very intuitive technology. But it’s exceptionally good at generating artwork in virtually any style, including photorealistic art. Indeed, diffusion has enabled a host of attention-grabbing applications, from synthetic avatars in Lensa to art tools in Canva. DeviantArt recently released a Stable Diffusion-powered app for creating custom artwork, while Microsoft is tapping DALL-E 2 to power a generative art feature coming to Microsoft Edge.
To be clear, it has not been a mystery that distribution models replicate elements of training images, which are usually indiscriminately scraped from the web. Character designers like Hollie Mengert and Greg Rutkowski, whose classic painting styles and fantasy landscapes have become one of the most used incentives in Stable Diffusion, have rejected what they see as poor AI imitations that are nonetheless attached to their names.
But it has been difficult to empirically measure how often copying occurs, given diffusion systems are trained on more than billions of images coming from a range of different sources.
To study Stable Diffusion, the researchers’ approach was to randomly sample 9,000 images from a dataset called LAION-Aesthetics—one of the image sets used to train Stable Diffusion—and the images’ corresponding labels. LAION-Aesthetics features images with text captions, including images of copyrighted characters (eg Luke Skywalker and Batman), images from IP-protected sources such as iStock, and art from living artists such as Phil Koch and Steve Henderson.
The researchers fed the captions to Stable Diffusion to let the system create new images. They then wrote new labels for each and tried to replicate the synthetic images from Stable Diffusion. After comparing the two sets of generated images using an automated similarity spotting tool—the set created from the LAION-Aesthetics captions and the set from the researchers’ cues—the researchers say they found a “significant amount of copying ” found by Stable Diffusion on the results, including backgrounds and objects retrieved from the training set.
One assignment—”Canvas Wall Art Print”—consistently produced images showing a particular couch, a relatively mundane example of the way diffusion models associate semantic concepts with images. Others containing the words “painting” and “wave” generated images with waves resembling those in the painting “The Great Wave of Kanagawa” by Katsushika Hokusai.
Across all their experiments, Stable Diffusion “copied” the training data set about 1.88% of the time, the researchers say. That might not sound like much, but considering the reach of diffusion systems today — Stable Diffusion has created more than 170 million images as of October, according to one ballpark estimate — it’s hard to ignore.
“Artists and content creators should absolutely be concerned that others may be profiting from their content without permission,” the researcher said.
In the study, the co-authors note that none of the Stable Diffusion generations matched their respective LAION-Esthetics source image and that not all models they tested were equally prone to copying. How often a model was copied depended on several factors, including the size of the training data set; smaller sets tended to lead to more copying than larger sets.
One system the researchers examined, a distribution model trained on the open-source ImageNet dataset, showed “no significant copying in any of the generations,” they wrote.
The co-authors also advised against excessive extrapolation of the study’s findings. Limited by the cost of computation, they could only take a small portion of Stable Diffusion’s complete training set in their experiments.
Still, they say the results should prompt companies to rethink the process of compiling datasets and training models on them. Vendors behind systems like Stable Diffusion have long claimed that fair use — the doctrine in US law that allows the use of copyrighted material without first obtaining permission from the rights holder — protects them in the event that their models are trained on licensed content. But this is an untested theory.
“At the moment, the data is compiled blindly, and the datasets are so large that human screening is infeasible,” the researcher said. “Diffusion models are amazing and powerful, and have shown such impressive results that we can’t abandon them, but we have to think about how to maintain their performance without compromising privacy.”
For the businesses that use distribution models to power their applications and services, the research may give pause. In a previous interview with TechCrunch, Bradley J. Hulbert, a founding partner at the law firm MBHB and an expert on IP law, said he believed it was unlikely that a judge would consider the copies of copyrighted works in AI-generated art as fair use will see — at least in the case of commercial systems like DALL-E 2. Motivated by the same concerns, Getty Images has banned AI-generated artwork from its platform.
The issue will soon play out in the courts. In November, a software developer filed a class-action lawsuit against Microsoft, its subsidiary GitHub and business partner OpenAI for allegedly violating copyright laws with Copilot, GitHub’s AI-powered code-generating service. The case hinges on the fact that Copilot – trained on millions of examples of code from the Internet – uploads portions of licensed code without providing credit.
Beyond the legal ramifications, there is reason to fear that incentives, whether direct or indirect, could reveal some of the more sensitive data embedded in the image training datasets. As a recent Ars Technica report revealed, private medical records — as many as thousands — are among the photos hidden in Stable Diffusion’s set.
The co-authors propose a solution in the form of a technique called differentially private training, which would “desensitize” diffusion models to the data used to train them—preserving the privacy of the original data in the process . Differential private training usually hurts performance, but it may be the price to pay to protect privacy and intellectual property going forward if other methods fail, the researchers say.
“Once the model has memorized data, it is very difficult to verify that a generated image is original,” the researcher said. “I think content creators are becoming aware of this risk.”