Finding Evidence of Memorized News Content in GPT Models
Effectively training large language models (LLMs) like ChatGPT requires a substantial amount of data. Model developers like OpenAI, Meta, and others collect data from many different sources, including Wikipedia, books, and online crawls of web pages. Some of this data is undoubtedly copyrighted. And there are lawsuits that have been filed. Recognizing that their content has also been fed into these models, some news organizations have begun negotiating and signing license agreements, calling for more transparency, or even thinking about their own lawsuits.
For the most part there’s little transparency from model developers about what data they used to train a model. New legislation in the EU may potentially address this by requiring developers of such models to include a summary of copyrighted data used to train a model. But in the meantime, it’s hard to know if your site’s data has been ingested. A recent report from the Washington Post evaluated one dataset, called C4, that is known to have been used in training some of Google’s and Meta’s models. The “News and Media” category of sites was the third largest in the dataset, and the New York Times alone accounted for 0.06% of all data in the dataset. The original article describing GPT-3 also talks about how the model was trained on the Common Crawl which is the same root dataset underlying the C4 dataset investigated by the Washington Post. GPT-3 was further trained on a dataset constructed by looking at millions of links from Reddit, which we can assume included a lot of news. It does seem likely that there’s a substantial amount of news content that has been ingested by popular models like GPT-3 or 4.
In the absence of transparency, how might you determine if your site’s content has been incorporated into a model like GPT-3 or 4? In this post I’ll demonstrate a method that can be used to provide evidence in some cases.
Previous research has explored how models can memorize and regurgitate verbatim copies of text strings that they’ve ingested during training. In particular, the research has shown that (1) larger models, with more parameters, are more likely to memorize, (2) memorization is more likely the more a text has been repeated in the training data, and (3) memorized data is more discoverable if the model is prompted with longer strings of context.
These findings suggest concrete strategies for discovering memorization. First, we should see more memorization in GPT-4 than GPT-3 as it is a larger model, although this may be complicated by observations that GPT-4 is non-deterministic. Second, we should look for memorization of text that has been repeated many many times in the training data, such as boilerplate or repeated phrases that may be unique to a publication. We know that training data cutoffs were in 2021 (June for GPT-3 and September for GPT-4) and so any memorized text would come from before those cutoffs. And finally, to discover memorization we need to prompt the model with sufficient context so that it completes the text using material it has memorized.
I first set up an experiment to test whether GPT3 and 4 had memorized some boilerplate from the New York Times. The NYT Opinion section includes the following text at the bottom of almost every opinion article: “The Times is committed to publishing a diversity of letters to the editor. We’d like to hear what you think about this or any of our articles. Here are some tips. And here’s our email: letters@nytimes.com.” This string is unique to the New York Times and it appears before the training cutoff dates in 2021.
To test for memorization I prompt the model with a prefix of the string of varying length (e.g. “The Times is committed to publishing”), and check whether the model outputs the rest of the string such that it matches the original. A match (or perhaps even a very similar completion) could be said to reflect memorization and demonstrate the likelihood that the training data included that text (probably many many times). I prompted the model once with a temperature of zero to get the most deterministic output, and then prompted the model 20 times with a temperature of 0.7 to understand variability. I tried both GPT-3 (model name: “text-davinci-003”) and the latest version of GPT-4 (model name: “gpt-4” as of Sept 5, 2023)
Here is an output completion from GPT-4 that demonstrates the full string was reproduced by the model, indicating memorization of the content.
There were many variations of the input prompt that produced the complete memorized output, though the model appears more likely to return the memorization when the input prompt includes a complete sentence. There were several different lengths of prefix that produced the memorized string though it was important to systematically test each variation as some prefixes triggered the memorization and others didn’t. GPT-3 did not produce as many complete memorizations. It was often able to return the memorized text up to the end of “Here are some tips.” however in only one test it did then complete the next sentence (“And here’s our email: letters@nytimes.com.”) accurately. These results suggest to me that GPT models from OpenAI have memorized text from the New York Times, and that GPT-4 demonstrates this memorization more readily than GPT-3. Furthermore, systematically testing different input prompt context lengths was important to finding instances of memorization.
I tried this with some boilerplate from other publications. For instance, the Washington Post includes the following text at the bottom of the Miss Manners column: “New Miss Manners columns are posted Monday through Saturday on washingtonpost.com/advice. You can send questions to Miss Manners at her website, missmanners.com. You can also follow her @RealMissManners.” In this case I did not observe an exact complete match, but GPT-4 got very close. When prompted with “New Miss Manners columns are posted Monday through Saturday on washingtonpost.com/advice. You can send” it completed “questions to Miss Manners at her website, missmanners.com. You can also follow her on Twitter @RealMissM” which, besides truncating the Twitter handle, also includes the text “on Twitter” which doesn’t appear in the original. GPT-3 didn’t do quite as well. When prompted with the same prefix it completed: “questions to Miss Manners at her website, missmanners.com, or to her email address, dearmissman” which is a partial match, but includes a hallucination about an email address that wasn’t in the original.
Finally, I looked at the tagline that The Economist includes at the bottom of stories in its Science & Technology section: “This article appeared in the Science & technology section of the print edition under the headline [HEADLINE]”. For the string tested I excluded the actual headline since that wouldn’t have been repeated in the training data. In this case I found evidence of memorization with GPT-3. When prompting GPT-3 with “This article appeared in the Science &” the model completed the rest of the string perfectly. This was also the case for longer prefixes. However, GPT-4 did not produce the original string when prompted with any prefix.
Overall, the results here are somewhat varied. In two of the cases GPT-4 produced more memorized strings than GPT-3, as was expected according to the previous research. But the fact that in one case GPT-3 produced a memorization and GPT-4 didn’t indicates that it still makes sense to check both models. Based on my observations it also seems clear that you need to try many different prefix lengths in the prompt to see if the model will complete a memorization based on that prefix. Extremely short prefixes are unlikely to result in a memorization, though in a couple of the cases 6 or 7 words was enough. Your test string will need to be long enough to explore a range of prefixes as sometimes it could take a longer prefix to trigger a memorization. Since the strings that were observed as memorized are the kinds of boilerplate strings that would be repeated many times in the training data, this could be what’s driving the memorization, and would be consistent with prior research. These bits of boilerplate may be repeated not only because they appear at the bottom of different stories by the same publication, but also because they have been excerpted on other websites that appear in the training data.
If you’d like to test memorization in GPT-3 and 4 with your own string, you can use this Colab Notebook. Just input your OpenAI key and the string you’d like to test. Let me know what you find!