Towards Accurate Quote-Aware Summarization of News using Generative AI

Alessandro Alviani
Generative AI in the Newsroom
8 min readJun 2, 2023

--

“A typewriter, quotation marks flying out of it, watercolour painting”, Bing Image Creator

Attribution is a fundamental principle of journalism. Correctly quoting a news source without distorting the sense of what was stated, or worse, adding information the journalist inferred, is an essential skill for any reporter. Even just recognizing a quote can be a struggle for algorithms. And large language models (LLMs) introduce a new challenge: they can potentially make up quotes or misattribute accurate quotes to the wrong sources. This is because of the way they work by predicting the next most likely word in a sequence based on the previous text. These kinds of errors could potentially erode trust in the media and should be avoided.

Our Goal

At IPPEN.MEDIA, we’ve been experimenting with numerous use cases for Large Language Models (LLMs). Some of them include suggesting headlines and lead variations, as well as summarizing or rewriting an article to target different audiences. When it comes to handling quotes, things can easily go wrong in generating summaries or text variations.

During our first round of testing, we discovered that ChatGPT tends to rewrite quotations, even when explicitly instructed not to. When we tried to summarize an article while keeping all the quotes unchanged by adding specific constraints to the prompt, ChatGPT just ignored those constraints and rewrote the quotes. Worse, while the prompts sometimes worked as expected and all the quotes were reproduced correctly, most of the time they were not.

This inconsistency — partly inherent in LLMs, which are probabilistic rather than deterministic models — undermines our goal of building AI-powered tools that both our editors and readers can trust. Even if we have adopted a double-check policy for all texts edited with ChatGPT and other LLMs and follow a human-in-the-loop approach, made-up quotes could still make it through the editing process. This jeopardizes accuracy and trustworthiness.

Our Approach

When it comes to quotes, it turns out that one of the fundamental concepts of prompt engineering — building a prompt that is as specific and clear as possible to define the desired output — may not be enough. As we’ll see later, a multi-step approach combined with instruction redundancy is required.

The instruction we initially added to our ChatGPT prompts for summaries and article versions [1] failed in two ways: The original quotes were either rewritten and put in quotation marks or paraphrased. Only in a few inconsistent cases were they correctly detected and left unchanged. After studying a sample set of 12 articles, we discovered that only in a very small fraction of texts (zero out of twelve for GPT-3.5 and six out of twelve for GPT-4) were all citations effectively detected and left unaltered in the new text variation (see Table 1 below in the right two columns).

Next, we tried to break the initial prompt into two steps. We also provided more context by using the system prompt to assign the model the role of an experienced news editor [2]. Again, the model usually ended up paraphrasing the original quotes.

Prompt Iteration is Key

In the context of the Generative AI in the Newsroom Challenge and the feedback we received, we began refining the prompt and moved to OpenAI’s Playground to take advantage of the additional parameters available there. For instance, we set the temperature parameter to 0 to reduce the variation in the output.

Again, we used a step-by-step approach, but this time we did it differently: we instructed the model to first extract all quotes using the format “” (i.e. look for anything in between quotation marks), and then generate a summary or new text version that included the previously extracted quotes. For clarity, we added the original article at the end of the prompt in step 2. [3]

The outcome was much better: no quotes were fabricated. However, even if it effectively extracts all quotes in step 1, the model could still deviate from the prompt and rewrite quotes such as (in German) incorrectly using the present tense rather than the subjunctive verb tense.

Worse, for longer articles with multiple quotes, the model can make two general errors: either it fails to extract all quotes, or it mistakenly identifies non-quote sentences that appear next to or in-between actual quotes as quotes. In general, the longer the text, the fewer the number of quotes recognized. This is particularly true for the GPT-3.5 model.

The real game changers in this iteration process were the next two adjustments. First, the addition of a simple system prompt [4] which can be provided to OpenAI’s Chat-based models in the Playground interface. The results improved significantly and support the idea that providing LLMs with more context increases their performance.

The second big improvement came from using GPT-4. Our tests show that OpenAI’s latest model outperforms GPT-3.5 at handling quotes. Using our two-step approach, almost all quotes were correctly recognized. In eleven out of twelve sample articles, all quotations were treated accurately — compared to seven out of twelve with GPT-3.5 (See Table 1). In total, 44 out of 45 single quotes were correctly inserted in the new output text — both for summaries and article variations — compared to up to 32 quotes using our original prompts (without the two-step approach).

Table 1: Original Sample of 12 articles. Task: Rewrite
Table 1: Original Sample of 12 articles. Task: Rewrite

We then performed testing on a fresh sample of 10 articles and 39 quotes and found that the detection and handling rates were remarkably similar (See Table 2). Using GPT-4, 37 out of 39 single quotes were properly integrated into the new output text in the case of writing article variations — compared to 21 quotes using our original prompts (without the two-step approach).

Table 2: New Testing Sample of 10 articles. Task: Rewrite
Table 2: New Testing Sample of 10 articles. Task: Rewrite

When it comes to summaries, GPT-4 is far superior to GPT-3.5. In 11 out of 12 articles, all quotes were correctly included in the AI-generated summary. With GPT-3.5, this rate dropped to 2 out of 12. On the fresh test set of 10 articles performance was lower for GPT-4 though, a result that means there’s still iteration to do on the prompt.

Table 3: Original Sample of 12 articles. Task: Summarization
Table 3: Original Sample of 12 articles. Task: Summarization
Table 4: New Testing Sample of 10 articles. Task: Summarization
Table 4: New Testing Sample of 10 articles. Task: Summarization

The instruction redundancy worked quite well. In our two-step approach, we asked GPT-3.5 and GPT-4 in the second step to rewrite or summarize a text and pasted not only all quotes extracted in step 1 but, once again, the original article at the end of the prompt. Even when GPT-3.5 and GPT-4 failed to extract all quotes from the original articles or delivered false positives in the first step (for example, the name of a news outlet published in quotes like “Handelsblatt” was detected as a verbatim quote), they were often able to reconstruct the original quotes in the second step correctly. Out of 12 articles, GPT-3.5 could only extract all quotes in five texts accurately — but managed to include all quotes in a new output summary or article variant in seven texts. GPT-4 was able to process quotes correctly in 11 of 12 texts, while quotes were only properly extracted in the first step in 8 cases.

More importantly for us, no quotes were fabricated, as they were with the very first prompts we had tried in our process.

Limitations of GPT Models

Still, the results are not always perfect. Sometimes both GPT-3.5 and GPT-4 models still recognize single words between quotation marks as verbatim quotations, such as a newspaper’s name, a figure of speech, or the name of a TV show. Even worse, in terms of accuracy and source transparency, GPT-4 sometimes fails to provide the name of the original source, such as a newspaper, even when it has successfully extracted and included all quotations. This is particularly noticeable in summaries and is an important error since without proper attribution a reader may not know the source of the quote. To preserve all quotes, GPT-4 inserted a grammatical error in one case — i.e. it simply put a comma between two quotes. Stylistically, GPT-4 sometimes repeats the words from a quote in the summary, as in the case of the following sentence:

Merz stressed the need for Berlin to have a stable government: “Berlin must have a stable, good government” (original article: Merz continued: “Berlin must have a stable, good government”).

Comparing to Another Model

We decided to re-run the tests with Claude, a ChatGPT competitor developed by Anthropic. We found similar performance to GPT models (See Table 5). Almost all quotes appeared correctly in the new article variant (44 out of 45 quotes) or the new summary (41 out of 45). The main difference to GPT-4 is a stylistic one in the summary task: to preserve as many quotes as possible, Claude sometimes just wrote two quotes side by side, without any introductory or explanatory words (such as “Olaf Scholz said …”). This reduces the quality of the output texts and makes them less than ideal for use in a journalistic context.

Table 5: Original Sample of 12 articles. Tasks: Rewrite and Summary. Tool: Claude.
Table 5: Original Sample of 12 articles. Tasks: Rewrite and Summary. Tool: Claude.

Main Takeaways and Next Steps

Our results emphasize two crucial learnings. First, having a human-in-the-loop is essential for accuracy. As with all use cases involving LLMs, triple-checking the output remains a cornerstone, especially for summaries. Second, there are no silver bullets or quick fixes; iteration is the best way to get as close as possible to the desired outputs when it comes to prompting.

Looking forward, we are eager to improve the quote process within the new Generative AI team that has recently been established at Ippen. For example, the summary task on the new 10-article sample has highlighted certain gaps and inconsistencies described above that we want to investigate further. As we iterate on the prompt to address these issues, we’ll then sample yet a new set of test articles to evaluate generalized performance.

We also plan to evaluate the quote process on additional Large Language Models, including Open Source LLMs.

[1] The text contains quotations; they are enclosed in quotation marks. Quotations must remain as in the original.

[2] You are an editor with 30 years of experience. You need to rewrite the following article into a new text. Think step by step.
Step 1: Rewrite the following original title using vivid but neutral language; Step 2: Make sure that all quotations within quotation marks are reproduced in the new content in the same way. Nothing in quotation marks may be rewritten.

[3] 1st step:
Extract all quotes between quotation marks such as “” in the following text:
###Text##

2nd step (after the model has extracted the quotes):
Rewrite the article and make sure the following quotes remain unchanged:
“quote”
“quote”
“quote”
Article: ###Text###

[4] You are a precise journalist and editor.

--

--

Product Lead AI at Ippen Media • JournalismAI Fellow 2022 • Previously: Editorial Director Microsoft News Hub Berlin, Germany Correspondent La Stampa