How Teams of AI Agents Could Provide Valuable Leads For Investigative Data Journalism
ChatGPT hasn’t quite hit the mark as an investigative reporting assistant — could an Agentic AI workflow offer a better solution?
Note: This post was co-authored with Nick Diakopoulos.
In investigative data journalism, the final product is rarely the work of a lone reporter. Instead, it’s the outcome of a collaborative effort where data analysts, reporters, designers, developers, and editors combine their skills to turn data into a compelling story. Data journalism almost inherently requires teamwork — few individuals possess the range of skills needed to collect, interpret, and present data in a way that truly resonates with readers. It’s through this collective effort that the most impactful stories are brought to life.
Over the past year, there’s been growing interest in adding at least one more member to the team: generative AI. The Markup, for instance, experimented with using ChatGPT as a reporting assistant earlier this year. More recently, The Pudding even tried Anthropic’s Claude to produce an entire story from start to finish and graded it in the process.
The results were underwhelming. In The Markup’s case, the AI provided inaccurate information, lacked transparency in sourcing, and required frequent corrections or highly specific instructions. The AI used by The Pudding seemed to fare slightly better but still earned only a C+ as a reporter.
Both experiments relied on interacting with generative AI through chat interfaces. When we think of generative AI today, it’s often tools like ChatGPT that come to mind — simple interfaces where you ask a question and get an immediate response. But this kind of back-and-forth exchange may be too simplistic for the demands of investigative data journalism, which requires more than just quick answers.
The future of generative AI might be headed toward something a little more ambitious. Just as human journalism benefits from collaboration, AI could also advance through teamwork. Increasingly, research in AI is shifting toward what’s being called “agentic AI” — teams of AI agents that don’t just answer questions but actively plan, strategize, and collaborate on complex tasks. Both OpenAI and Google are reportedly actively working on it.
We’ve already seen early signs of what’s possible in other areas, like literary translation. In one recent study, an agentic AI system operated with a senior editor, a proofreader, a localization specialist, and even a virtual CEO. The results were very promising, with translations from the agents often being preferred over those produced by human writers.
Inspired by these developments, we’ve been experimenting with a similar agentic workflow for investigative data journalism over the past few months. Not to replace the newsroom or substitute for the expert judgment of journalists, but to provide reporters with interesting leads from large datasets that they could assess and choose to pursue if they felt further investigation was warranted.
Specifically, we developed a prototype system that, when provided with a dataset and a description of its contents, generates a “tip sheet” — a list of newsworthy observations that may inspire further journalistic explorations of datasets. Behind the scenes, this system employs three AI agents, emulating the roles of a data analyst, an investigative reporter, and a data editor. To carry out our agentic workflow, we utilized GPT-4-turbo via OpenAI’s Assistants API, which allows the model to iteratively execute code and interact with the results of data analyses. In this blog post, we’ll provide a general overview of the system; more detailed information can be found in the pre-print of the short paper that will be presented at Computation + Journalism 2024.
How it works
Basically, our system consists of a series of prompts organized into an orchestrated workflow (see the diagram below), divided into two main types of prompts: role-specific and task-specific. The role-specific prompts act as the job descriptions of our AI agents, designed to replicate the roles of key players in a data journalism team — the analyst, the reporter, and the editor. Task-specific prompts, on the other hand, guide each agent through the nitty-gritty steps required to get the job done.
In our setup, the analyst is made responsible for turning journalistic questions into quantitative analyses. It conducts the analysis, interprets the results, and feeds these insights into the broader process. The reporter, meanwhile, generates the questions, pushes the analyst with follow-ups to guide the process towards something newsworthy, and distills the key findings into something meaningful. The editor, then, mainly steps in as the quality control, ensuring the integrity of the work, bulletproofing the analysis, and pushing the outputs towards factual accuracy.
Overall, these three agents collaborate through four stages:
- Question Generation: First, a dataset and its description are provided to the reporter agent, which is tasked with brainstorming a set of questions (with the number adjustable) that could be answered using the data.
- Analytical Planning: For each question, the analyst drafts an analytical plan detailing how the dataset can be used to answer the question. The editor provides feedback on the plan and the analyst redrafts.
- Execution and Interpretation: Each analytical plan is executed and interpreted by the analyst. The editor and reporter provide feedback, which the analyst incorporates, and the reporter then summarizes the final results in bullet points.
- Compilation and Presentation: All bullet points from the previous step are compiled, and a subset of the most significant findings is presented to the user in the tip sheet.
Throughout these stages, the agents don’t just passively use each other’s outputs as inputs but actively have to incorporate each other’s feedback, particularly during the analysis phase. After the analyst completes its work in the third step, for example, the reporter steps in to assess these findings. The reporter is then prompted to choose between three choices: 1) give a green light for “publication” which signals that the insight should be bulletproofed and potentially shared with the journalist supervising the agents, 2) suggest further analysis to try to develop other angles, or 3) decide the findings aren’t newsworthy enough to pursue.
If the reporter agent thinks the analysis is ready, the process moves ahead. But if more work is needed, which can often be the case, the reporter provides specific feedback — whether it’s a follow-up question or a new angle that needs exploring. The analyst then revisits the data, refines the analysis, and addresses the reporter’s concerns. This back-and-forth can happen several times, creating an iterative process. A complete list of the prompts used in the pipeline is available in the project’s GitHub repository.
How we tested it
Evaluation can be a difficult problem for generative AI — general benchmarks often don’t tell the full story, and merely interacting with the system doesn’t provide a structural evaluation of its capabilities. That’s why we approached the evaluation of our system with a focus on real-world applicability. We tested our generative agents on five actual investigative data journalism projects, detailed in the table below. All the projects were nominated for either the Sigma Awards or the Philip Meyer Journalism Award, and we prioritized diversity in publication location, methodologies, and types of insights. Although we selected relatively complex projects because of our focus on award-winning work, we also had to exclude projects that required extensive computational resources, large datasets, visual image analysis, or a focus on geographical data, due to the constraints of the OpenAI Assistants API.
To measure the benefits of the agents approach, we further compared the tips they produced to a baseline model that lacked the collaborative setup of our agent system. Our evaluation focused on three metrics: validity, potential newsworthiness, and precision.
Validity is about ensuring that the insights generated by our system are sound and logically derived from the provided data. Newsworthiness assesses whether these insights have the potential to become compelling stories that would grab a journalist’s attention and warrant further investigation and development into a story. And finally, precision measures how closely our system’s tips match the key findings in the original published articles. In other words, did the system identify insights similar to those presented in the original article?
What we found was promising: overall, the results affirmed the benefits of our agents pipeline, with higher average scores across all three metrics — validity, newsworthiness, and precision. Particularly the improvement in newsworthiness was notable, as our agents consistently outperformed the baseline model across projects, with about two-thirds of the suggested leads being potentially newsworthy.
Validity was relatively strong, with overall scores roughly between 80% and 90%. However, the system encountered difficulties with the more complex tasks, such as analyzing datasets that required expertise in multilingual text analysis methods.
Finally, while the agents demonstrated some improvements in precision compared to the baseline, the average precision remained relatively low at 34%. This means that generally about one in three tips found by the agents also appeared in the final article. Considering the broader scope of these tips, this may also indicate the influence of editorial decisions, with certain leads aligning more closely with editorial priorities than others.
Even with somewhat lower precision, when taken together with the newsworthiness results, the results show that the process was surfacing leads with news potential which weren’t included in the original reporting. This means there is potential to inform avenues of investigation for new coverage. Moreover, the validity ratings, while reasonable, indicate that having a human in the loop to check is as necessary as ever.
As an example to illustrate the findings further let’s look at the story we included published by Readr, which utilized the Facebook Ad Library to analyze the ad spending and targeting strategies of political parties in Taiwan. In this case, we found that while the agents’ setup exhibited higher newsworthiness overall, it had lower precision, meaning the baseline tips were more closely aligned with the original article. Both models provided basic statistics from the dataset, such as which party spent the most on ads (e.g., “The Democratic Progressive Party was the top spender on Facebook ads during Taiwan’s 2020 elections”) and details on targeted demographics and regions.
However, in several iterations, the agents’ setup shifted focus to also explore the relationship between ad spending and impressions, thus laying more emphasis on the platform’s role in disseminating the content. For instance, one agent returned a tip highlighting that an independent candidate achieved the highest campaign efficiency. While such insights are potentially valuable and similar analyses have been covered in other journalistic pieces, the original article centered primarily on the role of Taiwan’s political parties — a focus better reflected in the baseline tips.
What’s next
While the results of our evaluation were promising, there’s still a lot of ground left to cover. For one, the scope of our evaluation was limited by the constraints of OpenAI’s Assistants API, which doesn’t support executing complex code or using external packages. That meant we had to exclude certain types of projects from our evaluation. To really unlock the potential of these agents, we need to look at integrating more flexible, non-proprietary models that can handle a wider range of tasks.
There’s also a need to dig deeper into the various parts of the pipeline itself. How do the system prompts, knowledge bases, and feedback loops contribute to the final outcome? And how robust are they to small changes? Understanding these components more fully could lead to even more powerful and nuanced iterations of this tool for journalists. As we improve the system, we’re also keen to collaborate with newsrooms that are interested to try it out on their data-driven investigations.
Our current setup offers a lot of automation, but it leaves limited room for human input during the early stages. In a real-world newsroom, it’s crucial that journalists have more control, perhaps by shaping the initial questions or guiding the analysis in real-time. We’re interested in thinking through broader frameworks for integrating humans into the processing loop of agentic systems, including by exploring interaction and interface paradigms for effective ways to support journalistic agency, supervision, and expert judgment in the process.
The system we’ve developed shows a lot of promise — it’s a tool that can help uncover valuable leads and provide new angles on complex stories. But it’s also just that: a tool. The insights generated by these agents are a starting point, but the real work of journalism, the craft of telling a story that matters, remains firmly in human hands.
—
Joris Veerbeek is a PhD candidate in the Department of Media and Culture Studies at Utrecht University. He focuses on the application of AI in investigative journalism and, as part of his PhD, works on investigative data journalism projects with the Dutch weekly De Groene Amsterdammer.
Nick Diakopoulos is a Professor in Communication Studies and Computer Science (by courtesy) at Northwestern University where he is Director of the Computational Journalism Lab (CJL).