Can LLMs Help Us Understand Data?

Published in

Generative AI in the Newsroom

11 min readDec 4, 2023

Experiments in using local LLMs to answer questions using databases

In my work as a freelance data journalist and news apps developer, I spend a lot of time developing systems that allow less technical reporters, editors, and the general public to interact with databases. Yet, no matter how well-designed the system is, questions often arise that can only be answered by a database query, typically written in SQL (Structured Query Language).

With the rise of large language models (LLMs) that run on your local computer (rather than using a cloud service like OpenAI’s) I wondered if it might be possible to get an AI agent to answer questions written in natural language by translating and executing SQL queries on custom databases. Could this be useful to me as a data journalist, for instance, by making it faster to mine datasets for useful information? Would it be possible to have a local AI system so I could pose these questions on the fly to reduce reliance on proprietary models and protect any confidential data sources I’m investigating? Or, perhaps reporters and editors could interact with it directly and find some utility?

To start exploring these possibilities I coupled local large language models and Reason + Act prompting with a set of actions based on SQLite queries. In this post, I detail a series of experiments where I studied how well they could answer questions from my data.

Reasoning and Acting

Pre-trained LLMs are static. They’re limited by the data they were trained on. The less data on a particular subject or viewpoint they’ve seen in training, the less likely they’ll be able to accurately answer questions about it. If you ask ChatGPT who your local mayor is, there’s a good chance it will give you some outdated or even incorrect information. For this reason, using LLMs as a source for knowledge-based tasks is often a non-starter.

Reason + Act (a.k.a. ReAct, not to be confused with the JavaScript library) is a prompt format that gives an LLM the ability to interact with systems, retrieve results, and incorporate that information into its reasoning as it comes to a conclusion.

It works by interleaving “reasoning”, which helps models craft and follow plans, with acting, which grants access to external “tools” which could include other software or data/knowledge via APIs. ReAct traces (a trace is the full log of all the steps the LLM took to answer a question) look like this (borrowed from Colin Eberhardt’s Re-implementing LangChain in 100 lines of code):

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of: [LIST OF ACTIONS HERE]
Action Input 1: the first input to the action.
Observation: the result of the action
… (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

The idea is to construct a prompt (a prompt is a piece of text we input to the LLM) which includes one or two example ReAct traces, demonstrating various solutions to problems. Then we pose the question to be answered and let the LLM continue until it outputs an end-of-sequence token or “Observation:”. At that point we look at the output, parse any “Action” statements and related “Action Inputs.” This results in a call to a Python function with those arguments. Whatever the function returns is appended to the prompt as an “Observation” along with the model’s prior response(s). This continues until we run out of context space or the LLM comes to a “Final Answer” indicating that it has arrived at what it thinks is the answer to the initial question.

This was developed specifically to address shortcomings in traditional LLM performance in tasks like fact checking and question-and-answer problems. ReAct also helps improve human interpretability of LLM answers to questions.

A flowchart illustrating Reason+Act prompting. A full-sized PDF is available here.

LLMs plus SQLite

The data I had available was in the form of SQLite databases (SQLite databases are portable relational databases that fit in a single file and allow you to perform SQL queries on them). In order to allow a LLM to interact with my data via SQL queries, I provided it some actions:

tables: Lists the tables in the DB.
schema: Returns the schema for a given table (i.e. all of the variables).
help: Provides context for tables and columns. This is to avoid information overload in the prompt itself. The LLM can request help before using tables or columns. (e.g. “this column may be used to join on the groups table” or “sometimes users mention their location in this field”)
sql-query: Executes a SQLite3 query and returns the first 3 results (to preserve context space, but allow “top few” type questions).

All of these actions produce helpful error information in the event that the LLM uses an action wrong. This helps the LLM recover if it makes a common mistake like asking for the schema of a non-existent table or a SQL query naming the wrong column or trying to use a SQLite extension that isn’t enabled.

With these four actions, I manually produced a handful of example traces. The goal of these is to show the model how to use each action along with some basic SQL strategies. These were added to the prompt before posing the LLM the actual question to be answered.

Model Selection & Choice of Tools

If you’ve looked into the world of local LLMs, you may have been overwhelmed. There are an incredible number of models, formats, tools, and prompt types. One user in particular, Tom “TheBloke” Jobbins, has over two thousand LLMs available for download on Hugging Face. To wade through all this I needed a way to test a lot of different models in a standardized way.

I wrote a script that asks each model a set of questions and records the result. I then scored the result based on accuracy (based on the existence of correct-answer keywords like dollar amounts and names).

I focused my search on top models on the Alpaca Leaderboard. Models on the leaderboard can successfully answer a wide variety of questions and complete tasks including programming problems and logic puzzles. My assumption is that these candidates would likely be good at both SQL coding and reasoning. I also expanded my search to include some models recommended by colleagues as well as GPT-4 Turbo for comparison. I focused primarily on smaller 7B parameter models (e.g., Mistral, Llama 2), as these are the most approachable and able to run on a laptop, but I also tested some 13B and 70B models.

Experiment Design

The test consisted of six questions that would require the model to perform increasingly complex queries, using all of the tables available. The data was from a scrape of Roblox game and job postings that I’ve been working with. The Roblox Corporation, the company behind the Roblox platform, is a multi-billion dollar company that caters primarily to children. I wanted to answer journalistic questions like how many children under 18 were seeking employment on their forums, try to quantify the amount of money some games have made, and identify any unethical practices.

Here are the questions I posed to the LLMs:

“Which team was associated with the most job experiences?” Requires the model to group by the team column, count and then sort, taking the top few.
“What is the combined total price of all Game Passes in the data?” Requires a sum over the price column on the entire table.
“What are the top three most popular games by number of favorites?” Requires the model to numerically sort by the Favorites column, which consists of numeric strings with commas.
“What is the lowest and highest age requirement for jobs?” This requires the model to figure out that every value on the minAgeRequirement column is zero.
“What are some ages of users according to their profile descriptions?” Users sometimes mention their age in their descriptions (e.g., “I’m 15 years old”). This requires the model to do a string search and extract the number in no particular order.
“Of the people who mention their age in their profile description, what is the average age? You may be able to find this by looking for ‘ years old ‘ and then getting the characters before it.” Here I provide a hint to guide the model to the goal of extracting a number age and averaging it.

One of the primary limitations of working with LLMs is their context size. This is the maximum number of tokens (words and symbols) the LLM can take as input while coming up with output. Context size ranges from 2k to 4k for most local models. OpenAI models like GPT-4 can go up to as much as 128k tokens in their latest versions, though the use of longer contexts also comes with higher costs. Everything must fit inside this limit, including the prompt, examples, the question and all the ReAct cycles and final answer. Going beyond this limit results in a failure.

The need to leave enough context room for the model to use actions, write SQL queries, and retrieve JSON results meant I could only use a few examples.

I devised a generic prompt with four examples. Two were focused on showing the LLM how to use the tables and help commands. The second two examples showed SQL query solutions.

Initial Results

OpenAI’s GPT-4 Turbo model did consistently well when using the generic prompt. It’s a very flexible model and exhibited no significant issues regarding hallucination (with temperature=0). But smaller local models were also surprisingly competitive (note that in informal testing, instruction-tuned models tended to perform worse than completion models). Of the small models, the 7B Mistral-based models did the best.

When the examples provided in the trace matched closely with the question (e.g. the solution requires a SQL query where just a column or table name was different from an example), the models did consistently well. Problems requiring a simple sort based on a numeric field or taking the max/min of a column were all easily answered. But when the model needed to write a complex query from scratch and no similar examples existed, the LLM often failed to come to a solution at all. This was particularly true when the data required conversion or parsing.

For example, in the third question, the LLM needs to join two columns and then sort based on a “Favorites” column. This column has numeric strings with commas, so the model needs to strip out commas before it can be cast to an integer for numeric sorting. (Failure to strip the commas will result in a zero for all values containing one.) Without an example showing this, all models failed to come to a correct answer.

**Table 1. Generic prompt results. Number of questions answered correctly for each model.**

To address this problem, I wrote several extra question-and-answer traces that covered a variety of columns, techniques, and SQL strategies from all tables. Instead of presenting the LLM with a universal prompt with all the examples, I would inject individual examples into the prompt, dynamically, based on the input question’s similarity to the example questions. I initially developed this process manually, but can now pull the appropriate example traces from a small database based on semantic similarity to the input question. This significantly improved the ability for LLMs to come to solutions for the questions for which applicable examples were provided.

**Table 2. Dynamically injected prompt results. Number of questions answered correctly for each model.**

Mistral 7B did best with the dynamically injected example prompt strategy. The difference came down to question 6, which asked the model to get the average user age mentioned in user descriptions. Mistral 7B accidentally got this one correct. Both OpenAI’s GPT-4 Turbo and Mistral 7B erroneously cast non-numeric results to integers (resulting in zero) and included those zeros in the average, pushing the average down. Mistral rounded its result up to 16 (from 15.9), which was the closest any model got to the correct answer of 16.3 years old.

Schema Confusion

Using clear, descriptive column names is very important. To illustrate, one of my tables had a column named “creatorDescription” containing the user description. In questions where the LLM needed to search user descriptions, a huge amount of LLM time was spent figuring out that the description column wasn’t called “description”. This was also a common cause of hallucination errors. The LLM would sometimes get the schema, note in its thought that the column was called “creatorDescription” and then use “description” inside the SQL query anyway. Instead of forcing the LLM to learn an odd column name, it is easier to simply rename the column to “description”.

Likewise, a column named “team” was ambiguous to models. Extra work was required of LLMs to discover the field was a team name and not a team ID. Renaming the column to “team_name” helped disambiguate the meaning for tasks where the field needed to be used to come to a solution.

LLMs, across the board, struggled with complex transformations of data. The most challenging test question required the models to extract an age from a free-text description field (e.g. extract an integer 22 from a profile containing the text “I am 22 years old.”). SQLite has very few built-in scalar functions. When given no examples of how to accomplish this, all models failed to figure it out. When I injected an example prompt that showed a string search followed by a negative index, the model could mimic the strategy and apply it to new problems.

In all experiments, naming columns plainly, simply and descriptively helped the models come to a solution more often. If you know you’re going to be answering questions about ages found inside of free text fields, it’s best to create a new column with that transformation already performed accurately. The fewer steps the model needs to take, the better it will perform.

Conclusion

In my experiments, I found that small local LLMs can compete with larger 70B and even OpenAI GPT-4 Turbo models. Mistral 7B models did particularly well given their small size. But hallucination, small context size, complex schemas, and data transformations still posed a challenge for all LLMs, particularly the small ones.

Dynamic selection and injection of examples into prompts based on question similarity proved to be a very successful strategy. It allowed some models to arrive at the correct answer for even very complicated questions. Adding a new example is a much quicker solution to new unforeseen questions or data issues compared to the retraining required for fine-tuning methods.

While both local and OpenAI models were successful at answering questions from SQL databases using Reason+Act prompting, whether or not these models would be suitable for a general audience is a different question. Local LLMs in particular have little to no guarantees around bias and alignment, making open-ended questions vulnerable to these flaws.

Reason+Act traces by themselves were also a powerful asset in building trust in the model outputs. It was easy to look at the full trace, see what queries it tried, why it was running them, check the query result, and identify areas where the model went wrong or hallucinated. This isn’t possible using traditional fine-tuning or retraining methods.

I plan to expand upon this and explore using more complicated datasets and RDBMS (like PostgreSQL). I’ve published a video demonstrating these techniques and I’ve also made code available if you’d like to run your own experiments on your own data.