analyst-buddy: a small fine-tuned SQL agent for people who don't write SQL

A continuation of SQLEnv: Teaching Small Models to Explore Databases. That post trained a 0.6B model as a proof of concept. This one takes a step toward turning the idea into a useful product: a data analyst for small-business owners, at 1.7B, with notes on what we learned along the way.

The problem

Small-business owners have the data but no analyst. The numbers sit in a database, and getting answers to questions like "which store takes the most orders?" or "what is the average value of a paid order?" means learning to operate database systems and a query language there is no time for.

analyst-buddy answers those questions in plain English. It returns the answer, a result table, a chart, and the query that produced it. An owner who isn't a data expert can explore their operations and generate insights through natural language.

How an analyst operates

Consider answering a question against an unfamiliar database. You rarely write the final query in one go. You query the data to see what columns exist, scan a few sample rows, then build the query piece by piece, adjusting joins, fixing column names, and retrying after errors. The answer emerges from iteration.

That iterative loop is what we trained, not one-shot pattern matching. The agent starts with only the table names and discovers the rest through four actions: DESCRIBE → SAMPLE → QUERY → ANSWER.

Fine-tuning lifts results across every database

On the four sample business databases the product targets (249 scoreable questions, including a pet-shop database the model never trained on), an off-the-shelf Qwen3-1.7B gets 4.4% correct. After fine-tuning, the same-size model gets 49.0% correct, an improvement of about 11×, and it improves on every database.

Databaseoff-the-shelffine-tuned
overall4.4%49.0%
retail_smb (the pet shop)0.0%42.3%
dog_kennels2.4%54.9%
student_assessment9.4%60.4%
department_store3.7%38.6%

The off-the-shelf model tends to give up, return an empty result, or answer with the column name (store_id) instead of the value, spending about 9 of its 10 steps along the way. The fine-tuned agent explores first, then answers.

One example is the question "Which store has taken the most orders?" (answer: Maple Street Pets). In one run, the fine-tuned agent describes the orders and stores tables, writes a JOIN, hits Error: no such column: t1.name, re-describes, and corrects to t2.name. This is one of the kinds of error recovery the RL training is designed to produce.

How we trained it

We trained the model with a supervised warm-up followed by two-phase GRPO reinforcement learning, as a full fine-tune of Qwen3-1.7B. Full fine-tuning and RL training ran on Modal on a single A100-80GB. The app is served on Hugging Face ZeroGPU.

We score all checkpoints on a held-out set of 195 questions, on databases the model never trained on.

What we learned: verifiable rewards don't remove the need for a held-out set

Our reward is execution-verified: the query is run and its result checked against the gold answer, a binary correct/incorrect signal. This kind of verifiable-reward RL is what made recent reasoning models like DeepSeek-R1 work, and the hope is that it teaches a general skill rather than memorized answers. But reported gains can be inflated by confounds such as dataset contamination, so a contamination-free evaluation is a sensible floor (Wu et al., 2025).

When we trained longer and on more data, we saw signs of over-memorization: training accuracy kept rising while held-out accuracy fell, and the largest in-distribution gain, three-table solving, did not transfer. We have not run a controlled ablation separating the effect of more epochs from more data, so we treat this as a strong signal, not a proven mechanism. The evidence in the literature is mixed: some work finds RL generalizes better than supervised fine-tuning (Chu et al., 2025), while other work finds RLVR can narrow what a model solves as training continues (Yue et al., 2025). That is closer to what we saw. Watching a held-out set during training and stopping when it stops improving (standard in supervised learning, and applied in RLVR by evaluating every N steps) is our most concrete next step. We currently evaluate only at the end.

Room for improvement

One- and two-table questions are reliable. Three-table joins are still not performing well and are our next tuning target. The agent also over-explores easy single-table questions, sometimes talking itself out of a correct shorter answer. The most promising route to better results is mining multi-table recovery examples and adding a small step cost on already-solved simple questions, rather than simply training more.

Try it