fine-tuning a text2SQL model

July 6, 2026 · 4 min read

Goal - fine tune a model to be better at text2SQL

Platform used:

tinker labs for model fine-tuning (very easy to setup and run fine-tune jobs)
wandb (weights and biases) for MLOPs

Dataset #

first things ive done are piecing together infra and building skills (borrowed a few instructions from a claw, which has skills) for my cc to do better. then, fetch and transform all the sql queries which are generated and validated (success:true) against sql engine from our internal traces (past 90 days). pick the ones that can be solved via single sql query (multiple required agentic reasoning + execution)

Json:{“query”, “sql”}
Chat format: {messages: [{role, content}]} - where we only compute training loss on assistant message (which has the SQL based on system prompt and user query)

My initial v0 number for the dataset - 494 samples. For test samples, later in the experiment - ive pulled from the evals traces we have for in our chat, which helped validate the model performance better

Method used - LoRA (Low rank Adaptation) - freezes the base model and learns a small pair of low-rank matrices per layer. You train ~0.1–1% as many parameters as full fine-tuning. Cheap, fast - but its low “rank” limits how much new discrete information (like a column dictionary) it can absorb. Strong at nudging behavior, weak at rote memorization.

Rank - the size of the lora adapter, we used 32 rank at the start, then higher rank 64 = more capacity (more cost)

Validation - here, the validation for me is not the assets that it fetched, its the success metric along with the gold SQL, without any syntax issues, relationships, and programmatic errors.

started the training with Qwen3.5-9B (dense model)

The initial run, v0 (32 rank/ 3 epochs), to compare our fine tuned model with actual model scores
- The actual model scored - 17% and then our model scored 92%, but later realised, the test and train set has same examples, so that failed.
v1, added (268 real + 205 synth) data with rank 64/ 3 epochs - gave us 71%, but the issue is LoRA cannot learns facts but can learn patterns. facts map to high entropy lookup table, but adapter is low dimensional which holds the pattern (e.g., “join detail tables to gold.assets on guid, filter status=‘ACTIVE’”)

i learnt that, its not true - bigger model can perform better in domain specific SQL compared to smaller fine tuned models. The paper - Arctic-text2sql-r1 - 7B, which matches/beats deepseek-v3 and SLM-SQL hits 67% on BIRD, surpassing bigger models.

Another learning is - higher ratio of synthetic data doesnt help (atleast in my case of text2sql), it actually creates noise in weights - leading to regression of predicting tokens

Spent sometime on building bigger dataset as the leaning rate was flat with smaller set, scaled the dataset by 4x, with only 6% synthetic queries. The Qwen3.5-9B model performed the best with hybrid strategy - 91%

all the patterns in LoRA adapters, and the facts - schema and columns names (~150) are in system prompt along with a few attribution rules.
Tried with greedy decoding (temp=0), but it went into repetition loops generating noise, so was tweaking and 0.6 got me good results - 85%
When tried with best of 3 execution pair, the model generates 3 sql queries for the same query, it shoot up to - 91%

*fixed hyperparameters throughout to make it stable and consistent. Which took multiple iterations, and wandb helped me a lot into logs for every run to monitor training loss, etc..

Next phase - curios to try MOE (mixture of experts) models, so choose “Qwen3.5-35B-A3B” and “Nemotron-30B-A3B”, simply, its a 35/30B params model, where only some “experts” (A3B active) fire per token.

Even with the same scaled data, these models performed little lesser compared to results of 9B model - stuck at 71-78% of accuracy these MOE models had similar problems - “invented new table names”, “referenced columns that exist but within a different table”“what was the NRR for 2025?” → invents gold.kpi_actuals;

^^ ive to go little deeper into different arch and methods used for these models. But, had great learnings.