The Big Idea
Most teams building AI agents for data analysis start by adapting a coding agent - give it SQL tools, point it at a database, and expect it to answer business questions. Databricks' engineering team tried exactly this with Genie, their natural language data agent, and found that generic coding agents fail badly in enterprise data environments. The failure modes are not random - they stem from three structural differences between software tasks and data tasks that no amount of prompting or fine-tuning fixes. Their response was to rebuild Genie's core architecture around three targeted innovations: specialized knowledge search, parallel thinking, and multi-LLM routing. The result was an accuracy jump from 32% to over 90% on internal benchmarks. This post breaks down what those innovations actually are and what they mean for anyone building production data agents.
Before vs After
Generic coding agents are designed for environments with clear structure: a file system with known paths, a codebase with typed interfaces, unit tests that immediately tell you if something is wrong. Enterprise data environments have none of this. The gap between "it ran without errors" and "it returned the correct answer" is enormous when you are querying across thousands of tables built by different teams over years. Genie's architecture treats data analysis as a distinct problem category, not a sub-case of code generation.
Generic Coding Agent on Data
- Keyword search over table names - misses semantic relationships
- Single SQL attempt per query - no way to verify correctness
- One model handles planning, search, generation, and judging
- Assumes table metadata is complete and accurate
- 32% accuracy on enterprise data benchmarks
- No mechanism to distinguish a running query from a correct query
Genie - Purpose-Built Data Agent
- Semantic indices over data assets - 40% better table discovery
- Parallel solution sampling - multiple SQL candidates, best answer aggregated
- Specialized LLMs per subtask: planner, searcher, coder, judge
- Learns from existing queries and column usage patterns
- 90%+ accuracy on the same benchmarks
- Verification layer catches semantically wrong but syntactically valid SQL
How It Works
Genie's architecture has three distinct innovations layered on top of each other. The first is specialized knowledge search - instead of relying on table name matching, Genie builds multiple semantic indices from existing data assets: column descriptions, prior successful queries, business glossary terms, and usage patterns. When a user asks a question, the retrieval system finds contextually relevant tables and columns even when the user's language does not match the schema naming conventions. The second innovation is parallel thinking - Genie generates multiple candidate SQL queries for each question, executes them all, and uses an aggregation step to select or synthesize the final answer. This directly addresses the verification gap: without unit tests, sampling multiple trajectories and comparing results is the most reliable way to catch errors. The third is multi-LLM routing - different language models are deployed for different subtasks (planning the query strategy, searching knowledge indices, writing SQL, and judging answer quality), with each model selected for its strengths on that specific task rather than using one general-purpose model for everything.
The three innovations are designed to solve separate problems and they compound. Better table discovery gives the parallel thinking step higher-quality context to work with, which means the candidate SQL queries start from a stronger foundation. The judge LLM then evaluates candidates that are more likely to be semantically correct rather than just syntactically valid. GEPA - Genie's own cost-accuracy optimization technique - tunes how the multi-LLM routing makes trade-offs, allowing the system to hit accuracy targets at lower inference cost by routing simpler queries to cheaper models.
Key Findings
- Table discovery is the bottleneck, not SQL generation. Most NL2SQL failures happen before a single line of SQL is written - the wrong tables get selected. Specialized knowledge search using semantic indices over existing data assets improves this by up to 40%.
- Accuracy went from 32% to over 90% on internal benchmarks. The full architecture combining all three innovations produces this result; each innovation contributes independently but they compound significantly when combined.
- Parallel thinking improves accuracy across every LLM tested. This is not a model-specific finding - sampling multiple trajectories and aggregating is a generalizable pattern that benefits weaker and stronger models alike.
- Different LLMs have complementary strengths. No single model wins on all subtasks. A model that writes clean SQL may not be the best judge of whether a result is semantically correct. Multi-LLM routing exploits these differences rather than ignoring them.
- Specialized domain knowledge beats raw model capability. Genie outperforms generic agents not because it uses a more powerful model, but because it embeds enterprise data context - schema history, business terminology, prior query patterns - directly into the retrieval and generation pipeline.
Why This Matters for AI and Automation Practitioners
- The NL2SQL problem is not solved by better prompting. Genie's results make clear that accuracy gains at this scale require architectural changes - semantic retrieval, parallel sampling, and task-specialized routing - not prompt engineering. If your data agent is stuck below 60% accuracy, the bottleneck is likely the retrieval layer, not the generation layer.
- Parallel sampling is a transferable pattern. Any workflow without reliable ground truth verification benefits from this approach. Document processing, email classification, contract review - if you cannot write a unit test for the output, sampling multiple candidates and using an LLM judge to select the best answer is a systematic improvement strategy.
- Multi-LLM routing is worth the added complexity. The operational overhead of maintaining multiple model endpoints is real, but the accuracy gains justify it for production data agents where a wrong answer has business consequences. Cost-accuracy optimization tools like GEPA make this more tractable at scale.
- Source-of-truth governance is still a human problem. Genie's knowledge search can retrieve contextually relevant tables, but it cannot resolve conflicting business definitions baked into different datasets. A "revenue" figure that means different things to finance and product will produce inconsistent answers regardless of how good the agent is. Data governance upstream determines the ceiling for any data agent's accuracy.
Specialized Knowledge Search
Semantic indices over data assets - table descriptions, column metadata, prior query history. Solves the discovery problem before SQL generation starts.
Parallel Thinking
Multiple SQL trajectories sampled and aggregated. Substitutes for the unit tests that don't exist in open-ended data analysis. Works across all LLM sizes.
Multi-LLM Routing
Separate models for planning, search, code generation, and judging. Exploits complementary strengths; GEPA tunes cost-accuracy trade-offs per query type.
My Take
Two things in this post hold up under scrutiny and one does not. The +40% table discovery improvement and the parallel thinking pattern are both grounded in a real insight - data agents fail at retrieval and have no verification mechanism, and these innovations directly address both problems. The 32% to 90%+ accuracy claim is harder to evaluate: Databricks controls the benchmark, controls the data, and controls how success is defined. Enterprise data environments are vastly more heterogeneous than any internal test set. That said, the direction is right. Most practitioners building data agents are losing to retrieval failures they are misdiagnosing as generation failures. The practical takeaway is to audit where your agent actually breaks down - if wrong tables are being selected, fix the retrieval layer first. If SQL is generated correctly but returns wrong results, parallel sampling with a judge is a meaningful intervention. The multi-LLM routing adds operational complexity that is only worth it at production scale with real reliability requirements. For most builders at the prototype stage, a single strong model with better retrieval and parallel sampling will get you most of the way there.
Discussion Question
If you are building a data agent today, where does it actually fail - does it select the wrong tables, generate syntactically valid but semantically wrong SQL, or return correct SQL on the wrong data because the source of truth is inconsistent? Which of Genie's three innovations would have the most impact on your specific failure mode?