Article 11 · How Databricks Built a Data Agent That Actually Works

The Big Idea

Most teams building AI agents for data analysis start by adapting a coding agent - give it SQL tools, point it at a database, and expect it to answer business questions. Databricks' engineering team tried exactly this with Genie, their natural language data agent, and found that generic coding agents fail badly in enterprise data environments. The failure modes are not random - they stem from three structural differences between software tasks and data tasks that no amount of prompting or fine-tuning fixes. Their response was to rebuild Genie's core architecture around three targeted innovations: specialized knowledge search, parallel thinking, and multi-LLM routing. The result was an accuracy jump from 32% to over 90% on internal benchmarks. This post breaks down what those innovations actually are and what they mean for anyone building production data agents.

Before vs After

Generic coding agents are designed for environments with clear structure: a file system with known paths, a codebase with typed interfaces, unit tests that immediately tell you if something is wrong. Enterprise data environments have none of this. The gap between "it ran without errors" and "it returned the correct answer" is enormous when you are querying across thousands of tables built by different teams over years. Genie's architecture treats data analysis as a distinct problem category, not a sub-case of code generation.

Generic Coding Agent on Data

Keyword search over table names - misses semantic relationships
Single SQL attempt per query - no way to verify correctness
One model handles planning, search, generation, and judging
Assumes table metadata is complete and accurate
32% accuracy on enterprise data benchmarks
No mechanism to distinguish a running query from a correct query

Genie - Purpose-Built Data Agent

Semantic indices over data assets - 40% better table discovery
Parallel solution sampling - multiple SQL candidates, best answer aggregated
Specialized LLMs per subtask: planner, searcher, coder, judge
Learns from existing queries and column usage patterns
90%+ accuracy on the same benchmarks
Verification layer catches semantically wrong but syntactically valid SQL

How It Works

Genie's architecture has three distinct innovations layered on top of each other. The first is specialized knowledge search - instead of relying on table name matching, Genie builds multiple semantic indices from existing data assets: column descriptions, prior successful queries, business glossary terms, and usage patterns. When a user asks a question, the retrieval system finds contextually relevant tables and columns even when the user's language does not match the schema naming conventions. The second innovation is parallel thinking - Genie generates multiple candidate SQL queries for each question, executes them all, and uses an aggregation step to select or synthesize the final answer. This directly addresses the verification gap: without unit tests, sampling multiple trajectories and comparing results is the most reliable way to catch errors. The third is multi-LLM routing - different language models are deployed for different subtasks (planning the query strategy, searching knowledge indices, writing SQL, and judging answer quality), with each model selected for its strengths on that specific task rather than using one general-purpose model for everything.

Genie Data Agent - Architecture Pipeline

The three innovations are designed to solve separate problems and they compound. Better table discovery gives the parallel thinking step higher-quality context to work with, which means the candidate SQL queries start from a stronger foundation. The judge LLM then evaluates candidates that are more likely to be semantically correct rather than just syntactically valid. GEPA - Genie's own cost-accuracy optimization technique - tunes how the multi-LLM routing makes trade-offs, allowing the system to hit accuracy targets at lower inference cost by routing simpler queries to cheaper models.

The core insight: The absence of unit tests is not a minor inconvenience - it is a structural property of data analysis that changes everything about how you build the agent. Parallel sampling is a direct architectural response to having no ground truth verification.

Key Findings

Table discovery is the bottleneck, not SQL generation. Most NL2SQL failures happen before a single line of SQL is written - the wrong tables get selected. Specialized knowledge search using semantic indices over existing data assets improves this by up to 40%.
Accuracy went from 32% to over 90% on internal benchmarks. The full architecture combining all three innovations produces this result; each innovation contributes independently but they compound significantly when combined.
Parallel thinking improves accuracy across every LLM tested. This is not a model-specific finding - sampling multiple trajectories and aggregating is a generalizable pattern that benefits weaker and stronger models alike.
Different LLMs have complementary strengths. No single model wins on all subtasks. A model that writes clean SQL may not be the best judge of whether a result is semantically correct. Multi-LLM routing exploits these differences rather than ignoring them.
Specialized domain knowledge beats raw model capability. Genie outperforms generic agents not because it uses a more powerful model, but because it embeds enterprise data context - schema history, business terminology, prior query patterns - directly into the retrieval and generation pipeline.

+40%

Table Discovery Improvement

90%+

Accuracy (up from 32%)

Core Architectural Innovations

Why This Matters for AI and Automation Practitioners

The NL2SQL problem is not solved by better prompting. Genie's results make clear that accuracy gains at this scale require architectural changes - semantic retrieval, parallel sampling, and task-specialized routing - not prompt engineering. If your data agent is stuck below 60% accuracy, the bottleneck is likely the retrieval layer, not the generation layer.
Parallel sampling is a transferable pattern. Any workflow without reliable ground truth verification benefits from this approach. Document processing, email classification, contract review - if you cannot write a unit test for the output, sampling multiple candidates and using an LLM judge to select the best answer is a systematic improvement strategy.
Multi-LLM routing is worth the added complexity. The operational overhead of maintaining multiple model endpoints is real, but the accuracy gains justify it for production data agents where a wrong answer has business consequences. Cost-accuracy optimization tools like GEPA make this more tractable at scale.
Source-of-truth governance is still a human problem. Genie's knowledge search can retrieve contextually relevant tables, but it cannot resolve conflicting business definitions baked into different datasets. A "revenue" figure that means different things to finance and product will produce inconsistent answers regardless of how good the agent is. Data governance upstream determines the ceiling for any data agent's accuracy.

Specialized Knowledge Search

Semantic indices over data assets - table descriptions, column metadata, prior query history. Solves the discovery problem before SQL generation starts.

Parallel Thinking

Multiple SQL trajectories sampled and aggregated. Substitutes for the unit tests that don't exist in open-ended data analysis. Works across all LLM sizes.

Multi-LLM Routing

Separate models for planning, search, code generation, and judging. Exploits complementary strengths; GEPA tunes cost-accuracy trade-offs per query type.

My Take

Two things in this post hold up under scrutiny and one does not. The +40% table discovery improvement and the parallel thinking pattern are both grounded in a real insight - data agents fail at retrieval and have no verification mechanism, and these innovations directly address both problems. The 32% to 90%+ accuracy claim is harder to evaluate: Databricks controls the benchmark, controls the data, and controls how success is defined. Enterprise data environments are vastly more heterogeneous than any internal test set. That said, the direction is right. Most practitioners building data agents are losing to retrieval failures they are misdiagnosing as generation failures. The practical takeaway is to audit where your agent actually breaks down - if wrong tables are being selected, fix the retrieval layer first. If SQL is generated correctly but returns wrong results, parallel sampling with a judge is a meaningful intervention. The multi-LLM routing adds operational complexity that is only worth it at production scale with real reliability requirements. For most builders at the prototype stage, a single strong model with better retrieval and parallel sampling will get you most of the way there.

Discussion Question

If you are building a data agent today, where does it actually fail - does it select the wrong tables, generate syntactically valid but semantically wrong SQL, or return correct SQL on the wrong data because the source of truth is inconsistent? Which of Genie's three innovations would have the most impact on your specific failure mode?