Article 11 · May 2026

How Databricks Built a Data Agent That Actually Works

May 16, 2026 · by Satish K C 8 min read
Agents RAG LLMs Data Engineering

The Big Idea

Most teams building AI agents for data analysis start by adapting a coding agent - give it SQL tools, point it at a database, and expect it to answer business questions. Databricks' engineering team tried exactly this with Genie, their natural language data agent, and found that generic coding agents fail badly in enterprise data environments. The failure modes are not random - they stem from three structural differences between software tasks and data tasks that no amount of prompting or fine-tuning fixes. Their response was to rebuild Genie's core architecture around three targeted innovations: specialized knowledge search, parallel thinking, and multi-LLM routing. The result was an accuracy jump from 32% to over 90% on internal benchmarks. This post breaks down what those innovations actually are and what they mean for anyone building production data agents.

Before vs After

Generic coding agents are designed for environments with clear structure: a file system with known paths, a codebase with typed interfaces, unit tests that immediately tell you if something is wrong. Enterprise data environments have none of this. The gap between "it ran without errors" and "it returned the correct answer" is enormous when you are querying across thousands of tables built by different teams over years. Genie's architecture treats data analysis as a distinct problem category, not a sub-case of code generation.

Generic Coding Agent on Data

  • Keyword search over table names - misses semantic relationships
  • Single SQL attempt per query - no way to verify correctness
  • One model handles planning, search, generation, and judging
  • Assumes table metadata is complete and accurate
  • 32% accuracy on enterprise data benchmarks
  • No mechanism to distinguish a running query from a correct query

Genie - Purpose-Built Data Agent

  • Semantic indices over data assets - 40% better table discovery
  • Parallel solution sampling - multiple SQL candidates, best answer aggregated
  • Specialized LLMs per subtask: planner, searcher, coder, judge
  • Learns from existing queries and column usage patterns
  • 90%+ accuracy on the same benchmarks
  • Verification layer catches semantically wrong but syntactically valid SQL

How It Works

Genie's architecture has three distinct innovations layered on top of each other. The first is specialized knowledge search - instead of relying on table name matching, Genie builds multiple semantic indices from existing data assets: column descriptions, prior successful queries, business glossary terms, and usage patterns. When a user asks a question, the retrieval system finds contextually relevant tables and columns even when the user's language does not match the schema naming conventions. The second innovation is parallel thinking - Genie generates multiple candidate SQL queries for each question, executes them all, and uses an aggregation step to select or synthesize the final answer. This directly addresses the verification gap: without unit tests, sampling multiple trajectories and comparing results is the most reliable way to catch errors. The third is multi-LLM routing - different language models are deployed for different subtasks (planning the query strategy, searching knowledge indices, writing SQL, and judging answer quality), with each model selected for its strengths on that specific task rather than using one general-purpose model for everything.

Genie Data Agent - Architecture Pipeline
USER QUERY Natural Language KNOWLEDGE SEARCH Semantic Table Index Column + Query History +40% Discovery Accuracy PARALLEL THINKING SQL Candidate A SQL Candidate B SQL Candidate C 32% → 90%+ Accuracy MULTI-LLM ROUTING Planner LLM Coder LLM Judge LLM GEPA Cost Optimization ANSWER Verified GENIE DATA AGENT - KNOWLEDGE SEARCH + PARALLEL THINKING + MULTI-LLM ROUTING

The three innovations are designed to solve separate problems and they compound. Better table discovery gives the parallel thinking step higher-quality context to work with, which means the candidate SQL queries start from a stronger foundation. The judge LLM then evaluates candidates that are more likely to be semantically correct rather than just syntactically valid. GEPA - Genie's own cost-accuracy optimization technique - tunes how the multi-LLM routing makes trade-offs, allowing the system to hit accuracy targets at lower inference cost by routing simpler queries to cheaper models.

The core insight: The absence of unit tests is not a minor inconvenience - it is a structural property of data analysis that changes everything about how you build the agent. Parallel sampling is a direct architectural response to having no ground truth verification.

Key Findings

+40%
Table Discovery Improvement
90%+
Accuracy (up from 32%)
3
Core Architectural Innovations

Why This Matters for AI and Automation Practitioners

01

Specialized Knowledge Search

Semantic indices over data assets - table descriptions, column metadata, prior query history. Solves the discovery problem before SQL generation starts.

02

Parallel Thinking

Multiple SQL trajectories sampled and aggregated. Substitutes for the unit tests that don't exist in open-ended data analysis. Works across all LLM sizes.

03

Multi-LLM Routing

Separate models for planning, search, code generation, and judging. Exploits complementary strengths; GEPA tunes cost-accuracy trade-offs per query type.

My Take

Two things in this post hold up under scrutiny and one does not. The +40% table discovery improvement and the parallel thinking pattern are both grounded in a real insight - data agents fail at retrieval and have no verification mechanism, and these innovations directly address both problems. The 32% to 90%+ accuracy claim is harder to evaluate: Databricks controls the benchmark, controls the data, and controls how success is defined. Enterprise data environments are vastly more heterogeneous than any internal test set. That said, the direction is right. Most practitioners building data agents are losing to retrieval failures they are misdiagnosing as generation failures. The practical takeaway is to audit where your agent actually breaks down - if wrong tables are being selected, fix the retrieval layer first. If SQL is generated correctly but returns wrong results, parallel sampling with a judge is a meaningful intervention. The multi-LLM routing adds operational complexity that is only worth it at production scale with real reliability requirements. For most builders at the prototype stage, a single strong model with better retrieval and parallel sampling will get you most of the way there.

Discussion Question

If you are building a data agent today, where does it actually fail - does it select the wrong tables, generate syntactically valid but semantically wrong SQL, or return correct SQL on the wrong data because the source of truth is inconsistent? Which of Genie's three innovations would have the most impact on your specific failure mode?

Share