SQALE: A Large Text-to-SQL Corpus Grounded in Real Schemas
Published in AI for Tabular Data Workshop @ Eurips 2025, 2025
SQALE is a large-scale semi-synthetic text-to-SQL corpus containing 517,676 validated (schema, question, query) triples. As described in the paper (e.g., pipeline diagram on page 3 and dataset statistics on page 4 of the PDF ), it is built from 135,875 real-world database schemas derived from SchemaPile, expanded using a principled pipeline involving schema extension, natural-language question generation, and SQL query synthesis with ReFoRCE. The dataset exhibits realistic schema scales (median 91 tables, 435 columns), diverse SQL operator usage, and controlled join complexity, making it one of the most structurally rich resources for training and evaluating text-to-SQL models.
Recommended citation: Wolff, C., Gomm, D., & Hulsebos, M. SQaLe: A large text-to-SQL corpus grounded in real schemas. In EurIPS 2025 Workshop: AI for Tabular Data.
