Tabby: Tabular Data Synthesis with Language Models
Abstract
While advances in large language models (LLMs) have greatly improved the quality of synthetic text data in recent years, synthesizing tabular data has received relatively less attention. We address this disparity with Tabby, a simple but powerful post-training modification to the standard Transformer language model architecture, enabling its use for tabular dataset synthesis. Tabby enables the representation of differences across columns using Gated Mixture-of-Experts, with column-specific sets of parameters. Empirically, Tabby results in data quality near or equal to that of real data. By pairing our novel LLM table training technique, Plain, with Tabby, we observe up to a 44% improvement in quality over previous methods. We also show that Tabby extends beyond tables to more general structured data, reaching parity with real data on a nested JSON dataset as well.
Summary
AI-Generated Summary
Paper Overview
Core Contribution
- Introduces Tabby, a post-training modification to transformer-based LLMs for tabular data synthesis.
- Uses Gated Mixture-of-Experts (MoE) layers to model column-specific parameters.
- Achieves synthetic data quality near or equal to real data.
- Extends beyond tables to general structured data (e.g., nested JSON).
Research Context
- Tabular data synthesis has received less attention compared to text and image synthesis.
- Challenges include complex column interdependencies, mixed modalities, and spurious correlations.
- Prior approaches include GANs, LLMs, and diffusion models, but require significant preprocessing.
Keywords
- Tabular data synthesis
- Language models (LLMs)
- Mixture-of-Experts (MoE)
- Gated MoE
- Structured data
- Plain training technique
Background
Research Gap
- Lack of specialized architectures for tabular data synthesis.
- Limited focus on structured data modalities beyond tables.
- Need for models that handle mixed data types and complex dependencies.
Technical Challenges
- Modeling complex interdependencies between columns.
- Handling mixed modalities (text, numerical, nested data).
- Avoiding spurious correlations due to column order.
Prior Approaches
- GANs (e.g., CTGAN, TVAE) struggle with mode collapse and complex distributions.
- Diffusion models (e.g., Tab-DDPM) require strong assumptions and preprocessing.
- LLMs (e.g., GReaT, TapTap, Tabula) focus on training techniques but lack architectural modifications.
Methodology
Technical Architecture
- Tabby replaces select LLM blocks with MoE layers, allowing column-specific parameter sets.
- MoE layers increase model expressivity for tabular data.
- Plain training technique simplifies LLM fine-tuning for tabular data.
Implementation Details
- Tabby modifies the language modeling head or transformer MLPs/attention blocks.
- Plain training encodes tabular data as text with specialized tokens (<EOC>, <EOS>).
- Training process calculates losses per column, enabling per-column performance tracking.
Innovation Points
- First architecture modification to make LLMs better-suited for table generation.
- Combines MoE layers with LLMs for higher-fidelity synthetic data.
- Extends to nested JSON and other structured data modalities.
Results
Experimental Setup
- Evaluated on six tabular datasets (Diabetes, Travel, Adult, Abalone, Rainfall, House) and one nested JSON dataset.
- Metrics: Machine Learning Efficacy (MLE), Discrimination, Distance to Closest Record (DCR).
- Compared with GANs, diffusion models, and prior LLM-based approaches.
Key Findings
- Tabby achieves MLE parity with real data on 4/6 tabular datasets.
- Plain-trained Tabby models outperform prior methods, including Tab-DDPM.
- Tabby extends to nested JSON data, achieving parity with real data.
- Smaller Tabby models outperform larger non-Tabby LLMs.
Limitations
- Tabby’s parameter count scales with the number of columns, though parameter sharing can mitigate this.
- Plain training, while effective, may not handle all dataset complexities.
- Limited evaluation on extremely high-dimensional datasets.