利用二元擴散生成表格數據
Tabular Data Generation using Binary Diffusion
September 20, 2024
作者: Vitaliy Kinakh, Slava Voloshynovskiy
cs.AI
摘要
在機器學習中,生成合成表格數據尤其重要,特別是當真實數據有限或敏感時。傳統生成模型常常面臨挑戰,因為表格數據具有獨特特徵,如混合數據類型和不同分佈,需要進行複雜的預處理或使用大型預訓練模型。本文介紹一種新的、無損二進制轉換方法,將任何表格數據轉換為固定大小的二進制表示,並提出一種名為二進制擴散的新生成模型,專門設計用於二進制數據。二進制擴散利用 XOR 運算的簡單性進行噪聲添加和去除,並採用二進制交叉熵損失進行訓練。我們的方法消除了對廣泛預處理、複雜噪聲參數調整和在大型數據集上預訓練的需求。我們在幾個流行的表格基準數據集上評估我們的模型,結果顯示,二進制擴散在旅行、成年人收入和糖尿病數據集上優於現有的最先進模型,同時模型尺寸顯著更小。
English
Generating synthetic tabular data is critical in machine learning, especially
when real data is limited or sensitive. Traditional generative models often
face challenges due to the unique characteristics of tabular data, such as
mixed data types and varied distributions, and require complex preprocessing or
large pretrained models. In this paper, we introduce a novel, lossless binary
transformation method that converts any tabular data into fixed-size binary
representations, and a corresponding new generative model called Binary
Diffusion, specifically designed for binary data. Binary Diffusion leverages
the simplicity of XOR operations for noise addition and removal and employs
binary cross-entropy loss for training. Our approach eliminates the need for
extensive preprocessing, complex noise parameter tuning, and pretraining on
large datasets. We evaluate our model on several popular tabular benchmark
datasets, demonstrating that Binary Diffusion outperforms existing
state-of-the-art models on Travel, Adult Income, and Diabetes datasets while
being significantly smaller in size.Summary
AI-Generated Summary