ChatPaper.aiChatPaper

利用二元擴散生成表格數據

Tabular Data Generation using Binary Diffusion

September 20, 2024
作者: Vitaliy Kinakh, Slava Voloshynovskiy
cs.AI

摘要

在機器學習中,生成合成表格數據尤其重要,特別是當真實數據有限或敏感時。傳統生成模型常常面臨挑戰,因為表格數據具有獨特特徵,如混合數據類型和不同分佈,需要進行複雜的預處理或使用大型預訓練模型。本文介紹一種新的、無損二進制轉換方法,將任何表格數據轉換為固定大小的二進制表示,並提出一種名為二進制擴散的新生成模型,專門設計用於二進制數據。二進制擴散利用 XOR 運算的簡單性進行噪聲添加和去除,並採用二進制交叉熵損失進行訓練。我們的方法消除了對廣泛預處理、複雜噪聲參數調整和在大型數據集上預訓練的需求。我們在幾個流行的表格基準數據集上評估我們的模型,結果顯示,二進制擴散在旅行、成年人收入和糖尿病數據集上優於現有的最先進模型,同時模型尺寸顯著更小。
English
Generating synthetic tabular data is critical in machine learning, especially when real data is limited or sensitive. Traditional generative models often face challenges due to the unique characteristics of tabular data, such as mixed data types and varied distributions, and require complex preprocessing or large pretrained models. In this paper, we introduce a novel, lossless binary transformation method that converts any tabular data into fixed-size binary representations, and a corresponding new generative model called Binary Diffusion, specifically designed for binary data. Binary Diffusion leverages the simplicity of XOR operations for noise addition and removal and employs binary cross-entropy loss for training. Our approach eliminates the need for extensive preprocessing, complex noise parameter tuning, and pretraining on large datasets. We evaluate our model on several popular tabular benchmark datasets, demonstrating that Binary Diffusion outperforms existing state-of-the-art models on Travel, Adult Income, and Diabetes datasets while being significantly smaller in size.

Summary

AI-Generated Summary

PDF43November 16, 2024