ChatPaper.aiChatPaper

为数据科学模型生成天际线数据集

Generating Skyline Datasets for Data Science Models

February 16, 2025
作者: Mengying Wang, Hanchao Ma, Yiyang Bian, Yangxin Fan, Yinghui Wu
cs.AI

摘要

为各类数据驱动的人工智能和机器学习模型准备高质量数据集,已成为数据驱动分析中的基石任务。传统的数据发现方法通常围绕单一预定义的质量指标整合数据集,这可能导致下游任务产生偏差。本文提出了MODis框架,该框架通过优化多个用户定义的模型性能指标来发现数据集。给定一组数据源和一个模型,MODis选择并整合数据源,形成一个天际线数据集,在此之上,模型有望在所有性能指标上达到预期表现。我们将MODis建模为一个多目标有限状态转换器,并推导出三种可行的算法来生成天际线数据集。我们的第一个算法采用“从全集缩减”策略,从通用模式出发,逐步剔除无望的数据。第二个算法通过双向策略进一步降低成本,该策略交替进行数据增强与缩减。我们还引入了一种多样化算法,以减轻天际线数据集中的偏差。我们通过实验验证了天际线数据发现算法的效率与有效性,并展示了其在优化数据科学流程中的应用。
English
Preparing high-quality datasets required by various data-driven AI and machine learning models has become a cornerstone task in data-driven analysis. Conventional data discovery methods typically integrate datasets towards a single pre-defined quality measure that may lead to bias for downstream tasks. This paper introduces MODis, a framework that discovers datasets by optimizing multiple user-defined, model-performance measures. Given a set of data sources and a model, MODis selects and integrates data sources into a skyline dataset, over which the model is expected to have the desired performance in all the performance measures. We formulate MODis as a multi-goal finite state transducer, and derive three feasible algorithms to generate skyline datasets. Our first algorithm adopts a "reduce-from-universal" strategy, that starts with a universal schema and iteratively prunes unpromising data. Our second algorithm further reduces the cost with a bi-directional strategy that interleaves data augmentation and reduction. We also introduce a diversification algorithm to mitigate the bias in skyline datasets. We experimentally verify the efficiency and effectiveness of our skyline data discovery algorithms, and showcase their applications in optimizing data science pipelines.

Summary

AI-Generated Summary

PDF72February 22, 2025