語言模型學習:以數據為基礎進行資料增強預測
LML: Language Model Learning a Dataset for Data-Augmented Prediction
September 27, 2024
作者: Praneeth Vadlapati
cs.AI
摘要
本文介紹了一種新的方法,利用大型語言模型(LLMs)進行分類任務,通常這些任務是使用機器學習(ML)模型處理的。與依賴大量數據清理和特徵工程的ML模型不同,這種方法利用LLMs簡化了過程。本文提出了一個名為「語言模型學習(LML)」的新概念,由一種名為「數據增強預測(DAP)」的新方法驅動。分類由LLMs執行,使用一種類似於人類手動探索和理解數據並根據數據作為參考進行分類的方法。訓練數據被總結並評估,以確定導致每個標籤分類的特徵。在DAP過程中,系統使用數據摘要自動創建查詢,用於從數據集檢索相關行。LLMs使用數據摘要和相關行生成分類,確保即使在複雜數據情況下也能獲得令人滿意的準確性。在DAP中使用數據摘要和相似數據確保了上下文感知的決策。所提出的方法在提示中使用「作為可解釋的機器學習模型」一詞,以增強預測的可解釋性,使用戶可以查看每個預測背後的邏輯。在某些測試案例中,系統的準確率超過90%,證明了系統的有效性以及在各種場景中超越傳統ML模型的潛力。代碼可在https://github.com/Pro-GenAI/LML-DAP找到。
English
This paper introduces a new approach to using Large Language Models (LLMs)
for classification tasks, which are typically handled using Machine Learning
(ML) models. Unlike ML models that rely heavily on data cleaning and feature
engineering, this method streamlines the process using LLMs. This paper
proposes a new concept called "Language Model Learning (LML)" powered by a new
method called "Data-Augmented Prediction (DAP)". The classification is
performed by LLMs using a method similar to humans manually exploring and
understanding the data and deciding classifications using data as a reference.
Training data is summarized and evaluated to determine the features that lead
to the classification of each label the most. In the process of DAP, the system
uses the data summary to automatically create a query, which is used to
retrieve relevant rows from the dataset. A classification is generated by the
LLM using data summary and relevant rows, ensuring satisfactory accuracy even
with complex data. Usage of data summary and similar data in DAP ensures
context-aware decision-making. The proposed method uses the words "Act as an
Explainable Machine Learning Model" in the prompt to enhance the
interpretability of the predictions by allowing users to review the logic
behind each prediction. In some test cases, the system scored an accuracy above
90%, proving the effectiveness of the system and its potential to outperform
conventional ML models in various scenarios. The code is available at
https://github.com/Pro-GenAI/LML-DAPSummary
AI-Generated Summary