UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
Abstract
User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model's other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model's representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. We evaluate UPCORE across three standard unlearning methods consistently achieving a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. We find that UPCORE improves both standard metrics and AUC, benefitting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.
Summary
AI-Generated Summary
Paper Overview
Core Contribution
- UPCORE: A method-agnostic data selection framework for mitigating collateral damage during unlearning in large language models (LLMs).
- Key Insight: Model damage during unlearning is correlated with the variance of the model’s representations on the forget set. UPCORE selectively prunes outliers to minimize degradation.
- Innovation: Introduces a new metric, AUC (area-under-the-curve), to evaluate the trade-off between deletion efficacy and model preservation.
Research Context
- Problem: Unlearning specific data points from pretrained models often degrades performance on unrelated data, leading to a trade-off between deletion and model utility.
- Motivation: Legal frameworks (e.g., GDPR, CCPA) require the removal of sensitive data, but current unlearning methods struggle to balance deletion and model preservation.
- Prior Work: Existing unlearning methods (e.g., Gradient Ascent, Refusal, NPO) often lead to unintended consequences, such as reduced model utility.
Keywords
- Machine Unlearning, Coreset Selection, Utility Preservation, Collateral Damage, Isolation Forest, AUC Metric, LLMs, Data Privacy
Background
Research Gap
- Challenge: Existing unlearning methods do not systematically control data attributes (e.g., variance) that drive collateral damage.
- Question: What measurable attributes of the forget set drive collateral effects, and can they be controlled to optimize the trade-off between deletion and utility?
Technical Challenges
- Overgeneralization: Unlearning updates often propagate beyond the forget set, affecting unrelated data.
- Model Degradation: Unlearning can lead to significant performance loss on retained data.
- Data Variance: High variance in the forget set correlates with increased collateral damage.
Prior Approaches
- Exact Unlearning: Ensures the model is indistinguishable from one retrained without the forget data (computationally expensive).
- Approximate Unlearning: Modifies model parameters to approximate exact unlearning (more efficient but less precise).
- Model Editing: Directly modifies model weights to forget specific facts, but often leads to unintended consequences.
Methodology
Technical Architecture
- Hidden State Extraction: Extract hidden state representations from the model’s penultimate layer for each data point in the forget set.
- Isolation Forest Training: Train an Isolation Forest model to identify outliers in the forget set based on hidden state variance.
- Coreset Selection: Prune outliers to form a core forget set with lower variance.
- Unlearning: Apply unlearning algorithms (e.g., Gradient Ascent, Refusal, NPO) to the core forget set.
Implementation Details
- Isolation Forest: An unsupervised learning technique that efficiently identifies anomalous data points by recursively partitioning the dataset.
- Anomaly Score: Computed for each data point, with higher scores indicating greater contribution to variance.
- Pruning Threshold: Determined by either coreset size control or proportional pruning.
Innovation Points
- Variance Minimization: Reduces collateral damage by pruning high-variance outliers.
- Positive Transfer: Leverages overgeneralization to extend unlearning to pruned points.
- Method-Agnostic: Can be applied to any data-driven unlearning framework.
Results
Experimental Setup
- Unlearning Methods: Gradient Ascent, Refusal, Negative Preference Optimization (NPO).
- Datasets: Counterfact (prompt completion) and TriviaQA (question answering).
- Metrics: ROUGE scores, model utility, AUC for deletion effectiveness and utility retention.
Key Findings
- Superior Trade-off: UPCORE consistently achieves higher AUC across all unlearning methods, indicating a better balance between deletion and model utility.
- Positive Transfer: Unlearning on the core forget set extends to pruned points, reducing collateral damage.
- Robustness: UPCORE generalizes well to paraphrased and jailbreak prompts, maintaining high deletion efficacy.
Limitations
- Coreset Size: Performance gains stabilize beyond 30% pruning, but further pruning may degrade forget set performance.
- Scalability: The effectiveness of UPCORE may vary with the size and complexity of the forget set.
Conclusion
- UPCORE: A principled framework for utility-preserving unlearning in LLMs, leveraging variance minimization to reduce collateral damage.
- AUC Metric: Introduces a novel metric to quantify the trade-off between deletion and model utility across unlearning epochs.
- Future Work: Explore the application of UPCORE to other machine learning models and unlearning scenarios.
Acknowledgments
- Supported by NSF-CAREER Award 1846185, Microsoft AFMR, DARPA ECOLE, and NSF-AI Engage Institute.