Beyond No: Quantifying AI Over-Refusal and Emotional Attachment Boundaries
Abstract
We present an open-source benchmark and evaluation framework for assessing emotional boundary handling in Large Language Models (LLMs). Using a dataset of 1156 prompts across six languages, we evaluated three leading LLMs (GPT-4o, Claude-3.5 Sonnet, and Mistral-large) on their ability to maintain appropriate emotional boundaries through pattern-matched response analysis. Our framework quantifies responses across seven key patterns: direct refusal, apology, explanation, deflection, acknowledgment, boundary setting, and emotional awareness. Results demonstrate significant variation in boundary-handling approaches, with Claude-3.5 achieving the highest overall score (8.69/10) and producing longer, more nuanced responses (86.51 words on average). We identified a substantial performance gap between English (average score 25.62) and non-English interactions (< 0.22), with English responses showing markedly higher refusal rates (43.20% vs. < 1% for non-English). Pattern analysis revealed model-specific strategies, such as Mistral's preference for deflection (4.2%) and consistently low empathy scores across all models (< 0.06). Limitations include potential oversimplification through pattern matching, lack of contextual understanding in response analysis, and binary classification of complex emotional responses. Future work should explore more nuanced scoring methods, expand language coverage, and investigate cultural variations in emotional boundary expectations. Our benchmark and methodology provide a foundation for systematic evaluation of LLM emotional intelligence and boundary-setting capabilities.
Summary
AI-Generated Summary
Paper Overview
Core Contribution
- Introduces MUTAGREP, a method for execution-free repository-grounded plan search for code-use.
- Demonstrates the utility of repo-grounded plans for code generation.
- Formulates execution-free repo-grounded planning as LLM-guided tree search.
- Studies the design space of repo-grounded plan search.
- Shows that plan search benefits from scaling test-time compute.
Research Context
- Addresses the challenge of providing context from large code repositories to LLMs for coding tasks.
- Compares with approaches like adding the entire repo to the LLM’s context window or emulating human navigation.
- Focuses on the LongCodeArena benchmark for evaluation.
Keywords
- MUTAGREP
- Repository-grounded plan search
- LLM-guided tree search
- Code generation
- LongCodeArena
Background
Research Gap
- Existing methods either inefficiently use context windows or fail to emulate human-like navigation of codebases.
- Need for a method that decomposes user requests into natural language steps grounded in the codebase.
Technical Challenges
- Long contexts are detrimental to LLM reasoning abilities.
- Context windows are not unlimited.
- Difficulty in identifying relevant symbols in large codebases.
Prior Approaches
- Adding the entire repo to the LLM’s context window.
- Emulating human ability to navigate and pick relevant functionality.
- ReAct-based planning and full-repo context approaches.
Methodology
Technical Architecture
- Neural tree search in plan space.
- Symbol retriever for grounding.
- LLM-guided tree search with mutation of plans.
Implementation Details
- Successor function to mutate plans.
- Symbol retriever to ground intents to symbols.
- Tree-traversal algorithm for node expansion.
- Plan ranker to select the most promising node.
Innovation Points
- Execution-free plan search.
- Grounding plans in the codebase.
- Scaling test-time compute for improved performance.
Results
Experimental Setup
- Evaluated on the LongCodeArena benchmark.
- Compared with instruction-only, ReAct, and full-repo context approaches.
Key Findings
- MUTAGREP plans use less than 5% of the context window but rival the performance of full-repo context.
- Plans enable weaker models to match stronger models’ performance.
- Significant improvement on hard LongCodeArena tasks.
Limitations
- Dependence on the quality of the symbol retriever.
- Computational cost of tree search.
- Potential for incomplete or inaccurate plans.