RF-DETR目标检测与YOLOv12对比研究:基于Transformer与CNN架构在复杂果园环境下针对单类及多类青果检测的标签模糊性分析
RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity
April 17, 2025
作者: Ranjan Sapkota, Rahul Harsha Cheppally, Ajay Sharda, Manoj Karkee
cs.AI
摘要
本研究对RF-DETR目标检测基础模型与YOLOv12目标检测模型在复杂果园环境下的绿果检测性能进行了详细对比,该环境以标签模糊、遮挡及背景融合为特征。为此,开发了一个包含单类别(绿果)和多类别(遮挡与非遮挡绿果)标注的自定义数据集,以评估模型在动态现实条件下的表现。RF-DETR模型采用DINOv2骨干网络及可变形注意力机制,在全局上下文建模方面表现卓越,能有效识别部分遮挡或模糊的绿果。相比之下,YOLOv12利用基于CNN的注意力机制强化局部特征提取,优化了计算效率,更适合边缘部署。在单类别检测中,RF-DETR以0.9464的平均精度(mAP50)位居榜首,证明了其在杂乱场景中定位绿果的卓越能力。尽管YOLOv12N在mAP@50:95上以0.7620领先,RF-DETR在复杂空间场景中持续表现更优。多类别检测方面,RF-DETR以mAP@50为0.8298领先,展现了其区分遮挡与非遮挡果实的能力,而YOLOv12L在mAP@50:95上以0.6622最高,表明其在详细遮挡情境下的分类优势。训练动态分析显示,RF-DETR收敛迅速,特别是在单类别设置下,仅需10个epoch即达到稳定,展示了基于Transformer架构在适应动态视觉数据上的高效性。这些发现验证了RF-DETR在精准农业应用中的有效性,而YOLOv12则适用于快速响应场景。关键词:RF-DETR目标检测,YOLOv12,YOLOv13,YOLOv14,YOLOv15,YOLOE,YOLO World,YOLO,You Only Look Once,Roboflow,检测Transformer,卷积神经网络(CNNs)
English
This study conducts a detailed comparison of RF-DETR object detection base
model and YOLOv12 object detection model configurations for detecting
greenfruits in a complex orchard environment marked by label ambiguity,
occlusions, and background blending. A custom dataset was developed featuring
both single-class (greenfruit) and multi-class (occluded and non-occluded
greenfruits) annotations to assess model performance under dynamic real-world
conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and
deformable attention, excelled in global context modeling, effectively
identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12
leveraged CNN-based attention for enhanced local feature extraction, optimizing
it for computational efficiency and edge deployment. RF-DETR achieved the
highest mean Average Precision (mAP50) of 0.9464 in single-class detection,
proving its superior ability to localize greenfruits in cluttered scenes.
Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR
consistently outperformed in complex spatial scenarios. For multi-class
detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to
differentiate between occluded and non-occluded fruits, while YOLOv12L scored
highest in mAP@50:95 with 0.6622, indicating better classification in detailed
occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift
convergence, particularly in single-class settings where it plateaued within 10
epochs, demonstrating the efficiency of transformer-based architectures in
adapting to dynamic visual data. These findings validate RF-DETR's
effectiveness for precision agricultural applications, with YOLOv12 suited for
fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12,
YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once,
Roboflow, Detection Transformers, CNNsSummary
AI-Generated Summary