RF-DETR目標檢測與YOLOv12之比較：基於Transformer與CNN架構在複雜果園環境下針對單類別及多類別綠果檢測於標籤模糊性之研究

摘要

本研究針對RF-DETR物件偵測基礎模型與YOLOv12物件偵測模型配置，在標籤模糊、遮擋及背景混雜的複雜果園環境中進行綠色果實偵測的詳細比較。研究開發了一個包含單類別（綠色果實）和多類別（遮擋與非遮擋綠色果實）註解的自訂資料集，以評估模型在動態真實世界條件下的表現。RF-DETR物件偵測模型採用DINOv2骨幹和可變形注意力機制，在全局上下文建模方面表現卓越，能有效識別部分遮擋或模糊的綠色果實。相比之下，YOLOv12利用基於CNN的注意力機制增強局部特徵提取，優化了計算效率，適合邊緣部署。RF-DETR在單類別偵測中達到了最高的平均精度（mAP50）0.9464，證明了其在雜亂場景中定位綠色果實的優越能力。儘管YOLOv12N在mAP@50:95上取得了0.7620的最高分，RF-DETR在複雜空間場景中始終表現更佳。在多類別偵測中，RF-DETR以mAP@50 0.8298領先，顯示其區分遮擋與非遮擋果實的能力，而YOLOv12L在mAP@50:95上以0.6622得分最高，表明其在詳細遮擋情境下的分類更佳。訓練動態分析顯示，RF-DETR收斂迅速，特別是在單類別設定中，僅需10個epoch即達到穩定，展示了基於Transformer架構在適應動態視覺資料上的效率。這些發現驗證了RF-DETR在精準農業應用中的有效性，而YOLOv12則適合快速反應場景。>關鍵詞：RF-DETR物件偵測，YOLOv12，YOLOv13，YOLOv14，YOLOv15，YOLOE，YOLO World，YOLO，You Only Look Once，Roboflow，偵測Transformer，CNN

English

This study conducts a detailed comparison of RF-DETR object detection base model and YOLOv12 object detection model configurations for detecting greenfruits in a complex orchard environment marked by label ambiguity, occlusions, and background blending. A custom dataset was developed featuring both single-class (greenfruit) and multi-class (occluded and non-occluded greenfruits) annotations to assess model performance under dynamic real-world conditions. RF-DETR object detection model, utilizing a DINOv2 backbone and deformable attention, excelled in global context modeling, effectively identifying partially occluded or ambiguous greenfruits. In contrast, YOLOv12 leveraged CNN-based attention for enhanced local feature extraction, optimizing it for computational efficiency and edge deployment. RF-DETR achieved the highest mean Average Precision (mAP50) of 0.9464 in single-class detection, proving its superior ability to localize greenfruits in cluttered scenes. Although YOLOv12N recorded the highest mAP@50:95 of 0.7620, RF-DETR consistently outperformed in complex spatial scenarios. For multi-class detection, RF-DETR led with an mAP@50 of 0.8298, showing its capability to differentiate between occluded and non-occluded fruits, while YOLOv12L scored highest in mAP@50:95 with 0.6622, indicating better classification in detailed occlusion contexts. Training dynamics analysis highlighted RF-DETR's swift convergence, particularly in single-class settings where it plateaued within 10 epochs, demonstrating the efficiency of transformer-based architectures in adapting to dynamic visual data. These findings validate RF-DETR's effectiveness for precision agricultural applications, with YOLOv12 suited for fast-response scenarios. >Index Terms: RF-DETR object detection, YOLOv12, YOLOv13, YOLOv14, YOLOv15, YOLOE, YOLO World, YOLO, You Only Look Once, Roboflow, Detection Transformers, CNNs

RF-DETR目標檢測與YOLOv12之比較：基於Transformer與CNN架構在複雜果園環境下針對單類別及多類別綠果檢測於標籤模糊性之研究

RF-DETR Object Detection vs YOLOv12 : A Study of Transformer-based and CNN-based Architectures for Single-Class and Multi-Class Greenfruit Detection in Complex Orchard Environments Under Label Ambiguity

摘要

Summary

Support

Support