ILIAS:大规模实例级图像检索
ILIAS: Instance-Level Image retrieval At Scale
February 17, 2025
作者: Giorgos Kordopatis-Zilos, Vladan Stojnić, Anna Manko, Pavel Šuma, Nikolaos-Antonios Ypsilantis, Nikos Efthymiadis, Zakaria Laskar, Jiří Matas, Ondřej Chum, Giorgos Tolias
cs.AI
摘要
本研究介绍了ILIAS,一个专为大规模实例级图像检索设计的新型测试数据集。它旨在评估当前及未来基础模型与检索技术在识别特定物体方面的能力。相较于现有数据集,ILIAS的主要优势在于其大规模性、领域多样性、精确的真实标注,以及尚未达到饱和的性能表现。ILIAS包含了针对1,000个物体实例的查询图像和正样本图像,这些图像经过人工收集,以捕捉具有挑战性的条件和多样化的领域背景。大规模检索任务则针对来自YFCC100M的1亿张干扰图像进行。为了避免假阴性结果且无需额外标注工作,我们仅纳入确认在2014年(即YFCC100M的汇编日期)之后出现的查询物体。通过广泛的基准测试,我们得出以下观察:i) 在特定领域(如地标或商品)上微调的模型在该领域表现出色,但在ILIAS上表现欠佳;ii) 利用多领域类别监督学习线性适应层能带来性能提升,尤其是对于视觉-语言模型;iii) 在检索重排序中,局部描述符仍是关键要素,特别是在背景杂乱严重的情况下;iv) 视觉-语言基础模型在文本到图像检索上的表现,意外地接近相应的图像到图像检索情况。更多信息请访问:https://vrg.fel.cvut.cz/ilias/。
English
This work introduces ILIAS, a new test dataset for Instance-Level Image
retrieval At Scale. It is designed to evaluate the ability of current and
future foundation models and retrieval techniques to recognize particular
objects. The key benefits over existing datasets include large scale, domain
diversity, accurate ground truth, and a performance that is far from saturated.
ILIAS includes query and positive images for 1,000 object instances, manually
collected to capture challenging conditions and diverse domains. Large-scale
retrieval is conducted against 100 million distractor images from YFCC100M. To
avoid false negatives without extra annotation effort, we include only query
objects confirmed to have emerged after 2014, i.e. the compilation date of
YFCC100M. An extensive benchmarking is performed with the following
observations: i) models fine-tuned on specific domains, such as landmarks or
products, excel in that domain but fail on ILIAS ii) learning a linear
adaptation layer using multi-domain class supervision results in performance
improvements, especially for vision-language models iii) local descriptors in
retrieval re-ranking are still a key ingredient, especially in the presence of
severe background clutter iv) the text-to-image performance of the
vision-language foundation models is surprisingly close to the corresponding
image-to-image case. website: https://vrg.fel.cvut.cz/ilias/Summary
AI-Generated Summary