LLM2CLIP: 강력한 언어 모델이 더 풍부한 시각적 표현을 끌어낸다.

초록

CLIP은 오늘날 가장 중요한 다중 모달 기초 모델 중 하나입니다. CLIP의 능력을 강화하는 것은 무엇일까요? 인간 지식의 보디인인 자연어가 제공하는 풍부한 감독 신호는 강력한 교차 모달 표현 공간을 형성합니다. 그러나 GPT-4 및 LLaMA와 같은 대형 언어 모델의 급속한 발전으로 언어 이해와 생성의 경계가 계속 밀려납니다. 이는 흥미로운 질문을 던집니다: 언어 모델의 능력을 활용하여 다중 모달 표현 학습을 더 개선할 수 있을까요? LLM을 CLIP에 통합하는 잠재적 이점은 명확합니다. LLM의 강력한 텍스트 이해력은 CLIP이 이미지 캡션을 처리하는 능력을 근본적으로 향상시킬 수 있으며, 이는 바닐라 CLIP의 잘 알려진 한계인 긴 및 복잡한 텍스트 처리 능력을 현격히 향상시킵니다. 게다가, LLM은 방대한 텍스트 말뭉치에서 훈련되어 있어 개방 세계 지식을 보유하고 있습니다. 이는 훈련 중 캡션 정보를 확장하여 학습 프로세스의 효율성을 높일 수 있습니다. 본 논문에서는 LLM2CLIP이라는 새로운 접근 방식을 제안합니다. 이는 CLIP의 잠재력을 발휘하기 위해 LLM의 힘을 받아들이는 것입니다. 대조 학습을 통해 캡션 공간에서 LLM을 미세 조정함으로써 텍스트 능력을 출력 임베딩으로 추출하여 출력 레이어의 텍스트 구별력을 크게 향상시킵니다. 그런 다음, 미세 조정된 LLM이 CLIP의 시각 인코더에 강력한 교사로 작용하는 효율적인 훈련 과정을 설계합니다. LLM의 존재 덕분에 이제 바닐라 CLIP의 텍스트 인코더의 컨텍스트 창과 능력 제한에 제약받지 않고 더 긴 및 더 복잡한 캡션을 통합할 수 있습니다. 실험 결과 이 접근 방식이 교차 모달 작업에서 상당한 개선을 가져온다는 것을 입증하였습니다.

English

CLIP is one of the most important multimodal foundational models today. What powers CLIP's capabilities? The rich supervision signals provided by natural language, the carrier of human knowledge, shape a powerful cross-modal representation space. However, with the rapid advancements in large language models LLMs like GPT-4 and LLaMA, the boundaries of language comprehension and generation are continually being pushed. This raises an intriguing question: can the capabilities of LLMs be harnessed to further improve multimodal representation learning? The potential benefits of incorporating LLMs into CLIP are clear. LLMs' strong textual understanding can fundamentally improve CLIP's ability to handle image captions, drastically enhancing its ability to process long and complex texts, a well-known limitation of vanilla CLIP. Moreover, LLMs are trained on a vast corpus of text, possessing open-world knowledge. This allows them to expand on caption information during training, increasing the efficiency of the learning process. In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP's potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer's textual discriminability. We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP's visual encoder. Thanks to the LLM's presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP's text encoder's context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.

LLM2CLIP: 강력한 언어 모델이 더 풍부한 시각적 표현을 끌어낸다.

LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation

초록

Support