PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

Abstract

Large Multimodal Models (LMMs) excel in natural language and visual understanding but are challenged by exacting tasks such as Knowledge-based Visual Question Answering (KB-VQA) which involve the retrieval of relevant information from document collections to use in shaping answers to questions. We present an extensive training and evaluation framework, M2KR, for KB-VQA. M2KR contains a collection of vision and language tasks which we have incorporated into a single suite of benchmark tasks for training and evaluating general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a pre-trained version of the recently developed Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new state-of-the-art results across a range of tasks. We also present investigations into the scaling behaviors of PreFLMR intended to be useful in future developments in general-purpose multi-modal retrievers.

Model

(1) the text query consists of an instruction and a question, which is encoded by a text encoder; (2) at the output of the vision encoder, a mapping network consisting of Multi-Layer Perceptrons (MLP) converts the `[CLS]' token representations into the same embedding space as the text encoder; (3) the transformer blocks take in the patch image embeddings from the penultimate layer of the vision encoder and attend to the text features by cross-attention; (4) a text encoder encodes documents in the knowledge base; (5) the scores between queries and documents are computed based on late-interaction, allowing each query token to interact with all document token embeddings.

Retrieval Performance

We show PreFLMR's performance when incorporating vision encoders with different scales with base sized text encoder. All variants of PreFLMR shows competitive performance on all nine benchmark retrieval tasks in the M2KR.

For the evaluation metrics, WIT uses Recall@10, IGLUE uses Recall@1, all the rest datasets use Recall@5. The scale of the plot is adjusted for better visualization. The best and worst numbers of each task are annotated.

KB-VQA Performance

We show downstream KB-VQA performance when RA-VQAv2 is equipped with PreFLMR and finetuned on the target VQA task. PaLM-E, PALI-X, and PaLM-B are large multi-modal models with 562B, 55B, and 1T parameters, respectively. The E-VQA SOTA uses Lens, the Google API for image retrieval. After incorporating PreFLMR, RA-VQAV2 achieves competitive performance while using much smaller language model.

Model	OKVQA	Infoseek	E-VQA
SOTA	66.10	21.80	48.80
SOTA model	PaLM-E	PALI-X	PaLM-B + Lens
RA-VQAv2 w/ PreFLMR	61.88	30.65	54.45
w/o retrieval	55.44	21.78	19.80

Examples with PreFLMR as Knowledge Retriever

We demonstrate three types of usage of PreFLMR Knowledge Retriever: Image + Instruction + Question to retrieve Documents; Image + Instruction to retrieve Documents; Instruction + Question to retrieve Documents.

Image + Instruction + Question -> Documents

Image + Instruction -> Documents

Instruction + Question -> Documents

BibTeX

 
        @inproceedings{
          lin2023finegrained,
          title={Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering},
          author={Weizhe Lin and Jinghong Chen and Jingbiao Mei and Alexandru Coca and Bill Byrne},
          booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
          year={2023},
          url={https://openreview.net/forum?id=IWWWulAX7g}
        }
        
        @inproceedings{lin-etal-2024-preflmr,
          title = "{P}re{FLMR}: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers",
          author = "Lin, Weizhe  and
            Mei, Jingbiao  and
            Chen, Jinghong  and
            Byrne, Bill",
          editor = "Ku, Lun-Wei  and
            Martins, Andre  and
            Srikumar, Vivek",
          booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
          month = aug,
          year = "2024",
          address = "Bangkok, Thailand",
          publisher = "Association for Computational Linguistics",
          url = "https://aclanthology.org/2024.acl-long.289",
          pages = "5294--5316",
          abstract = "Large Multimodal Models (LMMs) excel in natural language and visual understanding but are challenged by exacting tasks such as Knowledge-based Visual Question Answering (KB-VQA) which involve the retrieval of relevant information from document collections to use in shaping answers to questions. We present an extensive training and evaluation framework, M2KR, for KB-VQA. M2KR contains a collection of vision and language tasks which we have incorporated into a single suite of benchmark tasks for training and evaluating general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a pre-trained version of the recently developed Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new state-of-the-art results across a range of tasks. We also present investigations into the scaling behaviors of PreFLMR intended to be useful in future developments in general-purpose multi-modal retrievers.",
      }

Acknowledgement

This work was supported in part by the AWS Cloud Credit for Research programme.

This page was adopted from the Nerfies project page, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Many thanks to the Academic Project Page Template