Cross modal retrieval
An exciting paper from Google Research and DeepMind - Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems arxiv.org/abs/2404.01616
The paper proposes a novel approach to leverage large language models (LLMs) to create a multi-modal retrieval system that can match speech and text in many languages, even those unseen during retrieval training.
It uses two-tower architecture with two uni-modal encoders and a shared projection layer with LLM backbone. The unimodal encoders generate representations for textual and speech input, which are projected into a shared embedding space using a pre-trained LLM. The projections are then aligned using contrastive loss.
The LLM is PaLM-2 which is trained on 100s of languages. The model is further trained on 900 hours of speech-text data in 21 languages. The system achieves impressive results by matching speech and text queries across 102 languages. Compared to previous systems explicitly trained on all 102 languages, this approach achieved a 10% absolute improvement in Recall@1. The model exhibits zero-shot cross-lingual speech-text translation capabilities, further improved by adding readily available machine translation data.
This paper shows a promising direction to revolutionize how we handle information across languages. Imagine a customer service rep understanding your spoken questions, no matter the language! The system's ability to handle unseen languages makes it attractive for areas like low-resource language processing and multilingual search engines.