Authors:
(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);
(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);
(3) Xiaolong Huang, Microsoft Corporation;
(4) Linjun Yang, Microsoft Corporation;
(5) Rangan Majumder, Microsoft Corporation;
(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).
Table of Links
3 Method
4 Experiments
4.1 Statistics of the Synthetic Data
4.2 Model Fine-tuning and Evaluation
5 Analysis
5.1 Is Contrastive Pre-training Necessary?
5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters
B Test Set Contamination Analysis
C Prompts for Synthetic Data Generation
D Instructions for Training and Evaluation
2 Related Work
Text Embeddings are continuous low-dimensional representations of text and have been extensively applied to various downstream tasks such as information retrieval, question answering, and retrievalaugmented generation (RAG). Early work on text embeddings includes latent semantic indexing [10] and weighted average of word embeddings [25]. More recent methods exploit supervision from natural language inference [3] and labeled query-document pairs, such as the MS-MARCO passage ranking dataset [5], to train text embeddings [37, 6, 13]. However, labeled data are often limited in terms of task diversity and language coverage. To address this challenge, methods like Contriever [18], OpenAI Embeddings [30], E5 [46], and BGE [48] adopt a multi-stage training paradigm. They first pre-train on large-scale weakly-supervised text pairs using contrastive loss and then fine-tune on small-scale but high-quality datasets. In this paper, we demonstrate that it is possible to obtain state-of-the-art text embeddings with single-stage training.
Synthetic Data Synthetic data generation is a widely studied topic in information retrieval research, with various methods proposed to enhance retrieval systems with artificially created data. For instance, Doc2query [33], InPars [2], and Promptagator [8] generate synthetic queries for unlabeled documents, which are then leveraged for document expansion or model training. GPL [45] employs a crossencoder to produce pseudo-labels for query-document pairs. Similarly, Query2doc [47] generates pseudo-documents for query expansion by few-shot prompting LLMs. Unlike these methods, our approach does not rely on any unlabeled documents or queries and thus can generate more diverse synthetic data.
Another related line of work focuses on knowledge distillation from black-box LLMs by training on synthetic data generated from them. DINO [39] generates synthetic text pairs for semantic textual similarity. Unnatural Instructions [16] is a synthetic instruction following dataset by prompting existing LLMs. Orca [29] and Phi [15] propose to train better small language models by using high-quality synthetic data from GPT-3.5/4 [34].
Large Language Models With the popularization of ChatGPT, large language models (LLMs) have demonstrated remarkable capabilities in instruction following and few-shot in-context learning [4]. However, the most advanced LLMs such as GPT-4 [34] are proprietary and have little technical details disclosed. To bridge the gap between proprietary and open-source LLMs, several notable efforts have been made, such as LLaMA-2 [44] and Mistral [19] models. A major limitation of LLMs is that they lack awareness of recent events and private knowledge. This issue can be partly mitigated by augmenting LLMs with information retrieved from external sources, a technique known as retrieval-augmented generation (RAG). On the other hand, LLMs can also serve as foundation models to enhance text embeddings. RepLLaMA [24] proposes to fine-tune LLaMA-2 with bi-encoder architecture for ad-hoc retrieval. SGPT [27], GTR [32], and Udever [51] demonstrate the scaling law of text embeddings empirically, but their performance still falls behind small bidirectional encoders such as E5 [46] and BGE [48]. In this paper, we present a novel approach to train state-of-the-art text embeddings by exploiting the latest advances of LLMs and synthetic data.
This paper is available on arxiv under CC0 1.0 DEED license.