Improving Text Embeddings with Large Language Models: Prompts for Synthetic Data Generation

cover
10 Oct 2024

Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

C Prompts for Synthetic Data Generation

For asymmetric tasks, we list the four prompt templates in Table 7, 8, 9, and 10. For symmetric tasks, the prompts templates are available in Table 11 and 12. To generate multilingual data, we sample the value of “{language}” from the language list of XLM-R [7] with higher probability for high-resource languages. When prompting GPT-4/3.5, we set the temperature to 1.0 and the top-p to 1.0, which is higher than the default setting to encourage more diversity.

This paper is available on arxiv under CC0 1.0 DEED license.