Improving Text Embeddings with Large Language Models: Prompts for Synthetic Data Generation

10 Oct 2024

Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to ([email protected]);

(2) Nan Yang, Microsoft Corporation, and correspondence to ([email protected]);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to ([email protected]).

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

C Prompts for Synthetic Data Generation

For asymmetric tasks, we list the four prompt templates in Table 7, 8, 9, and 10. For symmetric tasks, the prompts templates are available in Table 11 and 12. To generate multilingual data, we sample the value of “{language}” from the language list of XLM-R [7] with higher probability for high-resource languages. When prompting GPT-4/3.5, we set the temperature to 1.0 and the top-p to 1.0, which is higher than the default setting to encourage more diversity.

This paper is available on arxiv under CC0 1.0 DEED license.

← Previous

Improving Text Embeddings with Large Language Models: Test Set Contamination Analysis

Up Next →

Improving Text Embeddings with Large Language Models: Instructions for Training and Evaluation