This is the implementation of ANCE-Tele introduced in the EMNLP 2022 Main Conference paper "Reduce Catastrophic Forgetting of Dense Retrieval Training with Teleportation Negatives". If you find this work useful, please cite our paper π and give our repo a star βοΈ Thanks βͺ(ο½₯Οο½₯)οΎ
@inproceedings{sun2022ancetele,
title={Reduce Catastrophic Forgetting of Dense Retrieval Training with Teleportation Negatives},
author={Si, Sun and Chenyan, Xiong and Yue, Yu and Arnold, Overwijk and Zhiyuan, Liu and Jie, Bao},
booktitle={Proceedings of EMNLP 2022},
year={2022}
}
[2023/9/26] We release a new ANCE-Tele checkpoint and its Marco & BEIR evaluation results. Specific results for 18 BEIR datasets are shown on HuggingFace: OpenMatch/ance-tele_coco-base_msmarco_qry-psg-encoder.
Model | Pretrain Model | Train w/ Marco Title | Marco Dev MRR@10 | BEIR Avg NDCG@10 |
---|---|---|---|---|
ANCE-Tele | cocodr-base | w/o | 37.3 | 44.2 |
[2023/4/13] We update our ongoing work "Rethinking Few-shot Ability in Dense Retrieval" in this repository. Please switch to branch 'FewDR' and check the folder ANCE-Tele/plugins/FewDR
.
ANCE-Tele is a simple and efficient DR training method that introduces teleportation (momentum and lookahead) negatives to smooth the learning process, leading to improved training stability, convergence speed, and reduced catastrophic forgetting.
On web search and OpenQA, ANCE-Tele is competitive among systems using significantly more (50x) parameters, and eliminates the dependency on additional negatives (e.g., BM25, other DR systems), filtering strategies, and distillation modules. You can easily reproduce ANCE-Tele in about one day with only an A100 π. (Of course, 2080Ti is ok, but with more time).
Let's begin!
ANCE-Tele is tested on Python 3.8+, PyTorch 1.8+, and CUDA Version 10.2/11.1.
(1) Create a new Anaconda environment:
conda create --name ancetele python=3.8
conda activate ancetele
(2) Install the following packages using Pip or Conda under this environment:
transformers==4.9.2
datasets==1.11.0
pytorch==1.8.0
faiss-gpu==1.7.2
## faiss-gpu is depend on the CUDA version
## conda install faiss-gpu cudatoolkit=11.1
## conda install faiss-gpu cudatoolkit=10.2
openjdk==11
pyserini ## pyserini is depend on openjdk
(1) Download Dataset
Download Link | Size |
---|---|
msmarco.tar.gz | ~1.0G |
(2) Uncompress Dataset
Run the command: tar -zxvf msmarco.tar.gz
. The uncompressed folder contains the following files:
msmarco
βββ corpus.tsv # <TSV> psg_id /t psg_title /t psg
βββ train.query.txt # <TSV> qry_id /t qry
βββ qrels.train.tsv # <TSV> qry_id /t 0 /t pos_psg_id /t 1
βββ dev.query.txt # <TSV> qry_id /t qry
βββ qrels.dev.small.tsv # <TSV> qry_id /t 0 /t pos_psg_id /t 1
(1) Tokenize Dataset
Enter the folder ANCE-Tele/shells
and run the shell script:
bash tokenize_msmarco.sh
(1) Download our CheckP from HuggingFace:
Download Link | Size | Dev MRR@10 |
---|---|---|
ance-tele_msmarco_qry-psg-encoder | ~438M | 39.1 |
(2) Encoder & Search MS MARCO using our CheckP:
bash infer_msmarco.sh
P.S. We support multi-GPUs to encode the MARCO corpus, which is split into ten files (split00-split09). But Multi-GPU encoding only supports the use of 1/2/5 GPUs at the same time, e.g., setting ENCODE_CUDA="0,1"
π ANCE-Tele supports Faiss-GPU search but requires sufficient CUDA memory. In our experience: MS MARCO >= 1*A100 or 2*3090/V100 or 4*2080ti; NQ/TriviaQA >= 2*A100 or 4*3090/V100 or 8*2080ti e.g., setting
SEARCH_CUDA="0,1,2,3"
.
π If your CUDA memory is not enough, you can use split search: set
--sub_split_num 5
and the sub_split_num can be 1/2/5/10. You can also use CPU search: (1) cancel--use_gpu
and set--batch_size -1
.
(1) Download vanilla pre-trained model & our Epi-3 training negatives:
Download Link | Size |
---|---|
co-condenser-marco | ~473M |
ance-tele_msmarco_tokenized-train-data.tar.gz | ~8.8G |
(2) Uncompress our Epi-3 training negatives:
Run the command: tar -zxvf ance-tele_msmarco_tokenized-train-data.tar.gz
. The uncompressed folder contains 12 sub-files {split00-11.hn.json}. The format of each file is as follows:
{
"query": [train-query tokenized ids],
"positives": [[positive-passage-1 tokenized ids], [positive-passage-2 tokenized ids], ...],
"negatives": [[negative-passage-1 tokenized ids], [negative-passage-2 tokenized ids], ...],
}
(3) Train ANCE-Tele using our Epi-3 training negtatives
bash train_ance-tele_msmarco.sh
P.S. Multi-GPU training is supported. Please keep the following hyperparameters unchanged and set --negatives_x_device
when using multi-GPU setup.
Hyperparameters | Augments | Single GPU | E.g., Two GPUs |
---|---|---|---|
Qry Batch Size | --per_device_train_batch_size | 8 | 4 |
(Positive + Negative) Passages per Qry | --train_n_passages | 32 | 32 |
Learning rate | --learning_rate | 5e-6 | 5e-6 |
Total training Epoch | --num_train_epochs | 3 | 3 |
π If your CUDA memory is limited, please use the Gradient Caching technique. Set the following augments during training:
--grad_cache \
--gc_q_chunk_size 4 \
--gc_p_chunk_size 8 \
## Split a batch of queries to several gc_q_chunk_size
## Split a batch of passages to several gc_p_chunk_size
(4) Evaluate your ANCE-Tele
After training for 3 epochs, you can follow the instructions in MARCO: Reproduce w/ Our CheckPs - (2) to evaluate. Remember to replace the CheckP with your trained model file π.
To reproduce ANCE-Tele from scratch (Epi->2->3), you only need to prepare the vanilla pretrained model co-condenser-marco.
π Quick refreshing strategy. ANCE-Tele takes a quick refreshing strategy for hard negative mining. Hence, Epi-1,2 only train 1/10 of the total training steps (early stop), and only the last Epi-3 goes through the entire training epoch, significantly reducing the training cost.
π Train from scratch. Every training episode adopts train-from-scratch mode; each episode uses the vanilla pretrained model as the initial model, and the only difference is the input training negatives. In this way, it is convenient to reproduce the results without relying on intermediate episode CheckPs.
(1) Epi-1 Training
First, mine the Tele-negatives using vanilla co-condenser-marco. In Epi-1, Tele-negatives contain ANN-negatives and Lookahead-negatives (LA-Neg) without Momentum.
bash epi-1-mine-msmarco.sh
Then train the vanilla co-condenser-marco with the Epi-1 Tele-negatives and early stop at 20k step prepared for negative refreshing:
bash epi-1-train-msmarco.sh
(2) Epi-2 Training
For Epi-2, mine Tele-negatives using the Epi-1 trained model. Epi-2 Tele-negatives contain ANN-negatives, Lookahead-negatives (LA-Neg), and Momentum-negatives (Epi-1 training negatives).
bash epi-2-mine-msmarco.sh
Then train the vanilla co-condenser-marco with the Epi-2 Tele-negatives and early stop at 20k step prepared for negative refreshing:
bash epi-2-train-msmarco.sh
(3) Epi-3 Training
For the last Epi-3, mine Tele-negatives using the Epi-2 trained model. Epi-3 Tele-negatives contain ANN-negatives, Lookahead-negatives (LA-Neg), and Momentum-negatives (Epi-2 training negatives).
bash epi-3-mine-msmarco.sh
Then train the vanilla co-condenser-marco with the Epi-3 Tele-negatives. This step is the same as introduced in MARCO: Reproduce w/ Our Episode-3 Training Negatives - (3) :
bash epi-3-train-msmarco.sh
(4) Evaluate your ANCE-Tele
After three episodes, you can follow the instructions in MARCO: Reproduce w/ Our CheckPs - (2) to evaluate. Remember to replace the CheckP with your trained model file π.
(1) Download Datasets
NQ and TriviaQA use the same Wikipedia-Corpus-Index.
Download Link | Size |
---|---|
nq.tar.gz | ~17.6M |
triviaqa.tar.gz | ~174.6M |
wikipedia-corpus-index.tar.gz | ~12.9G |
(2) Uncompress Datasets
Run the command: tar -zxvf xxx.tar.gz
. The uncompressed folder contains the following files:
nq
βββ nq-train-qrels.jsonl
βββ nq-test.jsonl # <DICT> {"qid":xxx, "question":xxx, "answers":[xxx, ...]}
triviaqa
βββ triviaqa-train-qrels.jsonl
βββ triviaqa-test.jsonl # <DICT> {"qid":xxx, "question":xxx, "answers":[xxx, ...]}
wikipedia-corpus-index
βββ psgs_w100.tsv # <TSV> psg_id /t psg /t psg_title
βββ index-wikipedia-dpr-20210120-d1b9e6 # Wikipedia Index (for pyserini evaluation)
The format of nq/triviaqa-train-qrels.jsonl file is as follows:
{
"qid": xxx,
"question": xxx,
"answers": [xxx, ...]
"positive_ctxs": [xxx, ...],
}
P.S. "positive_ctxs" is a positive passage list. The empty list means that DPR did not provide the oracle-relevant passage for the training query. In such cases, we use the passages containing the answer mined by ANCE-Tele as the "positive" passages during training.
(1) Tokenize Datasets
Enter the folder ANCE-Tele/shells
and run the shell script:
bash tokenize_nq.sh
bash tokenize_triviaqa.sh
bash tokenize_wikipedia_corpus.sh
(1) Download our CheckPs from HuggingFace:
For NQ and TriviaQA, ANCE-Tele adopts Bi-encoder architecture, which is the same as DPR, coCondenser, etc. Different GPUs cause a little difference (a few thousandths) in search results.
Datasets | Qry-Encoder Download Link | Psg-Encoder Download Link | Size | R@5 | R@20 | R@100 |
---|---|---|---|---|---|---|
NQ | ance-tele_nq_qry-encoder | ance-tele_nq_psg-encoder | ~418M x 2 | 77.0 | 84.9 | 89.7 |
TriviaQA | ance-tele_triviaqa_qry-encoder | ance-tele_triviaqa_psg-encoder | ~418M x 2 | 76.9 | 83.4 | 87.3 |
(2) Encoder & Search NQ/TriviaQA using our CheckPs:
bash infer_nq.sh # NQ
bash infer_triviaqa.sh # TriviaQA
P.S. We support multi-GPUs to encode the Wikipedia corpus, split into 20 files (split00-split19). But Multi-GPU encoding only supports using 1/2/5 GPUs simultaneously, e.g., setting ENCODE_CUDA="0,1"
. If your CUDA memory is limited, please see [Faiss Search Notice] for more GPU Search details.
(1) Download vanilla pre-trained model & our Epi-3 training negatives:
Datassets | Download Link | Size |
---|---|---|
Vanilla pre-trained model | co-condenser-wiki | ~419M |
NQ | ance-tele_nq_tokenized-train-data.tar.gz | ~8.8G |
TriviaQA | ance-tele_triviaqa_tokenized-train-data.tar.gz | ~6.8G |
(2) Uncompress our Epi-3 training negatives:
Run the command: tar -zxvf xxx
. Each uncompressed dataset contains two sub-files {split00-01.hn.json}. The format of each file is as follows:
{
"query": [train-query tokenized ids],
"positives": [[positive-passage-1 tokenized ids], [positive-passage-2 tokenized ids], ...],
"negatives": [[negative-passage-1 tokenized ids], [negative-passage-2 tokenized ids], ...],
}
(3) Train ANCE-Tele using our Epi-3 training negtatives
bash train_ance-tele_nq.sh # NQ
bash train_ance-tele_triviaqa.sh # TriviaQA
P.S. Multi-GPU training is supported. Please keep the following hyperparameters unchanged and set --negatives_x_device
when using a multi-GPU setup. If your CUDA memory is limited, please use [Gradient Caching].
Hyperparameters | Augments | Single GPU | E.g., Four GPUs |
---|---|---|---|
Qry Batch Size | --per_device_train_batch_size | 128 | 32 |
(Positive + Negative) Passages per Qry | --train_n_passages | 12 | 12 |
Learning rate | --learning_rate | 5e-6 | 5e-6 |
Total training Epoch | --num_train_epochs | 40 | 40 |
After training, the models are saved under the ${train_job_name}
folder like:
${train_job_name}
βββ query_model # Qry-Encoder
βββ passage_model # Psg-Encoder
Before using your model, please copy the three files special_tokens_map.json
, tokenizer_config.json
, and vocab.txt
into Qry/Psg-Encoder folders.
cp special_tokens_map.json tokenizer_config.json vocab.txt ./query_model
cp special_tokens_map.json tokenizer_config.json vocab.txt ./passage_model
(4) Evaluate your ANCE-Tele
Then you can follow the instructions in NQ/TriviaQA: Reproduce w/ Our CheckPs - (2) to evaluate. Remember to replace the CheckPs with your trained model file π:
export qry_encoder_name=${train_job_name}/query_model
export psg_encoder_name=${train_job_name}/passage_model
To reproduce ANCE-Tele from scratch (Epi->2->3), you only need to prepare the vanilla pretrained model co-condenser-wiki. Before starting to reproduce, please know the quick-refreshing-strategy and train-from-scratch mode of ANCE-Tele [Iterative Training Notice] π.
(1) Epi-1 Training
First, mine the Tele-negatives using the vanilla co-condenser-wiki. In Epi-1, Tele-negatives contain ANN-negatives and Lookahead-negatives (LA-Neg) without Momentum.
bash epi-1-mine-nq.sh # NQ
bash epi-1-mine-triviaqa.sh # TriviaQA
Then train the vanilla co-condenser-wiki with the Epi-1 Tele-negatives and early stop at 2k step prepared for negative refreshing:
bash epi-1-train-nq.sh # NQ
bash epi-1-train-triviaqa.sh # TriviaQA
(2) Epi-2 Training
First, prepare the Epi-1 trained model introduced in [Prepare your ANCE-Tele for NQ and TriviaQA], and then use the prepared CheckPs to mine Epi-2 Tele-negatives, which contain ANN-negatives, Lookahead-negatives (LA-Neg), and Momentum-negatives (Epi-1 training negatives).
bash epi-2-mine-nq.sh # NQ
bash epi-2-mine-triviaqa.sh # TriviaQA
Then train the vanilla co-condenser-wiki with the Epi-2 Tele-negatives and early stop at 2k step prepared for negative refreshing:
bash epi-2-train-nq.sh # NQ
bash epi-2-train-triviaqa.sh # TriviaQA
(3) Epi-3 Training
For the last Epi-3, prepare the Epi-2 trained model introduced in [Prepare your ANCE-Tele for NQ and TriviaQA], and then use the prepared CheckPs to mine Epi-3 Tele-negatives, which contain ANN-negatives, Lookahead-negatives (LA-Neg), and Momentum-negatives (Epi-2 training negatives).
bash epi-3-mine-nq.sh # NQ
bash epi-3-mine-triviaqa.sh # TriviaQA
Then train the vanilla co-condenser-wiki with the Epi-3 Tele-negatives. This step is the same as introduced in NQ/TriviaQA: Reproduce w/ Our Episode-3 Training Negatives - (3):
bash epi-3-train-nq.sh # NQ
bash epi-3-train-triviaqa.sh # TriviaQA
(4) Evaluate your ANCE-Tele
After three training episodes, first Prepare your ANCE-Tele for NQ and TriviaQA. Then you can follow the instructions in NQ/TriviaQA: Reproduce w/ Our CheckPs - (2) to evaluate. Remember to replace the CheckPs with your trained model file π:
export qry_encoder_name=${train_job_name}/query_model
export psg_encoder_name=${train_job_name}/passage_model
- π Faiss Search Notice: How to use multi-GPU/CPU search
- π Grad Cache Notice: How to save CUDA memory during training ANCE-Tele.
- π Iterative Training Notice: ANCE-Tele takes a quick-refreshing-strategy and train-from-scratch mode.
For any question, feel free to create an issue, and we will try our best to solve. If the problem is more urgent, you can send an email to me at the same time π€.
NAME: Si Sun
EMAIL: [email protected]
ANCE-Tele is implemented and modified based on Tevatron. We thank the authors for their open-sourcing and excellent work. We have integrated ANCE-Tele into OpenMatch.