Welcome🙌! This is a repository for our paper "SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval" in CIKM'22.
Build your environment with the following CLI before reproduction.
We have confirmed that the results are reproduced successfully in Python version 3.7.15 and PyTorch version 1.12.1.
git clone https://github.com/eunseongc/SpaDE
cd SpaDE
pip install -r requirements.txt
Please visit https://microsoft.github.io/msmarco/Datasets and https://github.com/DI4IR/SIGIR2021 (for expanded_collection.tsv
) to download data.
You can download training triples (qid, pos pid, neg pid) from here.
(Note that this training triples have same negatives with the one given by MS MARCO, but we rearranged it and splitted the valid dataset.)
Before run the script, please locate 1) collection.tsv
(or expanded_collection.tsv
) and 2) marco_triples.pkl
to data/marco-passage/
.
Run this script to train the SpaDE from the scratch.
(It took us about 40 hours with 1x3090Ti GPU when the top 2 tokens were expanded)
source scripts/run_train.sh 2
To be updated
generate_and_eval.py
generates sparse matrices and evaluates them.
Below is an example of usage.
python genererate_and_eval.py --path {path_of_model_folder} --num_iter {iteration}
Please cite our paper:
@inproceedings{ChoiLCKSL22,
author = {Eunseong Choi and
Sunkyung Lee and
Minjin Choi and
Hyeseon Ko and
Young{-}In Song and
Jongwuk Lee},
title = {SpaDE: Improving Sparse Representations using a Dual Document Encoder
for First-stage Retrieval},
booktitle = {Proceedings of the 31st {ACM} International Conference on Information
{\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
pages = {272--282},
publisher = {{ACM}},
year = {2022},
}