♠SpaDE (CIKM'22)

Welcome🙌! This is a repository for our paper "SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval" in CIKM'22.

Build your environment with the following CLI before reproduction.
We have confirmed that the results are reproduced successfully in Python version 3.7.15 and PyTorch version 1.12.1.

Preparing

git clone https://github.com/eunseongc/SpaDE
cd SpaDE
pip install -r requirements.txt

Please visit https://microsoft.github.io/msmarco/Datasets and https://github.com/DI4IR/SIGIR2021 (for expanded_collection.tsv) to download data.

You can download training triples (qid, pos pid, neg pid) from here.
(Note that this training triples have same negatives with the one given by MS MARCO, but we rearranged it and splitted the valid dataset.)

Before run the script, please locate 1) collection.tsv (or expanded_collection.tsv) and 2) marco_triples.pkl to data/marco-passage/.

Training

Run this script to train the SpaDE from the scratch.
(It took us about 40 hours with 1x3090Ti GPU when the top 2 tokens were expanded)

source scripts/run_train.sh 2

Indexing

To be updated

Evaluation

generate_and_eval.py generates sparse matrices and evaluates them.
Below is an example of usage.

python genererate_and_eval.py --path {path_of_model_folder} --num_iter {iteration}

Citation

Please cite our paper:

@inproceedings{ChoiLCKSL22,
  author    = {Eunseong Choi and
               Sunkyung Lee and
               Minjin Choi and
               Hyeseon Ko and
               Young{-}In Song and
               Jongwuk Lee},
  title     = {SpaDE: Improving Sparse Representations using a Dual Document Encoder
               for First-stage Retrieval},
  booktitle = {Proceedings of the 31st {ACM} International Conference on Information
               {\&} Knowledge Management, Atlanta, GA, USA, October 17-21, 2022},
  pages     = {272--282},
  publisher = {{ACM}},
  year      = {2022},
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
base		base
data/marco-passage		data/marco-passage
dataloader		dataloader
evaluation		evaluation
experiment		experiment
model		model
model_config		model_config
saves		saves
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
eval.py		eval.py
generate_and_eval.py		generate_and_eval.py
main_config.cfg		main_config.cfg
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

♠SpaDE (CIKM'22)

Preparing

Training

Indexing

Evaluation

Citation

About

Releases

Packages

Contributors 4

Languages

License

eunseongc/SpaDE

Folders and files

Latest commit

History

Repository files navigation

♠SpaDE (CIKM'22)

Preparing

Training

Indexing

Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages