This is the docker image for Indri v5.13 conforming to the OSIRRC jig for the Open-Source IR Replicability Challenge (OSIRRC) at SIGIR 2019. This image is available on Docker Hub. The OSIRRC 2019 image library contains a log of successful executions of this image.
- Supported test collections:
robust04
,gov2
,core18
- Supported hooks:
index
,search
First, clone the jig
and follow its setup instructions - in particular, do not forget to also clone and install trec_eval
inside the jig
directory!
Once that is done, the following jig
command can be used to index TREC disks 4/5 for robust04
:
python3 run.py prepare \
--repo osirrc2019/indri \
--tag v0.1.0 \
--collections robust04=/path/to/disk45=trectext
The core18
indexing command is very similar:
python3 run.py prepare \
--repo osirrc2019/indri \
--collections core18=/path/to/WashingtonPost.v2=trectext
For gov2
, we have to change the corpus format to trecweb
(note: format trectext
will not throw an error but instead lead to a poorly performing index as HTML documents are not handled correctly):
python3 run.py prepare \
--repo osirrc2019/indri \
--collections gov2=/path/to/gov2=trecweb
The following jig
command can be used to perform a retrieval run on the collection with the robust04
test collection:
python3 run.py search \
--repo osirrc2019/indri \
--tag v0.1.0 \
--output out/indri \
--qrels qrels/qrels.robust04.txt \
--topic topics/topics.robust04.txt \
--collection robust04 \
--top_k 1000 \
--opts out_file_name="robust.dir1000" rule="method:dirichlet,mu:1000" topic_type="title"
For core18
use:
python3 run.py search \
--repo osirrc2019/indri \
--output out/indri \
--qrels qrels/qrels.core18.txt \
--topic topics/topics.core18.txt \
--collection core18 \
--top_k 1000 \
--opts out_file_name="core18.dir1000" rule="method:dirichlet,mu:1000" topic_type="title"
For gov2
use:
python3 run.py search \
--repo osirrc2019/indri \
--output out/indri \
--qrels qrels/qrels.701-850.txt \
--topic topics/topics.701-850.txt \
--collection gov2 \
--top_k 1000 \
--opts out_file_name="gov2.dir1000" rule="method:dirichlet,mu:1000" topic_type="title"
The option --opts
has a rule
entry to determine the retrieval method, a topic_type
entry to determine the TREC topic type (either title
, desc
or narr
or a combination, e.g. title+desc
), a use_prf
entry to determine the usage of pseudo-relevance feedback ("1"
to use PRF, anything else to not use it) and a sd
entry to switch on ("1"
) or off the sequence dependency model (sd
will have no effect on tfidf
or okapi
). Both sd
and use_prf
are optional parameters; if they are not provided, they are by default switched off.
Valid retrieval rule
entries can be found in the Indri documentation. Importantly, the tfidf
and okapi
methods also work in the rule
entry (i.e. no separate baseline entry is necessary). Note that there are default parameter settings for all of the retrieval rules - if only the name, but no further information is provided, the defaults are used.
While a retrieval rule can be specified, for PRF this option is not available at the moment (check searchRobust04.sh to make changes), instead a set of default settings are used: 50 feedback documents, 25 feedback terms and equal weighting between the original and expanded query.
- The created index is stemmed (Krovetz), with stopwords removed. The Lemur project stopword list was used; it contains 418 stopwords.
- The pseudo-relevance feedback setting is hardcoded in search.sh: 50
fbDocs
, 25fbTerms
, 0.5fbOrigWeight
. - The sequential dependency model setting is hardcoded in topicFormatting.pl: 0.9 original query, 0.05 bigram, 0.05 unordered window.
- Only standard stopword removal is applied to the topics; this means that in the TREC description and TREC narrative phrases like Find relevant documents about or Relevant are documents that ... remains in the topic after processing.
The following table contains examples of --opts
and the expected retrieval effectiveness in MAP, Precision@30/@10 and NDCG@20. These metrics may seem arbitrary; in fact, they were chosen to make---at least on paper---comparisons to published works on the same corpus.
The results below are computed based on the 250 topics of robust04
.
MAP | P@30 | P@10 | NDCG@20 | |
---|---|---|---|---|
1️⃣ --opts out_file_name="robust.dir1000.title" rule="method:dirichlet,mu:1000" topic_type="title" |
0.2499 | 0.3100 | 0.4253 | 0.4201 |
2️⃣ --opts out_file_name="robust.dir1000.title.sd" rule="method:dirichlet,mu:1000" topic_type="title" sd="1" |
0.2547 | 0.3146 | 0.4329 | 0.4232 |
3️⃣ --opts out_file_name="robust.dir1000.title.prf" rule="method:dirichlet,mu:1000" topic_type="title" use_prf="1" |
0.2812 | 0.3248 | 0.4386 | 0.4276 |
4️⃣ --opts out_file_name="robust.dir1000.title.prf.sd" rule="method:dirichlet,mu:1000" topic_type="title" use_prf="1" sd="1" |
0.2855 | 0.3295 | 0.4474 | 0.4298 |
5️⃣ --opts out_file_name="robust.jm0.5.title" rule="method:linear,collectionLambda:0.5" topic_type="title" |
0.2242 | 0.2839 | 0.3819 | 0.3689 |
6️⃣ --opts out_file_name="robust.bm25.title" rule="okapi,k1:1.2,b:0.75" topic_type="title" |
0.2338 | 0.2995 | 0.4181 | 0.4041 |
7️⃣ --opts out_file_name="robust.bm25.title.prf" rule="okapi,k1:1.2,b:0.75" topic_type="title" use_prf="1" |
0.2563 | 0.3041 | 0.4012 | 0.3995 |
8️⃣ --opts out_file_name="robust.bm25.title+desc" rule="okapi,k1:1.2,b:0.75" topic_type="title+desc" |
0.2702 | 0.3274 | 0.4618 | 0.4517 |
9️⃣ --opts out_file_name="robust.dir1000.title+desc.prf.sd" rule="method:dirichlet,mu:1000" topic_type="title+desc" use_prf="1" sd="1" |
0.2971 | 0.3562 | 0.4550 | 0.4448 |
🔟 --opts out_file_name="robust.dir1000.desc" rule="method:dirichlet,mu:1000" topic_type="desc" |
0.2023 | 0.2581 | 0.3703 | 0.3635 |
How do we fare compared to Indri results reported in papers? Here, we are considering the Robust04
numbers reported by researchers working at the institute developing/maintaining Indri:
- Query Reformulation Using Anchor Text contains a language modeling experiment on
Robust04
that reports aP@10
of 0.395 for an unstemmed index (and 246 topics instead of 249). Similar setup to run 1️⃣. - Learning to Reweight Terms with Distributed Representations offers several baselines with a Krovetz stemmed
Robust04
index with stopwords removed. TheBOW
model (language modeling with Dirichlet smoothing, similar to 1️⃣) achieves a MAP of 0.2512, theSD
model (language modeling with sequential dependency, similar to 2️⃣) achieves a MAP of 0.2643. For BM25 (similar to 6️⃣) the authors report a MAP of 0.2460. - Deeper Text Understanding for IR with Contextual Neural Language Modeling contains not a lot of details but reports a NDCG@20 of 0.417 for title queries (language modeling, similar to 1️⃣) and 0.409 for description queries (language modeling, similar to 🔟). It is not entirely clear how the description topics were processed. Slightly higher values are reported for the
SD
model: 0.427 for title (similar to run 2️⃣) as well as description queries. - Effective Query Formulation with Multiple Information Sources reports an NDCG@20 of 0.4178 (title queries, similar to 2️⃣) and 0.4085 (description queries) for the sequential dependence model.
Conclusion: the sequential dependency model outperforms the standard language modeling approach, though the absolute differences we observe are smaller. BM25 peforms worse than expected, this is likely a hyperparameter issue (we did not tune, used default values). The biggest difference can be found in the results involving TREC topic descriptions, a preprocessing issue (specific stopword phrases for description topics).
The results below are computed based on the 150 topics (topics/topics.701-850.txt) available for gov2
.
MAP | P@30 | P@10 | NDCG@20 | |
---|---|---|---|---|
1️⃣ --opts out_file_name="gov2.dir1000.title" rule="method:dirichlet,mu:1000" topic_type="title" |
0.2800 | 0.4741 | 0.5174 | 0.4353 |
2️⃣ --opts out_file_name="gov2.dir1000.title.sd" rule="method:dirichlet,mu:1000" topic_type="title" sd="1" |
0.2904 | 0.4899 | 0.5463 | 0.4507 |
3️⃣ --opts out_file_name="gov2.dir1000.title.prf" rule="method:dirichlet,mu:1000" topic_type="title" use_prf="1" |
0.3033 | 0.4833 | 0.5248 | 0.4276 |
4️⃣ --opts out_file_name="gov2.dir1000.title.prf.sd" rule="method:dirichlet,mu:1000" topic_type="title" use_prf="1" sd="1" |
0.3104 | 0.4863 | 0.5228 | 0.4280 |
5️⃣ --opts out_file_name="gov2.jm0.5.title" rule="method:linear,collectionLambda:0.5" topic_type="title" |
0.1808 | 0.3086 | 0.3161 | 0.2604 |
6️⃣ --opts out_file_name="gov2.bm25.title" rule="okapi,k1:1.2,b:0.75" topic_type="title" |
0.2485 | 0.4486 | 0.4973 | 0.4099 |
7️⃣ --opts out_file_name="gov2.bm25.title.prf" rule="okapi,k1:1.2,b:0.75" topic_type="title" use_prf="1" |
0.2565 | 0.4554 | 0.5000 | 0.3987 |
8️⃣ --opts out_file_name="gov2.bm25.title+desc" rule="okapi,k1:1.2,b:0.75" topic_type="title+desc" |
0.2705 | 0.4749 | 0.5275 | 0.4441 |
9️⃣ --opts out_file_name="gov2.bm25.title+desc.prf.sd" rule="method:dirichlet,mu:1000" topic_type="title+desc" use_prf="1" sd="1" |
0.2621 | 0.4483 | 0.4852 | 0.4040 |
🔟 --opts out_file_name="gov2.dir1000.desc" rule="method:dirichlet,mu:1000" topic_type="desc" |
0.1336 | 0.3244 | 0.3658 | 0.3096 |
The results below are computed based on the 50 topics (topics/topics.core18.txt) available for core18
.
MAP | P@30 | P@10 | NDCG@20 | |
---|---|---|---|---|
1️⃣ --opts out_file_name="core18.dir1000.title" rule="method:dirichlet,mu:1000" topic_type="title" |
0.2332 | 0.3260 | 0.4420 | 0.3825 |
2️⃣ --opts out_file_name="core18.dir1000.title.sd" rule="method:dirichlet,mu:1000" topic_type="title" sd="1" |
0.2428 | 0.3360 | 0.4660 | 0.3970 |
3️⃣ --opts out_file_name="core18.dir1000.title.prf" rule="method:dirichlet,mu:1000" topic_type="title" use_prf="1" |
0.2800 | 0.3793 | 0.4720 | 0.4219 |
4️⃣ --opts out_file_name="core18.dir1000.title.prf.sd" rule="method:dirichlet,mu:1000" topic_type="title" use_prf="1" sd="1" |
0.2816 | 0.3887 | 0.4720 | 0.4190 |
5️⃣ --opts out_file_name="core18.jm0.5.title" rule="method:linear,collectionLambda:0.5" topic_type="title" |
0.2055 | 0.2920 | 0.3780 | 0.3349 |
6️⃣ --opts out_file_name="core18.bm25.title" rule="okapi,k1:1.2,b:0.75" topic_type="title" |
0.2281 | 0.3280 | 0.4180 | 0.3718 |
7️⃣ --opts out_file_name="core18.bm25.title.prf" rule="okapi,k1:1.2,b:0.75" topic_type="title" use_prf="1" |
0.2699 | 0.3793 | 0.4580 | 0.3920 |
8️⃣ --opts out_file_name="core18.bm25.title+desc" rule="okapi,k1:1.2,b:0.75" topic_type="title+desc" |
0.2457 | 0.3393 | 0.4600 | 0.3937 |
9️⃣ --opts out_file_name="core18.bm25.title+desc.prf.sd" rule="method:dirichlet,mu:1000" topic_type="title+desc" use_prf="1" sd="1" |
0.2905 | 0.3967 | 0.4740 | 0.4129 |
🔟 --opts out_file_name="core18.dir1000.desc" rule="method:dirichlet,mu:1000" topic_type="desc" |
0.1674 | 0.2587 | 0.3400 | 0.2957 |
The following is a quick breakdown of what happens in each of the scripts in this repo.
The Dockerfile
installs dependencies (python3
, etc.), copies scripts to the root dir, and sets the working dir to /work
. It also installs Indri.
The robust04
corpus is difficult to handle for Indri: the original version is .z
compressed, a compression Indri cannot handle. And thus, indexRobust04.sh first uncompresses the collection, filters out undesired folders (such as cr
), downloads Lemur's stopword list, creates a parameter file for indexing and finally calls Indri's IndriBuildIndex
.
The core18
corpus is provided in JSON format and needs to be converted to Indri's format in a first step - the nodejs script parseCore18.js does the job.
In contrast, the formatting of gov2
is well suited for Indri, apart from creating a parameter file and calling Indri's `IndriBuildIndex, not much happens.
The robust04
and gov2
topics are in "classic" TREC format, a format Indri cannot handle. Thus, first the TREC topic file has to be converted into a format Indri can parse (done in topicFormatting.pl). Note: different retrieval methods require differently formatted Indri topic files. A parameter file is created and finally IndriRunQuery
is executed.
The core18
topics are very similar to robust04
/gov2
but did require some additional processing (core18
topics contain closing tags that are not present in the robust04
and gov2
topic files).
- Documentation reviewed at commit
b10184e
(2019-06-16) by Ryan Clancy.