Data processing for NeuSum
This repo contains the code which can generate the training data (CNN / Daily Mail) needed by NeuSum.
-
Preprocess CNN/DM dataset using abisee's scripts: https://github.com/abisee/cnn-dailymail
-
Convert its output to the format shown in the
sample_data
folder. The format of files:- File train.txt.src is the input document. Each line contains several tokenized sentences delimited by ##SENT## of a document.
- File train.txt.tgt is the summary of document. Each line contains several tokenized summaries delimited by ##SENT## of the corresponding document.
-
Use
find_oracle.py
to search the best sentences to be extracted. The arguments of themain
functions are:document_file
,summary_file
andoutput_path
. -
Next, build the ROUGE score gain file using
get_mmr_regression_gain.py
. The usage is shown in the code entry.
The algorithm is a brute-force search, which can be slow in some cases. Therefore, running it in parallel is recommended (and it is what I did in my experiments).
Recently, I modify the find_oracle.py
a little using multiprocessing so that it can be easier to run it in parallel. Please check out find_oracle_para.py
.