Redpajama Config Files

This folder contains example configuration files to easily and quickly reproduce the processing flow of Redpajama.

Arxiv

The raw data files can be downloaded from the same AWS link as in Redpajama/Arxiv.

Once downloaded, use raw_arxiv_to_jsonl.py to convert from the original format to jsonl that data-juicer can handle easily:

python tools/preprocess/raw_arxiv_to_jsonl.py           \
    --arxiv_src_dir       <arxiv_src_dir>    \
    --target_dir          <target_dir>       \
    --temp_dir            <temp_dir>         \
    --num_proc            <num_proc>

After conversion, modify the path configurations in redpajama-arxiv.yaml and execute the following command to reproduce the processing flow of redpajama:

python tools/process_data.py --config configs/redpajama/redpajama-arxiv.yaml

Comparison

	num_samples	num_tokens	peak_memory	wall_time
redpajama	1,724,497	30,667,506,934	35GB	`total: 11h52min`
data-juicer	2,675,426	30,338,153,178	21GB	preprocess: 5h21min read+unify: 25min remove_header_mapper: 5min remove_comments_mapper: 3min remove_bibliography_mapper: 4min expand_macro_mapper: 5min19s text_length_filter: 4min export: 43min `total: 6h53min`

Books

The raw data files can be downloaded from the same HuggingFace datasets as in Redpajama/Books.

Once downloaded, modify the path configurations in redpajama-books.yaml and execute the following command to reproduce the processing flow of redpajama.

python tools/process_data.py --config configs/redpajama/redpajama-books.yaml

Comparison

	num_samples	num_tokens	peak_memory	wall_time
redpajama	205,183	25,962,395,123	450GB	split_for_dedup: 5min dedup: 117min `total: 122min`
data-juicer	207,902	26,108,635,683	96GB	read+unify: 20min compute_hash: 78min dedup: 3min export: 3min `total: 114min`

Code

The raw data files can be downloaded from Google BigQuery as in Redpajama/Code.

Once downloaded, unzip and delete files whose extensions are not in the following whitelist:

.asm, .bat, .cmd, .c, .h, .cs, .cpp, .hpp, .c++, .h++, .cc, .hh, .C, .H, .cmake, .css, .dockerfile, .f90, .f, .f03, .f08, .f77, .f95, .for, .fpp, .go, .hs, .html, .java, .js, .jl, .lua, .md, .markdown, .php, .php3, .php4, .php5, .phps, .phpt, .pl, .pm, .pod, .perl,  ps1, .psd1, .psm1, .py, .rb, .rs, .sql, .scala, .sh, .bash, .command, .zsh, .ts, .tsx, .tex, .vb, Dockerfile, Makefile, .xml, .rst, .m, .smali

After preparation, modify the path configurations in redpajama-code.yaml and execute the following command to reproduce the processing flow of redpajama:

python tools/process_data.py --config configs/redpajama/redpajama-code.yaml

Comparison

	num_samples	num_tokens	peak_memory	wall_time
redpajama	73,208,524	150,390,270,060	212GB	local-dedup: 37h global-dedup: 1h merge-dedup: 6h filter: 17h `total: 61h`
data-juicer	73,169,889	150,310,903,230	370GB	preprocess: 5h21min read+unify: 12h document_deduplicator: 20h clean_copyright_mappe: 3h maximum_line_length_filter: 2.5h average_line_length_filter: 2h alphanumeric_filter: 13h export: 2.5h `total: 59h`

StackExchange

The raw data files can be downloaded from the same Archive link as in Redpajama/Stack_exchange.

Once downloaded, use raw_stackexchange_to_jsonl.py to convert from the original format to jsonl that data-juicer can handle easily:

python tools/preprocess/raw_arxiv_stackexchange_to_jsonl.py           \
    --src_dir       <src_dir>      \
    --target_dir    <target_dir>   \
    --topk          <topk>         \
    --num_proc      <num_proc>     \

After conversion, modify the path configurations in redpajama-stackexchange.yaml and execute the following command to reproduce the processing flow of redpajama:

python tools/process_data.py --config configs/redpajama/redpajama-stackexchange.yaml

Comparison

	num_samples	num_tokens	peak_memory	wall_time
redpajama	29,825,086	20,502,757,123	>500GB	filter: 170min postprocess: 90min `total: 260min`
data-juicer	29,825,086	20,628,082,262	100GB	preprocess: 210min read+unify: 86min clean_html: 15min language_id_score_filter: 18min `total: 391min`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Redpajama Config Files

Arxiv

Comparison

Books

Comparison

Code

Comparison

StackExchange

Comparison

Files

README.md

Latest commit

History

README.md

File metadata and controls

Redpajama Config Files

Arxiv

Comparison

Books

Comparison

Code

Comparison

StackExchange

Comparison