Refined open source dataset by Data-Juicer

We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile, etc.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.

We use simple 3-σ rule to set the hyperparameters for ops in each recipe.

Before and after refining for Pretraining Dataset

subset	#samples before	#samples after	keep ratio	config link	data link	source
Arxiv	1,724,497	1,655,259	95.99%	redpajama-arxiv-refine.yaml	Aliyun ModelScope	Redpajama
Books	205,182	195,983	95.51%	redpajama-book-refine.yaml	Aliyun ModelScope	Redpajama
Wikipedia	29,834,171	26,990,659	90.47%	redpajama-wiki-refine.yaml	Aliyun ModelScope	Redpajama
C4	364,868,892	346,217,856	94.89%	redpajama-c4-refine.yaml	Aliyun ModelScope	Redpajama
Common Crawl 2019-30	81,085,420	36,557,283	45.08%	redpajama-cc-refine/	Aliyun ModelScope	Redpajama
Common Crawl 2020-05	90,850,492	42,612,596	46.90%	redpajama-cc-refine/	Aliyun ModelScope	Redpajama
Common Crawl 2021-04	98,878,523	44,724,752	45.23%	redpajama-cc-refine/	Aliyun ModelScope	Redpajama
Common Crawl 2022-05	94,058,868	42,648,496	45.34%	redpajama-cc-refine/	Aliyun ModelScope	Redpajama
Common Crawl 2023-06	111,402,716	50,643,699	45.46%	redpajama-cc-refine/	Aliyun ModelScope	Redpajama
Github Code	73,208,524 + 21,387,703	49,279,344	52.09%	redpajama-code-refine.yaml stack-code-refine.yaml redpajama-stack-code-deduplicate.yaml	Aliyun ModelScope	Redpajama The Stack
StackExchange	45,447,328	26,309,203	57.89%	redpajama-pile-stackexchange-refine.yaml	Aliyun ModelScope	Redpajama The Pile
EuroParl	69,814	61,601	88.23%	pile-europarl-refine.yaml	Aliyun ModelScope	The Pile
FreeLaw	3,562,015	2,942,612	82.61%	pile-freelaw-refine.yaml	Aliyun ModelScope	The Pile
HackerNews	373,027	371,331	99.55%	pile-hackernews-refine.yaml	Aliyun ModelScope	The Pile
NIH ExPorter	939,661	858,492	91.36%	pile-nih-refine.yaml	Aliyun ModelScope	The Pile
PhilPapers	32,782	29,117	88.82%	pile-philpaper-refine.yaml	Aliyun ModelScope	The Pile
PubMed Abstracts	15,518,009	15,009,325	96.72%	pile-pubmed-abstract-refine.yaml	Aliyun ModelScope	The Pile
PubMed Central	3,098,930	2,694,860	86.96%	pile-pubmed-central-refine.yaml	Aliyun ModelScope	The Pile
USPTO	5,883,024	4,516,283	76.77%	pile-uspto-refine.yaml	Aliyun ModelScope	The Pile

Before and after refining for Alpaca-CoT Dataset

subset	#samples before	#samples after	keep ratio	config link	data link	source
Alpaca-Cot EN	136,219,879	Non-dedup: 104,573,711 Dedup: TBD	76.77%	alpaca-cot-en-refine.yaml	Aliyun ModelScope	39 Subsets of Alpaca-CoT
Alpaca-Cot ZH	21,197,246	9,873,214	46.58%	alpaca-cot-zh-refine.yaml	Aliyun ModelScope	28 Subsets of Alpaca-CoT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Refined open source dataset by Data-Juicer

Before and after refining for Pretraining Dataset

Before and after refining for Alpaca-CoT Dataset

Files

README.md

Latest commit

History

README.md

File metadata and controls

Refined open source dataset by Data-Juicer

Before and after refining for Pretraining Dataset

Before and after refining for Alpaca-CoT Dataset