Skip to content

Latest commit

 

History

History
37 lines (30 loc) · 15.6 KB

README.md

File metadata and controls

37 lines (30 loc) · 15.6 KB

Refined open source dataset by Data-Juicer

We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile, etc.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.

We use simple 3-σ rule to set the hyperparameters for ops in each recipe.

Before and after refining for Pretraining Dataset

subset #samples before #samples after keep ratio config link data link source
Arxiv 1,724,497 1,655,259 95.99% redpajama-arxiv-refine.yaml Aliyun
ModelScope
Redpajama
Books 205,182 195,983 95.51% redpajama-book-refine.yaml Aliyun
ModelScope
Redpajama
Wikipedia 29,834,171 26,990,659 90.47% redpajama-wiki-refine.yaml Aliyun
ModelScope
Redpajama
C4 364,868,892 346,217,856 94.89% redpajama-c4-refine.yaml Aliyun
ModelScope
Redpajama
Common Crawl 2019-30 81,085,420 36,557,283 45.08% redpajama-cc-refine/ Aliyun
ModelScope
Redpajama
Common Crawl 2020-05 90,850,492 42,612,596 46.90% redpajama-cc-refine/ Aliyun
ModelScope
Redpajama
Common Crawl 2021-04 98,878,523 44,724,752 45.23% redpajama-cc-refine/ Aliyun
ModelScope
Redpajama
Common Crawl 2022-05 94,058,868 42,648,496 45.34% redpajama-cc-refine/ Aliyun
ModelScope
Redpajama
Common Crawl 2023-06 111,402,716 50,643,699 45.46% redpajama-cc-refine/ Aliyun
ModelScope
Redpajama
Github Code 73,208,524
+ 21,387,703
49,279,344 52.09% redpajama-code-refine.yaml
stack-code-refine.yaml
redpajama-stack-code-deduplicate.yaml
Aliyun
ModelScope
Redpajama
The Stack
StackExchange 45,447,328 26,309,203 57.89% redpajama-pile-stackexchange-refine.yaml Aliyun
ModelScope
Redpajama
The Pile
EuroParl 69,814 61,601 88.23% pile-europarl-refine.yaml Aliyun
ModelScope
The Pile
FreeLaw 3,562,015 2,942,612 82.61% pile-freelaw-refine.yaml Aliyun
ModelScope
The Pile
HackerNews 373,027 371,331 99.55% pile-hackernews-refine.yaml Aliyun
ModelScope
The Pile
NIH ExPorter 939,661 858,492 91.36% pile-nih-refine.yaml Aliyun
ModelScope
The Pile
PhilPapers 32,782 29,117 88.82% pile-philpaper-refine.yaml Aliyun
ModelScope
The Pile
PubMed Abstracts 15,518,009 15,009,325 96.72% pile-pubmed-abstract-refine.yaml Aliyun
ModelScope
The Pile
PubMed Central 3,098,930 2,694,860 86.96% pile-pubmed-central-refine.yaml Aliyun
ModelScope
The Pile
USPTO 5,883,024 4,516,283 76.77% pile-uspto-refine.yaml Aliyun
ModelScope
The Pile

Before and after refining for Alpaca-CoT Dataset

subset #samples before #samples after keep ratio config link data link source
Alpaca-Cot EN 136,219,879 Non-dedup: 104,573,711
Dedup: TBD
76.77% alpaca-cot-en-refine.yaml Aliyun
ModelScope
39 Subsets of Alpaca-CoT
Alpaca-Cot ZH 21,197,246 9,873,214 46.58% alpaca-cot-zh-refine.yaml Aliyun
ModelScope
28 Subsets of Alpaca-CoT