We found that there are still some "bad" samples in existing processed datasets (e.g. RedPajama, The Pile, etc.). So we use our Data-Juicer to refine them and try to feed them to LLMs for better performance.
We use simple 3-σ rule to set the hyperparameters for ops in each recipe.
subset | #samples before | #samples after | keep ratio | config link | data link | source |
---|---|---|---|---|---|---|
Arxiv | 1,724,497 | 1,655,259 | 95.99% | redpajama-arxiv-refine.yaml | Aliyun ModelScope |
Redpajama |
Books | 205,182 | 195,983 | 95.51% | redpajama-book-refine.yaml | Aliyun ModelScope |
Redpajama |
Wikipedia | 29,834,171 | 26,990,659 | 90.47% | redpajama-wiki-refine.yaml | Aliyun ModelScope |
Redpajama |
C4 | 364,868,892 | 346,217,856 | 94.89% | redpajama-c4-refine.yaml | Aliyun ModelScope |
Redpajama |
Common Crawl 2019-30 | 81,085,420 | 36,557,283 | 45.08% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Common Crawl 2020-05 | 90,850,492 | 42,612,596 | 46.90% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Common Crawl 2021-04 | 98,878,523 | 44,724,752 | 45.23% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Common Crawl 2022-05 | 94,058,868 | 42,648,496 | 45.34% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Common Crawl 2023-06 | 111,402,716 | 50,643,699 | 45.46% | redpajama-cc-refine/ | Aliyun ModelScope |
Redpajama |
Github Code | 73,208,524 + 21,387,703 |
49,279,344 | 52.09% | redpajama-code-refine.yaml stack-code-refine.yaml redpajama-stack-code-deduplicate.yaml |
Aliyun ModelScope |
Redpajama The Stack |
StackExchange | 45,447,328 | 26,309,203 | 57.89% | redpajama-pile-stackexchange-refine.yaml | Aliyun ModelScope |
Redpajama The Pile |
EuroParl | 69,814 | 61,601 | 88.23% | pile-europarl-refine.yaml | Aliyun ModelScope |
The Pile |
FreeLaw | 3,562,015 | 2,942,612 | 82.61% | pile-freelaw-refine.yaml | Aliyun ModelScope |
The Pile |
HackerNews | 373,027 | 371,331 | 99.55% | pile-hackernews-refine.yaml | Aliyun ModelScope |
The Pile |
NIH ExPorter | 939,661 | 858,492 | 91.36% | pile-nih-refine.yaml | Aliyun ModelScope |
The Pile |
PhilPapers | 32,782 | 29,117 | 88.82% | pile-philpaper-refine.yaml | Aliyun ModelScope |
The Pile |
PubMed Abstracts | 15,518,009 | 15,009,325 | 96.72% | pile-pubmed-abstract-refine.yaml | Aliyun ModelScope |
The Pile |
PubMed Central | 3,098,930 | 2,694,860 | 86.96% | pile-pubmed-central-refine.yaml | Aliyun ModelScope |
The Pile |
USPTO | 5,883,024 | 4,516,283 | 76.77% | pile-uspto-refine.yaml | Aliyun ModelScope |
The Pile |
subset | #samples before | #samples after | keep ratio | config link | data link | source |
---|---|---|---|---|---|---|
Alpaca-Cot EN | 136,219,879 | Non-dedup: 104,573,711 Dedup: TBD |
76.77% | alpaca-cot-en-refine.yaml | Aliyun ModelScope |
39 Subsets of Alpaca-CoT |
Alpaca-Cot ZH | 21,197,246 | 9,873,214 | 46.58% | alpaca-cot-zh-refine.yaml | Aliyun ModelScope |
28 Subsets of Alpaca-CoT |