Add RedPajama-V2 to the Klone Data Commons #152

kamahori · 2024-01-22T15:28:43Z

Creating this PR to add the RedPajama-V2 dataset to the Klone Data Commons.

npho · 2024-03-01T02:41:59Z

Hi @kamahori do you know how much space this data set would use (in TB)? I know Ryan on our team said he was working on deploying this and mentioned to me he thought this might take 500TB but that seems kind of high and I think the URL you provided said it was only 11TB?

kamahori · 2024-03-03T05:43:08Z

Hi @npho, thanks for working on this. I don't believe it will take as much as 500TB. I think each file is provided in compressed way and we don't need to have them uncompressed.

pnw-ryanmcgr · 2024-03-14T18:06:10Z

Good Afternoon @kamahori
Were you requesting that we download the full redpajama-v2 dataset or just the partial set (11Gb) provided on the huggingface website? If the parquet files located here https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2/tree/refs%2Fconvert%2Fparquet will work for what you need please let us know, as we have downloaded roughly 10% of the large form dataset at almost 51TB.

Thank you,
Ryan McGregor

pnw-ryanmcgr · 2024-03-29T20:00:42Z

Just a reminder that we need to know more regarding this request or I will have to close it unmerged.

cylinbao · 2024-03-30T06:51:19Z

Hi, we thank efforts on helping to set up this dataset. However, after discussion, we think to make the best use of it, we will need the whole dataset, and not just the partial sample sets.

Based on their documentation on the HF page https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2, it seems they provide the scripts to remove the duplicated parts in the dataset. Their table shows, the amount of tokens reduced from 50T to 30T after performing the deduplication and filtering. If the original data takes 500TB, we except the clean version will be 300TB. Will it be a acceptable size? If yes, will it be possible for you guys to download the whole dataset and run the data cleaning process? Thanks!

Create redpajama_v2.md

70afd79

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add RedPajama-V2 to the Klone Data Commons #152

Add RedPajama-V2 to the Klone Data Commons #152

kamahori commented Jan 22, 2024

npho commented Mar 1, 2024

kamahori commented Mar 3, 2024

pnw-ryanmcgr commented Mar 14, 2024 •

edited

Loading

pnw-ryanmcgr commented Mar 29, 2024

cylinbao commented Mar 30, 2024

Add RedPajama-V2 to the Klone Data Commons #152

Are you sure you want to change the base?

Add RedPajama-V2 to the Klone Data Commons #152

Conversation

kamahori commented Jan 22, 2024

npho commented Mar 1, 2024

kamahori commented Mar 3, 2024

pnw-ryanmcgr commented Mar 14, 2024 • edited Loading

pnw-ryanmcgr commented Mar 29, 2024

cylinbao commented Mar 30, 2024

pnw-ryanmcgr commented Mar 14, 2024 •

edited

Loading