Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RedPajama-V2 to the Klone Data Commons #152

Open
wants to merge 1 commit into
base: src
Choose a base branch
from

Conversation

kamahori
Copy link

Creating this PR to add the RedPajama-V2 dataset to the Klone Data Commons.

@npho
Copy link
Member

npho commented Mar 1, 2024

Hi @kamahori do you know how much space this data set would use (in TB)? I know Ryan on our team said he was working on deploying this and mentioned to me he thought this might take 500TB but that seems kind of high and I think the URL you provided said it was only 11TB?

@kamahori
Copy link
Author

kamahori commented Mar 3, 2024

Hi @npho, thanks for working on this. I don't believe it will take as much as 500TB. I think each file is provided in compressed way and we don't need to have them uncompressed.

@pnw-ryanmcgr
Copy link
Contributor

pnw-ryanmcgr commented Mar 14, 2024

Good Afternoon @kamahori
Were you requesting that we download the full redpajama-v2 dataset or just the partial set (11Gb) provided on the huggingface website? If the parquet files located here https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2/tree/refs%2Fconvert%2Fparquet will work for what you need please let us know, as we have downloaded roughly 10% of the large form dataset at almost 51TB.

Thank you,
Ryan McGregor

@pnw-ryanmcgr
Copy link
Contributor

Just a reminder that we need to know more regarding this request or I will have to close it unmerged.

@cylinbao
Copy link

Hi, we thank efforts on helping to set up this dataset. However, after discussion, we think to make the best use of it, we will need the whole dataset, and not just the partial sample sets.

Based on their documentation on the HF page https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2, it seems they provide the scripts to remove the duplicated parts in the dataset. Their table shows, the amount of tokens reduced from 50T to 30T after performing the deduplication and filtering. If the original data takes 500TB, we except the clean version will be 300TB. Will it be a acceptable size? If yes, will it be possible for you guys to download the whole dataset and run the data cleaning process? Thanks!

Screenshot 2024-03-29 at 2 53 00 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants