Generalize encode/decode for datasets #415

GMNGeoffrey · 2024-01-05T22:34:21Z

This fixes a TODO to allow arbitrary encoding/decoding schemes for
different datasets. To do so, I switched from pickle to dill, which
extends pickle to enable things like pickling functions, including
their referenced globals. dill is already a dependency of datasets,
so this doesn't add any new dependencies.

This PR also includes some gitignore additions that I found
necessary for my usage. I can alter the entries, remove it from this
PR, or break it into a separate PR, as you prefer. Probably the most
controversial addition would be data/*/samples/*, since that's not
a format that is currently referenced in this codebase. I was using
directories like that to save sample prompts for datasets. Happy to
drop it if its inclusion is not desired.

Probably most controversial here is the addition of `data/*/samples/*`. I was using this to save sample prompts for datasets. Happy to drop it if its inclusion is not desired. The other things are all common things you'd want to gitignore: venv directories, vs-code workspaces, output directories (using the directory names suggested by this codebase), and the default wandb output directory.

This fixes a TODO to allow arbitrary encoding/decoding schemes for different datasets. To do so, I switched from pickle to dill, which extends pickle to enable things like pickling functions, including their referenced globals. dill is already a dependency of datasets, so this doesn't add any new dependencies.

AutomaticHourglass · 2024-01-26T18:11:58Z

I also did a similar thing on my personal work, recommended.

GMNGeoffrey added 2 commits January 5, 2024 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize encode/decode for datasets #415

Generalize encode/decode for datasets #415

GMNGeoffrey commented Jan 5, 2024

AutomaticHourglass commented Jan 26, 2024

Generalize encode/decode for datasets #415

Are you sure you want to change the base?

Generalize encode/decode for datasets #415

Conversation

GMNGeoffrey commented Jan 5, 2024

AutomaticHourglass commented Jan 26, 2024