Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating, deleting and updating datasets #31

Open
c42f opened this issue Nov 16, 2021 · 4 comments
Open

Creating, deleting and updating datasets #31

c42f opened this issue Nov 16, 2021 · 4 comments
Milestone

Comments

@c42f
Copy link
Contributor

c42f commented Nov 16, 2021

We need some programmatic way to create datasets, to update their metadata and to delete them. Currently people need to manage this manually by writing TOML but clearly this isn't great.

API musings

One possibility is to overload the dataset() function itself with the ability to create a dataset. For example adding a create=true flag:

dataset("SomeData", create=true, tags=[...], description="some desc", other_args...)
dataset(project, "SomeData", create=true, tags=[...], description="some desc", other_args...)

Another idea would be to pass a verb along as a positional argument, such as

dataset("SomeData", :create; description="some desc", other_args...)
dataset("SomeData", :delete)
dataset("SomeData", :update, description="new desc")

With :read being the default verb. This allows us to reuse the exported dataset() function for all dataset-related CRUD operations.

But let's be honest this is little weird other than being economical with exported names. Perhaps I've been doing too much REST recently :-) Probably a better alternative would be to just have a function per operation:

DataSets.create("SomeData"; description="some desc", other_args...)
DataSets.delete("SomeData")
DataSets.update("SomeData", description="new desc")

update() is a bit of an odd one out of these operations — what if you wanted to delete some metadata? I guess we could pass something like description=nothing for deleting metadata items.

Which data project?

When creating a dataset it needs to be created within "some" data project. Presumably this would be the topmost project in the data project stack, or within a provided project if the project is supplied as the first argument.

Data ownership

Creation — and especially deletion — brings up an additional problem: How do we distinguish between data which is "owned" by a data project (so that the data itself should be deleted when the dataset is removed from the project), vs data which is merely linked to?

For existing data referenced on the filesystem this is particularly relevant. We don't want datasets() to delete somebody's existing data which they're referring to. But neither do we want DataSets.delete() to leave unwanted data lying around.

I think we should have an extra metadata key to distinguish between data which is managed-vs-linked-to by DataSets. Perhaps under the keys linked, or managed or some such. (Should this go within the storage section or not?)

@tclements
Copy link

This is something I would love to have! Manually writing updating TOML feels hackish and unreproducible at the moment. The create, delete and update syntax seems the best to me - I'd rather these operations be explicit.

@jvaverka
Copy link

Love the questions being asked here, but I would add another related to

When creating a dataset it needs to be created within "some" data project. Presumably this would be the topmost project in the data project stack, or within a provided project if the project is supplied as the first argument.

Should it data projects be made more transparent as well?

While I know the functions DataSets.ActiveDataProject & DataSets.DataProject are provided I honestly did not think about the concept of a Data Project when first using this package. Maybe something in the Data REPL to show the active project (My Data Project) data> would make this more obvious. Maybe we also provide a Data REPL command to list the ones DataSets.jl knows are available.

command alias description
projects proj list all available data projects
project $name proj $name switch to $name data project

@c42f
Copy link
Contributor Author

c42f commented Mar 31, 2022

Maybe something in the Data REPL to show the active project

We have this — I guess it's just badly named:

data> stack list
DataSets.StackedDataProject:
  DataSets.ActiveDataProject:
    (empty)
  DataSets.TomlFileDataProject [/home/chris/.julia/datasets/Data.toml]:
    📁 SomeDir    => 302a6dd6-d9e1-4487-8919-c520f08165be
    📄 SomeFile   => 97633d9c-afa8-4437-abd9-320cb4fdb270
    📁 TrueFX     => aa21c966-563e-42fb-ac3d-edaa3bdf3652
    📁 imagenet   => e73ae172-eeb0-4417-b3e1-007d42918752

Alternatively, we could make data> ls just show the full stack in this format by default? (The downside there is that duplicate names can occur with the topmost data project taking precedence. Which is why I used the current format for ls where deduplication has already happened.)

Current data REPL docs do mention this, and the stack command is findable via tab completion. But clearly it should be more discoverable, somehow.

data> ?
  DataSets Data REPL
  ====================

  Press > to enter the data repl. Press TAB to complete commands.

  Command          Alias   Action                                             
  –––––––––––––––– ––––––– –––––––––––––––––––––––––––––––––––––––––––––––––––
  help             ?       Show this message                                  
  list             ls      List all datasets by name                          
  show $name               Preview the content of dataset $name               
  stack            st      Manipulate the global data search stack            
  stack list       st ls   List all projects in the global data search stack  
  stack push $path st push Add data project $path to front of the search stack
  stack pop        st pop  Remove data project from front of the search stack 

@jvaverka
Copy link

This works perfectly. My tired eyes / brain just looked right over it. Thanks for clarifying!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants