Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Craft TSV records for an SFB demo #1

Open
jsheunis opened this issue Jun 6, 2023 · 7 comments
Open

Craft TSV records for an SFB demo #1

jsheunis opened this issue Jun 6, 2023 · 7 comments

Comments

@jsheunis
Copy link
Collaborator

jsheunis commented Jun 6, 2023

Some context in: datalad/datalad-catalog#311

Example dataset (here in csv):

identifier,1234
version,latest
name,Demo
description,This is a dataset description
author,Heunis,Stephan,Dr,0000-1234-5678
author,...,...,...,...
keywords,minimal,example,catalog,from,metadata
property,Storage,7PB
property,Source,Open
sfb1451,<key>,<value>
project,<project-name>

Example file (here in csv):

path, contentbytesize, url
myfile.txt, 2292972, https://file1url
subdir/my2ndfile.txt, 82828282, https://file2url

Some notes from discussion:

  • need to be able to generate the dataset_id ourselves deterministically, from e.g. project and identifier and name?
  • super-/sub-dataset linkage to be done in separate workflow
  • where to encode the linkage between file metadata and specific dataset id/version?
@mih
Copy link

mih commented Jun 6, 2023

Specification of dataset metadata in a table format, where each dataset property is described in a single row.

Below is one section on each recognized property. Each section contains a table snippet with details on the syntax and semantics for that property. colX labels identifier the respective column numbers. Optional cells are indicated by [*].

Dataset identifier [required]

Definition: https://schema.org/identifier

col1 col2
identifier https://schema.org/Text

Dataset name [required]

Definition: https://schema.org/name

col1 col2
name https://schema.org/Text

Dataset description [optional]

Definition: https://schema.org/description

col1 col2
description https://schema.org/Text

Dataset version [optional]

Definition: https://schema.org/version

col1 col2
version https://schema.org/Text

Author [optional]

Definition: https://schema.org/author

One or more rows are supported, each row for one author, order in table (top-to-bottom) defines author order (first is first, last is last).

col1 col2 col3[*] col4[*] col5[*]
author full name orcid https://schema.org/email affiliation(s)

Publication [optional]

Definition: https://schema.org/publication

One or more rows are supported, each row for one publication related to the dataset, or possibly a publication of the dataset itself. The citation is free-format. Citation style should be consistent across publications.

col1 col2 col3
publication doi https://schema.org/Text

Dataset keywords [optional]

Definition: https://schema.org/keywords

One keyword per cell, no limit on number of keywords.

col1 col2 col3 ...
keywords https://schema.org/Text https://schema.org/Text ...

Custom dataset property [optional]

Definition: https://schema.org/Property

One special property category label (col1) is recognized: property, indicating "no particular category". Multi-value properties can use any number of cells starting col3.

Any row with a col1 cell value that does not match a recognized field is interpreted as a custom property. Col4 and col5 allow for amending the "display name" and "display value" of the property with identifiers in the form of URLs to the definitions of respective concepts and items.

col1 col2 col3 col4[*] col5[*]
https://schema.org/Text https://schema.org/Text https://schema.org/Text property definition value definition

@mih
Copy link

mih commented Jun 6, 2023

Given the specification above, for the SFB1451 we can additionally required (already included in above spec, just spelled out here).

Origin SFB project [required]

Columns 3 and 4 are implicit, 3 is constant, and 4 is the link to the project landing page on the SFB website (for example).

col1 col2 col3 col4[*] col5[*]
sfb1451 project https://schema.org/Text https://schema.org/ResearchProject https://schema.org/url

To be extended with relevant metadata, such as species, limbs, recording method, ...

@mslw
Copy link

mslw commented Jun 6, 2023

Looks good to me so far. Regarding "publication": if current catalog schema & rendering is the target, than the catalog would show (in different fields): Title, Authors, Publication year, DOI & Journal. I would be cautious allowing free-form citation, it can be a nightmare to parse. Requiring DOI is a great idea, maybe we add other fields instead of free-form citation?

@jsheunis
Copy link
Collaborator Author

jsheunis commented Jun 6, 2023

I'm confused by the definitions of Custom dataset property [optional] and Origin SFB project [required]. This is my interpretation in the form of data:

col1 col2 col3
property Storage 7PB
property Source Open
sfb1451 Species Human
sfb1451 Limb Leg
sfb1451 project Z03

For the first two rows, col2 and col3 represent the name and value of a custom property, respectively. Can you correct me where I'm misinterpreting?

@mih
Copy link

mih commented Jun 6, 2023

Where/what is the confusion?

@jsheunis
Copy link
Collaborator Author

jsheunis commented Jun 6, 2023

If the sample data I provided in the table looks correct, then I guess my interpretation is correct.

But for clarity:

Custom properties:

Any row with a col1 cell value that does not match a recognized field is interpreted as a custom property.

Does that mean it should not be indicated with property in col1 (as per my example)? If not, what would need to be supplied in col3, because my interpretation would be that col1 contains the property name, and col2 the property value.

Origin SFB project: I'm not confused about this anymore.

@mih
Copy link

mih commented Jun 6, 2023

Wording could probably be improved, but you did it according to my intent for the spec, hence I cannot explain why it should be different. It should be exactly like you did.

But ideally also with the other two columns that bring the semantic clarity to name and value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants