Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to convert DataCite metadata to EML: good fit for this package? #341

Open
peterdesmet opened this issue May 3, 2022 · 7 comments

Comments

@peterdesmet
Copy link
Member

👋 I have written a function that converts DataCite metadata to EML, with the DOI as the parameter. I'm planning to use this for at least 2 packages and was wondering if this function would be a good fit to be added to the EML package?

Context

I'm in the process of publishing bird tracking datasets that I already published on Zenodo to GBIF, to open them up to a wider audience. One of the steps in the process is converting the dataset metadata to EML, which can then be uploaded to a GBIF IPT for publication. I don't want to do this manually, which is why I wrote a function. To make it more generic than Zenodo, I'm pulling the metadata from the DataCite.org API (rather than the Zenodo API), where all research repositories push metadata to if they want to mint a DOI.

Functionality

library(movepub)
doi <- "https://doi.org/10.5281/zenodo.5879096" # Also works as "10.5281/zenodo.5879096"
datacite_to_eml(doi)
#> $dataset
#> $dataset$title
#> [1] "O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium)"
#> 
#> $dataset$abstract
#> $dataset$abstract$para
#> [1] "<![CDATA[<em>O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium)</em> is a bird tracking dataset published by the Research Institute for Nature and Forest (INBO). It contains animal tracking data collected by the LifeWatch GPS tracking network for large birds (http://lifewatch.be/en/gps-tracking-network-large-birds) for the project/study <strong>O_WESTERSCHELDE</strong>, using trackers developed by the University of Amsterdam Bird Tracking System (UvA-BiTS, http://www.uva-bits.nl). The study has been operational since 2018. In total 13 individuals of Eurasian oystercatchers (<em>Haematopus ostralegus</em>) have been tagged in their breeding area in East Flanders (Belgium), west of the river Scheldt, mainly to study their habitat use on mudflats of the Western Scheldt (the Netherlands). Data are uploaded from the UvA-BiTS database to Movebank and from there archived on Zenodo (see https://github.com/inbo/bird-tracking). No new data are expected. Data in this package are exported from Movebank study <em>O_WESTERSCHELDE - Eurasian oystercatchers (Haematopus ostralegus, Haematopodidae) breeding in East Flanders (Belgium)</em> (Movebank Study ID 1099562810), which can be viewed at https://www.movebank.org/cms/webapp?gwt_fragment=page=studies,path=study1099562810. Fields in the data follow the Movebank Attribute Dictionary (http://vocab.nerc.ac.uk/collection/MVB) and are described in <code>datapackage.json</code>. <strong>Files</strong> Files are structured as a Frictionless Data Package. You can access all data in R via <code>https://zenodo.org/record/5879096/files/datapackage.json</code> using frictionless. <strong>datapackage.json</strong>: technical description of the data files. <strong>O_WESTERSCHELDE-reference-data.csv</strong>: reference data about the animals, tags and deployments. <strong>O_WESTERSCHELDE-gps-yyyy.csv.gz</strong>: GPS data recorded by the tags, grouped by year. <strong>O_WESTERSCHELDE-acceleration-yyyy.csv.gz</strong>: acceleration data recorded by the tags, grouped by year.]]>"
#> [2] "This dataset was collected using infrastructure provided by INBO and funded by Research Foundation - Flanders (FWO) as part of the Belgian contribution to LifeWatch. Additional funding was provided by the Sovon Dutch Centre for Field Ornithology."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
#> 
#> 
#> $dataset$contact
#> list()
#> 
#> $dataset$creator
#> $dataset$creator[[1]]
#> $dataset$creator[[1]]$individualName
#> $dataset$creator[[1]]$individualName$givenName
#> [1] "Geert"
#> 
#> $dataset$creator[[1]]$individualName$surName
#> [1] "Spanoghe"
#> 
#> 
#> 
#> $dataset$creator[[2]]
#> $dataset$creator[[2]]$individualName
#> $dataset$creator[[2]]$individualName$givenName
#> [1] "Peter"
#> 
#> $dataset$creator[[2]]$individualName$surName
#> [1] "Desmet"
#> 
#> 
#> $dataset$creator[[2]]$userId
#> $dataset$creator[[2]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[2]]$userId[[2]]
#> [1] "0000-0002-8442-8025"
#> 
#> 
#> 
#> $dataset$creator[[3]]
#> $dataset$creator[[3]]$individualName
#> $dataset$creator[[3]]$individualName$givenName
#> [1] "Tanja"
#> 
#> $dataset$creator[[3]]$individualName$surName
#> [1] "Milotic"
#> 
#> 
#> $dataset$creator[[3]]$userId
#> $dataset$creator[[3]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[3]]$userId[[2]]
#> [1] "0000-0002-3129-6196"
#> 
#> 
#> 
#> $dataset$creator[[4]]
#> $dataset$creator[[4]]$individualName
#> $dataset$creator[[4]]$individualName$givenName
#> [1] "Gunther"
#> 
#> $dataset$creator[[4]]$individualName$surName
#> [1] "Van Ryckegem"
#> 
#> 
#> $dataset$creator[[4]]$userId
#> $dataset$creator[[4]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[4]]$userId[[2]]
#> [1] "0000-0001-8788-0001"
#> 
#> 
#> 
#> $dataset$creator[[5]]
#> $dataset$creator[[5]]$individualName
#> $dataset$creator[[5]]$individualName$givenName
#> [1] "Joost"
#> 
#> $dataset$creator[[5]]$individualName$surName
#> [1] "Vanoverbeke"
#> 
#> 
#> $dataset$creator[[5]]$userId
#> $dataset$creator[[5]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[5]]$userId[[2]]
#> [1] "0000-0002-3893-9529"
#> 
#> 
#> 
#> $dataset$creator[[6]]
#> $dataset$creator[[6]]$individualName
#> $dataset$creator[[6]]$individualName$givenName
#> [1] "Bruno J."
#> 
#> $dataset$creator[[6]]$individualName$surName
#> [1] "Ens"
#> 
#> 
#> $dataset$creator[[6]]$userId
#> $dataset$creator[[6]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[6]]$userId[[2]]
#> [1] "0000-0002-4659-4807"
#> 
#> 
#> 
#> $dataset$creator[[7]]
#> $dataset$creator[[7]]$individualName
#> $dataset$creator[[7]]$individualName$givenName
#> [1] "Willem"
#> 
#> $dataset$creator[[7]]$individualName$surName
#> [1] "Bouten"
#> 
#> 
#> $dataset$creator[[7]]$userId
#> $dataset$creator[[7]]$userId$directory
#> [1] "http://orcid.org/"
#> 
#> $dataset$creator[[7]]$userId[[2]]
#> [1] "0000-0002-5250-8872"
#> 
#> 
#> 
#> 
#> $dataset$metadataProvider
#> list()
#> 
#> $dataset$keywordSet
#> $dataset$keywordSet[[1]]
#> $dataset$keywordSet[[1]]$keywordThesaurus
#> [1] "n/a"
#> 
#> $dataset$keywordSet[[1]]$keyword
#>  [1] "animal movement"  "animal tracking"  "gps tracking"     "accelerometer"   
#>  [5] "altitude"         "temperature"      "biologging"       "birds"           
#>  [9] "LifeWatch"        "UvA-BiTS"         "Movebank"         "frictionlessdata"
#> 
#> 
#> 
#> $dataset$pubDate
#> [1] "2022-01-19"
#> 
#> $dataset$intellectualRights
#> [1] "cc0-1.0"
#> 
#> $dataset$alternateIdentifier
#> [1] "https://doi.org/10.5281/zenodo.5879096"

Created on 2022-05-03 by the reprex package (v2.0.1)

@peterdesmet
Copy link
Member Author

The suggested datacite_to_eml() function is now documented at https://inbo.github.io/movepub/reference/datacite_to_eml.html. I think the EML package would be a better home for it.

@mbjones
Copy link
Member

mbjones commented Jun 7, 2022

As another convenience method to template an EML record from existing metadata, this seems useful to me and I would support its inclusion. @cboettig woiuld you have any objections? If not, maybe @peterdesmet could submit a PR?

@cboettig
Copy link
Member

cboettig commented Jun 7, 2022

👍 yeah seems like this would be helpful! PRs welcome!

@peterdesmet
Copy link
Member Author

Cool, I'll see when I have some time for that. The function relies quite a lot on the purrr package. Is it fine if this is added as a dependency?

@mbjones
Copy link
Member

mbjones commented Jun 7, 2022

For the packages I maintain, I try to keep dependencies to a minimum, especially for large, packages or packages that entrain a complex ecosystem, as they usually cause maintenance headaches down the road. We spend a fair number of cycles just treading water on package dependencies trying to keep packages on CRAN. Backwards incompatible changes or a package being supplanted by a "newer" version (as is common for RStudio packages) has caused a lot of churn for us. That said, if you really need it, then that is what they are there for. But keep in mind that each dependency is a potential future upgrade problem.

@peterdesmet
Copy link
Member Author

I share your sentiment. I'll see if I can replace my three uses of purrr::map_chr() and two uses of purrr::map() with a base R alternative, if it remains readable (code at https://github.com/inbo/movepub/blob/main/R/datacite_to_eml.R#L26-L49).

@cboettig
Copy link
Member

cboettig commented Jun 7, 2022

map is essentially lapply, and map_chr is essentially vapply with a template type. e.g.

  keywords <- purrr::map_chr(metadata$subjects, "subject")
## is the same as
  keywords <- vapply(metadata$subjects, `[[`, character(1L), "subject")

(yes, [[ is a the familiar sub-setting function, recall in R everything is a function). (Not tested)

That said, purrr is a light dependency compared to some things EML already pulls in....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants