Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to retrieve taxonomy version and tests to verify version #132

Merged
merged 33 commits into from
Aug 12, 2024

Conversation

jrdnbradford
Copy link
Contributor

Closes #131.

Description

This adds a GitHub Action Workflow to check whether the eBird taxonomy has updated and opens an issue if it has.

  • .github/workflows/taxonomy-check.yaml: a workflow file with two jobs. The first runs an R script that checks whether the tax dataset is the same as that returned from ebirdtaxonomy(). If they aren't identical the second job runs. This job checks whether there is already an issue with a particular title (env.ISSUE_TITLE) and opens one if it doesn't exist. This keeps the workflow from opening duplicate issues if no one can get to the first in time.

  • data-raw/tax_check.R: a script run by the taxonomy check that checks if the current dataset is identical with that retrieved from eBird

  • R/ebirdtaxonomy.R: I updated this to sort comNameCodes and sciNameCodes, which are both characters that are comma-separated and multi-valued. In testing I found the API often returns these codes in different orders within the column (how fun!). By sorting, we ensure easier comparability between the package dataset and the most recent from eBird

  • R/zzz.R: added a sort_comma_separated function to sort the above codes

  • data-raw/tax.R: updated the comments. Looks like use_data() used to be called with internal = TRUE and the old location was left over

  • data/tax.rda: the same ole taxonomy we know and love, but with comNameCodes and sciNameCodes sorted for easier comparison.

Let me know what I can do to make this better. 🐦

@slager
Copy link
Collaborator

slager commented Jul 12, 2024

Thanks! Most of my current comments are centered around simplifying this a bit.

I don't think we want to re-sort the taxonomy, either in ebirdtaxonomy() or in the rda file. It should currently be returned in taxonomic order, which is the desired order. If there are issues, I'd recommend reordering only at the point of comparison.

All functions/code for sorting/comparison would probably best just go in the R file in data-raw/, I don't see a need to create a new package function in zzz.R since it is unlikely to be used anywhere else.

In other words, I think this can/should be done without changing anything in R/, and without changing the current tax.rda.

I'm not clear on how the badge would get produced for this, and feel like it is probably not necessary.

The taxonomy update cycle is quite predictable and seasonal. This probably only needs to check between about the beginning of August and the end of November or so.

For naming the issue, I'd recommend appending a datestamp so that it doesn't get confused next year when there's already an issue by the same name.

@sebpardo might have other comments. I'd be happy to give this a test run next but I think it would be nice to boil it down a bit as suggested above.

@sebpardo
Copy link
Contributor

sebpardo commented Jul 12, 2024

I agree that we want to avoid sorting inside ebirdtaxonomy() and especially individual columns and reassigning them into the original data frame as this is ripe for causing issues down the line. If at all, we should sort the whole data frame at once.

A potentially simpler way of checking for version updates is to create a function that uses eBird's Taxonomic Version API. This would require for our tax data frame to have a version attribute set when it's created that could be checked against the TRUE/FALSE value in the "latest" field of that API request. Checking this way, we avoid having to do any taxonomic list comparisons and we'd also have a function for another API request for which there's none at the moment. Getting two birds stoned at once!

Thoughts?

@jrdnbradford
Copy link
Contributor Author

Awesome, thanks for this feedback @slager and @sebpardo! It looks like the eBird's Taxonomic Version API is definitely the way to go here. Should definitely simplify things and make it easier to maintain. I'll make some changes!

@jrdnbradford
Copy link
Contributor Author

Round 2. This is much better.

  • R/ebirdtaxonomy.R - now adds a version attribute to the data frame
  • data/tax.rda - now includes the version attribute
  • R/ebirdtaxonomyversion.R - new function that retrieves taxonomic version data
  • tests/ - expanded to test the new function, that the tax dataset version attribute can be set, and whether tax version is the latest
  • .github/workflows/taxonomy-version-check.yaml - now runs a single test file that verifies the version is the latest. If not it opens an issue. To prevent multiple, essentially identical issues from being opened, it checks if there is already an open issue with the right title and opens one if not.

@jrdnbradford jrdnbradford changed the title Add GitHub Action to check taxonomy Add function to retrieve taxonomy version and GitHub Action to verify version Jul 13, 2024
@slager
Copy link
Collaborator

slager commented Jul 15, 2024

I need a little more time to actually review this but on first glance it's looking great @jrdnbradford

@jrdnbradford
Copy link
Contributor Author

jrdnbradford commented Jul 18, 2024

No rush here! I found this too late to use it myself, but it looks like you can test GitHub actions locally: https://github.com/nektos/act. Seems much better than committing and pushing tests of the action to the repo, if you'd like to test out the action.

@jrdnbradford
Copy link
Contributor Author

It occurred to me that the GitHub Action here is actually redundant. I already added a test that checks whether the tax version is up to date. Instead of having this GitHub Action to maintain when it's only relevant for a small portion of the year, maybe it's better to add a schedule trigger to the R-CMD-check:

on:
  push:
    branches: [main, master]
  pull_request:
    branches: [main, master]
  schedule:
    - cron: '0 0 * 8-11 1' # Midnight on Mondays from August through November

When the version updates, the version check test will fail and the repo owners will get a workflow failure email to check out. Of course it won't be as explicit as the new GitHub Action I created, you'll need to open up the workflow logs to see it's actually an update issue. But maybe this explicitness is worth giving up so that you don't have the maintain this action.

Thoughts?

@slager slager self-requested a review July 27, 2024 16:34
Copy link
Collaborator

@slager slager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, had a few review comments.

.github/workflows/taxonomy-version-check.yaml Outdated Show resolved Hide resolved
R/ebirdtaxonomyversion.R Outdated Show resolved Hide resolved
tests/testthat/test-data.R Outdated Show resolved Hide resolved
@slager
Copy link
Collaborator

slager commented Aug 6, 2024

I will defer to @sebpardo here.

I don't have strong preferences on this. As an eBird reviewer I do receive multiple emails every year letting me know the status of each taxonomic update as they happen.

@sebpardo
Copy link
Contributor

sebpardo commented Aug 6, 2024

Looks great. Regarding your suggested options, I think either of these two would be good:

  • throw in the unique bits of the tax version check into the R-CMD-check and run those steps only if the test step fails
  • remove the tax version check and add a schedule to R-CMD-check, with no issue created

The first would remove redundancy and would make it so that the tax-version-check bits would only run when tests fail, which is all we need. I personally like having an issue created when the taxonomy is out of date, but I would also be fine with the last option given its simplicity and the tests failing is enough of an indicator to fix things.

@slager, what are your thoughts on having R-CMD-check run on a regular interval, perhaps every 2 weeks?

@jrdnbradford, I really have no preference so I'd be happy with whichever option you choose to pursue.

@slager
Copy link
Collaborator

slager commented Aug 9, 2024

It seems fine to schedule the modified R CMD check action to do the taxonomy check! The existing R CMD check GHA workflow was very boilerplate from usethis, so in the future if something breaks with this, it should be pretty easy to get the R CMD check action back to the basic functionality we have now.

As an aside, speaking of automated GitHub Actions, another thing we had talked about doing at one point is scheduling one to run at pretty long intervals that would do the R CMD check tests, but with @sebpardo 's API key and without using VCR. If we tightened up the tests a bit and implemented that, this could catch changes to the eBird API, changes to which don't seem very well documented these days.

@jrdnbradford
Copy link
Contributor Author

jrdnbradford commented Aug 12, 2024

@slager and @sebpardo awesome, thanks for your continued help and discussion on this. Based on the discussion here I will opt for adding a schedule to R-CMD-check, with no issue created. The main arguments here being:

  • simplicity
  • there are already taxonomy update notifications sent to you all as eBird reviewers
  • running R-CMD-check on a regular basis will catch other potential issues with {rebird} and the eBird API earlier
  • running on a schedule seems to fit better with future testing aspirations for the package

I'll set this up and comment when ready.

@jrdnbradford
Copy link
Contributor Author

This should be set now. If you have other preferences for R-CMD-check's cron schedule please edit.

@sebpardo sebpardo merged commit 2e74bdf into ropensci:master Aug 12, 2024
6 checks passed
@jrdnbradford jrdnbradford changed the title Add function to retrieve taxonomy version and GitHub Action to verify version Add function to retrieve taxonomy version and tests to verify version Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Addition of GitHub Action to update rebird::tax dataset on a schedule?
3 participants