Add pb_read and pb_write functions (#115)

Add new functions `pb_read()` and `pb_write()` as convenience wrappers around pattern of downloading to `tempfile()` and then reading into memory - closes #97
ropensci · Dec 30, 2023 · 4589222 · 4589222
1 parent 077a649
commit 4589222
Show file tree

Hide file tree

Showing 70 changed files with 897 additions and 5,447 deletions.
diff --git a/.gitignore b/.gitignore
@@ -15,3 +15,4 @@ data/*
 .DS_Store
 CRAN_SUBMISSION
 CRAN_RELEASE
+docs/*
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,16 +1,10 @@
 Package: piggyback
 Version: 0.1.5.9004
 Title: Managing Larger Data on a GitHub Repository
-Description: Because larger (> 50 MB) data files cannot easily be committed to git,
-  a different approach is required to manage data associated with an analysis in a 
-  GitHub repository.  This package provides a simple work-around by allowing larger
-  (up to 2 GB) data files to piggyback on a repository as assets attached to individual
-  GitHub releases.  These files are not handled by git in any way, but instead are
-  uploaded, downloaded, or edited directly by calls through the GitHub API. These
-  data files can be versioned manually by creating different releases.  This approach
-  works equally well with public or private repositories.  Data can be uploaded
-  and downloaded programmatically from scripts. No authentication is required to
-  download data from public repositories.
+Description: Helps store files as GitHub release assets, which is a convenient 
+  way for large/binary data files to piggyback onto public and private GitHub 
+  repositories. Includes functions for file downloads, uploads, and managing 
+  releases via the GitHub API.
 Authors@R: c(person("Carl", "Boettiger", 
                     email = "[email protected]", 
                     role = c("aut", "cre", "cph"),
@@ -43,9 +37,8 @@ Imports:
     memoise,
     rlang
 Suggests:
+    arrow,
     spelling,
-    duckdbfs,
-    duckdb,
     readr,
     covr,
     testthat,

diff --git a/NAMESPACE b/NAMESPACE
@@ -6,7 +6,9 @@ export(pb_download)
 export(pb_download_url)
 export(pb_list)
 export(pb_new_release)
+export(pb_read)
 export(pb_release_create)
 export(pb_release_delete)
 export(pb_releases)
 export(pb_upload)
+export(pb_write)
diff --git a/NEWS.md b/NEWS.md
@@ -8,6 +8,8 @@ before trying API download URLs. This should reduce/eliminate effect of API rate
 limits for pb_download. [#109]
 * `"latest"` release now aligns with GitHub's "latest" release definition [#113]
 * `pb_download_url()` now can return choice of "browser" or "api" download URLs [#116]
+* Add new functions `pb_read()` and `pb_write()` as convenience wrappers around
+pattern of downloading to `tempfile()` and then reading into memory. [#97]
 
 # piggyback 0.1.5
 

diff --git a/R/pb_download_url.R b/R/pb_download_url.R
@@ -70,7 +70,7 @@ pb_download_url <- function(file = NULL,
     file <- file[file %in% df$file_name]
   }
 
-  if(length(file) == 0) return(cli::cli_abort("No download URLs to return."))
+  if(length(file) == 0) return(cli::cli_abort("No download URLs found"))
 
   switch(
     url_type,

diff --git a/R/pb_read.R b/R/pb_read.R
@@ -0,0 +1,90 @@
+#' Read one file into memory
+#'
+#' A convenience wrapper around writing an object to a temporary file and then
+#' uploading to a specified repo/release. This convenience comes at a cost to
+#' performance efficiency, since it first downloads the data to disk and then
+#' reads the data from disk into memory. See `vignette("cloud_native")` for
+#' alternative ways to bypass this flow and work with the data directly.
+#'
+#' @param file string: file name
+#' @param repo string: GH repository name in format "owner/repo". Default
+#' `guess_repo()` tries to guess based on current working directory's git repo
+#' @param tag  string: tag for the GH release, defaults to "latest"
+#' @param read_function function: used to read in the data, where the file is
+#' passed as the first argument and any additional arguments are subsequently
+#' passed in via `...`. Default `guess_read_function(file)` will check the file
+#' extension and try to find an appropriate read function if the extension is one
+#' of rds, csv, tsv, parquet, txt, or json, and will abort if not found.
+#' @param ... additional arguments passed to `read_function` after file
+#' @param .token GitHub authentication token, see [gh::gh_token()]
+#'
+#' @export
+#' @family pb_rw
+#'
+#' @return Result of reading in the file in question.
+#' @examples \donttest{
+#' try({ # try block is to avoid CRAN issues and is not required in ordinary usage
+#'  piggyback::pb_read("mtcars.tsv.gz", repo = "cboettig/piggyback-tests")
+#' })
+#' }
+pb_read <- function(file,
+                    ...,
+                    repo = guess_repo(),
+                    tag = "latest",
+                    read_function = guess_read_function(file),
+                    .token = gh::gh_token()) {
+  stopifnot(
+    is.character(file) && length(file) == 1,
+    is.character(repo) && length(repo) == 1,
+    is.character(tag) && length(tag) == 1,
+    rlang::is_function(read_function)
+  )
+
+  on.exit(unlink(file.path(tempdir(), file)))
+
+  pb_download(
+    file = file,
+    dest = tempdir(check = TRUE),
+    repo = repo,
+    tag = tag,
+    overwrite = TRUE,
+    .token = .token
+  )
+
+  read_function(file.path(tempdir(), file), ...)
+}
+
+#' Guess read function from file extension
+#'
+#' This function accepts a filename and tries to return a valid function for
+#' reading it.
+#'
+#' `guess_read_function` understands the following file extensions:
+#' - rds with `readRDS`
+#' - csv, csv.gz, csv.xz with `utils::read.csv`
+#' - tsv, tsv.gz, tsv.xz with `utils::read.delim`
+#' - parquet with `arrow::read_parquet`
+#' - txt, txt.gz, txt.xz with `readLines`
+#' - json, json.gz, json.xz with `jsonlite::fromJSON`
+#'
+#' @family pb_rw
+#' @param file filename to parse
+#' @return function for reading the file, if found
+#' @keywords internal
+guess_read_function <- function(file){
+  file_ext <- tools::file_ext(gsub(x = file, pattern = ".gz$|.xz$", replacement = ""))
+  if (file_ext == "parquet") rlang::check_installed("arrow")
+
+  read_fn <- switch(
+    file_ext,
+    "rds" = readRDS,
+    "csv" = utils::read.csv,
+    "tsv" = utils::read.delim,
+    "parquet" = arrow::read_parquet,
+    "txt" = readLines,
+    "json" = jsonlite::fromJSON,
+    cli::cli_abort("File type {.val {file_ext}} is not recognized, please provide a {.arg read_function}")
+  )
+
+  return(read_fn)
+}
diff --git a/R/pb_write.R b/R/pb_write.R
@@ -0,0 +1,84 @@
+#' Write one object to repo/release
+#'
+#' A convenience wrapper around writing an object to a temporary file and then
+#' uploading to a specified repo/release.
+#'
+#' @param x object: memory object to save to piggyback
+#' @param file string: file name
+#' @param ... additional arguments passed to `write_function`
+#' @param repo string: GH repository name in format "owner/repo". Default
+#' `guess_repo()` tries to guess based on current working directory's git repo
+#' @param tag  string: tag for the GH release, defaults to "latest"
+#' @param write_function function: used to write an R object to file, where the
+#' object is passed as the first argument, the filename as the second argument,
+#' and any additional arguments are subsequently passed in via `...`. Default
+#' `guess_write_function(file)` will check the file extension and try to find an
+#' appropriate write function if the extension is one of rds, csv, tsv, parquet,
+#' txt, or json, and will abort if not found.
+#' @param .token GitHub authentication token, see [gh::gh_token()]
+#'
+#' @export
+#' @family pb_rw
+#'
+#' @return Writes file to release and returns github API response
+#' @examples \donttest{
+#' \dontshow{if (interactive()) \{}
+#'   pb_write(mtcars, "mtcars.rds", repo = "tanho63/piggyback-tests")
+#'   #> ℹ Uploading to latest release: "v0.0.2".
+#'   #> ℹ Uploading mtcars.rds ...
+#'   #> |===============================================================| 100%
+#' \dontshow{\}}
+#'}
+pb_write <- function(x,
+                     file,
+                     ...,
+                     repo = guess_repo(),
+                     tag = "latest",
+                     write_function = guess_write_function(file),
+                     .token = gh::gh_token()) {
+  stopifnot(
+    is.character(file) && length(file) == 1,
+    is.character(repo) && length(repo) == 1,
+    is.character(tag) && length(tag) == 1,
+    rlang::is_function(write_function)
+  )
+  destfile <- file.path(tempdir(check = TRUE), file)
+  on.exit(try(unlink(destfile)))
+  write_function(x, destfile, ...)
+  pb_upload(destfile, repo = repo, tag = tag, .token = .token)
+}
+
+#' Guess write function from file extension
+#'
+#' This function accepts a filename and tries to return a valid function for
+#' writing to it.
+#'
+#' `guess_write_function` understands the following file extensions:
+#' - rds with `saveRDS`
+#' - csv, csv.gz, csv.xz with `utils::write.csv`
+#' - tsv, tsv.gz, tsv.xz with a modified `utils::write.csv` where sep is set to `"\t"`
+#' - parquet with `arrow::write_parquet`
+#' - txt, txt.gz, txt.xz with `writeLines`
+#' - json, json.gz, json.xz with `jsonlite::write_json`
+#'
+#' @family pb_rw
+#' @param file filename to parse
+#' @return function for reading the file, if found
+#' @keywords internal
+guess_write_function <- function(file){
+  file_ext <- tools::file_ext(gsub(x = file, pattern = ".gz$|.xz$", replacement = ""))
+  if (file_ext == "parquet") rlang::check_installed("arrow")
+
+  write_fn <- switch(
+    file_ext,
+    "rds" = saveRDS,
+    "csv" = utils::write.csv,
+    "tsv" = function(x, file, ..., sep = "\t") utils::write.csv(x = x, file = file, sep = sep, ...),
+    "txt" = writeLines,
+    "parquet" = arrow::write_parquet,
+    "json" = jsonlite::write_json,
+    cli::cli_abort("File type {.val {file_ext}} is not recognized, please provide a {.arg write_function}")
+  )
+
+  return(write_fn)
+}
diff --git a/README.Rmd b/README.Rmd
@@ -7,7 +7,7 @@ output: github_document
 
 ```{r setup, include = FALSE}
 knitr::opts_chunk$set(
-  eval = TRUE,
+  eval = FALSE,
   collapse = TRUE,
   message = FALSE,
   comment = "#>",
@@ -29,7 +29,7 @@ knitr::opts_chunk$set(
 [![DOI](http://joss.theoj.org/papers/10.21105/joss.00971/status.svg)](https://doi.org/10.21105/joss.00971)
   <!-- badges: end -->
 
-`{piggyback}` provides an R interface for storing files as GitHub release assets, 
+`piggyback` provides an R interface for storing files as GitHub release assets,
 which is a convenient way for large/binary data files to _piggyback_ onto public 
 and private GitHub repositories. This package includes functions for file downloads,
 uploads, and managing releases, which then are passed to the GitHub API. 
@@ -38,12 +38,12 @@ No authentication is required to download data from public repositories.
 
 ## Installation
 Install from CRAN via:
-```r
+```{r}
 install.packages("piggyback")
 ```
 You can install the development version from [GitHub](https://github.com/ropensci/piggyback)
 with either r-universe or with remotes:
-```r
+```{r}
 install.packages("piggyback", repos = c('https://ropensci.r-universe.dev', getOption("repos")))
 # install.packages("remotes")
 remotes::install_github("ropensci/piggyback")
@@ -53,7 +53,7 @@ See [getting started vignette](https://docs.ropensci.org/piggyback/articles/intr
 for a more comprehensive introduction.
 
 Download data attached to a GitHub release:
-```r
+```{r}
 library(piggyback)
 pb_download("iris2.tsv.gz", 
             repo = "cboettig/piggyback-tests",
@@ -72,12 +72,12 @@ or a GITHUB_PAT environment variable - for more information, see the vignette no
 [authentication](https://docs.ropensci.org/piggyback/articles/piggyback.html#authentication).
 
 We can also upload data to a release. Start by creating a release:
-```r
+```{r}
 pb_release_create(repo = "cboettig/piggyback-tests", tag = "v0.0.2")
 #> ✔ Created new release "v0.0.2".
 ```
 then upload to it:
-```r
+```{r}
 readr::write_tsv(mtcars, "mtcars.tsv.gz")
 pb_upload("mtcars.tsv.gz", repo = "cboettig/piggyback-tests")
 #> ℹ Uploading to latest release: "v0.0.2".
@@ -117,8 +117,7 @@ Please note that this project is released with a
 [Contributor Code of Conduct](https://ropensci.org/code-of-conduct/).
 By participating in this project you agree to abide by its terms.
 
-```{r include=FALSE}
-unlink("*.gz")
+```{r eval=TRUE, include=FALSE}
 codemeta::write_codemeta()
 ```
 

diff --git a/README.md b/README.md
@@ -17,8 +17,8 @@ Status](https://badges.ropensci.org/220_status.svg)](https://github.com/ropensci
 [![DOI](http://joss.theoj.org/papers/10.21105/joss.00971/status.svg)](https://doi.org/10.21105/joss.00971)
 <!-- badges: end -->
 
-`{piggyback}` provides an R interface for storing files as GitHub
-release assets, which is a convenient way for large/binary data files to
+`piggyback` provides an R interface for storing files as GitHub release
+assets, which is a convenient way for large/binary data files to
 *piggyback* onto public and private GitHub repositories. This package
 includes functions for file downloads, uploads, and managing releases,
 which then are passed to the GitHub API.

diff --git a/codemeta.json b/codemeta.json
@@ -2,12 +2,12 @@
   "@context": "https://doi.org/10.5063/schema/codemeta-2.0",
   "@type": "SoftwareSourceCode",
   "identifier": "piggyback",
-  "description": "Because larger (> 50 MB) data files cannot easily be committed to git, a different approach is required to manage data associated with an analysis in a GitHub repository. This package provides a simple work-around by allowing larger (up to 2 GB) data files to piggyback on a repository as assets attached to individual GitHub releases. These files are not handled by git in any way, but instead are uploaded, downloaded, or edited directly by calls through the GitHub API. These data files can be versioned manually by creating different releases. This approach works equally well with public or private repositories. Data can be uploaded and downloaded programmatically from scripts. No authentication is required to download data from public repositories.",
+  "description": "Helps store files as GitHub release assets, which is a convenient way for large/binary data files to piggyback onto public and private GitHub repositories. Includes functions for file downloads, uploads, and managing releases via the GitHub API.",
   "name": "piggyback: Managing Larger Data on a GitHub Repository",
   "codeRepository": "https://github.com/ropensci/piggyback",
   "issueTracker": "https://github.com/ropensci/piggyback/issues",
   "license": "https://spdx.org/licenses/GPL-3.0",
-  "version": "0.1.5.9003",
+  "version": "0.1.5.9004",
   "programmingLanguage": {
     "@type": "ComputerLanguage",
     "name": "R",
@@ -74,6 +74,18 @@
     }
   ],
   "softwareSuggestions": [
+    {
+      "@type": "SoftwareApplication",
+      "identifier": "arrow",
+      "name": "arrow",
+      "provider": {
+        "@id": "https://cran.r-project.org",
+        "@type": "Organization",
+        "name": "Comprehensive R Archive Network (CRAN)",
+        "url": "https://cran.r-project.org"
+      },
+      "sameAs": "https://CRAN.R-project.org/package=arrow"
+    },
     {
       "@type": "SoftwareApplication",
       "identifier": "spelling",
@@ -282,5 +294,5 @@
     },
     "SystemRequirements": null
   },
-  "fileSize": "380.757KB"
+  "fileSize": "397.705KB"
 }