Skip to content

Commit

Permalink
Add pb_read and pb_write functions (#115)
Browse files Browse the repository at this point in the history
Add new functions `pb_read()` and `pb_write()` as convenience wrappers around
pattern of downloading to `tempfile()` and then reading into memory - closes #97
  • Loading branch information
tanho63 authored Dec 30, 2023
1 parent 077a649 commit 4589222
Show file tree
Hide file tree
Showing 70 changed files with 897 additions and 5,447 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ data/*
.DS_Store
CRAN_SUBMISSION
CRAN_RELEASE
docs/*
17 changes: 5 additions & 12 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,16 +1,10 @@
Package: piggyback
Version: 0.1.5.9004
Title: Managing Larger Data on a GitHub Repository
Description: Because larger (> 50 MB) data files cannot easily be committed to git,
a different approach is required to manage data associated with an analysis in a
GitHub repository. This package provides a simple work-around by allowing larger
(up to 2 GB) data files to piggyback on a repository as assets attached to individual
GitHub releases. These files are not handled by git in any way, but instead are
uploaded, downloaded, or edited directly by calls through the GitHub API. These
data files can be versioned manually by creating different releases. This approach
works equally well with public or private repositories. Data can be uploaded
and downloaded programmatically from scripts. No authentication is required to
download data from public repositories.
Description: Helps store files as GitHub release assets, which is a convenient
way for large/binary data files to piggyback onto public and private GitHub
repositories. Includes functions for file downloads, uploads, and managing
releases via the GitHub API.
Authors@R: c(person("Carl", "Boettiger",
email = "[email protected]",
role = c("aut", "cre", "cph"),
Expand Down Expand Up @@ -43,9 +37,8 @@ Imports:
memoise,
rlang
Suggests:
arrow,
spelling,
duckdbfs,
duckdb,
readr,
covr,
testthat,
Expand Down
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ export(pb_download)
export(pb_download_url)
export(pb_list)
export(pb_new_release)
export(pb_read)
export(pb_release_create)
export(pb_release_delete)
export(pb_releases)
export(pb_upload)
export(pb_write)
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ before trying API download URLs. This should reduce/eliminate effect of API rate
limits for pb_download. [#109]
* `"latest"` release now aligns with GitHub's "latest" release definition [#113]
* `pb_download_url()` now can return choice of "browser" or "api" download URLs [#116]
* Add new functions `pb_read()` and `pb_write()` as convenience wrappers around
pattern of downloading to `tempfile()` and then reading into memory. [#97]

# piggyback 0.1.5

Expand Down
2 changes: 1 addition & 1 deletion R/pb_download_url.R
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ pb_download_url <- function(file = NULL,
file <- file[file %in% df$file_name]
}

if(length(file) == 0) return(cli::cli_abort("No download URLs to return."))
if(length(file) == 0) return(cli::cli_abort("No download URLs found"))

switch(
url_type,
Expand Down
90 changes: 90 additions & 0 deletions R/pb_read.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
#' Read one file into memory
#'
#' A convenience wrapper around writing an object to a temporary file and then
#' uploading to a specified repo/release. This convenience comes at a cost to
#' performance efficiency, since it first downloads the data to disk and then
#' reads the data from disk into memory. See `vignette("cloud_native")` for
#' alternative ways to bypass this flow and work with the data directly.
#'
#' @param file string: file name
#' @param repo string: GH repository name in format "owner/repo". Default
#' `guess_repo()` tries to guess based on current working directory's git repo
#' @param tag string: tag for the GH release, defaults to "latest"
#' @param read_function function: used to read in the data, where the file is
#' passed as the first argument and any additional arguments are subsequently
#' passed in via `...`. Default `guess_read_function(file)` will check the file
#' extension and try to find an appropriate read function if the extension is one
#' of rds, csv, tsv, parquet, txt, or json, and will abort if not found.
#' @param ... additional arguments passed to `read_function` after file
#' @param .token GitHub authentication token, see [gh::gh_token()]
#'
#' @export
#' @family pb_rw
#'
#' @return Result of reading in the file in question.
#' @examples \donttest{
#' try({ # try block is to avoid CRAN issues and is not required in ordinary usage
#' piggyback::pb_read("mtcars.tsv.gz", repo = "cboettig/piggyback-tests")
#' })
#' }
pb_read <- function(file,
...,
repo = guess_repo(),
tag = "latest",
read_function = guess_read_function(file),
.token = gh::gh_token()) {
stopifnot(
is.character(file) && length(file) == 1,
is.character(repo) && length(repo) == 1,
is.character(tag) && length(tag) == 1,
rlang::is_function(read_function)
)

on.exit(unlink(file.path(tempdir(), file)))

pb_download(
file = file,
dest = tempdir(check = TRUE),
repo = repo,
tag = tag,
overwrite = TRUE,
.token = .token
)

read_function(file.path(tempdir(), file), ...)
}

#' Guess read function from file extension
#'
#' This function accepts a filename and tries to return a valid function for
#' reading it.
#'
#' `guess_read_function` understands the following file extensions:
#' - rds with `readRDS`
#' - csv, csv.gz, csv.xz with `utils::read.csv`
#' - tsv, tsv.gz, tsv.xz with `utils::read.delim`
#' - parquet with `arrow::read_parquet`
#' - txt, txt.gz, txt.xz with `readLines`
#' - json, json.gz, json.xz with `jsonlite::fromJSON`
#'
#' @family pb_rw
#' @param file filename to parse
#' @return function for reading the file, if found
#' @keywords internal
guess_read_function <- function(file){
file_ext <- tools::file_ext(gsub(x = file, pattern = ".gz$|.xz$", replacement = ""))
if (file_ext == "parquet") rlang::check_installed("arrow")

read_fn <- switch(
file_ext,
"rds" = readRDS,
"csv" = utils::read.csv,
"tsv" = utils::read.delim,
"parquet" = arrow::read_parquet,
"txt" = readLines,
"json" = jsonlite::fromJSON,
cli::cli_abort("File type {.val {file_ext}} is not recognized, please provide a {.arg read_function}")
)

return(read_fn)
}
84 changes: 84 additions & 0 deletions R/pb_write.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#' Write one object to repo/release
#'
#' A convenience wrapper around writing an object to a temporary file and then
#' uploading to a specified repo/release.
#'
#' @param x object: memory object to save to piggyback
#' @param file string: file name
#' @param ... additional arguments passed to `write_function`
#' @param repo string: GH repository name in format "owner/repo". Default
#' `guess_repo()` tries to guess based on current working directory's git repo
#' @param tag string: tag for the GH release, defaults to "latest"
#' @param write_function function: used to write an R object to file, where the
#' object is passed as the first argument, the filename as the second argument,
#' and any additional arguments are subsequently passed in via `...`. Default
#' `guess_write_function(file)` will check the file extension and try to find an
#' appropriate write function if the extension is one of rds, csv, tsv, parquet,
#' txt, or json, and will abort if not found.
#' @param .token GitHub authentication token, see [gh::gh_token()]
#'
#' @export
#' @family pb_rw
#'
#' @return Writes file to release and returns github API response
#' @examples \donttest{
#' \dontshow{if (interactive()) \{}
#' pb_write(mtcars, "mtcars.rds", repo = "tanho63/piggyback-tests")
#' #> ℹ Uploading to latest release: "v0.0.2".
#' #> ℹ Uploading mtcars.rds ...
#' #> |===============================================================| 100%
#' \dontshow{\}}
#'}
pb_write <- function(x,
file,
...,
repo = guess_repo(),
tag = "latest",
write_function = guess_write_function(file),
.token = gh::gh_token()) {
stopifnot(
is.character(file) && length(file) == 1,
is.character(repo) && length(repo) == 1,
is.character(tag) && length(tag) == 1,
rlang::is_function(write_function)
)
destfile <- file.path(tempdir(check = TRUE), file)
on.exit(try(unlink(destfile)))
write_function(x, destfile, ...)
pb_upload(destfile, repo = repo, tag = tag, .token = .token)
}

#' Guess write function from file extension
#'
#' This function accepts a filename and tries to return a valid function for
#' writing to it.
#'
#' `guess_write_function` understands the following file extensions:
#' - rds with `saveRDS`
#' - csv, csv.gz, csv.xz with `utils::write.csv`
#' - tsv, tsv.gz, tsv.xz with a modified `utils::write.csv` where sep is set to `"\t"`
#' - parquet with `arrow::write_parquet`
#' - txt, txt.gz, txt.xz with `writeLines`
#' - json, json.gz, json.xz with `jsonlite::write_json`
#'
#' @family pb_rw
#' @param file filename to parse
#' @return function for reading the file, if found
#' @keywords internal
guess_write_function <- function(file){
file_ext <- tools::file_ext(gsub(x = file, pattern = ".gz$|.xz$", replacement = ""))
if (file_ext == "parquet") rlang::check_installed("arrow")

write_fn <- switch(
file_ext,
"rds" = saveRDS,
"csv" = utils::write.csv,
"tsv" = function(x, file, ..., sep = "\t") utils::write.csv(x = x, file = file, sep = sep, ...),
"txt" = writeLines,
"parquet" = arrow::write_parquet,
"json" = jsonlite::write_json,
cli::cli_abort("File type {.val {file_ext}} is not recognized, please provide a {.arg write_function}")
)

return(write_fn)
}
17 changes: 8 additions & 9 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ output: github_document

```{r setup, include = FALSE}
knitr::opts_chunk$set(
eval = TRUE,
eval = FALSE,
collapse = TRUE,
message = FALSE,
comment = "#>",
Expand All @@ -29,7 +29,7 @@ knitr::opts_chunk$set(
[![DOI](http://joss.theoj.org/papers/10.21105/joss.00971/status.svg)](https://doi.org/10.21105/joss.00971)
<!-- badges: end -->

`{piggyback}` provides an R interface for storing files as GitHub release assets,
`piggyback` provides an R interface for storing files as GitHub release assets,
which is a convenient way for large/binary data files to _piggyback_ onto public
and private GitHub repositories. This package includes functions for file downloads,
uploads, and managing releases, which then are passed to the GitHub API.
Expand All @@ -38,12 +38,12 @@ No authentication is required to download data from public repositories.

## Installation
Install from CRAN via:
```r
```{r}
install.packages("piggyback")
```
You can install the development version from [GitHub](https://github.com/ropensci/piggyback)
with either r-universe or with remotes:
```r
```{r}
install.packages("piggyback", repos = c('https://ropensci.r-universe.dev', getOption("repos")))
# install.packages("remotes")
remotes::install_github("ropensci/piggyback")
Expand All @@ -53,7 +53,7 @@ See [getting started vignette](https://docs.ropensci.org/piggyback/articles/intr
for a more comprehensive introduction.

Download data attached to a GitHub release:
```r
```{r}
library(piggyback)
pb_download("iris2.tsv.gz",
repo = "cboettig/piggyback-tests",
Expand All @@ -72,12 +72,12 @@ or a GITHUB_PAT environment variable - for more information, see the vignette no
[authentication](https://docs.ropensci.org/piggyback/articles/piggyback.html#authentication).

We can also upload data to a release. Start by creating a release:
```r
```{r}
pb_release_create(repo = "cboettig/piggyback-tests", tag = "v0.0.2")
#> ✔ Created new release "v0.0.2".
```
then upload to it:
```r
```{r}
readr::write_tsv(mtcars, "mtcars.tsv.gz")
pb_upload("mtcars.tsv.gz", repo = "cboettig/piggyback-tests")
#> ℹ Uploading to latest release: "v0.0.2".
Expand Down Expand Up @@ -117,8 +117,7 @@ Please note that this project is released with a
[Contributor Code of Conduct](https://ropensci.org/code-of-conduct/).
By participating in this project you agree to abide by its terms.

```{r include=FALSE}
unlink("*.gz")
```{r eval=TRUE, include=FALSE}
codemeta::write_codemeta()
```

Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ Status](https://badges.ropensci.org/220_status.svg)](https://github.com/ropensci
[![DOI](http://joss.theoj.org/papers/10.21105/joss.00971/status.svg)](https://doi.org/10.21105/joss.00971)
<!-- badges: end -->

`{piggyback}` provides an R interface for storing files as GitHub
release assets, which is a convenient way for large/binary data files to
`piggyback` provides an R interface for storing files as GitHub release
assets, which is a convenient way for large/binary data files to
*piggyback* onto public and private GitHub repositories. This package
includes functions for file downloads, uploads, and managing releases,
which then are passed to the GitHub API.
Expand Down
18 changes: 15 additions & 3 deletions codemeta.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@
"@context": "https://doi.org/10.5063/schema/codemeta-2.0",
"@type": "SoftwareSourceCode",
"identifier": "piggyback",
"description": "Because larger (> 50 MB) data files cannot easily be committed to git, a different approach is required to manage data associated with an analysis in a GitHub repository. This package provides a simple work-around by allowing larger (up to 2 GB) data files to piggyback on a repository as assets attached to individual GitHub releases. These files are not handled by git in any way, but instead are uploaded, downloaded, or edited directly by calls through the GitHub API. These data files can be versioned manually by creating different releases. This approach works equally well with public or private repositories. Data can be uploaded and downloaded programmatically from scripts. No authentication is required to download data from public repositories.",
"description": "Helps store files as GitHub release assets, which is a convenient way for large/binary data files to piggyback onto public and private GitHub repositories. Includes functions for file downloads, uploads, and managing releases via the GitHub API.",
"name": "piggyback: Managing Larger Data on a GitHub Repository",
"codeRepository": "https://github.com/ropensci/piggyback",
"issueTracker": "https://github.com/ropensci/piggyback/issues",
"license": "https://spdx.org/licenses/GPL-3.0",
"version": "0.1.5.9003",
"version": "0.1.5.9004",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
Expand Down Expand Up @@ -74,6 +74,18 @@
}
],
"softwareSuggestions": [
{
"@type": "SoftwareApplication",
"identifier": "arrow",
"name": "arrow",
"provider": {
"@id": "https://cran.r-project.org",
"@type": "Organization",
"name": "Comprehensive R Archive Network (CRAN)",
"url": "https://cran.r-project.org"
},
"sameAs": "https://CRAN.R-project.org/package=arrow"
},
{
"@type": "SoftwareApplication",
"identifier": "spelling",
Expand Down Expand Up @@ -282,5 +294,5 @@
},
"SystemRequirements": null
},
"fileSize": "380.757KB"
"fileSize": "397.705KB"
}
Loading

0 comments on commit 4589222

Please sign in to comment.