Skip to content

Commit

Permalink
data_read() preserves class for rds files (#558)
Browse files Browse the repository at this point in the history
* `data_read()` preserves class for rds files

* desc. news

* fix

* add test

* typo

* Update NEWS.md

---------

Co-authored-by: Etienne Bacher <[email protected]>
  • Loading branch information
strengejacke and etiennebacher authored Oct 19, 2024
1 parent 3f46e31 commit 003e2b8
Show file tree
Hide file tree
Showing 5 changed files with 67 additions and 38 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.13.0.8
Version: 0.13.0.9
Authors@R: c(
person("Indrajeet", "Patil", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0003-1995-6531")),
Expand Down
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@ BUG FIXES
* `describe_distribution()` no longer errors if the sample was too sparse to compute
CIs. Instead, it warns the user and returns `NA` (#550).

* `data_read()` preserves variable types when importing files from `rds` or
`rdata` format (#558).

# datawizard 0.13.0

BREAKING CHANGES
Expand Down
41 changes: 21 additions & 20 deletions R/data_read.R
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,16 @@
#' for SAS data files.
#' @param encoding The character encoding used for the file. Usually not needed.
#' @param convert_factors If `TRUE` (default), numeric variables, where all
#' values have a value label, are assumed to be categorical and converted
#' into factors. If `FALSE`, no variable types are guessed and no conversion
#' of numeric variables into factors will be performed. See also section
#' 'Differences to other packages'. For `data_write()`, this argument only
#' applies to the text (e.g. `.txt` or `.csv`) or spreadsheet file formats (like
#' `.xlsx`). Converting to factors might be useful for these formats because
#' labelled numeric variables are then converted into factors and exported as
#' character columns - else, value labels would be lost and only numeric values
#' are written to the file.
#' values have a value label, are assumed to be categorical and converted into
#' factors. If `FALSE`, no variable types are guessed and no conversion of
#' numeric variables into factors will be performed. For `data_read()`, this
#' argument only applies to file types with *labelled data*, e.g. files from
#' SPSS, SAS or Stata. See also section 'Differences to other packages'. For
#' `data_write()`, this argument only applies to the text (e.g. `.txt` or
#' `.csv`) or spreadsheet file formats (like `.xlsx`). Converting to factors
#' might be useful for these formats because labelled numeric variables are then
#' converted into factors and exported as character columns - else, value labels
#' would be lost and only numeric values are written to the file.
#' @param verbose Toggle warnings and messages.
#' @param ... Arguments passed to the related `read_*()` or `write_*()` functions.
#'
Expand Down Expand Up @@ -65,12 +66,13 @@
#' @section Differences to other packages that read foreign data formats:
#' `data_read()` is most comparable to `rio::import()`. For data files from
#' SPSS, SAS or Stata, which support labelled data, variables are converted into
#' their most appropriate type. The major difference to `rio::import()` is that
#' `data_read()` automatically converts fully labelled numeric variables into
#' factors, where imported value labels will be set as factor levels. If a
#' numeric variable has _no_ value labels or less value labels than values, it
#' is not converted to factor. In this case, value labels are preserved as
#' `"labels"` attribute. Character vectors are preserved. Use
#' their most appropriate type. The major difference to `rio::import()` is for
#' data files from SPSS, SAS, or Stata, i.e. file types that support
#' *labelled data*. `data_read()` automatically converts fully labelled numeric
#' variables into factors, where imported value labels will be set as factor
#' levels. If a numeric variable has _no_ value labels or less value labels than
#' values, it is not converted to factor. In this case, value labels are
#' preserved as `"labels"` attribute. Character vectors are preserved. Use
#' `convert_factors = FALSE` to remove the automatic conversion of numeric
#' variables to factors.
#'
Expand Down Expand Up @@ -105,7 +107,7 @@ data_read <- function(path,
por = .read_spss(path, encoding, convert_factors, verbose, ...),
dta = .read_stata(path, encoding, convert_factors, verbose, ...),
sas7bdat = .read_sas(path, path_catalog, encoding, convert_factors, verbose, ...),
.read_unknown(path, file_type, convert_factors, verbose, ...)
.read_unknown(path, file_type, verbose, ...)
)

# tell user about empty columns
Expand Down Expand Up @@ -188,7 +190,7 @@ data_read <- function(path,
value_labels <- NULL
attr(i, "converted_to_factor") <- TRUE
} else {
# else, fall back to numeric
# else, fall back to numeric or factor
i <- as.numeric(i)
}

Expand Down Expand Up @@ -288,7 +290,7 @@ data_read <- function(path,
}


.read_unknown <- function(path, file_type, convert_factors, verbose, ...) {
.read_unknown <- function(path, file_type, verbose, ...) {
insight::check_if_installed("rio", reason = paste0("to read files of type '", file_type, "'"))
if (verbose) {
insight::format_alert("Reading data...")
Expand Down Expand Up @@ -317,6 +319,5 @@ data_read <- function(path,
}
out <- tmp
}

.post_process_imported_data(out, convert_factors, verbose)
out
}
32 changes: 17 additions & 15 deletions man/data_read.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 25 additions & 2 deletions tests/testthat/test-data_read.R
Original file line number Diff line number Diff line change
Expand Up @@ -141,12 +141,12 @@ test_that("data_read - RDS file, matrix, coercible", {
httr::stop_for_status(request)
writeBin(httr::content(request, type = "raw"), temp_file)

expect_message(expect_message(expect_message({
expect_message({
d <- data_read(
temp_file,
verbose = TRUE
)
})), regex = "0 out of 5")
})

expect_s3_class(d, "data.frame")
expect_identical(dim(d), c(2L, 5L))
Expand All @@ -155,6 +155,29 @@ test_that("data_read - RDS file, matrix, coercible", {



# RDS file, preserve class /types -----------------------------------

test_that("data_read - RDS file, preserve class", {
withr::with_tempfile("temp_file", fileext = ".rds", code = {
request <- httr::GET("https://raw.github.com/easystats/circus/main/data/hiv.rds")
httr::stop_for_status(request)
writeBin(httr::content(request, type = "raw"), temp_file)

d <- data_read(temp_file)
expect_s3_class(d, "data.frame")
expect_identical(
sapply(d, class),
c(
village = "integer", outcome = "integer", distance = "numeric",
amount = "numeric", incentive = "integer", age = "integer",
hiv2004 = "integer", agecat = "factor"
)
)
})
})



# RData -----------------------------------

test_that("data_read - no warning for RData", {
Expand Down

0 comments on commit 003e2b8

Please sign in to comment.