Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

degroup() for nested models #533

Merged
merged 13 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Type: Package
Package: datawizard
Title: Easy Data Wrangling and Statistical Transformations
Version: 0.12.2
Version: 0.12.2.1
Authors@R: c(
person("Indrajeet", "Patil", , "[email protected]", role = "aut",
comment = c(ORCID = "0000-0003-1995-6531", Twitter = "@patilindrajeets")),
Expand Down
50 changes: 28 additions & 22 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# datawizard 0.12.3

CHANGES

* `demean()` (and `degroup()`) now also work for nested designs, if argument
`nested = TRUE` and `by` specifies more than one variable.

# datawizard 0.12.2

* Remove `htmltools` from `Suggests` in an attempt of fixing an error in CRAN
Expand Down Expand Up @@ -73,8 +80,8 @@ BREAKING CHANGES

* The following arguments were deprecated in 0.5.0 and are now removed:

* in `data_to_wide()`: `colnames_from`, `rows_from`, `sep`
* in `data_to_long()`: `colnames_to`
* in `data_to_wide()`: `colnames_from`, `rows_from`, `sep`
* in `data_to_long()`: `colnames_to`
* in `data_partition()`: `training_proportion`

NEW FUNCTIONS
Expand All @@ -93,7 +100,7 @@ CHANGES
argument, to compute weighted frequency tables. `include_na` allows to include
or omit missing values from the table. Furthermore, a `by` argument was added,
to compute crosstables (#479, #481).

# datawizard 0.9.1

CHANGES
Expand Down Expand Up @@ -144,7 +151,7 @@ CHANGES

* `unnormalize()` and `unstandardize()` now work with grouped data (#415).

* `unnormalize()` now errors instead of emitting a warning if it doesn't have the
* `unnormalize()` now errors instead of emitting a warning if it doesn't have the
necessary info (#415).

BUG FIXES
Expand All @@ -167,7 +174,7 @@ BUG FIXES

* Fixed issue in `data_filter()` where functions containing a `=` (e.g. when
naming arguments, like `grepl(pattern, x = a)`) were mistakenly seen as
faulty syntax.
faulty syntax.

* Fixed issue in `empty_column()` for strings with invalid multibyte strings.
For such data frames or files, `empty_column()` or `data_read()` no longer
Expand Down Expand Up @@ -204,14 +211,14 @@ CHANGES

NEW FUNCTIONS

* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic
`tibble::rowid_to_column()`). Note that its behavior is different from
* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic
`tibble::rowid_to_column()`). Note that its behavior is different from
`tibble::rowid_to_column()` for grouped data. See the Details section in the
docs.

* `data_unite()`, to merge values of multiple variables into one new variable.

* `data_separate()`, as counterpart to `data_unite()`, to separate a single
* `data_separate()`, as counterpart to `data_unite()`, to separate a single
variable into multiple new variables.

* `data_modify()`, to create new variables, or modify or remove existing
Expand All @@ -234,7 +241,7 @@ BUG FIXES

* `center()` and `standardize()` did not work for grouped data frames (of class
`grouped_df`) when `force = TRUE`.

* The `data.frame` method of `describe_distribution()` returns `NULL` instead of
an error if no valid variable were passed (for example a factor variable with
`include_factors = FALSE`) (#421).
Expand Down Expand Up @@ -262,12 +269,12 @@ BUG FIXES

# datawizard 0.7.0

BREAKING CHANGES
BREAKING CHANGES

* In selection patterns, expressions like `-var1:var3` to exclude all variables
between `var1` and `var3` are no longer accepted. The correct expression is
`-(var1:var3)`. This is for 2 reasons:

* to be consistent with the behavior for numerics (`-1:2` is not accepted but
`-(1:2)` is);
* to be consistent with `dplyr::select()`, which throws a warning and only
Expand All @@ -279,8 +286,8 @@ NEW FUNCTIONS
or more variables into a new variable.

* `mean_sd()` and `median_mad()` for summarizing vectors to their mean (or
median) and a range of one SD (or MAD) above and below.
median) and a range of one SD (or MAD) above and below.

* `data_write()` as counterpart to `data_read()`, to write data frames into
CSV, SPSS, SAS, Stata files and many other file types. One advantage over
existing functions to write data in other packages is that labelled (numeric)
Expand All @@ -296,8 +303,8 @@ MINOR CHANGES

* `data_rename()` gets a `verbose` argument.
* `winsorize()` now errors if the threshold is incorrect (previously, it provided
a warning and returned the unchanged data). The argument `verbose` is now
useless but is kept for backward compatibility. The documentation now contains
a warning and returned the unchanged data). The argument `verbose` is now
useless but is kept for backward compatibility. The documentation now contains
details about the valid values for `threshold` (#357).
* In all functions that have arguments `select` and/or `exclude`, there is now
one warning per misspelled variable. The previous behavior was to have only one
Expand All @@ -318,7 +325,7 @@ BUG FIXES
* Fix unexpected warning in `convert_na_to()` when `select` is a list (#352).
* Fixed issue with correct labelling of numeric variables with more than nine
unique values and associated value labels.


# datawizard 0.6.5

Expand Down Expand Up @@ -350,7 +357,7 @@ NEW FUNCTIONS
* `data_codebook()`: to generate codebooks of data frames.

* New functions to deal with duplicates: `data_duplicated()` (keep all duplicates,
including the first occurrence) and `data_unique()` (returns the data, excluding
including the first occurrence) and `data_unique()` (returns the data, excluding
all duplicates except one instance of each, based on the selected method).

MINOR CHANGES
Expand All @@ -360,15 +367,15 @@ MINOR CHANGES
* The `include_bounds` argument in `normalize()` can now also be a numeric
value, defining the limit to the upper and lower bound (i.e. the distance
to 1 and 0).
* `data_filter()` now works with grouped data.

* `data_filter()` now works with grouped data.

BUG FIXES

* `data_read()` no longer prints message for empty columns when the data
actually had no empty columns.
* `data_to_wide()` now drops columns that are not in `id_cols` (if specified),

* `data_to_wide()` now drops columns that are not in `id_cols` (if specified),
`names_from`, or `values_from`. This is the behaviour observed in `tidyr::pivot_wider()`.

# datawizard 0.6.3
Expand Down Expand Up @@ -800,4 +807,3 @@ NEW FUNCTIONS
# datawizard 0.1.0

* First release.

133 changes: 104 additions & 29 deletions R/demean.R
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,26 @@
#' @param x A data frame.
#' @param select Character vector (or formula) with names of variables to select
#' that should be group- and de-meaned.
#' @param by Character vector (or formula) with the name of the variable(s) that
#' indicates the group- or cluster-ID. For cross-classified designs, `by` can
#' also identify two or more variables as group- or cluster-IDs. See also
#' section _De-meaning for cross-classified designs_ below.
#' @param by Character vector (or formula) with the name of the variable that
#' indicates the group- or cluster-ID. For cross-classified or nested designs,
#' `by` can also identify two or more variables as group- or cluster-IDs. If
#' the data is nested and should be treated as such, set `nested = TRUE`. Else,
#' if `by` defines two or more variables and `nested = FALSE`, a cross-classified
#' design is assumed. Note that `demean()` and `degroup()` can't handle a mix
#' of nested and cross-classified designs in one model.
#'
#' For nested designs, `by` can be:
#' - a character vector with the name of the variable that indicates the
#' levels, ordered from *highest* level to *lowest* (e.g.
#' `by = c("L4", "L3", "L2")`.
#' - a character vector with variable names in the format `by = "L4/L3/L2"`,
#' where the levels are separated by `/`.
#'
#' See also section _De-meaning for cross-classified designs_ and
#' _De-meaning for nested designs_ below.
#' @param nested Logical, if `TRUE`, the data is treated as nested. If `FALSE`,
#' the data is treated as cross-classified. Only applies if `by` contains more
#' than one variable.
#' @param center Method for centering. `demean()` always performs
#' mean-centering, while `degroup()` can use `center = "median"` or
#' `center = "mode"` for median- or mode-centering, and also `"min"`
Expand All @@ -33,7 +49,10 @@
#' @return
#' A data frame with the group-/de-meaned variables, which get the suffix
#' `"_between"` (for the group-meaned variable) and `"_within"` (for the
#' de-meaned variable) by default.
#' de-meaned variable) by default. For cross-classified or nested designs,
#' the name pattern of the group-meaned variables is the name of the centered
#' variable followed by the name of the variable that indicates the related
#' grouping level, e.g. `predictor_L3_between` and `predictor_L2_between`.
#'
#' @seealso If grand-mean centering (instead of centering within-clusters)
#' is required, see [`center()`]. See [`performance::check_heterogeneity_bias()`]
Expand Down Expand Up @@ -164,17 +183,30 @@
#'
#' @section De-meaning for cross-classified designs:
#'
#' `demean()` can also handle cross-classified designs, where the data has two
#' or more groups at the higher (i.e. second) level. In such cases, the
#' `demean()` can handle cross-classified designs, where the data has two or
#' more groups at the higher (i.e. second) level. In such cases, the
#' `by`-argument can identify two or more variables that represent the
#' cross-classified group- or cluster-IDs. The de-meaned variables for
#' cross-classified designs are simply subtracting all group means from each
#' individual value, i.e. _fully cluster-mean-centering_ (see _Guo et al. 2024_
#' for details). Note that de-meaning for cross-classified designs is *not*
#' equivalent to de-meaning of nested data structures from models with three or
#' more levels, i.e. de-meaning is supposed to work for models like
#' `y ~ x + (1|group1) + (1|group2)`, but *not* for models like
#' `y ~ x + (1|group1/group2)`.
#' more levels. Set `nested = TRUE` to explicitly assume a nested design. For
#' cross-classified designs, de-meaning is supposed to work for models like
#' `y ~ x + (1|level3) + (1|level2)`, but *not* for models like
#' `y ~ x + (1|level3/level2)`. Note that `demean()` and `degroup()` can't
#' handle a mix of nested and cross-classified designs in one model.
#'
#' @section De-meaning for nested designs:
#'
#' _Brincks et al. (2017)_ have suggested an algorithm to center variables for
#' nested designs, which is implemented in `demean()`. For nested designs, set
#' `nested = TRUE` *and* specify the variables that indicate the different
#' levels in descending order in the `by` argument. E.g.,
#' `by = c("level4", "level3, "level2")` assumes a model like
#' `y ~ x + (1|level4/level3/level2)`. An alternative notation for the
#' `by`-argument would be `by = c("level4/level3/level2")`, similar to the
#' formula notation.
#'
#' @section Analysing panel data with mixed models using lme4:
#'
Expand All @@ -185,35 +217,40 @@
#' @references
#'
#' - Bafumi J, Gelman A. 2006. Fitting Multilevel Models When Predictors
#' and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
#' American Political Science Association.
#' and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
#' American Political Science Association.
#'
#' - Bell A, Fairbrother M, Jones K. 2019. Fixed and Random Effects
#' Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
#' Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
#'
#' - Bell A, Jones K. 2015. Explaining Fixed Effects: Random Effects
#' Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
#' Research and Methods, 3(1), 133–153.
#' Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
#' Research and Methods, 3(1), 133–153.
#'
#' - Brincks, A. M., Enders, C. K., Llabre, M. M., Bulotsky-Shearer, R. J.,
#' Prado, G., and Feaster, D. J. (2017). Centering Predictor Variables in
#' Three-Level Contextual Models. Multivariate Behavioral Research, 52(2),
#' 149–163. https://doi.org/10.1080/00273171.2016.1256753
#'
#' - Gelman A, Hill J. 2007. Data Analysis Using Regression and
#' Multilevel/Hierarchical Models. Analytical Methods for Social Research.
#' Cambridge, New York: Cambridge University Press
#' Multilevel/Hierarchical Models. Analytical Methods for Social Research.
#' Cambridge, New York: Cambridge University Press
#'
#' - Giesselmann M, Schmidt-Catran, AW. 2020. Interactions in fixed
#' effects regression models. Sociological Methods & Research, 1–28.
#' https://doi.org/10.1177/0049124120914934
#' effects regression models. Sociological Methods & Research, 1–28.
#' https://doi.org/10.1177/0049124120914934
#'
#' - Guo Y, Dhaliwal J, Rights JD. 2024. Disaggregating level-specific effects
#' in cross-classified multilevel models. Behavior Research Methods, 56(4),
#' 3023–3057.
#' in cross-classified multilevel models. Behavior Research Methods, 56(4),
#' 3023–3057.
#'
#' - Heisig JP, Schaeffer M, Giesecke J. 2017. The Costs of Simplicity:
#' Why Multilevel Models May Benefit from Accounting for Cross-Cluster
#' Differences in the Effects of Controls. American Sociological Review 82
#' (4): 796–827.
#' Why Multilevel Models May Benefit from Accounting for Cross-Cluster
#' Differences in the Effects of Controls. American Sociological Review 82
#' (4): 796–827.
#'
#' - Hoffman L. 2015. Longitudinal analysis: modeling within-person
#' fluctuation and change. New York: Routledge
#' fluctuation and change. New York: Routledge
#'
#' @examples
#'
Expand Down Expand Up @@ -244,6 +281,7 @@
demean <- function(x,
select,
by,
nested = FALSE,
suffix_demean = "_within",
suffix_groupmean = "_between",
add_attributes = TRUE,
Expand All @@ -259,6 +297,7 @@ demean <- function(x,
x = x,
select = select,
by = by,
nested = nested,
center = "mean",
suffix_demean = suffix_demean,
suffix_groupmean = suffix_groupmean,
Expand All @@ -268,15 +307,12 @@ demean <- function(x,
}






#' @rdname demean
#' @export
degroup <- function(x,
select,
by,
nested = FALSE,
center = "mean",
suffix_demean = "_within",
suffix_groupmean = "_between",
Expand All @@ -303,10 +339,17 @@ degroup <- function(x,
))
}

# handle different "by" options
if (inherits(by, "formula")) {
by <- all.vars(by)
}

# we also allow lme4-syntax here: if by = "L4/L3/L2", we assume a nested design
if (length(by) == 1 && grepl("/", by, fixed = TRUE)) {
by <- insight::trim_ws(unlist(strsplit(by, "/", fixed = TRUE), use.names = FALSE))
nested <- TRUE
}

# identify interaction terms
interactions_no <- select[!grepl("(\\*|\\:)", select)]
interactions_yes <- select[grepl("(\\*|\\:)", select)]
Expand Down Expand Up @@ -407,6 +450,38 @@ degroup <- function(x,
names(group_means_list) <- select
# create de-meaned variables by subtracting the group mean from each individual value
person_means_list <- lapply(select, function(i) dat[[i]] - group_means_list[[i]])
} else if (nested) {
# nested design: by > 1, nested is explicitly set to TRUE
# We want:
# L3_between = xbar(k)
# L2_between = xbar(j,k) - xbar(k)
# L1_within = x(ijk) - xbar(jk)
# , where
# x(ijk) is the individual value / variable that is measured on level 1
# xbar(k) <- ave(x_ijk, L3, FUN = mean), the group mean of the variable at highest level
# xbar(jk) <- ave(x_ijk, L3, L2, FUN = mean), the group mean of the variable at second level
group_means_list <- lapply(select, function(i) {
out <- lapply(seq_along(by), function(k) {
dat$higher_levels <- do.call(paste, c(dat[by[1:k]], list(sep = "_")))
stats::ave(dat[[i]], dat$higher_levels, FUN = gm_fun)
})
# subtract mean of higher level from lower level
for (j in 2:length(by)) {
out[[j]] <- out[[j]] - out[[j - 1]]
}
names(out) <- paste0(select, "_", by)
out
})
# create de-meaned variables by subtracting the group mean from each individual value
person_means_list <- lapply(
# seq_along(select),
# function(i) dat[[select[i]]] - group_means_list[[i]][[length(by)]]
select,
function(i) {
dat$higher_levels <- do.call(paste, c(dat[by], list(sep = "_")))
dat[[i]] - stats::ave(dat[[i]], dat$higher_levels, FUN = gm_fun)
}
)
} else {
# cross-classified design: by > 1
group_means_list <- lapply(by, function(j) {
Expand Down
Loading
Loading