degroup() for nested models (#533)

* degroup() for nested models Fixes #530 * version * docs * ... * fix * fix? * this one? * this should be * comments * finalze, add tests * Typos, dict * docs
easystats · Aug 27, 2024 · bc52b99 · bc52b99
1 parent 4f0b5b7
commit bc52b99
Show file tree

Hide file tree

Showing 6 changed files with 234 additions and 62 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,7 +1,7 @@
 Type: Package
 Package: datawizard
 Title: Easy Data Wrangling and Statistical Transformations
-Version: 0.12.2
+Version: 0.12.2.1
 Authors@R: c(
     person("Indrajeet", "Patil", , "[email protected]", role = "aut",
            comment = c(ORCID = "0000-0003-1995-6531", Twitter = "@patilindrajeets")),

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,10 @@
+# datawizard 0.12.3
+
+CHANGES
+
+* `demean()` (and `degroup()`) now also work for nested designs, if argument
+  `nested = TRUE` and  `by` specifies more than one variable.
+
 # datawizard 0.12.2
 
 * Remove `htmltools` from `Suggests` in an attempt of fixing an error in CRAN
@@ -73,8 +80,8 @@ BREAKING CHANGES
 
 * The following arguments were deprecated in 0.5.0 and are now removed:
 
-  * in `data_to_wide()`: `colnames_from`, `rows_from`, `sep` 
-  * in `data_to_long()`: `colnames_to` 
+  * in `data_to_wide()`: `colnames_from`, `rows_from`, `sep`
+  * in `data_to_long()`: `colnames_to`
   * in `data_partition()`: `training_proportion`
 
 NEW FUNCTIONS
@@ -93,7 +100,7 @@ CHANGES
   argument, to compute weighted frequency tables. `include_na` allows to include
   or omit missing values from the table. Furthermore, a `by` argument was added,
   to compute crosstables (#479, #481).
-  
+
 # datawizard 0.9.1
 
 CHANGES
@@ -144,7 +151,7 @@ CHANGES
 
 * `unnormalize()` and `unstandardize()` now work with grouped data (#415).
 
-* `unnormalize()` now errors instead of emitting a warning if it doesn't have the 
+* `unnormalize()` now errors instead of emitting a warning if it doesn't have the
   necessary info (#415).
 
 BUG FIXES
@@ -167,7 +174,7 @@ BUG FIXES
 
 * Fixed issue in `data_filter()` where functions containing a `=` (e.g. when
   naming arguments, like `grepl(pattern, x = a)`) were mistakenly seen as
-  faulty syntax. 
+  faulty syntax.
 
 * Fixed issue in `empty_column()` for strings with invalid multibyte strings.
   For such data frames or files, `empty_column()` or `data_read()` no longer
@@ -204,14 +211,14 @@ CHANGES
 
 NEW FUNCTIONS
 
-* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic 
-  `tibble::rowid_to_column()`). Note that its behavior is different from 
+* `rowid_as_column()` to complement `rownames_as_column()` (and to mimic
+  `tibble::rowid_to_column()`). Note that its behavior is different from
   `tibble::rowid_to_column()` for grouped data. See the Details section in the
   docs.
 
 * `data_unite()`, to merge values of multiple variables into one new variable.
 
-* `data_separate()`, as counterpart to `data_unite()`, to separate a single 
+* `data_separate()`, as counterpart to `data_unite()`, to separate a single
   variable into multiple new variables.
 
 * `data_modify()`, to create new variables, or modify or remove existing
@@ -234,7 +241,7 @@ BUG FIXES
 
 * `center()` and `standardize()` did not work for grouped data frames (of class
   `grouped_df`) when `force = TRUE`.
-  
+
 * The `data.frame` method of `describe_distribution()` returns `NULL` instead of
   an error if no valid variable were passed (for example a factor variable with
   `include_factors = FALSE`) (#421).
@@ -262,12 +269,12 @@ BUG FIXES
 
 # datawizard 0.7.0
 
-BREAKING CHANGES 
+BREAKING CHANGES
 
 * In selection patterns, expressions like `-var1:var3` to exclude all variables
   between `var1` and `var3` are no longer accepted. The correct expression is
   `-(var1:var3)`. This is for 2 reasons:
-  
+
   * to be consistent with the behavior for numerics (`-1:2` is not accepted but
     `-(1:2)` is);
   * to be consistent with `dplyr::select()`, which throws a warning and only
@@ -279,8 +286,8 @@ NEW FUNCTIONS
   or more variables into a new variable.
 
 * `mean_sd()` and `median_mad()` for summarizing vectors to their mean (or
-  median) and a range of one SD (or MAD) above and below.  
-  
+  median) and a range of one SD (or MAD) above and below.
+
 * `data_write()` as counterpart to `data_read()`, to write data frames into
   CSV, SPSS, SAS, Stata files and many other file types. One advantage over
   existing functions to write data in other packages is that labelled (numeric)
@@ -296,8 +303,8 @@ MINOR CHANGES
 
 * `data_rename()` gets a `verbose` argument.
 * `winsorize()` now errors if the threshold is incorrect (previously, it provided
-  a warning and returned the unchanged data). The argument `verbose` is now 
-  useless but is kept for backward compatibility. The documentation now contains   
+  a warning and returned the unchanged data). The argument `verbose` is now
+  useless but is kept for backward compatibility. The documentation now contains
   details about the valid values for `threshold` (#357).
 * In all functions that have arguments `select` and/or `exclude`, there is now
   one warning per misspelled variable. The previous behavior was to have only one
@@ -318,7 +325,7 @@ BUG FIXES
 * Fix unexpected warning in `convert_na_to()` when `select` is a list (#352).
 * Fixed issue with correct labelling of numeric variables with more than nine
   unique values and associated value labels.
-  
+
 
 # datawizard 0.6.5
 
@@ -350,7 +357,7 @@ NEW FUNCTIONS
 * `data_codebook()`: to generate codebooks of data frames.
 
 * New functions to deal with duplicates: `data_duplicated()` (keep all duplicates,
-  including the first occurrence) and `data_unique()` (returns the data, excluding 
+  including the first occurrence) and `data_unique()` (returns the data, excluding
   all duplicates except one instance of each, based on the selected method).
 
 MINOR CHANGES
@@ -360,15 +367,15 @@ MINOR CHANGES
 * The `include_bounds` argument in `normalize()` can now also be a numeric
   value, defining the limit to the upper and lower bound (i.e. the distance
   to 1 and 0).
-  
-* `data_filter()` now works with grouped data. 
+
+* `data_filter()` now works with grouped data.
 
 BUG FIXES
 
 * `data_read()` no longer prints message for empty columns when the data
   actually had no empty columns.
-  
- * `data_to_wide()` now drops columns that are not in `id_cols` (if specified), 
+
+ * `data_to_wide()` now drops columns that are not in `id_cols` (if specified),
   `names_from`, or `values_from`. This is the behaviour observed in `tidyr::pivot_wider()`.
 
 # datawizard 0.6.3
@@ -800,4 +807,3 @@ NEW FUNCTIONS
 # datawizard 0.1.0
 
 * First release.
-
diff --git a/R/demean.R b/R/demean.R
@@ -11,10 +11,26 @@
 #' @param x A data frame.
 #' @param select Character vector (or formula) with names of variables to select
 #'   that should be group- and de-meaned.
-#' @param by Character vector (or formula) with the name of the variable(s) that
-#'   indicates the group- or cluster-ID. For cross-classified designs, `by` can
-#'   also identify two or more variables as group- or cluster-IDs. See also
-#'   section _De-meaning for cross-classified designs_ below.
+#' @param by Character vector (or formula) with the name of the variable that
+#'   indicates the group- or cluster-ID. For cross-classified or nested designs,
+#'   `by` can also identify two or more variables as group- or cluster-IDs. If
+#'   the data is nested and should be treated as such, set `nested = TRUE`. Else,
+#'   if `by` defines two or more variables and `nested = FALSE`, a cross-classified
+#'   design is assumed. Note that `demean()` and `degroup()` can't handle a mix
+#'   of nested and cross-classified designs in one model.
+#'
+#'   For nested designs, `by` can be:
+#'   - a character vector with the name of the variable that indicates the
+#'     levels, ordered from *highest* level to *lowest* (e.g.
+#'     `by = c("L4", "L3", "L2")`.
+#'   - a character vector with variable names in the format `by = "L4/L3/L2"`,
+#'     where the levels are separated by `/`.
+#'
+#'   See also section _De-meaning for cross-classified designs_ and
+#'   _De-meaning for nested designs_ below.
+#' @param nested Logical, if `TRUE`, the data is treated as nested. If `FALSE`,
+#'   the data is treated as cross-classified. Only applies if `by` contains more
+#'   than one variable.
 #' @param center Method for centering. `demean()` always performs
 #'   mean-centering, while `degroup()` can use `center = "median"` or
 #'   `center = "mode"` for median- or mode-centering, and also `"min"`
@@ -33,7 +49,10 @@
 #' @return
 #' A data frame with the group-/de-meaned variables, which get the suffix
 #' `"_between"` (for the group-meaned variable) and `"_within"` (for the
-#' de-meaned variable) by default.
+#' de-meaned variable) by default. For cross-classified or nested designs,
+#' the name pattern of the group-meaned variables is the name of the centered
+#' variable followed by the name of the variable that indicates the related
+#' grouping level, e.g. `predictor_L3_between` and `predictor_L2_between`.
 #'
 #' @seealso If grand-mean centering (instead of centering within-clusters)
 #'   is required, see [`center()`]. See [`performance::check_heterogeneity_bias()`]
@@ -164,17 +183,30 @@
 #'
 #' @section De-meaning for cross-classified designs:
 #'
-#' `demean()` can also handle cross-classified designs, where the data has two
-#' or more groups at the higher (i.e. second) level. In such cases, the
+#' `demean()` can handle cross-classified designs, where the data has two or
+#' more groups at the higher (i.e. second) level. In such cases, the
 #' `by`-argument can identify two or more variables that represent the
 #'  cross-classified group- or cluster-IDs. The de-meaned variables for
 #' cross-classified designs are simply subtracting all group means from each
 #' individual value, i.e. _fully cluster-mean-centering_ (see _Guo et al. 2024_
 #' for details). Note that de-meaning for cross-classified designs is *not*
 #' equivalent to de-meaning of nested data structures from models with three or
-#' more levels, i.e. de-meaning is supposed to work for models like
-#' `y ~ x + (1|group1) + (1|group2)`, but *not* for models like
-#' `y ~ x + (1|group1/group2)`.
+#' more levels. Set `nested = TRUE` to explicitly assume a nested design. For
+#' cross-classified designs, de-meaning is supposed to work for models like
+#' `y ~ x + (1|level3) + (1|level2)`, but *not* for models like
+#' `y ~ x + (1|level3/level2)`. Note that `demean()` and `degroup()` can't
+#' handle a mix of nested and cross-classified designs in one model.
+#'
+#' @section De-meaning for nested designs:
+#'
+#' _Brincks et al. (2017)_ have suggested an algorithm to center variables for
+#' nested designs, which is implemented in `demean()`. For nested designs, set
+#' `nested = TRUE` *and* specify the variables that indicate the different
+#' levels in descending order in the `by` argument. E.g.,
+#' `by = c("level4", "level3, "level2")` assumes a model like
+#' `y ~ x + (1|level4/level3/level2)`. An alternative notation for the
+#' `by`-argument would be `by = c("level4/level3/level2")`, similar to the
+#' formula notation.
 #'
 #' @section Analysing panel data with mixed models using lme4:
 #'
@@ -185,35 +217,40 @@
 #' @references
 #'
 #'   - Bafumi J, Gelman A. 2006. Fitting Multilevel Models When Predictors
-#'   and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
-#'   American Political Science Association.
+#'     and Group Effects Correlate. In. Philadelphia, PA: Annual meeting of the
+#'     American Political Science Association.
 #'
 #'   - Bell A, Fairbrother M, Jones K. 2019. Fixed and Random Effects
-#'   Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
+#'     Models: Making an Informed Choice. Quality & Quantity (53); 1051-1074
 #'
 #'   - Bell A, Jones K. 2015. Explaining Fixed Effects: Random Effects
-#'   Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
-#'   Research and Methods, 3(1), 133–153.
+#'     Modeling of Time-Series Cross-Sectional and Panel Data. Political Science
+#'     Research and Methods, 3(1), 133–153.
+#'
+#'   - Brincks, A. M., Enders, C. K., Llabre, M. M., Bulotsky-Shearer, R. J.,
+#'     Prado, G., and Feaster, D. J. (2017). Centering Predictor Variables in
+#'     Three-Level Contextual Models. Multivariate Behavioral Research, 52(2),
+#'     149–163. https://doi.org/10.1080/00273171.2016.1256753
 #'
 #'   - Gelman A, Hill J. 2007. Data Analysis Using Regression and
-#'   Multilevel/Hierarchical Models. Analytical Methods for Social Research.
-#'   Cambridge, New York: Cambridge University Press
+#'     Multilevel/Hierarchical Models. Analytical Methods for Social Research.
+#'     Cambridge, New York: Cambridge University Press
 #'
 #'   - Giesselmann M, Schmidt-Catran, AW. 2020. Interactions in fixed
-#'   effects regression models. Sociological Methods & Research, 1–28.
-#'   https://doi.org/10.1177/0049124120914934
+#'     effects regression models. Sociological Methods & Research, 1–28.
+#'     https://doi.org/10.1177/0049124120914934
 #'
 #'   - Guo Y, Dhaliwal J, Rights JD. 2024. Disaggregating level-specific effects
-#'   in cross-classified multilevel models. Behavior Research Methods, 56(4),
-#'   3023–3057.
+#'     in cross-classified multilevel models. Behavior Research Methods, 56(4),
+#'     3023–3057.
 #'
 #'   - Heisig JP, Schaeffer M, Giesecke J. 2017. The Costs of Simplicity:
-#'   Why Multilevel Models May Benefit from Accounting for Cross-Cluster
-#'   Differences in the Effects of Controls. American Sociological Review 82
-#'   (4): 796–827.
+#'     Why Multilevel Models May Benefit from Accounting for Cross-Cluster
+#'     Differences in the Effects of Controls. American Sociological Review 82
+#'     (4): 796–827.
 #'
 #'   - Hoffman L. 2015. Longitudinal analysis: modeling within-person
-#'   fluctuation and change. New York: Routledge
+#'     fluctuation and change. New York: Routledge
 #'
 #' @examples
 #'
@@ -244,6 +281,7 @@
 demean <- function(x,
                    select,
                    by,
+                   nested = FALSE,
                    suffix_demean = "_within",
                    suffix_groupmean = "_between",
                    add_attributes = TRUE,
@@ -259,6 +297,7 @@ demean <- function(x,
     x = x,
     select = select,
     by = by,
+    nested = nested,
     center = "mean",
     suffix_demean = suffix_demean,
     suffix_groupmean = suffix_groupmean,
@@ -268,15 +307,12 @@ demean <- function(x,
 }
 
 
-
-
-
-
 #' @rdname demean
 #' @export
 degroup <- function(x,
                     select,
                     by,
+                    nested = FALSE,
                     center = "mean",
                     suffix_demean = "_within",
                     suffix_groupmean = "_between",
@@ -303,10 +339,17 @@ degroup <- function(x,
     ))
   }
 
+  # handle different "by" options
   if (inherits(by, "formula")) {
     by <- all.vars(by)
   }
 
+  # we also allow lme4-syntax here: if by = "L4/L3/L2", we assume a nested design
+  if (length(by) == 1 && grepl("/", by, fixed = TRUE)) {
+    by <- insight::trim_ws(unlist(strsplit(by, "/", fixed = TRUE), use.names = FALSE))
+    nested <- TRUE
+  }
+
   # identify interaction terms
   interactions_no <- select[!grepl("(\\*|\\:)", select)]
   interactions_yes <- select[grepl("(\\*|\\:)", select)]
@@ -407,6 +450,38 @@ degroup <- function(x,
     names(group_means_list) <- select
     # create de-meaned variables by subtracting the group mean from each individual value
     person_means_list <- lapply(select, function(i) dat[[i]] - group_means_list[[i]])
+  } else if (nested) {
+    # nested design: by > 1, nested is explicitly set to TRUE
+    # We want:
+    # L3_between = xbar(k)
+    # L2_between = xbar(j,k) - xbar(k)
+    # L1_within = x(ijk) - xbar(jk)
+    # , where
+    # x(ijk) is the individual value / variable that is measured on level 1
+    # xbar(k) <- ave(x_ijk, L3, FUN = mean), the group mean of the variable at highest level
+    # xbar(jk) <- ave(x_ijk, L3, L2, FUN = mean), the group mean of the variable at second level
+    group_means_list <- lapply(select, function(i) {
+      out <- lapply(seq_along(by), function(k) {
+        dat$higher_levels <- do.call(paste, c(dat[by[1:k]], list(sep = "_")))
+        stats::ave(dat[[i]], dat$higher_levels, FUN = gm_fun)
+      })
+      # subtract mean of higher level from lower level
+      for (j in 2:length(by)) {
+        out[[j]] <- out[[j]] - out[[j - 1]]
+      }
+      names(out) <- paste0(select, "_", by)
+      out
+    })
+    # create de-meaned variables by subtracting the group mean from each individual value
+    person_means_list <- lapply(
+      # seq_along(select),
+      # function(i) dat[[select[i]]] - group_means_list[[i]][[length(by)]]
+      select,
+      function(i) {
+        dat$higher_levels <- do.call(paste, c(dat[by], list(sep = "_")))
+        dat[[i]] - stats::ave(dat[[i]], dat$higher_levels, FUN = gm_fun)
+      }
+    )
   } else {
     # cross-classified design: by > 1
     group_means_list <- lapply(by, function(j) {