02-tidy.Rmd

# Tidy Data

Tidying up data is often where most of the data work happens. It is known by many names--cleaning, wrangling, tidying--but it all ultimately is to make the data in a format that we can analyze it.

## Define Tidy Data

[Tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) are discussed in many resources (e.g., [R for Data Science](https://r4ds.had.co.nz)). "Tidy datasets are all alike but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). In this section, I’ll provide some standard vocabulary for describing the structure and semantics of a dataset, and then use those definitions to define tidy data," (https://vita.had.co.nz/papers/tidy-data.pdf).


--------------------------------------

## Tidying (Cleaning) Data

For examples, we are going to import some data from a `.csv` file.

```{r, message=FALSE, warning=FALSE}
data <- rio::import("data_csv.csv")
```


## Changing Variable Type

Importing data into R can cause your variables to change types. For example, lets see what the race variable class is now that the file has been imported into R.

```{r}
class(data$Race)
```

As you see, its a character. We want it to be labeled as a factor. We can do this by using the as.factor() function in baseR. Make sure you save it as another variable in your dataset. If you don't assign it to another variable, it will only temporarily show your variable as a factor but not actually change it to a factor for later analyses.

```{r}
data$RaceF <- as.factor(data$Race)
data$SmokingF <- as.factor(data$smoking)

```

Now lets see what the class of our RaceF variable is.

```{r}
class(data$RaceF)
```

Please note- when importing from a SAV file, some factors may be "haven-labelled." Use this technique to change haven-labelled variables into factors as well.

## Changing a whole dataset to a certain class

If you want your whole dataset to be a certain class, you can easily do that in R using map_df(). Let's say we want the whole dataset to be numeric.

```{r, message=FALSE, warning=FALSE}
library(tidyverse)

data_num <- map_df(data, as.numeric)
tibble::glimpse(data)
```

As you can see, the character and factor variables that do not contain numbers are unable to be numeric. We will go over how to change that later. You can also change the whole dataset to be factors using the same code, but changing as.numeric to as.factor.

## Janitor Package

The janitor package has useful functions that make it easy to clean your dataset. Use the clean_names() function to make variable names consistent.

```{r, message=FALSE, warning=FALSE}
library(janitor)

clean_df <- clean_names(data)

tibble::glimpse(clean_df)
```

Now, all of our variables are lowercase and consistent.

You can also remove columns and/or rows that are completely missing. We do not have any missing rows or columns in our real dataset, so we will make one up to show you how it works.

```{r}
df <- data.frame(var1 = c(1, NA, 6, 8, 3),
                 var2 = c("a", NA, "c", "s", "y"),
                 var3 = c(NA, NA, NA, NA, NA))

df
```

```{r}
remove_empty(df, "rows")
```

```{r}
remove_empty(df, "cols")
```

```{r}
remove_empty(df, c("rows", "cols"))
```

## Washer function in the Furniture package

Note the use of `head()` to only print 6 rows (instead of all of them).

```{r, message=FALSE, warning=FALSE}
library(furniture)

washer(clean_df, 99, value = NA) %>% 
  head()
```


--------------------------------------

## Tidyverse

### Pipes

```
me %>% 
  wake_up("8:00am") %>% 
  exercise(30, units = "minutes") %>% 
  shower(15, units = "minutes") %>% 
  eat_breakfast("toast") %>% 
  go_to_work("basement")
```

Piping takes what is on the left-hand side and puts it in the right hand side's function. If you look at the code below, you'll see how code looks without a pipe.

```{r}
summary(clean_df)
```

Now, here is what the code looks like using a pipe.

```{r}
clean_df %>% 
  summary()
```

It may not seem to make the code more readable, but the more complex your code is the easier it will be to read it using pipes. It also allows you to connect multiple lines of code without having to overwrite your dataset multiple times. As we go through more data cleaning and wrangling, you'll begin to see how piping makes a difference when coding.

### Subset Data

#### Select function

The select function allows you to pull specific information out of our dataset. You can do this using baseR commands, or with the select function from tidyverse.

```{r}
head(clean_df[, c("id", "educ", "age")])
head(clean_df[, c(1:3)])
```


```{r}
clean_df %>% 
  select(id, educ, age) %>% 
  head()

clean_df %>% 
  select(1:3) %>% 
  head()
```

Each code chunk does the same thing. The first two, using [, is the "base R" way of selecting variables. The last two, using the pipe, is the tidyverse way. Both work great so the choice is yours.


#### starts_with(), ends_with(), contains()

When you use tidyverse, you can pull variables out that have common features. Here are a few examples. Note that we use the select function as well as another function to pull out information. 

This line of code pulls out the variables that start with the word "income".

```{r}
clean_df %>% 
  select(starts_with("income")) %>% 
  head()
```
This code allows you to select all of the variables in your dataset that end with _1 and _2. 

```{r}
clean_df %>% 
  select(ends_with(c("_1", "_2"))) %>% 
  head()
```

With this code you can pull out any variables that contain "anx" anywhere in the variable name.

```{r}
clean_df %>% 
  select(contains("anx")) %>% 
  head()
```

### Filter, equality/inequality, and/or, within

```{r}
clean_df[clean_df$anx_1 > 20,]

clean_df %>% 
  filter(anx_1 > 20) %>% 
  head()
```


```{r}
clean_df %>% 
  filter(sex == "female") %>% 
  head()
```

```{r}
clean_df %>% 
  filter(sex == "female" & race == "black") %>% 
  head()
```

```{r}
clean_df %>% 
  filter(sex == "female" | educ == 20) %>% 
  head()
```

Can also have it match multiple things

```{r}
clean_df %>% 
  filter(race %in% c("black", "latinx")) %>% 
  head()
```


### Mutate

Anytime you see mutate() it means you are adding a new variable or modifying an existing one.  For example, lets take a look at what class our sex variable is.


```{r}
class(clean_df$sex)
```

When we imported the data, R did not know sex was a factor. So, we need to change it to be factor. Using the mutate function allows us to modify existing variables, or create new variables. In this case, I make a new variable called "sex_f" and change sex to be a factor. 

```{r}
clean_df <- clean_df %>% 
  mutate(sex_f = factor(sex))
class(clean_df$sex_f)
```

There are loads of useful functions for mutating a variable! We will present a few here.

#### Case_when

When you want to modify variables based on various possible conditions, you can use the case_when() function. For example, when a categorical variable is dummy-coded to be 0,1,2.. you can change it to have actual name labels. In this example, our married variable i supposed to be factor of 0 = not married and 1 = married. Lets look at our married variable before we use mutate and case_when.

```{r}
summary(clean_df$married)
```

The next code chunk will show a new variable, called "married_f" that has levels "married" and "not married" and is a factor. To do this, we need to use mutate to create the new married_f variable, and then tell r that when the married variable = 1, then label married, and if married = 0, label it not married. Note that we need to use == to specify that it is equal to something. We also need to put our labels in "" when they are words. The last line of code turns married_f into a factor, rather than a character variable. 


```{r}
clean_df <- clean_df %>% 
  mutate(married_f = case_when(
    married == 1 ~ "married",
    married == 0 ~ "not married"
  ) %>% as.factor())

glimpse(clean_df)
```

Here is another example.

```{r}
clean_df <- clean_df %>% 
  mutate(income_cat = case_when(
    income_1 > mean(income_1, na.rm=TRUE) + sd(income_1, na.rm=TRUE) ~ "High Earner",
    income_1 > mean(income_1, na.rm=TRUE) ~ "Mid-High Earner",
    income_1 <= mean(income_1, na.rm=TRUE) - sd(income_1, na.rm=TRUE) ~ "Low Earner",
    income_1 <= mean(income_1, na.rm=TRUE) ~ "Mid-Low Earner",
  ) %>% as.factor())

glimpse(clean_df)
```

This last one is a complex example but it shows the utility of `case_when()`. Any guesses what this is doing?

```{r, eval = FALSE}
data %>% 
  rowwise() %>% 
  mutate(
    var1 = 
      case_when(
        sum(
          item1 > 2,
          item2 > 2,
          item3 > 2,
          item4 > 2,
          item5 > 2,
          na.rm=TRUE
        ) >= 6 &
        sum(
          perf1 > 3,
          perf2 > 3,
          perf3 > 3,
          na.rm=TRUE
        ) > 0 ~ 1,
        TRUE ~ 0
      )
  ) %>% 
  ungroup() %>% 
  mutate(var1 = factor(var1))
```


#### Rowsums and Rowmeans

```{r}
clean_df <- clean_df %>% 
  mutate(dep_v1 = furniture::rowmeans(depression_1, depression_2))

clean_df <- clean_df %>% 
  rowwise() %>% 
  mutate(dep_v2 = mean(c(depression_1, depression_2))) %>% 
  ungroup()

glimpse(clean_df)
```

### Forcats Package

```{r}
library(forcats) # forcats is also in tidyverse so if you downloaded tidyverse you don't need to download this as well
```

#### Relevel

You can manually reorder factor levels

```{r}
clean_df %>% 
  mutate(race_f = fct_relevel(race_f, "white", "black", "latinx", "other"))
```

#### Reorder

Reorder a factor by its levels' frequency

```{r}
clean_df %>% 
  mutate(race_f = fct_infreq(race_f))
```

Reorder a factor by a numeric variable

```{r}
clean_df %>% 
  mutate(race_f = fct_reorder(race_f, income_1))
```

#### Lump

When you have a factor with too many levels, you can lump all of the infrequent ones into one factor, "other". You can specify how many levels you want to include. the argument "n" is the number of levels you want to keep, besides the "other" category that will be made from the lump function. You can also include the argument other_level = "" if you would like to change the name from other to something more specific.

```{r}
clean_df %>% 
  mutate(race_f = fct_lump(race_f, n = 2))
```

In this example, we previously had four levels of the race variable: white, black, latinx, and other. When we specified n = 2, we lumped the other and black categories (the most infrequent two) into one "Other" category. It kept our two most frequent categories: white and latinx. 


#### Other Useful Functions for Factors

![Forcats cheat sheet](forcatscheat.png)


### Group_by

The group_by() function allows for us to group our data by a specific variable and compare groups with descriptive statistics. 

```{r}
clean_df %>% 
  group_by(race_f) %>% 
  summarize(N = n(), 
            mean = mean(depression_1),
            sd = sd(depression_1))
```

This example groups the dataset by race_f and summarizes it by the number per race (the n() argument), and provides the mean (mean()) and standard deviation (sd()) for depression scores by race.