-
Notifications
You must be signed in to change notification settings - Fork 1
/
02-tidy.Rmd
386 lines (263 loc) · 10.9 KB
/
02-tidy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
# Tidy Data
Tidying up data is often where most of the data work happens. It is known by many names--cleaning, wrangling, tidying--but it all ultimately is to make the data in a format that we can analyze it.
## Define Tidy Data
[Tidy data](https://vita.had.co.nz/papers/tidy-data.pdf) are discussed in many resources (e.g., [R for Data Science](https://r4ds.had.co.nz)). "Tidy datasets are all alike but every messy dataset is messy in its own way. Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). In this section, I’ll provide some standard vocabulary for describing the structure and semantics of a dataset, and then use those definitions to define tidy data," (https://vita.had.co.nz/papers/tidy-data.pdf).
--------------------------------------
## Tidying (Cleaning) Data
For examples, we are going to import some data from a `.csv` file.
```{r, message=FALSE, warning=FALSE}
data <- rio::import("data_csv.csv")
```
## Changing Variable Type
Importing data into R can cause your variables to change types. For example, lets see what the race variable class is now that the file has been imported into R.
```{r}
class(data$Race)
```
As you see, its a character. We want it to be labeled as a factor. We can do this by using the as.factor() function in baseR. Make sure you save it as another variable in your dataset. If you don't assign it to another variable, it will only temporarily show your variable as a factor but not actually change it to a factor for later analyses.
```{r}
data$RaceF <- as.factor(data$Race)
data$SmokingF <- as.factor(data$smoking)
```
Now lets see what the class of our RaceF variable is.
```{r}
class(data$RaceF)
```
Please note- when importing from a SAV file, some factors may be "haven-labelled." Use this technique to change haven-labelled variables into factors as well.
## Changing a whole dataset to a certain class
If you want your whole dataset to be a certain class, you can easily do that in R using map_df(). Let's say we want the whole dataset to be numeric.
```{r, message=FALSE, warning=FALSE}
library(tidyverse)
data_num <- map_df(data, as.numeric)
tibble::glimpse(data)
```
As you can see, the character and factor variables that do not contain numbers are unable to be numeric. We will go over how to change that later. You can also change the whole dataset to be factors using the same code, but changing as.numeric to as.factor.
## Janitor Package
The janitor package has useful functions that make it easy to clean your dataset. Use the clean_names() function to make variable names consistent.
```{r, message=FALSE, warning=FALSE}
library(janitor)
clean_df <- clean_names(data)
tibble::glimpse(clean_df)
```
Now, all of our variables are lowercase and consistent.
You can also remove columns and/or rows that are completely missing. We do not have any missing rows or columns in our real dataset, so we will make one up to show you how it works.
```{r}
df <- data.frame(var1 = c(1, NA, 6, 8, 3),
var2 = c("a", NA, "c", "s", "y"),
var3 = c(NA, NA, NA, NA, NA))
df
```
```{r}
remove_empty(df, "rows")
```
```{r}
remove_empty(df, "cols")
```
```{r}
remove_empty(df, c("rows", "cols"))
```
## Washer function in the Furniture package
Note the use of `head()` to only print 6 rows (instead of all of them).
```{r, message=FALSE, warning=FALSE}
library(furniture)
washer(clean_df, 99, value = NA) %>%
head()
```
--------------------------------------
## Tidyverse
### Pipes
```
me %>%
wake_up("8:00am") %>%
exercise(30, units = "minutes") %>%
shower(15, units = "minutes") %>%
eat_breakfast("toast") %>%
go_to_work("basement")
```
Piping takes what is on the left-hand side and puts it in the right hand side's function. If you look at the code below, you'll see how code looks without a pipe.
```{r}
summary(clean_df)
```
Now, here is what the code looks like using a pipe.
```{r}
clean_df %>%
summary()
```
It may not seem to make the code more readable, but the more complex your code is the easier it will be to read it using pipes. It also allows you to connect multiple lines of code without having to overwrite your dataset multiple times. As we go through more data cleaning and wrangling, you'll begin to see how piping makes a difference when coding.
### Subset Data
#### Select function
The select function allows you to pull specific information out of our dataset. You can do this using baseR commands, or with the select function from tidyverse.
```{r}
head(clean_df[, c("id", "educ", "age")])
head(clean_df[, c(1:3)])
```
```{r}
clean_df %>%
select(id, educ, age) %>%
head()
clean_df %>%
select(1:3) %>%
head()
```
Each code chunk does the same thing. The first two, using [, is the "base R" way of selecting variables. The last two, using the pipe, is the tidyverse way. Both work great so the choice is yours.
#### starts_with(), ends_with(), contains()
When you use tidyverse, you can pull variables out that have common features. Here are a few examples. Note that we use the select function as well as another function to pull out information.
This line of code pulls out the variables that start with the word "income".
```{r}
clean_df %>%
select(starts_with("income")) %>%
head()
```
This code allows you to select all of the variables in your dataset that end with _1 and _2.
```{r}
clean_df %>%
select(ends_with(c("_1", "_2"))) %>%
head()
```
With this code you can pull out any variables that contain "anx" anywhere in the variable name.
```{r}
clean_df %>%
select(contains("anx")) %>%
head()
```
### Filter, equality/inequality, and/or, within
```{r}
clean_df[clean_df$anx_1 > 20,]
clean_df %>%
filter(anx_1 > 20) %>%
head()
```
```{r}
clean_df %>%
filter(sex == "female") %>%
head()
```
```{r}
clean_df %>%
filter(sex == "female" & race == "black") %>%
head()
```
```{r}
clean_df %>%
filter(sex == "female" | educ == 20) %>%
head()
```
Can also have it match multiple things
```{r}
clean_df %>%
filter(race %in% c("black", "latinx")) %>%
head()
```
### Mutate
Anytime you see mutate() it means you are adding a new variable or modifying an existing one. For example, lets take a look at what class our sex variable is.
```{r}
class(clean_df$sex)
```
When we imported the data, R did not know sex was a factor. So, we need to change it to be factor. Using the mutate function allows us to modify existing variables, or create new variables. In this case, I make a new variable called "sex_f" and change sex to be a factor.
```{r}
clean_df <- clean_df %>%
mutate(sex_f = factor(sex))
class(clean_df$sex_f)
```
There are loads of useful functions for mutating a variable! We will present a few here.
#### Case_when
When you want to modify variables based on various possible conditions, you can use the case_when() function. For example, when a categorical variable is dummy-coded to be 0,1,2.. you can change it to have actual name labels. In this example, our married variable i supposed to be factor of 0 = not married and 1 = married. Lets look at our married variable before we use mutate and case_when.
```{r}
summary(clean_df$married)
```
The next code chunk will show a new variable, called "married_f" that has levels "married" and "not married" and is a factor. To do this, we need to use mutate to create the new married_f variable, and then tell r that when the married variable = 1, then label married, and if married = 0, label it not married. Note that we need to use == to specify that it is equal to something. We also need to put our labels in "" when they are words. The last line of code turns married_f into a factor, rather than a character variable.
```{r}
clean_df <- clean_df %>%
mutate(married_f = case_when(
married == 1 ~ "married",
married == 0 ~ "not married"
) %>% as.factor())
glimpse(clean_df)
```
Here is another example.
```{r}
clean_df <- clean_df %>%
mutate(income_cat = case_when(
income_1 > mean(income_1, na.rm=TRUE) + sd(income_1, na.rm=TRUE) ~ "High Earner",
income_1 > mean(income_1, na.rm=TRUE) ~ "Mid-High Earner",
income_1 <= mean(income_1, na.rm=TRUE) - sd(income_1, na.rm=TRUE) ~ "Low Earner",
income_1 <= mean(income_1, na.rm=TRUE) ~ "Mid-Low Earner",
) %>% as.factor())
glimpse(clean_df)
```
This last one is a complex example but it shows the utility of `case_when()`. Any guesses what this is doing?
```{r, eval = FALSE}
data %>%
rowwise() %>%
mutate(
var1 =
case_when(
sum(
item1 > 2,
item2 > 2,
item3 > 2,
item4 > 2,
item5 > 2,
na.rm=TRUE
) >= 6 &
sum(
perf1 > 3,
perf2 > 3,
perf3 > 3,
na.rm=TRUE
) > 0 ~ 1,
TRUE ~ 0
)
) %>%
ungroup() %>%
mutate(var1 = factor(var1))
```
#### Rowsums and Rowmeans
```{r}
clean_df <- clean_df %>%
mutate(dep_v1 = furniture::rowmeans(depression_1, depression_2))
clean_df <- clean_df %>%
rowwise() %>%
mutate(dep_v2 = mean(c(depression_1, depression_2))) %>%
ungroup()
glimpse(clean_df)
```
### Forcats Package
```{r}
library(forcats) # forcats is also in tidyverse so if you downloaded tidyverse you don't need to download this as well
```
#### Relevel
You can manually reorder factor levels
```{r}
clean_df %>%
mutate(race_f = fct_relevel(race_f, "white", "black", "latinx", "other"))
```
#### Reorder
Reorder a factor by its levels' frequency
```{r}
clean_df %>%
mutate(race_f = fct_infreq(race_f))
```
Reorder a factor by a numeric variable
```{r}
clean_df %>%
mutate(race_f = fct_reorder(race_f, income_1))
```
#### Lump
When you have a factor with too many levels, you can lump all of the infrequent ones into one factor, "other". You can specify how many levels you want to include. the argument "n" is the number of levels you want to keep, besides the "other" category that will be made from the lump function. You can also include the argument other_level = "" if you would like to change the name from other to something more specific.
```{r}
clean_df %>%
mutate(race_f = fct_lump(race_f, n = 2))
```
In this example, we previously had four levels of the race variable: white, black, latinx, and other. When we specified n = 2, we lumped the other and black categories (the most infrequent two) into one "Other" category. It kept our two most frequent categories: white and latinx.
#### Other Useful Functions for Factors
![Forcats cheat sheet](forcatscheat.png)
### Group_by
The group_by() function allows for us to group our data by a specific variable and compare groups with descriptive statistics.
```{r}
clean_df %>%
group_by(race_f) %>%
summarize(N = n(),
mean = mean(depression_1),
sd = sd(depression_1))
```
This example groups the dataset by race_f and summarizes it by the number per race (the n() argument), and provides the mean (mean()) and standard deviation (sd()) for depression scores by race.