-
Notifications
You must be signed in to change notification settings - Fork 175
/
ch03.Rmd
769 lines (509 loc) · 36.7 KB
/
ch03.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
---
output:
bookdown::html_document2:
fig_caption: yes
editor_options:
chunk_output_type: console
---
```{r echo = FALSE, cache = FALSE}
source("utils.R", local = TRUE)
```
Bar Graphs {#CHAPTER-BAR-GRAPH}
==========
Bar graphs are perhaps the most commonly used kind of data visualization. They're typically used to display numeric values (on the y-axis), for different categories (on the x-axis). For example, a bar graph would be good for showing the prices of four different kinds of items. A bar graph generally wouldn't be as good for showing prices over time, where time is a continuous variable -- though it can be done, as we'll see in this chapter.
There's an important distinction you should be aware of when making bar graphs: sometimes the bar heights represent *counts* of cases in the data set, and sometimes they represent *values* in the data set. Keep this distinction in mind -- it can be a source of confusion since they have very different relationships to the data, but the same term is used for both of them. In this chapter I'll discuss this more, and present recipes for both types of bar graphs.
From this chapter on, this book will focus on using ggplot2 instead of base R graphics. Using ggplot2 will both keep things simpler and make for more sophisticated graphics.
Making a Basic Bar Graph {#RECIPE-BAR-GRAPH-BASIC-BAR}
------------------------
### Problem
You have a data frame where one column represents the *x* position of each bar, and another column represents the vertical (y) height of each bar.
### Solution
Use `ggplot()` with `geom_col()` and specify what variables you want on the x- and y-axes (Figure \@ref(fig:FIG-BAR-GRAPH-BASIC-BAR)):
```{r FIG-BAR-GRAPH-BASIC-BAR, fig.cap='Bar graph of values with a discrete x-axis'}
library(gcookbook) # Load gcookbook for the pg_mean data set
ggplot(pg_mean, aes(x = group, y = weight)) +
geom_col()
```
> **Note**
>
> In previous versions of ggplot2, the recommended way to create a bar graph of values was to use `geom_bar(stat = "identity")`. As of ggplot2 2.2.0, there is a `geom_col()` function which does the same thing.
### Discussion
When x is a continuous (or numeric) variable, the bars behave a little differently. Instead of having one bar at each actual x value, there is one bar at each possible x value between the minimum and the maximum, as in Figure \@ref(fig:FIG-BAR-GRAPH-BASIC-BAR-CONT). You can convert the continuous variable to a discrete variable by using `factor()`.
```{r FIG-BAR-GRAPH-BASIC-BAR-CONT, fig.show="hold", fig.cap='Bar graph of values with a continuous x-axis (left); With x variable converted to a factor (notice that the space for 6 is gone; right)', fig.width=3.5, fig.height=3.5}
# There's no entry for Time == 6
BOD
# Time is numeric (continuous)
str(BOD)
ggplot(BOD, aes(x = Time, y = demand)) +
geom_col()
# Convert Time to a discrete (categorical) variable with factor()
ggplot(BOD, aes(x = factor(Time), y = demand)) +
geom_col()
```
Notice that there was no row in `BOD` for `Time` = 6. When the x variable is continuous, ggplot2 will use a numeric axis which will have space for all numeric values within the range -- hence the empty space for 6 in the plot. When `Time` is converted to a factor, ggplot2 uses it as a discrete variable, where the values are treated as arbitrary labels instead of numeric values, and so it won't allocate space on the x axis for all possible numeric values between the minimum and maximum.
In these examples, the data has a column for x values and another for y values. If you instead want the height of the bars to represent the *count* of cases in each group, see Recipe \@ref(RECIPE-BAR-GRAPH-COUNTS).
By default, bar graphs use a dark grey for the bars. To use a color fill, use `fill`. Also, by default, there is no outline around the fill. To add an outline, use `colour`. For Figure \@ref(fig:FIG-BAR-GRAPH-BASIC-BAR-SINGLE-FILL), we use a light blue fill and a black outline:
```{r FIG-BAR-GRAPH-BASIC-BAR-SINGLE-FILL, fig.cap="A single fill and outline color for all bars"}
ggplot(pg_mean, aes(x = group, y = weight)) +
geom_col(fill = "lightblue", colour = "black")
```
> **Note**
>
> In ggplot2, the default is to use the British spelling, colour, instead of the American spelling, color. Internally, American spellings are remapped to the British ones, so if you use the American spelling it will still work.
### See Also
If you want the height of the bars to represent the count of cases in each group, see Recipe \@ref(RECIPE-BAR-GRAPH-COUNTS).
To reorder the levels of a factor based on the values of another variable, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER-VALUE). To manually change the order of factor levels, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER).
For more information about using colors, see Chapter \@ref(CHAPTER-COLORS).
Grouping Bars Together {#RECIPE-BAR-GRAPH-GROUPED-BAR}
----------------------
### Problem
You want to group bars together by a second variable.
### Solution
Map a variable to fill, and use `geom_col(position = "dodge")`.
In this example we'll use the `cabbage_exp` data set, which has two categorical variables, `Cultivar` and `Date`, and one continuous variable, `Weight`:
```{r}
library(gcookbook) # Load gcookbook for the cabbage_exp data set
cabbage_exp
```
We'll map `Date` to the *x* position and map `Cultivar` to the fill color (Figure \@ref(fig:FIG-BAR-GRAPH-GROUPED-BAR)):
```{r FIG-BAR-GRAPH-GROUPED-BAR, fig.cap="Graph with grouped bars"}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(position = "dodge")
```
### Discussion
The most basic bar graphs have one categorical variable on the x-axis and one continuous variable on the y-axis. Sometimes you'll want to use another categorical variable to divide up the data, in addition to the variable on the x-axis. You can produce a grouped bar plot by mapping that variable to fill, which represents the fill color of the bars. You must also use `position = "dodge"`, which tells the bars to "dodge" each other horizontally; if you don't, you'll end up with a stacked bar plot (Recipe \@ref(RECIPE-BAR-GRAPH-STACKED-BAR)).
As with variables mapped to the x-axis of a bar graph, variables that are mapped to the fill color of bars must be categorical rather than continuous variables.
To add a black outline, use `colour = "black"` inside `geom_col()`. To set the colors, you can use `scale_fill_brewer()` or `scale_fill_manual()`. In Figure \@ref(fig:FIG-BAR-GRAPH-GROUPED-BAR-OUTLINE) we'll use the `Pastel1` palette from `RColorBrewer`:
```{r FIG-BAR-GRAPH-GROUPED-BAR-OUTLINE, fig.cap="Grouped bars with black outline and a different color palette"}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(position = "dodge", colour = "black") +
scale_fill_brewer(palette = "Pastel1")
```
Other aesthetics, such as `colour` (the color of the outlines of the bars) or `linestyle`, can also be used for grouping variables, but `fill` is probably what you'll want to use.
Note that if there are any missing combinations of the categorical variables, that bar will be missing, and the neighboring bars will expand to fill that space. If we remove the last row from our example data frame, we get Figure \@ref(fig:FIG-BAR-GRAPH-GROUPED-BAR-MISSING):
```{r FIG-BAR-GRAPH-GROUPED-BAR-MISSING, fig.cap="Graph with a missing bar-the other bar fills the space"}
ce <- cabbage_exp[1:5, ]
ce
ggplot(ce, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(position = "dodge", colour = "black") +
scale_fill_brewer(palette = "Pastel1")
```
If your data has this issue, you can manually make an entry for the missing factor level combination with an `NA` for the *y* variable.
### See Also
For more on using colors in bar graphs, see Recipe \@ref(RECIPE-BAR-GRAPH-COLORS).
To reorder the levels of a factor based on the values of another variable, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER-VALUE).
Making a Bar Graph of Counts {#RECIPE-BAR-GRAPH-COUNTS}
----------------------------
### Problem
Your data has one row representing each case, and you want plot counts of the cases.
### Solution
Use `geom_bar()` without mapping anything to `y` (Figure \@ref(fig:FIG-BAR-GRAPH-COUNT)):
```{r FIG-BAR-GRAPH-COUNT, fig.cap="Bar graph of counts"}
# Equivalent to using geom_bar(stat = "bin")
ggplot(diamonds, aes(x = cut)) +
geom_bar()
```
### Discussion
The `diamonds` data set has 53,940 rows, each of which represents information about a single diamond:
```{r}
diamonds
```
With `geom_bar()`, the default behavior is to use `stat = "bin"`, which counts up the number of cases for each group (each *x* position, in this example). In the graph we can see that there are about 23,000 cases with an `ideal` cut.
In this example, the variable on the x-axis is discrete. If we use a continuous variable on the x-axis, we'll get a bar at each unique *x* value in the data, as shown in Figure \@ref(fig:FIG-BAR-GRAPH-COUNT-CONTINUOUS), left:
```{r FIG-BAR-GRAPH-COUNT-CONTINUOUS, echo=FALSE, message=FALSE, fig.show="hold", fig.cap='Bar graph of counts on a continuous axis (left); A histogram (right)', fig.width=6, fig.height=3}
# Bar graph of counts
ggplot(diamonds, aes(x = carat)) +
geom_bar()
# Histogram
ggplot(diamonds, aes(x = carat)) +
geom_histogram()
```
The bar graph with a continuous x-axis is similar to a histogram, but not the same. A histogram is shown on the right of Figure \@ref(fig:FIG-BAR-GRAPH-COUNT-CONTINUOUS). In this kind of bar graph, each bar represents a unique *x* value, whereas in a histogram, each bar represents a *range* of *x* values.
### See Also
If, instead of having `ggplot()` count up the number of rows in each group, you have a column in your data frame representing the y values, use `geom_col()`. See Recipe \@ref(RECIPE-BAR-GRAPH-BASIC-BAR).
You could also get the same graphical output by calculating the counts before sending the data to `ggplot()`. See Recipe \@ref(RECIPE-DATAPREP-SUMMARIZE) for more on summarizing data.
For more about histograms, see Recipe \@ref(RECIPE-DISTRIBUTION-BASIC-HIST).
Using Colors in a Bar Graph {#RECIPE-BAR-GRAPH-COLORS}
---------------------------
### Problem
You want to use different colors for the bars in your graph.
### Solution
Map the appropriate variable to the fill aesthetic.
We'll use the `uspopchange` data set for this example. It contains the percentage change in population for the US states from 2000 to 2010. We'll take the top 10 fastest-growing states and graph their percentage change. We'll also color the bars by region (Northeast, South, North Central, or West).
First, take the top 10 states:
```{r}
library(gcookbook) # Load gcookbook for the uspopchange data set
library(dplyr)
upc <- uspopchange %>%
arrange(desc(Change)) %>%
slice(1:10)
upc
```
Now we can make the graph, mapping Region to fill
(Figure \@ref(fig:FIG-BAR-GRAPH-FILL)):
```{r FIG-BAR-GRAPH-FILL, fig.cap="A variable mapped to fill", fig.width=6, fig.height=3.5}
ggplot(upc, aes(x = Abb, y = Change, fill = Region)) +
geom_col()
```
### Discussion
The default colors aren't the most appealing, so you may want to set them using `scale_fill_brewer()` or `scale_fill_manual()`. With this example, we'll use the latter, and we'll set the outline color of the bars to black, with `colour="black"` (Figure \@ref(fig:FIG-BAR-GRAPH-FILL-MANUAL)). Note that *setting* occurs outside of `aes()`, while *mapping* occurs within `aes()`:
```{r FIG-BAR-GRAPH-FILL-MANUAL, fig.cap="Graph with different colors, black outlines, and sorted by percentage change", fig.width=6, fig.height=3.5}
ggplot(upc, aes(x = reorder(Abb, Change), y = Change, fill = Region)) +
geom_col(colour = "black") +
scale_fill_manual(values = c("#669933", "#FFCC66")) +
xlab("State")
```
This example also uses the `reorder()` function to reorder the levels of the factor `Abb` based on the values of `Change`. In this particular case it makes sense to sort the bars by their height, instead of in alphabetical order.
### See Also
For more about using `reorder()`, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER-VALUE).
For more information about using colors, see Chapter \@ref(CHAPTER-COLORS).
Coloring Negative and Positive Bars Differently {#RECIPE-BAR-GRAPH-COLOR-NEG}
-----------------------------------------------
### Problem
You want to use different colors for negative and positive-valued bars.
### Solution
We'll use a subset of the climate data and create a new column called pos, which indicates whether the value is positive or negative:
```{r}
library(gcookbook) # Load gcookbook for the climate data set
library(dplyr)
climate_sub <- climate %>%
filter(Source == "Berkeley" & Year >= 1900) %>%
mutate(pos = Anomaly10y >= 0)
climate_sub
```
Once we have the data, we can make the graph and map pos to the fill color, as in Figure \@ref(fig:FIG-BAR-GRAPH-COLOR-NEG). Notice that we use position="identity" with the bars. This will prevent a warning message about stacking not being well defined for negative numbers:
```{r FIG-BAR-GRAPH-COLOR-NEG, fig.cap="Different colors for positive and negative values", fig.width=10, fig.height=2.5, out.width="100%"}
ggplot(climate_sub, aes(x = Year, y = Anomaly10y, fill = pos)) +
geom_col(position = "identity")
```
### Discussion
There are a few problems with the first attempt. First, the colors are probably the reverse of what we want: usually, blue means cold and red means hot. Second, the legend is redundant and distracting.
We can change the colors with `scale_fill_manual()` and remove the legend with `guide = FALSE`, as shown in Figure \@ref(fig:FIG-BAR-GRAPH-COLOR-NEG2). We'll also add a thin black outline around each of the bars by setting `colour` and specifying `size`, which is the thickness of the outline (in millimeters):
```{r FIG-BAR-GRAPH-COLOR-NEG2, fig.cap="Graph with customized colors and no legend", fig.width=10, fig.height=2.5, out.width="100%"}
ggplot(climate_sub, aes(x = Year, y = Anomaly10y, fill = pos)) +
geom_col(position = "identity", colour = "black", size = 0.25) +
scale_fill_manual(values = c("#CCEEFF", "#FFDDDD"), guide = FALSE)
```
### See Also
To change the colors used, see Recipes Recipe \@ref(RECIPE-COLORS-PALETTE-DISCRETE) and Recipe \@ref(RECIPE-COLORS-PALETTE-DISCRETE-MANUAL).
To hide the legend, see Recipe \@ref(RECIPE-LEGEND-REMOVE).
Adjusting Bar Width and Spacing {#RECIPE-BAR-GRAPH-ADJUST-WIDTH}
-------------------------------
### Problem
You want to adjust the width of bars and the spacing between them.
### Solution
To make the bars narrower or wider, set `width` in `geom_col()`. The default value is 0.9; larger values make the bars wider, and smaller values make the bars narrower (Figure \@ref(fig:FIG-BAR-WIDTH)).
For example, for standard-width bars:
```{r eval=FALSE}
library(gcookbook) # Load gcookbook for the pg_mean data set
ggplot(pg_mean, aes(x = group, y = weight)) +
geom_col()
```
For narrower bars:
```{r eval=FALSE}
ggplot(pg_mean, aes(x = group, y = weight)) +
geom_col(width = 0.5)
```
And for wider bars (these have the maximum width of 1):
```{r eval=FALSE}
ggplot(pg_mean, aes(x = group, y = weight)) +
geom_col(width = 1)
```
```{r FIG-BAR-WIDTH, echo=FALSE, fig.show="hold", fig.cap="Different bar widths", fig.width=4, fig.height=3}
ggplot(pg_mean, aes(x = group, y = weight)) +
geom_col()
ggplot(pg_mean, aes(x = group, y = weight)) +
geom_col(width = 0.5)
ggplot(pg_mean, aes(x = group, y = weight)) +
geom_col(width = 1)
```
For grouped bars, the default is to have no space between bars within each group. To add space between bars within a group, make width smaller and set the value for `position_dodge` to be larger than `width` (Figure \@ref(fig:FIG-BAR-WIDTH-DODGE)).
For a grouped bar graph with narrow bars:
```{r eval=FALSE}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(width = 0.5, position = "dodge")
```
And with some space between the bars:
```{r eval=FALSE}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(width = 0.5, position = position_dodge(0.7))
```
```{r FIG-BAR-WIDTH-DODGE, echo=FALSE, fig.show="hold", fig.cap="Bar graph with narrow grouped bars (left); With space between the bars (right)", fig.width=4.5, fig.height=3}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(width = 0.5, position = "dodge")
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(width = 0.5, position = position_dodge(0.7))
```
The first graph used `position = "dodge"`, and the second graph used `position = position_dodge()`. This is because `position = "dodge"` is simply shorthand for `position = position_dodge()` with the default value of 0.9, but when we want to set a specific value, we need to use the more verbose form.
### Discussion
The default `width` for bars is 0.9, and the default value used for `position_dodge()` is the same. To be more precise, the value of `width` in `position_dodge()` is `NULL`, which tells ggplot2 to use the same value as the width from `geom_bar()`.
All of these will have the same result:
```{r eval=FALSE}
geom_bar(position = "dodge")
geom_bar(width = 0.9, position = position_dodge())
geom_bar(position = position_dodge(0.9))
geom_bar(width = 0.9, position = position_dodge(width=0.9))
```
The items on the x-axis have x values of 1, 2, 3, and so on, though you typically don't refer to them by these numerical values. When you use `geom_bar(width = 0.9)`, it makes each group take up a total width of 0.9 on the x-axis. When you use `position_dodge(width = 0.9)`, it spaces the bars so that the *middle* of each bar is right where it would be if the bar width were 0.9 and the bars were touching. This is illustrated in Figure \@ref(fig:FIG-BAR-WIDTH-DODGE-EXPLANATION). The two graphs both have the same dodge width of 0.9, but while the top has a bar width of 0.9, the bottom has a bar width of 0.2. Despite the different bar widths, the middles of the bars stay aligned.
```{r FIG-BAR-WIDTH-DODGE-EXPLANATION, echo=FALSE, fig.show="hold", fig.cap="Same dodge width of 0.9, but different bar widths of 0.9 (left) and 0.2 (right)", fig.width=4.5, fig.height=3}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(position = "dodge")
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(width = 0.2, position = position_dodge(width=0.9))
```
If you make the entire graph wider or narrower, the bar dimensions will scale proportionally. To see how this works, you can just resize the window in which the graphs appear. For information about controlling this when writing to a file, see Chapter \@ref(CHAPTER-OUTPUT).
Making a Stacked Bar Graph {#RECIPE-BAR-GRAPH-STACKED-BAR}
--------------------------
### Problem
You want to make a stacked bar graph.
### Solution
Use `geom_bar()` and map a variable `fill`. This will put `Date` on the x-axis and use `Cultivar` for the fill color, as shown in Figure \@ref(fig:FIG-BAR-GRAPH-STACKED-BAR):
```{r FIG-BAR-GRAPH-STACKED-BAR, fig.cap="Stacked bar graph", fig.height=3.5}
library(gcookbook) # Load gcookbook for the cabbage_exp data set
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col()
```
### Discussion
To understand how the graph is made, it's useful to see how the data is structured. There are three levels of `Date` and two levels of `Cultivar`, and for each combination there is a value for `Weight`:
```{r}
cabbage_exp
```
By default, the stacking order of the bars is the same as the order of items in the legend. For some data sets it might make sense to reverse the order of the legend. To do this, you can use the `guides` function and specify which aesthetic for which the legend should be reversed. In this case, it's `fill`:
```{r FIG-BAR-GRAPH-STACKED-BAR-REVLEVELS, fig.cap="Stacked bar graph with reversed legend order", fig.height=3.5}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col() +
guides(fill = guide_legend(reverse = TRUE))
```
If you'd like to reverse the stacking order of the bars, as in Figure \@ref(fig:FIG-BAR-GRAPH-STACKED-BAR-REVSTACK), use `position_stack(reverse = TRUE)`. You'll also need to reverse the order of the legend for it to match the order of the bars:
```{r FIG-BAR-GRAPH-STACKED-BAR-REVSTACK, fig.cap="Stacked bar graph with reversed stacking order", fig.height=3.5}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(position = position_stack(reverse = TRUE)) +
guides(fill = guide_legend(reverse = TRUE))
```
It's also possible to modify the column of the data frame so that the factor levels are in a different order (see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER)). Do this with care, since the modified data could change the results of other analyses.
For a more polished graph, we'll use `scale_fill_brewer()` to get a different color palette, and use `colour="black"` to get a black outline (Figure \@ref(fig:FIG-BAR-GRAPH-STACKED-BAR-COLORS)):
```{r FIG-BAR-GRAPH-STACKED-BAR-COLORS, fig.cap="Stacked bar graph with reversed legend, new palette, and black outline", fig.height=3.5}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(colour = "black") +
scale_fill_brewer(palette = "Pastel1")
```
### See Also
For more on using colors in bar graphs, see Recipe \@ref(RECIPE-BAR-GRAPH-COLORS).
To reorder the levels of a factor based on the values of another variable, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER-VALUE). To manually change the order of factor levels, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER).
Making a Proportional Stacked Bar Graph {#RECIPE-BAR-GRAPH-PROPORTIONAL-STACKED-BAR}
---------------------------------------
### Problem
You want to make a stacked bar graph that shows proportions (also called a 100% stacked bar graph).
### Solution
Use `geom_col(position = "fill")` (Figure \@ref(fig:FIG-BAR-GRAPH-PROP-STACKED-BAR)):
```{r FIG-BAR-GRAPH-PROP-STACKED-BAR, fig.cap="Proportional stacked bar graph", fig.height=3.5}
library(gcookbook) # Load gcookbook for the cabbage_exp data set
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(position = "fill")
```
### Discussion
With `position = "fill"`, the y values will be scaled to go from 0 to 1. To print the labels as percentages, use `scale_y_continuous(labels = scales::percent)`.
```{r eval=FALSE}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(position = "fill") +
scale_y_continuous(labels = scales::percent)
```
> **Note**
>
> Using `scales::percent` is a way of using the `percent` function from the scales package. You could instead do `library(scales)` and then just use `scale_y_continuous(labels = percent)`. This would also make all of the functions from scales available in the current R session.
To make the output look a little nicer, you can change the color palette and add an outline. This is shown in (Figure \@ref(fig:FIG-BAR-GRAPH-PROP-STACKED-BAR-FINAL)):
```{r FIG-BAR-GRAPH-PROP-STACKED-BAR-FINAL, fig.cap="Proportional stacked bar graph with reversed legend, new palette, and black outline", fig.height=3.5}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(colour = "black", position = "fill") +
scale_y_continuous(labels = scales::percent) +
scale_fill_brewer(palette = "Pastel1")
```
Instead of having ggplot2 compute the proportions automatically, you may want to compute the proportional values yourself. This can be useful if you want to use those values in other computations.
To do this, first scale the data to 100% within each stack. This can be done by using `group_by()` together with `mutate()` from the dplyr package.
```{r}
library(gcookbook)
library(dplyr)
cabbage_exp
# Do a group-wise transform(), splitting on "Date"
ce <- cabbage_exp %>%
group_by(Date) %>%
mutate(percent_weight = Weight / sum(Weight) * 100)
ce
```
To calculate the percentages within each `Weight` group, we used dplyr's `group_by()` and `mutate()` functions. In the example here, the `group_by()` function tells dplyr that future operations should operate on the data frame as though it were split up into groups, on the `Date` column. The `mutate()` function tells it to calculate a new column, dividing each row's `Weight` value by the sum of the `Weight` column *within each group*.
> **Note**
>
> You may have noticed that `cabbage_exp` and `ce` print out differently. This is because `cabbage_exp` is a regular data frame, while `ce` is a *tibble*, which is a data frame with some extra properties. The dplyr package creates tibbles. For more information, see Chapter \@ref(CHAPTER-DATAPREP).
After computing the new column, making the graph is the same as with a regular stacked bar graph.
```{r eval=FALSE}
ggplot(ce, aes(x = Date, y = percent_weight, fill = Cultivar)) +
geom_col()
```
### See Also
For more on transforming data by groups, see Recipe \@ref(RECIPE-DATAPREP-CALCULATE-GROUP).
Adding Labels to a Bar Graph {#RECIPE-BAR-GRAPH-LABELS}
----------------------------
### Problem
You want to add labels to the bars in a bar graph.
### Solution
Add `geom_text()` to your graph. It requires a mapping for x, y, and the text itself. By setting `vjust` (the vertical justification), it is possible to move the text above or below the tops of the bars, as shown in Figure \@ref(fig:FIG-BAR-GRAPH-LABEL):
```{r FIG-BAR-GRAPH-LABEL, fig.show="hold", fig.cap="Labels under the tops of bars (left); Labels above bars (right)", fig.height=3.5}
library(gcookbook) # Load gcookbook for the cabbage_exp data set
# Below the top
ggplot(cabbage_exp, aes(x = interaction(Date, Cultivar), y = Weight)) +
geom_col() +
geom_text(aes(label = Weight), vjust = 1.5, colour = "white")
# Above the top
ggplot(cabbage_exp, aes(x = interaction(Date, Cultivar), y = Weight)) +
geom_col() +
geom_text(aes(label = Weight), vjust = -0.2)
```
Notice that when the labels are placed atop the bars, they may be clipped. To remedy this, see Recipe \@ref(RECIPE-AXES-RANGE).
Another common scenario is to add labels for a bar graph of *counts* instead of values. To do this, use `geom_bar()`, which adds bars whose height is proportional to the number of rows, and then use `geom_text()` with counts:
```{r FIG-BAR-GRAPH-COUNT-LABEL, fig.cap="Bar graph of counts with labels under the tops of bars", fig.width=4, fig.height=3.5}
ggplot(mtcars, aes(x = factor(cyl))) +
geom_bar() +
geom_text(aes(label = ..count..), stat = "count", vjust = 1.5, colour = "white")
```
We needed to tell `geom_text()` to use the `"count"` statistic to compute the number of rows for each x value, and then, to use those computed counts as the labels, we told it to use the aesthetic mapping `aes(label = ..count..)`.
### Discussion
In Figure \@ref(fig:FIG-BAR-GRAPH-LABEL), the *y* coordinates of the labels are centered at the top of each bar; by setting the vertical justification (`vjust`), they appear below or above the bar tops. One drawback of this is that when the label is above the top of the bar, it can go off the top of the plotting area. To fix this, you can manually set the *y* limits, or you can set the *y* positions of the text *above* the bars and not change the vertical justification. One drawback to changing the text's *y* position is that if you want to place the text fully above or below the bar top, the value to add will depend on the *y* range of the data; in contrast, changing `vjust` to a different value will always move the text the same distance relative to the height of the bar:
```{r, eval=FALSE}
# Adjust y limits to be a little higher
ggplot(cabbage_exp, aes(x = interaction(Date, Cultivar), y = Weight)) +
geom_col() +
geom_text(aes(label = Weight), vjust = -0.2) +
ylim(0, max(cabbage_exp$Weight) * 1.05)
# Map y positions slightly above bar top - y range of plot will auto-adjust
ggplot(cabbage_exp, aes(x = interaction(Date, Cultivar), y = Weight)) +
geom_col() +
geom_text(aes(y = Weight + 0.1, label = Weight))
```
For grouped bar graphs, you also need to specify position=position_dodge() and give it a value for the dodging width. The default dodge width is 0.9. Because the bars are narrower, you might need to use size to specify a smaller font to make the labels fit. The default value of size is 5, so we'll make it smaller by using 3 (Figure \@ref(fig:FIG-BAR-LABEL-GROUPED)):
```{r FIG-BAR-LABEL-GROUPED, fig.cap="Labels on grouped bars", fig.height=3.5}
ggplot(cabbage_exp, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(position = "dodge") +
geom_text(
aes(label = Weight),
colour = "white", size = 3,
vjust = 1.5, position = position_dodge(.9)
)
```
Putting labels on stacked bar graphs requires finding the cumulative sum for each stack. To do this, first make sure the data is sorted properly -- if it isn't, the cumulative sum might be calculated in the wrong order. We'll use the `arrange()` function from the dplyr package. Note that we have to use the `rev()` function to reverse the order of `Cultivar`:
```{r}
library(dplyr)
# Sort by the Date and Cultivar columns
ce <- cabbage_exp %>%
arrange(Date, rev(Cultivar))
```
Once we make sure the data is sorted properly, we'll use `group_by()` to chunk it into groups by `Date`, then calculate a cumulative sum of `Weight` within each chunk:
```{r FIG-BAR-LABEL-STACKED, fig.cap="Labels on stacked bars", fig.height=3.5}
# Get the cumulative sum
ce <- ce %>%
group_by(Date) %>%
mutate(label_y = cumsum(Weight))
ce
ggplot(ce, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col() +
geom_text(aes(y = label_y, label = Weight), vjust = 1.5, colour = "white")
```
The result is shown in Figure \@ref(fig:FIG-BAR-LABEL-STACKED).
When using labels, changes to the stacking order are best done by modifying the order of levels in the factor (see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER)) before taking the cumulative sum. The other method of changing stacking order, by specifying breaks in a scale, won't work properly, because the order of the cumulative sum won't be the same as the stacking order.
To put the labels in the middle of each bar (Figure \@ref(fig:FIG-BAR-LABEL-STACKED-MIDDLE)), there must be an adjustment to the cumulative sum, and the *y* offset in `geom_bar()` can be removed:
```{r FIG-BAR-LABEL-STACKED-MIDDLE, fig.cap="Labels in the middle of stacked bars", fig.height=3.5}
ce <- cabbage_exp %>%
arrange(Date, rev(Cultivar))
# Calculate y position, placing it in the middle
ce <- ce %>%
group_by(Date) %>%
mutate(label_y = cumsum(Weight) - 0.5 * Weight)
ggplot(ce, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col() +
geom_text(aes(y = label_y, label = Weight), colour = "white")
```
For a more polished graph (Figure \@ref(fig:FIG-BAR-LABEL-STACKED-FINAL)), we'll change the colors, add labels in the middle with a smaller font using `size`, add a "kg" suffix using `paste`, and make sure there are always two digits after the decimal point by using `format()`:
```{r FIG-BAR-LABEL-STACKED-FINAL, fig.cap="Customized stacked bar graph with labels", fig.height=3.5}
ggplot(ce, aes(x = Date, y = Weight, fill = Cultivar)) +
geom_col(colour = "black") +
geom_text(aes(y = label_y, label = paste(format(Weight, nsmall = 2), "kg")), size = 4) +
scale_fill_brewer(palette = "Pastel1")
```
### See Also
To control the appearance of the text, see Recipe \@ref(RECIPE-APPEARANCE-TEXT-APPEARANCE).
For more on transforming data by groups, see Recipe \@ref(RECIPE-DATAPREP-CALCULATE-GROUP).
Making a Cleveland Dot Plot {#RECIPE-BAR-GRAPH-DOT-PLOT}
---------------------------
### Problem
You want to make a Cleveland dot plot.
### Solution
Cleveland dot plots are an alternative to bar graphs that reduce visual clutter and can be easier to read.
The simplest way to create a dot plot (as shown in Figure \@ref(fig:FIG-BAR-GRAPH-DOTPLOT)) is to use `geom_point()`:
```{r FIG-BAR-GRAPH-DOTPLOT, fig.cap="Basic dot plot", fig.width = 4, fig.height=5, out.width="50%"}
library(gcookbook) # Load gcookbook for the tophitters2001 data set
tophit <- tophitters2001[1:25, ] # Take the top 25 from the tophitters data set
ggplot(tophit, aes(x = avg, y = name)) +
geom_point()
```
### Discussion
The `tophitters2001` data set contains many columns, but we'll focus on just three of them for this example:
```{r}
tophit[, c("name", "lg", "avg")]
```
In Figure \@ref(fig:FIG-BAR-GRAPH-DOTPLOT) the names are sorted alphabetically, which isn't very useful in this graph. Dot plots are often sorted by the value of the continuous variable on the horizontal axis.
Although the rows of `tophit` happen to be sorted by `avg`, that doesn't mean that the items will be ordered that way in the graph. By default, the items on the given axis will be ordered however is appropriate for the data type. `name` is a character vector, so it's ordered alphabetically. If it were a factor, it would use the order defined in the factor levels. In this case, we want `name` to be sorted by a different variable, `avg`.
To do this, we can use `reorder(name, avg)`, which takes the name column, turns it into a factor, and sorts the factor levels by `avg`. To further improve the appearance, we'll make the vertical grid lines go away by using the theming system, and turn the horizontal grid lines into dashed lines (Figure \@ref(fig:FIG-BAR-GRAPH-DOTPLOT-ORDERED)):
```{r FIG-BAR-GRAPH-DOTPLOT-ORDERED, fig.cap="Dot plot, ordered by batting average", fig.width=4, fig.height=5, out.width="50%"}
ggplot(tophit, aes(x = avg, y = reorder(name, avg))) +
geom_point(size = 3) + # Use a larger dot
theme_bw() +
theme(
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.y = element_line(colour = "grey60", linetype = "dashed")
)
```
It's also possible to swap the axes so that the names go along the x-axis and the values go along the y-axis, as shown in Figure \@ref(fig:FIG-BAR-GRAPH-DOTPLOT-ORDERED-SWAP). We'll also rotate the text labels by 60 degrees:
```{r FIG-BAR-GRAPH-DOTPLOT-ORDERED-SWAP, fig.cap="Dot plot with names on x-axis and values on y-axis", fig.width=6, fig.height=4, out.width="80%"}
ggplot(tophit, aes(x = reorder(name, avg), y = avg)) +
geom_point(size = 3) + # Use a larger dot
theme_bw() +
theme(
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.major.x = element_line(colour = "grey60", linetype = "dashed"),
axis.text.x = element_text(angle = 60, hjust = 1)
)
```
It's also sometimes desirable to group the items by another variable. In this case we'll use the factor `lg`, which has the levels `NL` and `AL`, representing the National League and the American League. This time we want to sort first by `lg` and then by `avg`. Unfortunately, the `reorder()` function will only order factor levels by one other variable; to order the factor levels by two variables, we must do it manually:
```{r}
# Get the names, sorted first by lg, then by avg
nameorder <- tophit$name[order(tophit$lg, tophit$avg)]
# Turn name into a factor, with levels in the order of nameorder
tophit$name <- factor(tophit$name, levels = nameorder)
```
To make the graph (Figure \@ref(fig:FIG-BAR-GRAPH-DOTPLOT-ORDERED2)), we'll also add a mapping of `lg` to the color of the points. Instead of using grid lines that run all the way across, this time we'll make the lines go only up to the points, by using `geom_segment()`. Note that `geom_segment()` needs values for `x`, `y`, `xend`, and `yend`:
```{r FIG-BAR-GRAPH-DOTPLOT-ORDERED2, fig.cap="Grouped by league, with lines that stop at the point", fig.width=4, fig.height=6, out.width="50%"}
ggplot(tophit, aes(x = avg, y = name)) +
geom_segment(aes(yend = name), xend = 0, colour = "grey50") +
geom_point(size = 3, aes(colour = lg)) +
scale_colour_brewer(palette = "Set1", limits = c("NL", "AL")) +
theme_bw() +
theme(
panel.grid.major.y = element_blank(), # No horizontal grid lines
legend.position = c(1, 0.55), # Put legend inside plot area
legend.justification = c(1, 0.5)
)
```
Another way to separate the two groups is to use facets, as shown in Figure \@ref(fig:FIG-BAR-GRAPH-DOTPLOT-ORDERED2-FACET). The order in which the facets are displayed is different from the sorting order in Figure \@ref(fig:FIG-BAR-GRAPH-DOTPLOT-ORDERED2); to change the display order, you must change the order of factor levels in the `lg` variable:
```{r FIG-BAR-GRAPH-DOTPLOT-ORDERED2-FACET, fig.cap="Faceted by league", fig.width=4, fig.height=6, out.width="50%"}
ggplot(tophit, aes(x = avg, y = name)) +
geom_segment(aes(yend = name), xend = 0, colour = "grey50") +
geom_point(size = 3, aes(colour = lg)) +
scale_colour_brewer(palette = "Set1", limits = c("NL", "AL"), guide = FALSE) +
theme_bw() +
theme(panel.grid.major.y = element_blank()) +
facet_grid(lg ~ ., scales = "free_y", space = "free_y")
```
### See Also
For more on changing the order of factor levels, see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER). Also see Recipe \@ref(RECIPE-DATAPREP-FACTOR-REORDER-VALUE) for details on changing the order of factor levels based on some other values.
For more on moving the legend, see Recipe \@ref(RECIPE-LEGEND-POSITION). To hide grid lines, see Recipe \@ref(RECIPE-APPEARANCE-HIDE-GRIDLINES).