-
Notifications
You must be signed in to change notification settings - Fork 0
/
diamonds.Rmd
155 lines (129 loc) · 4.74 KB
/
diamonds.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: "Plotting diamonds Dataset"
output: html_notebook
---
Explore plotting the diamonds dataset that comes with the ggplot2 package.
The dataset includes data like the quality, clarity, and cut for over 50,000 diamonds.
Checking counts for each type of Cut.
geom_bar automatically counts rows as the y value.
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
```
The color aesthetic will add color to the outline of each bar.
Now, fill is used to show the split of various diamond.color values for each diamond.cut.
This is a *stacked bar chart*
```{r}
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color))
```
```{r}
diamonds_summ <- diamonds %>%
group_by(cut) %>%
summarise(average_price = mean(price),
median = median(price),
sd = sd(price),
se = stderr())
limits = aes(x = cut, y = average_price,ymax= diamonds_summ$average_price+diamonds_summ$se, ymin=diamonds_summ$average_price-diamonds_summ$se )
```
```{r}
plot_cut1 = ggplot() +
geom_point(data=diamonds,aes(x=cut,y=price, color = cut),
position="jitter" , alpha=.05)+ #dodge width around the centre line
geom_boxplot(data=diamonds,aes(x=cut,y=price), fill = "transparent")
limits = aes(x = cut, y = average_price,ymax= average_price+sd, ymin= average_price-sd )
plot_cut2 = plot_cut1+
geom_errorbar(data = diamonds_summ,limits,
width=0.2 )
plot_cut2
```
Across the various Cuts, the mean price is almost the same. Cut doesn't predict the price.
The other factors must be explored.
Scatter Plot of Price vs Carat
```{r}
ggplot(data = diamonds, aes(x = carat, y = price)) +
geom_point()
```
We can observe that there aren't many data points at 1.9 or 1.4 carats. This is probably indicative of market demand for significant threshold in carats.
Changing color of each point so that it represented another variable, such as the cut of the diamond.
Also adding a title.
```{r}
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point()+
ggtitle("Price vs Carat for each Cut")
```
Create a different plot for each type of cut. ggplot2 does this with the facet_wrap() function:
```{r}
ggplot(data = diamonds, aes(x = carat, y = price, color = cut)) +
geom_point(alpha=.1) + #Transparent to overcome overplotting
facet_wrap(~cut)+
theme(legend.position = "none") +
ggtitle("Price vs Carat for each Cut")
```
Exploring Price vs Carat for Color of diamonds
```{r}
ggplot(data = diamonds, aes(x = carat, y = price, color = color)) +
geom_point(alpha=.1)+ #Transparent to overcome overplotting
ggtitle("Price vs Carat for each Color")
```
Exploring Price vs Carat for Clarity of diamonds
```{r}
ggplot(data = diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point(alpha=.1) + #Transparent to overcome overplotting
facet_wrap(~clarity)+
theme(legend.position = "none") +
ggtitle("Price vs Carat for each Clarity")
```
Depth is a continuous variable that can't be used to facet_wrap. In this case you can discretise it frst.
```{r}
diamonds$depth_n <- cut_number(diamonds$depth, 6)
```
This new depth can be used as the color in aes, and for facet_wrap
```{r}
ggplot(data = diamonds, aes(x = carat, y = price, color = depth_n)) +
geom_point(alpha=.1) + #Transparent to overcome overplotting
facet_wrap(~depth_n)+
theme(legend.position = "none") +
ggtitle("Price vs Carat for each Depth")
```
```{r}
diamonds_corr <- diamonds %>%
select(price,carat,depth) %>%
drop_na() %>%
cor() %>%
round(2)
diamonds_corr
```
Carat is the strongest predictor of price.
Depth has no correlation to price.
Keeping carat constant, we can check the influence of Color, Cut, Clarity
```{r}
diamonds %>%
filter(carat == 0.9) %>%
ggplot(aes(x=color,y=price))+
geom_point()+
facet_grid(cut~clarity)
```
-Better the color(D), higher the price.
-The relationship is more pronounced in VVS2 to IF Clarity. But this could be due to fewer data points.
-For poorer Clarity & Cut, the price is almost the same across Colors.
-Visualise again, with a trendline for every observation.
```{r message=FALSE}
diamonds %>%
filter(carat==0.9) %>%
ggplot(aes(x=color,y=price,group=carat))+
geom_smooth(method="loess",se=FALSE)+
facet_grid(cut~clarity)
```
Consider only the 'nicer' clarity values:
Clarity: VVS2, VVS1, IF
Cut: All but Fair
Color: All
```{r}
diamonds %>%
filter(carat>=0.9 & carat<=1.5, clarity =="VVS2"|clarity=="VVS1"|clarity=="IF",cut!="Fair") %>%
ggplot(aes(x=color,y=price,group=carat,color=carat))+
geom_line()+
facet_grid(cut~clarity)
```
In the better clarity range, very few observations exist with Good or worse cuts. This makes sense as the resulting diamond (bad cut, but excellent clarity) wouldn't be a desirable product.