forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
231 lines (171 loc) · 6.84 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
Load packages
```{r}
library(dplyr)
library(ggplot2)
```
## Loading and preprocessing the data
Read the data.
```{r}
activitydata <-read.csv("data/activity.csv", header = TRUE, na.strings = "NA")
summary(activitydata)
str(activitydata)
```
Transform the interval column to represent the hours of the day in decimal format.
```{r}
activitydata <- mutate(activitydata, interval = (
as.integer(interval / 100) +
(interval %% 100)/60)
)
```
## What is mean total number of steps taken per day?
##### Calculate the total number of steps taken per day
Load dpylr package. Then group by date, sum all the steps per each day and
finally calculate the mean.
```{r}
activitybydate <- group_by(activitydata, date)
activitybydate <- summarise(activitybydate,
totalsteps = sum(steps, na.rm = TRUE),
count = n());
summary(activitybydate)
origmean <- mean(activitybydate$totalsteps)
origmedian <- median(activitybydate$totalsteps)
```
The mean number of steps taken per day is 9354.23
#### Make a histogram of the total number of steps taken each day
```{r}
hist(activitybydate$totalsteps,
main = "Total steps per day",
xlab = "Total Steps")
```
## What is the average daily activity pattern?
First, group the data by interval and average.
```{r}
activitybyinterval <- group_by(activitydata, interval)
activitybyinterval <- summarise(activitybyinterval,
mean = mean(steps, na.rm=TRUE))
```
#### Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
The interval that contains the maximum average number of steps is 8:35h.
```{r}
activitybyinterval$interval[
which(activitybyinterval$mean == max(activitybyinterval$mean))]
```
#### Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
```{r}
plot(x = activitybyinterval$interval, activitybyinterval$mean,
type = "l",
xlab = "Interval",
ylab = "steps",
main = "Steps across the day")
```
## Imputing missing values
#### Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with _NAs_ )
The total number of NAs into the data sets is:
```{r}
sum(is.na(activitydata))
```
We can observe their distribution per day:
```{r}
table( sapply(
lapply(split(activitydata$steps, activitydata$date ), is.na),
sum))
```
There are 8 days that have no valid measurement and 53 days that have no NA value.
#### Devise a strategy for filling in all of the missing values in the dataset.
I would drop those days from the data set, but the assignment ask to fill their value. We fill them with **the average value per interval calculated previously**.
#### Create a new dataset that is equal to the original dataset but with the missing data filled in.
We are going to generate a vector that repeats the averaged values 61 times, and then assign its value to every NA value in the _activitydata_ data set.
```{r}
means <- rep(activitybyinterval$mean, 61)
activityfilled <- activitydata
activityfilled$steps[is.na(activityfilled$steps)] <-
means[is.na(activityfilled$steps)];
```
Now there are no days with NAs
```{r}
table( sapply(
lapply(split(activityfilled$steps, activityfilled$date ), is.na),
sum))
```
#### Make a histogram of the total number of steps taken each day and calculate and report the mean and median total number of steps taken per day.
Recalculate the mean number of steps per day.
```{r}
activitybydate <- group_by(activityfilled, date)
activitybydate <- summarise(activitybydate,
totalsteps = sum(steps, na.rm = TRUE),
count = n());
summary(activitybydate)
origmean
origmedian
```
Both, the mean and the median have changed.
```{r}
hist(activitybydate$totalsteps,
main = "Total steps per day",
xlab = "Total Steps")
```
The histogram has also changed, now there are many more elements in the center of the distribution. This happened because we created 8 new _average days_
However, if we plot the averaged daily distribution, there is no impact. The interval with the maximum averaged number of steps is the same, and the plot is also the same.
```{r}
activitybyinterval <- group_by(activityfilled, interval)
activitybyinterval <- summarise(activitybyinterval,
mean = mean(steps, na.rm=TRUE))
activitybyinterval$interval[
which(activitybyinterval$mean == max(activitybyinterval$mean))]
plot(x = activitybyinterval$interval, activitybyinterval$mean,
type = "l",
xlab = "Interval",
ylab = "steps")
```
## Are there differences in activity patterns between weekdays and weekends?
#### Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
First, convert the date column to Date, then
```{r}
activityfilled$date <- as.Date(as.character(activityfilled$date))
```
Create a new column with the weekday:
```{r}
activityfilled <- mutate(activityfilled, weekday = weekdays(date))
```
Create a new column with the required factor:
```{r}
activityfilled <- mutate(activityfilled, day =
( weekday %in% c("Saturday", "Sunday") ) )
activityfilled$day <- as.factor(activityfilled$day)
levels(activityfilled$day) <- c("weekday", "weekend")
summary(activityfilled)
```
#### Make a plot:
We have to group the data by interval and by the column day before, then average:
```{r}
activitybyintbyday <- group_by(activityfilled, interval, day)
activitybyintbyday <- summarise(activitybyintbyday, mean = mean(steps))
```
We can also calculate the cumsum by _day_
```{r}
activitybyintbyday <- group_by(ungroup(activitybyintbyday), day)
activitybyintbyday <- mutate(activitybyintbyday, cumsum = cumsum(mean))
```
and finally plot, assuming our data cover the whole day:
```{r}
ggplot(activitybyintbyday, aes(x=interval))+
scale_x_continuous(limits = c(0,24), breaks = seq(0,24,2))+
geom_line(aes( y = mean)) +
geom_line(aes(y = cumsum/60), col = "red")+
facet_grid(day ~ .) +
xlab("hour")+
ylab("mean number of steps")+
ggtitle("Average daily patterns for Weekdays and Weekends")
```
We have divided the _cumsum_ by $60$ to be able to represent both lines in the same chart. We can see that on weekends the activity starts a little bit later, and there seems to be more activity during the afternoon and the evening. On the other side, on weekdays, there is much more activity in the morning.
```{r}
temp <- group_by(activitybyintbyday, day)
summarise(temp, sum = sum(mean))
```
We can also say that, on average, there are more steps on weekends.