PA1_template.Rmd

---
title: "Reproducible Research: Peer Assessment 1"
output: 
  html_document:
    keep_md: true
---

Load packages
```{r}
library(dplyr)
library(ggplot2)

```

## Loading and preprocessing the data

Read the data.

```{r}
activitydata <-read.csv("data/activity.csv", header = TRUE, na.strings = "NA")
summary(activitydata)
str(activitydata)
```

Transform the interval column to represent the hours of the day in decimal format.

```{r}
activitydata <- mutate(activitydata, interval = (
                         as.integer(interval / 100) + 
                         (interval  %% 100)/60) 
                       )
```

## What is mean total number of steps taken per day?

##### Calculate the total number of steps taken per day

Load dpylr package. Then group by date, sum all the steps per each day and
finally calculate the mean.

```{r}
activitybydate <- group_by(activitydata, date)
activitybydate <- summarise(activitybydate, 
                            totalsteps = sum(steps, na.rm = TRUE),
                            count = n());
summary(activitybydate)
origmean <- mean(activitybydate$totalsteps)
origmedian <- median(activitybydate$totalsteps)
```

The mean number of steps taken per day is 9354.23

#### Make a histogram of the total number of steps taken each day
```{r}
hist(activitybydate$totalsteps, 
     main = "Total steps per day",
     xlab = "Total Steps")
```

## What is the average daily activity pattern?

First, group the data by interval and average.

```{r}
activitybyinterval <- group_by(activitydata, interval)
activitybyinterval <- summarise(activitybyinterval, 
                                mean = mean(steps, na.rm=TRUE))
```

#### Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

The interval that contains the maximum average number of steps is 8:35h.

```{r}
activitybyinterval$interval[
  which(activitybyinterval$mean == max(activitybyinterval$mean))]
```


#### Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)

```{r}
plot(x = activitybyinterval$interval, activitybyinterval$mean, 
     type = "l",
     xlab = "Interval",
     ylab = "steps",
     main = "Steps across the day")
```

## Imputing missing values

#### Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with _NAs_ )
The total number of NAs into the data sets is:

```{r}
sum(is.na(activitydata))
```

We can observe their distribution per day:

```{r}
table( sapply(
  lapply(split(activitydata$steps, activitydata$date ), is.na),
  sum))
```

There are 8 days that have no valid measurement and 53 days that have no NA value.

#### Devise a strategy for filling in all of the missing values in the dataset.

I would drop those days from the data set, but the assignment ask to fill their value. We fill them with **the average value per interval calculated previously**.

#### Create a new dataset that is equal to the original dataset but with the missing data filled in.

We are going to generate a vector that repeats the averaged values 61 times, and then assign its value to every NA value in the _activitydata_ data set.

```{r}
means <- rep(activitybyinterval$mean, 61)
activityfilled <- activitydata
activityfilled$steps[is.na(activityfilled$steps)] <-
  means[is.na(activityfilled$steps)];
```

Now there are no days with NAs

```{r}
table( sapply(
  lapply(split(activityfilled$steps, activityfilled$date ), is.na),
  sum))
```

#### Make a histogram of the total number of steps taken each day and calculate and report the mean and median total number of steps taken per day.

Recalculate the mean number of steps per day.

```{r}
activitybydate <- group_by(activityfilled, date)
activitybydate <- summarise(activitybydate, 
                            totalsteps = sum(steps, na.rm = TRUE),
                            count = n());
summary(activitybydate)
origmean
origmedian
```

Both, the mean and the median have changed.

```{r}
hist(activitybydate$totalsteps, 
     main = "Total steps per day",
     xlab = "Total Steps")
```

The histogram has also changed, now there are many more elements in the center of the distribution. This happened because we created 8 new _average days_

However, if we plot the averaged daily distribution, there is no impact. The interval with the maximum averaged number of steps is the same, and the plot is also the same.

```{r}
activitybyinterval <- group_by(activityfilled, interval)
activitybyinterval <- summarise(activitybyinterval, 
                                mean = mean(steps, na.rm=TRUE))
activitybyinterval$interval[
  which(activitybyinterval$mean == max(activitybyinterval$mean))]
plot(x = activitybyinterval$interval, activitybyinterval$mean, 
     type = "l",
     xlab = "Interval",
     ylab = "steps")
```


## Are there differences in activity patterns between weekdays and weekends?

#### Create a new factor variable in the dataset with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.

First, convert the date column to Date, then 
```{r}
activityfilled$date <- as.Date(as.character(activityfilled$date))
```

Create a new column with the weekday:
```{r}
activityfilled <- mutate(activityfilled, weekday = weekdays(date))
```

Create a new column with the required factor:

```{r}
activityfilled <- mutate(activityfilled, day = 
                                ( weekday %in% c("Saturday", "Sunday") ) )
activityfilled$day <- as.factor(activityfilled$day)
levels(activityfilled$day) <- c("weekday", "weekend")
summary(activityfilled)
```


#### Make a plot:

We have to group the data by interval and by the column day before, then average:

```{r}
activitybyintbyday <- group_by(activityfilled, interval, day)
activitybyintbyday <- summarise(activitybyintbyday, mean = mean(steps))
```
We can also calculate the cumsum by _day_
```{r}
activitybyintbyday <- group_by(ungroup(activitybyintbyday), day)
activitybyintbyday <- mutate(activitybyintbyday, cumsum = cumsum(mean))
```

and finally plot, assuming our data cover the whole day:

```{r}
ggplot(activitybyintbyday, aes(x=interval))+
  scale_x_continuous(limits = c(0,24), breaks = seq(0,24,2))+
  geom_line(aes( y = mean)) + 
  geom_line(aes(y = cumsum/60), col = "red")+
  facet_grid(day ~ .) +
  xlab("hour")+
  ylab("mean number of steps")+
  ggtitle("Average daily patterns for Weekdays and Weekends")
```

We have divided the _cumsum_ by $60$ to be able to represent both lines in the same chart. We can see that on weekends the activity starts a little bit later, and there seems to be more activity during the afternoon and the evening. On the other side, on weekdays, there is much more activity in the morning.

```{r}
temp <- group_by(activitybyintbyday, day)
summarise(temp, sum = sum(mean))
```

We can also say that, on average, there are more steps on weekends.