10-reg.Rmd

# Linear Regression

**Chapter Links**

* [Chapter 10 Slide Show](http://tysonbarrett.com/EDUC-6600/Slides/u03_Ch10_LinReg.html#1)

* [Interactive Online App - Correlation and Regression](http://digitalfirst.bfwpub.com/stats_applet/stats_applet_5_correg.html)

* [Cancer Dataset - SPSS format](https://usu.box.com/s/9c92zof5whb76bphmzxn3vqx5702qgq6)


**Unit Assignment Links**

* Unit 3 Writen Part: [Skeleton - pdf](https://usu.box.com/s/vjcsotiqwu1mwnwgzbfyig6k451ymgow)

* Unit 3 R Part: [Directions - pdf](https://usu.box.com/s/ectr9zx8qfbbm59h0qcexjreje5r9aio) and [Skeleton - Rmd](https://usu.box.com/s/k3vzw6nuq5tw66bxeptcyzth38pj69f9)

* Unit 3 Reading to Summarize: [Article - pdf](https://usu.box.com/s/qmo57s03tbq02ks75p7eb5gad0ap05kg)

* Inho's Dataset: [Excel](https://usu.box.com/s/hyky7eb24l6vvzj2xboedhcx1xolrpw1)

**Related Readings**

* [Linear Regression - Walkthrough in R](https://uc-r.github.io/linear_regression)

* [Standard Error of the Regression vs. R-squared](http://statisticsbyjim.com/regression/standard-error-regression-vs-r-squared/)


```{r global_options, include=FALSE}
# set global chunk options...  
#  this changes the defaults so you don't have to repeat yourself
knitr::opts_chunk$set(comment     = NA,
                      cache       = TRUE,
                      echo        = TRUE, 
                      warning     = FALSE, 
                      message     = FALSE)
```


Required Packages 

```{r load_libraries}
library(tidyverse)    # Loads several very helpful 'tidy' packages
library(haven)        # Read in SPSS datasets
library(car)          # Companion for Applied Regression (and ANOVA)
library(broom)        # Convert STatistical Analysis Objects into Tidy Dataframes
library(magrittr)     # A Forward-Pipe Operator for R
```


Example: Cancer Experiment 

The `Cancer` dataset was introduced in [chapter 3][Example: Cancer Experiment].


```{r, include=FALSE}
cancer_raw <- haven::read_spss("data/cancer.sav")

cancer_clean <- cancer_raw %>% 
  dplyr::rename_all(tolower) %>% 
  dplyr::mutate(id = factor(id)) %>% 
  dplyr::mutate(trt = factor(trt,
                             labels = c("Placebo", 
                                        "Aloe Juice"))) %>% 
  dplyr::mutate(stage = factor(stage))
```


-------------------------------------------------------


## Visualize the Raw Data

Always plot your data first!

```{r}
cancer_clean %>% 
  ggplot(aes(x = age,
             y = weighin)) +
  geom_point() +
  geom_smooth(method = "lm",    se = TRUE, color = "blue") +  # straight line (linear model)
  geom_smooth(method = "loess", se = FALSE, color = "red")   # loess line (moving window)
```

-------------------------------------------------------


## Fitting a Simple Regression Model


The `lm()` function needs at least TWO arguments:

* **formula** - The name of the *outcome* or dependent variable (DV) goes on the left of the tilda symbol and the name of the *predictor* or independent variable (IV) comes after: `continuous_y ~ continuous_x`

* **data** - Since the datset is not the first argument in the function, you must use the period to signify that the datset is being piped from above `data = .` 


```{r}
cancer_clean %>% 
  lm(weighin ~ age,    # formula: order DOES matter
     data = .)         # data piped from above
```

-------------------------------------------------------


## Extracting Information From the Model

### Model Overview


To view more complete information, add a `summary()` step using a pipe AFTER the `lm()` step

```{r}
cancer_clean %>% 
  lm(weighin ~ age,    # formula: order DOES matter
     data = .) %>%     # data piped from above
  summary()
```

> **NOTE - Variable Designation Matters!**

In simple linear regression (with only one predictor DV), the slope estimate ($\hat{\beta_1}$) is different depending on the designation of $x$ and $y$ (two ordering), but the $p-values$ are the same.


```{r}
cancer_clean %>% 
  lm(age ~ weighin,    # formula: order DOES matter
     data = .) %>%     # data piped from above
  summary()
```

----------------------------

### Model Fit or Accuracy

One line for the entire model

```{r}
cancer_clean %>% 
  lm(weighin ~ age,    
     data = .) %>%     
  broom::glance()
```


-------------------------------------------------------

### Beta Estimates

One line for each parameter, intercept and a slope for each predictor

```{r}
cancer_clean %>% 
  lm(weighin ~ age,    
     data = .) %>%     
  broom::tidy()
```

----------------------------

### Confidence Intervals

```{r}
cancer_clean %>% 
  lm(weighin ~ age,    
     data = .) %>%     
  confint()
```


----------------------------

### Predictions, Residuals, ect.

One line for each subject in the original dataset

```{r}
cancer_clean %>% 
  lm(weighin ~ age,    
     data = .) %>%     
  broom::augment()
```


-------------------------------------------------------


## Model Diagnostics


### Base R Graphics

```{r}
par(mfrow = c(2, 2))

cancer_clean %>% 
  lm(weighin ~ age,    
     data = .) %>%     
  plot()

par(mfrow = c(1, 1))
```