forked from dereksonderegger/444
-
Notifications
You must be signed in to change notification settings - Fork 0
/
11_StatisticalTables.Rmd
305 lines (253 loc) · 14.3 KB
/
11_StatisticalTables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
# (PART\*) Miscellaneous {-}
# Statistical Tables{-}
```{r, echo=FALSE}
# Unattach any packages that happen to already be loaded. In general this is unecessary
# but is important for the creation of the book to not have package namespaces
# fighting unexpectedly.
pkgs = names(sessionInfo()$otherPkgs)
if( length(pkgs > 0)){
pkgs = paste('package:', pkgs, sep = "")
for( i in 1:length(pkgs)){
detach(pkgs[i], character.only = TRUE, force=TRUE)
}
}
knitr::opts_chunk$set(cache=FALSE)
```
```{r, message=FALSE, warning=FALSE, results='hide'}
suppressPackageStartupMessages({
library(tidyverse) # loading ggplot2 and dplyr
})
```
As usual, I have a [video](https://www.youtube.com/rIGd59qi4UA)
for the chapter.
Statistics makes use of a wide variety of distributions and before the days of
personal computers, every statistician had books with hundreds and hundreds of
pages of tables allowing them to look up particular values. Fortunately in the
modern age, we don't need those books and tables, but we do still need to access
those values. To make life easier and consistent for R users, every distribution
is accessed in the same manner.
## Example Distributions{-}
+--------------+-------------+---------------+------------------------------------+
| Distribution | R Stem | Parameters | Parameter Interpretation |
+==============+=============+===============+====================================+
| Binomial | `binom` | `size` <br> | Number of Trials |
| | | `prob` | Probability of Success (per Trial) |
+--------------+-------------+---------------+------------------------------------+
| Exponential | `exp` | `rate` | Mean of the distribution |
+--------------+-------------+---------------+------------------------------------+
| Normal | `norm` | `mean=0` | Center of the distribution |
| | | `sd=1` | Standard deviation |
+--------------+-------------+---------------+------------------------------------+
| Uniform | `unif` | `min=0` | Minimum of the distribution |
| | | `max=1` | Maximum of the distribution |
+--------------+-------------+---------------+------------------------------------+
| t | `t` | `df` | Degrees of freedom |
+--------------+-------------+---------------+------------------------------------+
| F | `f` | `df1` <br> | Numerator Degrees of freedom |
| | | `df2` | Denominator Degrees of freedom |
+--------------+-------------+---------------+------------------------------------+
All of the functions in R to manipulate distributions have a similar naming scheme.
They begin with a `d`,`p`,`q`, or `r` and then the stem of the distribution. For
example, I might use a function called `pnorm()`. The above table lists the most
common distributions you might see in an introductory statistics course.
## `mosaic::plotDist()` function{-}
The `mosaic` package provides a very useful routine for understanding a distribution.
The `plotDist()` function takes the R name of the distribution along with whatever
parameters are necessary for that function and show the distribution. For reference
below is a list of common distributions and their R name and a list of necessary
parameters.
For example, to see the normal distribution with mean $\mu=10$ and standard deviation
$\sigma=2$, we use
```{r, fig.height=3, message=FALSE, warning=FALSE}
mosaic::plotDist('norm', mean=10, sd=2)
```
This function works for discrete distributions as well.
```{r, fig.height=3}
mosaic::plotDist('binom', size=10, prob=.2)
```
## Base R functions{-}
All the probability distributions available in R are accessed in exactly the same way,
using a `d`-function, `p`-function, `q`-function, and `r`-function. For the rest of
this section that $X$ is a random variable from the distribution of interest and $x$
is some possible value that $X$ could take on. Notice that the `p`-function is the
inverse of the `q`-function.
+--------------------+-------------------------------------------------------------------+
| Function | Result |
+====================+===================================================================+
| `d`-function(x) | The distance from the x-axis to the curve at a given $x$. |
+--------------------+-------------------------------------------------------------------+
| `p`-function(x) | Probability of an outcome less than or equal to `x` |
+--------------------+-------------------------------------------------------------------+
| `q`-function(q) | The x-value with the input probability to the left. This is the inverse question as the `p`-function. |
+--------------------+-------------------------------------------------------------------+
| `r`-function(n) | Generate $n$ random observations from the distribution |
+--------------------+-------------------------------------------------------------------+
For each distribution in R, there will be this set of functions but we replace the
“-function” with the distribution name or a shortened version. `norm`, `exp`,
`binom`, `t`, `f` are the names for the normal, exponential, binomial, T and F
distributions. Furthermore, most distributions have additional parameters that
define the distribution and will also be passed as arguments to these functions,
although, if a reasonable default value for the parameter exists, there will be a default.
### d-function{-}
The purpose of the d-function is to calculate the distance from the x-axis to the
density curve or height of a probability mass function. The “d” actually stands for
density. Notice that for discrete distributions, this is the probability of observing
that particular value, while for continuous distributions, the height doesn't have a
nice physical interpretation.
We start with an example of the Binomial distribution. For $X\sim Binomial\left(n=10,\pi=.2\right)$, suppose we wanted to know $P(X=0)$. We know
the probability mass function is
$$P\left(X=x\right)={n \choose x}\pi^{x}\left(1-\pi\right)^{n-x}$$
thus
$$P\left(X=0\right) = {10 \choose 0}\,0.2^{0}\left(0.8\right)^{10} = 1\cdot1\cdot0.8^{10} \approx 0.107$$
but that calculation is fairly tedious. To get R to do the same calculation,
we just need the height of the probability mass function at $0$. To do this
calculation, we need to know the x value we are interested in along with the
distribution parameters $n$ and $\pi$.
The first thing we should do is check the help file for the binomial distribution
functions to see what parameters are needed and what they are named.
```{r, eval=FALSE}
?dbinom
```
The help file shows us the parameters $n$ and $\pi$ are called size and prob
respectively. So to calculate the probability that $X=0$ we would use the
following command:
```{r}
dbinom(0, size=10, prob=.2)
```
### p-function{-}
Often we are interested in the probability of observing some value or anything
less (In probability theory, we call this the cumulative density function or CDF).
P-values will be calculated this way, so we want a nice easy way to do this.
To start our example with the binomial distribution, again let
$X\sim Binomial\left(n=10,\pi=0.2\right)$. Suppose I want to know the probability
of observing a 0, 1, or 2. That is, what is $P\left(X\le2\right)$? I could just
find the probability of each and add them up.
```{r}
dbinom(0, size=10, prob=.2) + # P(X==0) +
dbinom(1, size=10, prob=.2) + # P(X==1) +
dbinom(2, size=10, prob=.2) # P(X==2)
```
but this would get tedious for binomial distributions with a large number
of trials. The shortcut is to use the `pbinom()` function.
```{r}
pbinom(2, size=10, prob=.2)
```
For discrete distributions, you must be careful because R will give you the
probability of less than or equal to 2. If you wanted less than two, you should
use `dbinom(1,10,.2)`.
The normal distribution works similarly. Suppose for $Z\sim N\left(0,1\right)$
and we wanted to know $P\left(Z\le-1\right)$?
```{r, echo=FALSE, fig.height=3}
z <- seq(-4,4,by=.01)
heights <- dnorm(z)
plot(z, heights, type='l', ylab='density', xlab='Z')
z2 <- seq(-4,-1, by=.01)
y2 <- dnorm(z2)
polygon(c(-4,z2,-1),c(0,y2,0), col='grey')
text(-2.5, .135, 'P(Z < -1)')
axis(1, -1, labels=-1)
```
The answer is easily found via `pnorm()`.
```{r}
pnorm(-1)
```
Notice for continuous random variables, the probability $P\left(Z=-1\right)=0$
so we can ignore the issue of “less than” vs “less than or equal to”.
Often times we will want to know the probability of greater than some value.
That is, we might want to find $P\left(Z \ge -1\right)$. For the normal
distribution, there are a number of tricks we could use. Notably
$$P\left(Z\ge-1\right) = P\left(Z\le1\right)=1-P\left(Z<-1\right)$$
but sometimes I'm lazy and would like to tell R to give me the area to the right
instead of area to the left (which is the default). This can be done by setting
the argument `lower.tail=FALSE`.
The `mosaic` package includes an augmented version of the `pnorm()` function
called `xpnorm()` that calculates the same number but includes some *extra*
information and produces a pretty graph to help us understand what we just
calculated and do the tedious “1 minus” calculation to find the upper area.
Fortunately this x-variant exists for the Normal, Chi-squared, F, Gamma
continuous distributions and the discrete Poisson, Geometric, and Binomial
distributions.
```{r, fig.height=3}
mosaic::xpnorm(-1)
```
### `q`-function{-}
We will also sometimes find ourselves asking for the quantiles of a distribution.
Percentiles are by definition 1/100, 2/100, etc but if I am interested in something
that isn't an even division of 100, we get fancy and call them quantiles. This is a
small semantic quibble, but we ought to be precise. That being said, I won't correct
somebody if they call these percentiles. For example, I might want to find the 0.30
quantile, which is the value such that 30\% of the distribution is less than it, and
70\% is greater. Mathematically, I wish to find the value $z$ such that $P(Z<z)=0.30$.
To find this value in the tables in a book, we use the table in reverse. R gives
us a handy way to do this with the `qnorm()` function and the mosaic package
provides a nice visualization using the augmented `xqnorm()`.
Below, I specify that I'm using a function in the `mosaic` package by specifying
it via `PackageName::FunctionName()` format because I haven't loaded the `mosaic`
package because some of its functions conflict with `dplyr`.
```{r}
mosaic::xqnorm(0.30) # Give me the value along with a pretty picture
qnorm(.30) # No pretty picture, just the value
```
### r-function{-}
Finally, I often want to be able to generate random data from a particular
distribution. R does this with the r-function. The first argument to this function
the number of random variables to draw and any remaining arguments are the
parameters of the distribution.
```{r}
rnorm(5, mean=20, sd=2)
rbinom(4, size=10, prob=.8)
```
## Exercises {-#Exercises_Statistical_Tables}
1. We will examine how to use the probability mass function (a.k.a. d-function)
and cumulative probability function (a.k.a. p-function) for the Poisson
distribution.
a) Create a graph of the distribution of a Poisson random variable with
rate parameter $\lambda=2$ using the mosaic function `plotDist()`.
b) Calculate the probability that a Poisson random variable (with rate
parameter $\lambda=2$) is exactly equal to 3 using the `dpois()` function.
Be sure that this value matches the graphed distribution in part (a).
c) For a Poisson random variable with rate parameter $\lambda=2$, calculate
the probability it is less than or equal to 3, by summing the four values
returned by the Poisson `d`-function.
d) Perform the same calculation as the previous question but using the
cumulative probability function `ppois()`.
2. We will examine how to use the cumulative probability functions (a.k.a.
p-functions) for the normal and exponential distributions.
a) Use the mosaic function `plotDist()` to produce a graph of the standard
normal distribution (that is a normal distribution with mean $\mu=0$ and
standard deviation $\sigma=1$.
b) For a standard normal, use the `pnorm()` function or its `mosaic` augmented
version `xpnorm()` to calculate
i. $P\left(Z<-1\right)$
ii. $P\left(Z\ge1.5\right)$
c) Use the mosaic function `plotDist()` to produce a graph of an exponential
distribution with rate parameter 2.
d) Suppose that $Y\sim Exp\left(2\right)$, as above, use the `pexp()`
function to calculate $P\left(Y \le 1 \right)$. (Unfortunately there
isn't a mosaic augmented `xpexp()` function.)
3. We next examine how to calculate quantile values for the normal and
exponential distributions using R's q-functions.
a) Find the value of a standard normal distribution ($\mu=0$, $\sigma=1$)
such that 5% of the distribution is to the left of the value using the
`qnorm()` function or the mosaic augmented version `xqnorm()`.
b) Find the value of an exponential distribution with rate 2 such that
60% of the distribution is less than it using the `qexp()` function.
4. Finally we will look at generating random deviates from a distribution.
a) Generate a single value from a uniform distribution with minimum 0,
and maximum 1 using the `runif()` function. Repeat this step several
times and confirm you are getting different values each time.
b) Generate a sample of size 20 from the same uniform distribution and
save it as the vector `x` using the following:
```{r}
x <- runif(20, min=0, max=1)
```
Then produce a histogram of the sample using the function `hist()` or
`geom_histogram`
```{r, eval=FALSE}
data.frame(x=x) %>%
ggplot(aes(x=x)) +
geom_histogram(bins=10)
```
c) Generate a sample of 2000 from a normal distribution with `mean=10` and
standard deviation `sd=2` using the `rnorm()` function. Create a histogram
of the resulting sample.