Statistical reporting

Objectives

:::objectives Questions

How do I consistently report on statistical results?

Objectives

Be able to follow a consistent analysis methodology in R
Report the statistical results in a clear and structured way :::

Purpose and aim

To practise the statistical methods we have covered in the previous sessions and to produce consistent statistical reports.

Data and hypotheses

The first section uses data in the file data/raw/CS6-NPYield.csv. This is a dataset comprising 24 observations of three variables (one dependent and two predictor). This records the yield of peas (in pounds/plot) for different plots. For each plot a record of whether nitrogen and/or phosphate fertiliser has been added.

Analysis methodology

Carry out the analysis suggested in the following steps and confirm that your analysis matches the text in this section.

Step 1 - Identify the variables and the question to be explored

Load in the dataset and look at the raw data
- Confirm that the NPYield dataset contains 24 observations of 3 variables
Check the names and types of the variables
- yield is a continuous variable
- Nit and Pho are categorical (factor) predictor variables
The obvious question to ask is how do the variables Nit and Pho affect the variable yield?

Step 2 – Describe the data

Plot the data
- Plot the effects of the individual predictor variables first. Here box plots of yield against Nit and yield against Pho would be appropriate.
- Plot the effects of the interactions between pairs of predictor variables. Here an interaction plot would be appropriate since the two predictor variables are both categorical
Carry out any descriptive statistics that you may find useful
- Calculate means, variances, ranges etc. Here we’re not going to bother with this given the simplicity of the dataset.
Consider whether there appear to be any significant effects of any of the variables
- There appears to be an effect of nitrogen
- There does not appear to be an effect of phosphate
- There may be an interaction effect but it is not clear

Step 3 - Perform tests and or fit models

Select a test appropriate to the data that you have
- Here, with a single continuous response variable and two categorical variables we have two options: a two-way ANOVA test or a linear model (I’d always go for a linear model framework). We’ll first fit a model with all interactions lm(yield ~ Nit * Pho) and then look at model reduction to see if all of the terms are valid.
Assess results of model fit
- Identify any significant effects. Here the full model is not significant and after performing backwards stepwise elimination we find that the minimal model is yield ~ Nit.
- Determine the coefficients of the best fitting model. Here we find that the formula is given by:
\begin{equation} yield = 52.07 + \binom{0}{5.62} \binom{Nit:N}{Nit:Y} \end{equation}

Step 4 - Check assumptions of tests/models

Plot Diagnostics plots and/or carry out appropriate tests
- Here we will just plot the diagnostic plots. We should look at the diagnostics for both the full model and the minimal model.

Writing up the analysis

There are four broad sections to any statistical report:

Introduction
Methods & Results
Discussion
Appendix

Introduction

Here we aim to:

Describe what the data represent and any information on how it was collected
State the question to be investigated.

:::note Always try to keep the language non-technical :::

For example:

Introduction

The aim of this analysis is to investigate the relationship between the yield of peas and the addition of fertilisers. The yield of peas in pounds/plot is recorded for 24 plots and used as the response variable. Two predictor variables are thought to potentially have an effect: The addition of a fixed amount of a nitrogen based fertiliser and the addition of a fixed amount of a phosphate based fertiliser.

Methods and results

In this section we aim to:

Describe data using descriptive statistics and plots
Describe/state procedures undertaken
- e.g. "Fit linear model with all interactions between variables"
Present figures and state key results
- e.g. Show lines of best fit on a scatter graph or state whether F-statistics are significant alongside the relevant p-values.
Produce diagnostic plots and/or results of assumption tests

This section should contain enough detail to enable someone else to replicate the results. No R output needs to be included here.

In the example below I am using ggplot2 and patchwork, because it makes it a lot easier to produce nice graphs. It is possible to do similar things in using the base R syntax, but you'll probably find that composing your plot into panels is best done with an external programme.

For example:

Methods and results

The response variable yield was plotted against the two categorical variables Pho and Nit independently, and the means of the different categorical combinations were calculated and plotted. These are shown in Figure 1:

Figure 1a appears to suggest that there isn’t an effect of phosphate on yield. Figure 1b indicates that there might be an effect of nitrogen on yield. Figure 1c suggests that there might be an interaction effect of nitrogen and phosphate on yield, although this might be due to the presence of an outlier in the (No Phosphate & Nitrogen) group.

A full linear model containing both variables and the interaction between them was fitted to the data (yield ~ Nit + Pho + Nit:Pho) and the model assumptions were checked using a full residual analysis (see Figure 2). The assumptions of equal variance and normality appear to be met, suggesting that a linear model analysis may be adequate for the data.

The ANOVA analysis for the full model compared with the null model results gives a non-significant result (F_3,20 = 2.21, p = 0.12) suggesting that there is insufficient evidence that yield is affected by all of the variables.

Backwards stepwise elimination was then used to find the minimal model. Both the interaction between Nitrogen and Phosphate and Phosphate terms on yield were not found to be significant and the minimal model was found to be:

\begin{equation} yield ~ Nit \end{equation}

with the variable Nit being a significant predictor of yield (F_1,22 = 6.06, p = 0.02).

\begin{equation} yield = 52.07 + \binom{0}{5.62} \binom{Nit:N}{Nit:Y} \end{equation}

A box plot of the final model result (a) alongside the diagnostic plots for this minimal model (b, c) are shown in Figure 3. The assumptions of equal variance and normality still appear to be met, again suggesting that a linear model analysis is appropriate for the data.

Discussion

In this section we aim to:

Summarise the results in the context of the question, i.e. What did you find?
Discuss results of model assumption tests. Were these met? Is anything slightly dodgy?
Discuss data limitations, e.g. number of data points, presence of outliers etc.

For example:

Discussion

The analysis shows that a linear model may be adequate for the data and that the addition of a Nitrogen based fertiliser is a significant predictor of pea yield, whereas a Phosphate based fertiliser does not have a statistically significant effect. It can be seen that the addition of Nitrogen appears to increase the yield. It is interesting to note that whilst a reduced model has been shown to be statistically significant, the full model was not significant when compared with a null model. The number of data points is relatively small for a full interaction analysis (only 6 points per 2-way classification), and so repeating the analysis with a larger number of observations might be beneficial.

Appendix

This section is not always necessary / included but here we have the option to:

Add all R output here, i.e. Full ANOVA tables and printed output
Could include a copy of any R script here if you’ve done something clever in terms of data manipulation. Again this would be so that you can aid reproducibility of any results you’ve obtained

Exercise: cars 2 variables

:::exercise

Investigate the relationship between the continuous response variable mpg and the categorical predictor variables cyl and gear.

The data can be found in data/raw/CS6-cars2var.csv. In this dataset:

mpg is miles per gallon
cyl in number of cylinders (a categorical variable)
gear is number of gears (a categorical variable)
wt is weight of car (a continuous variable)

Hint

When you load the datasets into R, check that each variable has been interpreted correctly. Specifically check that categorical variables have been loaded as factors and not as numbers. Specifically, you may have to force a variable to be interpreted as a factor using a line of code such as: `cars3$gear <- factor(cars3$gear)`

:::

Exercise: cars 3 variables

:::exercise

This dataset now also includes the continuous predictor variable wt. Investigate the relationships between all four variables.

The data can be found in data/raw/CS6-cars3var.csv. In this dataset:

mpg is miles per gallon
cyl in number of cylinders (a categorical variable)
gear is number of gears (a categorical variable)
wt is weight of car (a continuous variable)

Hint

For a dataset with three predictor variables (`A`, `B`, `C`) and one response variables (`Y`) there will be six initial plots:

3 individual plots: Y vs. A, Y vs. B, Y vs. C
3 two-variable plots: Y vs. A&B, Y vs. A&C, Y vs. B&C

We know how to look at two types of two-variable plots:

categorical x categorical (interaction.plot)
categorical x continuous (see linear model practical)

But we won’t usually plot continuous x continuous two-variable plots (since this would require 3D plotting), but if you’re interested then have a look at the plot3d() command in the rgl library.

:::

Key points

:::keypoints

A useful analysis methodology would be:
1. identify the variables and the question to be explored
2. describe the data (plots, summaries)
3. perform tests or fit models
4. check assumptions of the tests/models
Writing up the statistical analysis often follows this structure:
1. Introduction (description of data in non-technical way; collection methods; question to answer)
2. Methods & Results (description of data; procedures; present figures and key results; results of assumptions checks)
3. Discussion (summarise results; discuss results of assumptions; limitations of data)
4. Appendix (optional, can include R output) :::

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cs6-practical-statistical_reporting 2.md

cs6-practical-statistical_reporting 2.md

Statistical reporting

Objectives

Purpose and aim

Data and hypotheses

Analysis methodology

Step 1 - Identify the variables and the question to be explored

Step 2 – Describe the data

Step 3 - Perform tests and or fit models

Step 4 - Check assumptions of tests/models

Writing up the analysis

Introduction

Methods and results

Discussion

Appendix

Exercise: cars 2 variables

Exercise: cars 3 variables

Key points

Files

cs6-practical-statistical_reporting 2.md

Latest commit

History

cs6-practical-statistical_reporting 2.md

File metadata and controls

Statistical reporting

Objectives

Purpose and aim

Data and hypotheses

Analysis methodology

Step 1 - Identify the variables and the question to be explored

Step 2 – Describe the data

Step 3 - Perform tests and or fit models

Step 4 - Check assumptions of tests/models

Writing up the analysis

Introduction

Methods and results

Discussion

Appendix

Exercise: cars 2 variables

Exercise: cars 3 variables

Key points