layout | root | title | minutes | keep_md |
---|---|---|---|---|
lesson |
. |
Getting Data into R |
25 |
true |
- Introducing the data pipeline
- Doing a quick nose around dataframes in R
- Get data out of your spreadsheet
- Importing: Point and click in R studio
- Importing: Scripting your import (and file paths)
- Importing: Using Google Sheets
This lesson will teach you to move your data from an Excel sheet or similar into R.
We strongly believe that the greatest strength you will gain from the work today is reproducibility. Learning 'coding' is a pain, and for a 'one off' process there will nearly always be some tempting 'point-and-click' alternative. However, learning to code, means being forced to write down what you did to get from your data to your result. You'll only have to reproduce something once to appreciate the value of having everything documented. Whether it is an error in your data, an new observation, or your supervisor or collaborator asking you to repeat an analysis with a twist, you will always be revisiting and revising your work. And the initial investment in writing (i.e. coding) your analysis means that these adjustments become simple.
- Collect data on paper
- Transcribe paper forms into Excel or similar
- Import Excel into SPSS or another stats package
- Perform analyses and make graphs
- Copy/paste these into your final document (Word, Powerpoint) etc
- One early mistake, means repeating all steps by hand
- You can't remember what you did because you haven't looked at this for 5 days, 5 months or 5 years
- Other people can't collaborate or help because all the files are local, and even if they have them you weren't obsessive enough to write notes about each step
Learn just enough code to write down your 'workflow'. Ensure the workflow starts with the raw data, and then any changes in the workflow or the data will automatically propagate.
- Collect data electronically
- Write a script that converts your data into figures and tables
- Knit everything into a single report?
OK. Step 3 is showing off, and probably too much for today but as an FYI everything you are reading now was written in R including the text, the code, the tables and the figures. When we want to update something, we just edit and 'knit'. Look at the course by Roger Peng at Johns Hopkins on Coursera if you want to learn more.
We need a 'common data language' that all spreadsheet languages can write to, and all statistical and programming languages can read. Arguably one of the oldest, and most ubiquitous is the 'comma separated values' (CSV) file. This is easily exportable from Microsoft Excel, Apple Numbers, Open Office, Google Sheets...etc.
It's a simple format. The top line are the column names, each seperated by a comma. The following lines are the observations in those columns, again, seperated by a comma.
It's strength is in it's simplicity. It only has data, no formulas, no tricks and is very well recognised amongst software packages as it is very easily supported. R has excellent support for CSV.
Exercise: Export an .xlsx file to .csv
- Find the cleaned .xlsx file that was produced in the Excel Hell lecture and export it as a .csv file. If all else fails then we have a copy here.
Once a sheet has been exported, it can be imported into R:
There're 3 main ways to do this:
- Point-and-click within RStudio, which we covered in Lesson 1 (R for Newbies).
- Use the
read.csv
function. - Make a connection to a database or an online Google Sheets spreadsheet
As a quick reminder from Lesson 1. Click the Import Dataset button, and select the From local file ... option.
Then accept the defaults in the dialog box that appears.
Point and click is lovely and easy but sadly not reproducible. Much better is to write down where your data comes from. We are going to do this in 3 steps here.
- Find the path to your file. We covered file paths in lesson 1, and this is just the formal address of your file on your computer. A quick way to do this in
R
is to use thefile.choose
function. - We will take the path you have generated and name it
myfile
. - We will use a function called
read_csv
from thereadr
library to import the data into a data frame (which we in turn namedf
for convenience).
Try right clicking a file in Finder (on a Mac) or Windows Explorer (on a PC). You'll normally see an option for 'info' or 'properties' that will show you the path to your file.
install.packages("readr") # install only needed the first time
library(readr) # load the readr family functions including read_csv
myfile <- file.choose()
dataframename <- read_csv(myfile)
You could have done this in one step, but it would have made things harder to read. Hard to read, means difficult to remember, and we are doing our best to avoid that!
There is a function
read.csv
provided by R butread_csv
is better. The built in function has a couple of annoying 'habits'. If you wish to use it then don't forget to specify: (1)header=TRUE
which tells the function to expect column names in row 1 instead of data, and (2)stringsAsFactors=FALSE
which is a necessary but annoying reminder to R that you want it to leave 'strings' alone and not convert them to 'labelled' factors.
- Import the .csv file from the Excel Hell lesson as a dataframe and call it
RCT
- Name all the columns in the dataframe.
Step 1 in your pipeline has to be getting your (now beautifully tidy) spreadsheet into R.
Spreadsheets in R are called 'dataframes'.
A data-frame has columns, each identified by a name, and rows for observations. This means that however you import your data into R within your pipeline, those data will end up as a dataframe. Let's do a quick tour of dataframes in R.
# Have a look at the first few rows of data you've imported
head(RCT)
# A tibble: 6 x 21
Pt_No Age Gender Radnomisation PS_Prior_op PS_3hrs PS_12_16hrs
<int> <int> <int> <int> <int> <int> <int>
1 1 80 1 1 0 0 0
2 2 72 1 1 1 2 2
3 3 72 1 1 0 1 0
4 4 55 1 1 0 0 0
5 5 84 1 2 0 0 2
6 6 72 1 2 1 2 2
# ... with 14 more variables: PS_24_28hrs <int>, PS_2wks <int>,
# Mov_12_16hrs <int>, Mov_24_28hrs <int>, Mov_2wks <int>,
# Paracetamol_g <int>, Ibuprofen_mg <int>, Diclofenac_mg <int>,
# Oromorph_mg <dbl>, Amitryptaline_mg <dbl>, Pregabalin_mg <int>,
# Days_in_hosp <int>, Reason_hosp_stay <int>, Overall_satisfaction <int>
# Have a look at the last few rows
tail(RCT)
# A tibble: 6 x 21
Pt_No Age Gender Radnomisation PS_Prior_op PS_3hrs PS_12_16hrs
<int> <int> <int> <int> <int> <int> <int>
1 59 77 1 2 0 0 1
2 60 85 1 1 0 3 3
3 61 55 1 1 0 0 3
4 62 79 1 1 0 1 1
5 63 NA 1 2 0 5 0
6 64 80 1 1 0 0 0
# ... with 14 more variables: PS_24_28hrs <int>, PS_2wks <int>,
# Mov_12_16hrs <int>, Mov_24_28hrs <int>, Mov_2wks <int>,
# Paracetamol_g <int>, Ibuprofen_mg <int>, Diclofenac_mg <int>,
# Oromorph_mg <dbl>, Amitryptaline_mg <dbl>, Pregabalin_mg <int>,
# Days_in_hosp <int>, Reason_hosp_stay <int>, Overall_satisfaction <int>
# Have a look at the entire dataframe
View(RCT)
The 'rows' are contain measurements from different individuals, and the columns contain the measurements. Just like in Excel, any item (cell) can be accessed by its coordinates in the table which are always written as [row, column]
(i.e. which row, then which column).
RCT[1,1] # top left cell
# A tibble: 1 x 1
Pt_No
<int>
1 1
RCT[64,21] # bottom right cell
# A tibble: 1 x 1
Overall_satisfaction
<int>
1 3
RCT[1, ] # first row (all columns)
# A tibble: 1 x 21
Pt_No Age Gender Radnomisation PS_Prior_op PS_3hrs PS_12_16hrs
<int> <int> <int> <int> <int> <int> <int>
1 1 80 1 1 0 0 0
# ... with 14 more variables: PS_24_28hrs <int>, PS_2wks <int>,
# Mov_12_16hrs <int>, Mov_24_28hrs <int>, Mov_2wks <int>,
# Paracetamol_g <int>, Ibuprofen_mg <int>, Diclofenac_mg <int>,
# Oromorph_mg <dbl>, Amitryptaline_mg <dbl>, Pregabalin_mg <int>,
# Days_in_hosp <int>, Reason_hosp_stay <int>, Overall_satisfaction <int>
RCT[ ,1] # first column (all rows)
# A tibble: 64 x 1
Pt_No
<int>
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
# ... with 54 more rows
Most of the time the rows are 'observations' and we want to pick out 'characteristics' of those observations (i.e. the columns). Rather than having to remember the column number, we can just ask for a columm by name using the $
operator: e.g. dataframename$some_column
# First we need to know what column names exist
names(RCT)
RCT$Age
RCT$Gender
-
What are the names of the columns? Hint: try
names()
-
Display in the console the items in the column
PS_Prior_op
-
Can you do the same for all the other columns in
RCT
? -
Can you find out how many patients were randomised to each arm? Hint: you can use the function
table()
orxtab()
-
Can you find out how many subjects there are?
-
What are the names of the columns?
names(RCT)
-
Display in the console the items in the column
RCT$PS_Prior_op
-
Can you do the same for all the other columns in
RCT
?RCT$Gender
,RCT$Age
, etc. -
Can you find out how many patients were randomised to each arm?
table(RCT$Randomisation)
-
Can you find out how many subjects there are?
nrow(RCT)
We can also load data through a connection with Google Sheets. What's good about this? Collaboration and live-updating. Let's use this system to import a cleaned version of the breast RCT data we have been working on.
We're going to use data in a shared sheet we've called rct-clean
.
The sheet is available here.
First we need the functions someone else has kindly written that allow R to talk to google sheets. You will need to install the library first.
install.packages("googlesheets")
Now let's have a look at the sheets in your account. We'll use the
gs_ls()
function.
The first time you run this you will be asked to authenticate (a browser window will open and ask you to sign in to your google account). Behind the scenes R now saves a hidden file into your working directory. The next time you ever run your script, as long as you haven't moved your code to a new directory, it won't need to ask.
library(googlesheets)
gs_ls()
# A tibble: 3 × 10
sheet_title author perm version updated
<chr> <chr> <chr> <chr> <dttm>
1 rct-clean datascibc rw new 2016-11-15 15:16:00
2 rct-dirty datascibc rw new 2016-11-15 13:54:25
3 rct-dirty-20161203 datascibc rw new 2016-11-15 13:00:33
# ... with 5 more variables: sheet_key <chr>, ws_feed <chr>,
# alternate <chr>, self <chr>, alt_key <chr>
You can see a list of google sheets. It is now straightforward to read
these into R. Let's import the shared sheet rct-clean
, and name it
gsheet
.
gsheet <- gs_title("rct-clean")
Sheet successfully identified: "rct-clean"
So far so good, but sheet
isn't a data frame, it's just a connection
to that sheet. There's one more step.
df <- gs_read(gsheet)
Accessing worksheet titled 'breast-clean.csv'.
No encoding supplied: defaulting to UTF-8.
df
# A tibble: 64 × 20
id recruit age gender random ps0h ps3h ps12 ps24h ps168h
<int> <chr> <int> <chr> <chr> <int> <int> <int> <int> <int>
1 1 1/1/2014 80 F drain 0 0 0 2 2
2 2 1/2/2014 72 f drain 1 2 2 1 2
3 3 1/3/2014 72 F drain 0 1 0 0 0
4 4 1/4/2014 55 F drain 0 0 0 0 1
5 5 1/5/2014 84 f skin 0 0 2 3 NA
6 6 1/6/2014 72 F skin 1 2 2 2 NA
7 7 1/7/2014 65 F skin 0 1 0 0 0
8 8 1/8/2014 75 F drain 0 0 0 0 0
9 9 1/9/2014 64 F skin 2 2 1 1 2
10 10 1/10/2014 53 F drain 0 1 0 0 0
# ... with 54 more rows, and 10 more variables: move12h <int>,
# move24h <int>, move168h <int>, tylenol <chr>, codeine <chr>,
# oramorph <chr>, other <chr>, los <int>, los_reason <chr>,
# satisfaction <chr>
Bonus: The
googlesheets
library is actually doing one better than making a dataframe, and instead makes a tibble. Say it quickly and it sounds like table? It's just a fancy modern version of R's basic dataframe. You don't need to worry about the difference.
a. What are the names of the columns?
b. Can you have a look at the column containing the age of the patients?
c. Can you find out how many patients were randomised to each arm? Hint: you can use the function table()
or xtab()
d. Can you find out how many subjects there are?
a. What are the names of the columns?
names(df)
b. Can you have a look at the column containing the age of the patients?
df$age
c. Can you find out how many patients were randomised to each arm? Hint: you can use the function table()
or xtab()
table(df$random)
d. Can you find out how many subjects there are?
nrow(df)
- Can you import and look at the
outreach.csv
dataset? What is the average heart rate? How many patients were accepted to ICU (the column is calledicu_accept
), and what was the mortality?