-
Notifications
You must be signed in to change notification settings - Fork 0
/
98-RNAseq_DE_analysis_with_R.Solutions.html
333 lines (321 loc) · 20.9 KB
/
98-RNAseq_DE_analysis_with_R.Solutions.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="pandoc">
<title>RNAseq Using R </title>
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap.css" />
<link rel="stylesheet" type="text/css" href="css/bootstrap/bootstrap-theme.css" />
<link rel="stylesheet" type="text/css" href="css/swc.css" />
<meta charset="UTF-8" />
<!-- HTML5 shim, for IE6-8 support of HTML5 elements -->
<!--[if lt IE 9]>
<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body class="lesson">
<div class="container card">
<div class="banner" align="right" style="margin-top: 5px">
<a href="http://monash.edu/bioinformatics" title="Monash Bioinformatics Platform">
<div style="margin-right: 5px;" class="label swc-blue-bg">Monash Bioinformatics Platform</div>
</a>
</div>
<article>
<div class="row">
<div class="col-md-10 col-md-offset-1">
<h1 class="title"></h1>
<h1 id="rnaseq-de-analysis-with-r---solutions">RNAseq DE analysis with R - Solutions</h1>
<p>This document provides solutions to the exercises given in the ‘RNAseq DE analysis with R’ course.</p>
<h2 id="clustering">Clustering</h2>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Exercise: Heatmap</h2>
</div>
<div class="panel-body">
<p>Produce a heatmap for the 50 most highly expressed genes and annotate the samples with with their age.</p>
<ul>
<li>Subset the read counts object for the 50 most highly expressed genes</li>
<li>Annotate the samples in the subset with their age (check order with design!)</li>
<li>Plot a heatmap with this subset of data, scaling genes and ordering both genes and samples</li>
</ul>
</div>
</section>
<p>Solution:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Subset the read counts object for the 30 most highly expressed genes</span>
select =<span class="st"> </span><span class="kw">order</span>(<span class="kw">rowMeans</span>(counts$counts), <span class="dt">decreasing=</span><span class="ot">TRUE</span>)[<span class="dv">1</span>:<span class="dv">50</span>]
highexprgenes_counts <-<span class="st"> </span>counts$counts[select,]
<span class="co"># Annotate the samples in the subset with their age (check order with design!)</span>
<span class="kw">colnames</span>(highexprgenes_counts)<-<span class="st"> </span>experiment_design.ord$age
<span class="kw">head</span>(highexprgenes_counts)</code></pre>
<pre><code>## 26 26 30 30 30 26 26 30
## 4550 12559197 7221252 3612091 3696466 5332585 7045283 12201302 5190887
## 4549 2635341 1542503 880852 900136 1256473 1506197 2564452 1226882
## 213 711 633 1554756 1591787 1992909 541 560 1939249
## 378706 87658 2085194 1200456 1212964 62432 2063356 86078 61130
## 6029 44288 1461296 723247 726382 18142 1449540 43866 17747
## 125050 80094 740715 588378 595398 54657 726225 78199 53896</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Plot a heatmap with this subset of data, scaling genes and ordering both genes and samples</span>
<span class="kw">heatmap</span>(highexprgenes_counts, <span class="dt">col=</span><span class="kw">topo.colors</span>(<span class="dv">50</span>), <span class="dt">margin=</span><span class="kw">c</span>(<span class="dv">10</span>,<span class="dv">6</span>))</code></pre>
<div class="figure">
<img src="figure/Heatmap-1.png" alt="plot of chunk Heatmap" /><p class="caption">plot of chunk Heatmap</p>
</div>
<h2 id="principal-component-analysis">Principal Component Analysis</h2>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Exercise: PCA</h2>
</div>
<div class="panel-body">
<p>Produce a PCA plot from the read counts of the 500 most highly expressed genes and change the labels until you can identify the reason for the split between samples from the same tissue.</p>
<ul>
<li>Get the read counts for the 500 most highly expressed genes</li>
<li>Transpose this matrix of read counts</li>
<li>Check the number of dimensions explaining the variability in the dataset</li>
<li>Run the PCA with an appropriate number of components</li>
<li>Annotate the samples with their age & re-run the PCA & plot the main components</li>
<li>Annotate the samples with other clinical data & re-run the PCA & plot the main components until you can separate the samples within each tissue group</li>
</ul>
</div>
</section>
<p>Solution:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># select data for the 1000 most highly expressed genes</span>
select =<span class="st"> </span><span class="kw">order</span>(<span class="kw">rowMeans</span>(counts$counts), <span class="dt">decreasing=</span><span class="ot">TRUE</span>)[<span class="dv">1</span>:<span class="dv">500</span>]
highexprgenes_counts <-<span class="st"> </span>counts$counts[select,]
<span class="co"># transpose the data to have variables (genes) as columns</span>
data_for_PCA <-<span class="st"> </span><span class="kw">t</span>(highexprgenes_counts)
<span class="co"># Run the PCA with an appropriate number of components</span>
mds <-<span class="st"> </span><span class="kw">cmdscale</span>(<span class="kw">dist</span>(data_for_PCA))
<span class="co"># Plot the PCA</span>
<span class="kw">plot</span>(mds[,<span class="dv">1</span>], -mds[,<span class="dv">2</span>], <span class="dt">type=</span><span class="st">"n"</span>, <span class="dt">xlab=</span><span class="st">"Dimension 1"</span>, <span class="dt">ylab=</span><span class="st">"Dimension 2"</span>, <span class="dt">main=</span><span class="st">""</span>)
<span class="kw">text</span>(mds[,<span class="dv">1</span>], -mds[,<span class="dv">2</span>], <span class="kw">rownames</span>(mds), <span class="dt">cex=</span><span class="fl">0.8</span>) </code></pre>
<div class="figure">
<img src="figure/PCA-1.png" alt="plot of chunk PCA" /><p class="caption">plot of chunk PCA</p>
</div>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Annotate the samples with their age & re-run the PCA & plot the main components</span>
<span class="kw">rownames</span>(mds) <-<span class="st"> </span>experiment_design.ord$age
<span class="kw">plot</span>(mds[,<span class="dv">1</span>], -mds[,<span class="dv">2</span>], <span class="dt">type=</span><span class="st">"n"</span>, <span class="dt">xlab=</span><span class="st">"Dimension 1"</span>, <span class="dt">ylab=</span><span class="st">"Dimension 2"</span>, <span class="dt">main=</span><span class="st">""</span>)
<span class="kw">text</span>(mds[,<span class="dv">1</span>], -mds[,<span class="dv">2</span>], <span class="kw">rownames</span>(mds), <span class="dt">cex=</span><span class="fl">0.8</span>) </code></pre>
<div class="figure">
<img src="figure/PCA-2.png" alt="plot of chunk PCA" /><p class="caption">plot of chunk PCA</p>
</div>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Annotate the samples with other clinical data & re-run the PCA & plot the main components until you can separate the samples within each tissue group</span>
<span class="kw">rownames</span>(mds)<-<span class="st"> </span>experiment_design.ord$technical_replicate_group
<span class="kw">plot</span>(mds[,<span class="dv">1</span>], -mds[,<span class="dv">2</span>], <span class="dt">type=</span><span class="st">"n"</span>, <span class="dt">xlab=</span><span class="st">"Dimension 1"</span>, <span class="dt">ylab=</span><span class="st">"Dimension 2"</span>, <span class="dt">main=</span><span class="st">""</span>)
<span class="kw">text</span>(mds[,<span class="dv">1</span>], -mds[,<span class="dv">2</span>], <span class="kw">rownames</span>(mds), <span class="dt">cex=</span><span class="fl">0.8</span>) </code></pre>
<div class="figure">
<img src="figure/PCA-3.png" alt="plot of chunk PCA" /><p class="caption">plot of chunk PCA</p>
</div>
<h2 id="differential-expression">Differential Expression</h2>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Exercise: Limma</h2>
</div>
<div class="panel-body">
<p>Get the number of DE genes between technical group 1 and technical group 2 (all Brain samples) with adj pvalue<0.01.</p>
<ul>
<li>Create a new design matrix for limma with the technical replicate groups</li>
<li>Re-normalise the read counts with ‘voom’ function with new design matrix</li>
<li>Fit a linear model on these normalised data</li>
<li>Make the contrast matrix corresponding to the new set of parameters</li>
<li>Fit the contrast matrix to the linear model</li>
<li>Compute moderated t-statistics of differential expression</li>
<li>Get the output table for the 10 most significant DE genes for this comparison</li>
</ul>
</div>
</section>
<p>Solution:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(limma)
<span class="co"># Create a new design matrix for limma with the technical replicate groups</span>
techgroup<-<span class="kw">factor</span>(experiment_design.ord$technical_replicate_group)
design <-<span class="st"> </span><span class="kw">model.matrix</span>(~<span class="dv">0</span>+techgroup)
<span class="kw">colnames</span>(design)<-<span class="st"> </span><span class="kw">gsub</span>(<span class="st">"techgroup"</span>,<span class="st">""</span>,<span class="kw">colnames</span>(design))
design</code></pre>
<pre><code>## group_1 group_2 group_3 group_4
## 1 0 1 0 0
## 2 1 0 0 0
## 3 0 0 0 1
## 4 0 0 0 1
## 5 0 0 1 0
## 6 1 0 0 0
## 7 0 1 0 0
## 8 0 0 1 0
## attr(,"assign")
## [1] 1 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$techgroup
## [1] "contr.treatment"</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Re-normalise the read counts with 'voom' function with new design matrix</span>
y <-<span class="st"> </span><span class="kw">voom</span>(mycounts,design,<span class="dt">lib.size=</span><span class="kw">colSums</span>(mycounts)*nf)
counts.voom <-<span class="st"> </span>y$E
<span class="co"># Fit a linear model on these normalised data</span>
fit <-<span class="st"> </span><span class="kw">lmFit</span>(y,design)
<span class="co"># Make the contrast matrix corresponding to the new set of parameters</span>
cont.matrix <-<span class="st"> </span><span class="kw">makeContrasts</span>(group_2-group_1,<span class="dt">levels=</span>design)
cont.matrix </code></pre>
<pre><code>## Contrasts
## Levels group_2 - group_1
## group_1 -1
## group_2 1
## group_3 0
## group_4 0</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Fit the contrast matrix to the linear model</span>
fit <-<span class="st"> </span><span class="kw">contrasts.fit</span>(fit, cont.matrix)
<span class="co"># Compute moderated t-statistics of differential expression </span>
fit <-<span class="st"> </span><span class="kw">eBayes</span>(fit)
<span class="kw">options</span>(<span class="dt">digits=</span><span class="dv">3</span>)
<span class="co"># Get the output table for the 10 most significant DE genes for this comparison</span>
<span class="kw">dim</span>(<span class="kw">topTable</span>(fit,<span class="dt">coef=</span><span class="st">"group_2 - group_1"</span>,<span class="dt">p.val=</span><span class="fl">0.01</span>,<span class="dt">n=</span><span class="ot">Inf</span>))</code></pre>
<pre><code>## [1] 9652 6</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">topTable</span>(fit,<span class="dt">coef=</span><span class="st">"group_2 - group_1"</span>,<span class="dt">p.val=</span><span class="fl">0.01</span>)</code></pre>
<pre><code>## logFC AveExpr t P.Value adj.P.Val B
## 125050 -3.598 13.36 -1219 6.95e-58 1.04e-53 123.0
## 378706 -4.964 14.07 -842 4.15e-54 3.12e-50 114.4
## 6029 -5.432 13.07 -500 8.61e-49 4.31e-45 102.0
## 4514 0.761 12.20 271 1.44e-42 5.42e-39 86.5
## 26871 -2.795 10.87 -187 8.96e-39 2.69e-35 78.7
## 6023 -4.752 10.00 -168 1.16e-37 2.90e-34 76.5
## 6043 -2.923 8.73 -164 2.02e-37 4.34e-34 75.9
## 5354 0.576 5.73 158 4.82e-37 9.05e-34 74.3
## 692148 0.733 10.39 146 3.15e-36 5.26e-33 72.6
## 652965 -2.864 7.83 -137 1.36e-35 2.04e-32 71.6</code></pre>
<h2 id="gene-ontology">Gene Ontology</h2>
<section class="challenge panel panel-success">
<div class="panel-heading">
<h2><span class="glyphicon glyphicon-pencil"></span>Exercise: GOstats</h2>
</div>
<div class="panel-body">
<p>Identify the GO terms in the Molecular Function domain that are over-represented (pvalue<0.01) in your list of DE genes.</p>
<ul>
<li>Get your list of DE genes (Entrez Gene IDs)</li>
<li>Set the new parameters for the hypergeometric test</li>
<li>Run the test and adjust the pvalues in the output object</li>
<li>Identify the significant GO terms at pvalue 0.01</li>
</ul>
</div>
</section>
<p>Solution:</p>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(GOstats)</code></pre>
<pre><code>## Loading required package: Biobase</code></pre>
<pre><code>## Loading required package: BiocGenerics</code></pre>
<pre><code>## Loading required package: methods</code></pre>
<pre><code>## Loading required package: parallel</code></pre>
<pre><code>##
## Attaching package: 'BiocGenerics'</code></pre>
<pre><code>## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB</code></pre>
<pre><code>## The following object is masked from 'package:limma':
##
## plotMA</code></pre>
<pre><code>## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs</code></pre>
<pre><code>## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit</code></pre>
<pre><code>## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.</code></pre>
<pre><code>## Loading required package: Category</code></pre>
<pre><code>## Loading required package: stats4</code></pre>
<pre><code>## Loading required package: AnnotationDbi</code></pre>
<pre><code>## Loading required package: IRanges</code></pre>
<pre><code>## Loading required package: S4Vectors</code></pre>
<pre><code>##
## Attaching package: 'S4Vectors'</code></pre>
<pre><code>## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums</code></pre>
<pre><code>## Loading required package: Matrix</code></pre>
<pre><code>##
## Attaching package: 'Matrix'</code></pre>
<pre><code>## The following object is masked from 'package:S4Vectors':
##
## expand</code></pre>
<pre><code>## Loading required package: graph</code></pre>
<pre><code>## </code></pre>
<pre><code>##
## Attaching package: 'GOstats'</code></pre>
<pre><code>## The following object is masked from 'package:AnnotationDbi':
##
## makeGOGraph</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Get your list of DE genes (Entrez Gene IDs)</span>
entrezgeneids <-<span class="st"> </span><span class="kw">as.character</span>(<span class="kw">rownames</span>(limma.res.pval.FC))
universeids <-<span class="st"> </span><span class="kw">rownames</span>(mycounts)
<span class="co"># Set the new parameters for the hypergeometric test</span>
params <-<span class="st"> </span><span class="kw">new</span>(<span class="st">"GOHyperGParams"</span>,<span class="dt">annotation=</span><span class="st">"org.Hs.eg"</span>,<span class="dt">geneIds=</span>entrezgeneids,<span class="dt">universeGeneIds=</span>universeids,<span class="dt">ontology=</span><span class="st">"MF"</span>,<span class="dt">pvalueCutoff=</span><span class="fl">0.01</span>,<span class="dt">testDirection=</span><span class="st">"over"</span>)</code></pre>
<pre><code>## Loading required package: org.Hs.eg.db</code></pre>
<pre><code>## </code></pre>
<pre><code>## Warning in makeValidParams(.Object): removing geneIds not in
## universeGeneIds</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Run the test and adjust the pvalues in the output object</span>
hg <-<span class="st"> </span><span class="kw">hyperGTest</span>(params)
hg.pv <-<span class="st"> </span><span class="kw">pvalues</span>(hg)
hg.pv.fdr <-<span class="st"> </span><span class="kw">p.adjust</span>(hg.pv,<span class="st">'fdr'</span>)
<span class="co"># Identify the significant GO terms at pvalue 0.01</span>
sigGO.ID <-<span class="st"> </span><span class="kw">names</span>(hg.pv.fdr[hg.pv.fdr <<span class="st"> </span>hgCutoff])
df <-<span class="st"> </span><span class="kw">summary</span>(hg)
GOterms.sig <-<span class="st"> </span>df[df[,<span class="dv">1</span>] %in%<span class="st"> </span>sigGO.ID,<span class="st">"Term"</span>]
<span class="kw">length</span>(GOterms.sig )</code></pre>
<pre><code>## [1] 433</code></pre>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">head</span>(GOterms.sig)</code></pre>
<pre><code>## [1] "transmembrane receptor activity"
## [2] "signal transducer activity"
## [3] "signaling receptor activity"
## [4] "transmembrane signaling receptor activity"
## [5] "receptor activity"
## [6] "molecular transducer activity"</code></pre>
<h2 id="r-environment">R environment</h2>
<pre class="sourceCode r"><code class="sourceCode r"><span class="kw">sessionInfo</span>()</code></pre>
<pre><code>## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] GO.db_3.3.0 org.Hs.eg.db_3.3.0 GOstats_2.38.1
## [4] graph_1.50.0 Category_2.38.0 Matrix_1.2-6
## [7] AnnotationDbi_1.34.4 IRanges_2.6.1 S4Vectors_0.10.2
## [10] Biobase_2.32.0 BiocGenerics_0.18.0 limma_3.28.17
##
## loaded via a namespace (and not attached):
## [1] knitr_1.13 magrittr_1.5 splines_3.3.1
## [4] xtable_1.8-2 lattice_0.20-33 stringr_1.0.0
## [7] tools_3.3.1 grid_3.3.1 AnnotationForge_1.14.2
## [10] DBI_0.4-1 genefilter_1.54.2 survival_2.39-5
## [13] RBGL_1.48.1 GSEABase_1.34.0 formatR_1.4
## [16] RSQLite_1.0.0 evaluate_0.9 stringi_1.1.1
## [19] XML_3.98-1.4 annotate_1.50.0</code></pre>
</div>
</div>
</article>
<div class="footer">
<a class="label swc-blue-bg" href="https://github.com/MonashBioinformaticsPlatform/RNAseq-DE-analysis-with-R">Source</a>
</div>
</div>
<!-- Javascript placed at the end of the document so the pages load faster -->
<script src="http://code.jquery.com/jquery-1.9.1.js"></script>
<script src="css/bootstrap/bootstrap-js/bootstrap.js"></script>
</body>
</html>