-
Notifications
You must be signed in to change notification settings - Fork 6
/
README.Rmd
executable file
·396 lines (304 loc) · 12.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
---
title: "Bratwurst"
author: "Daniel Huebschmann & Sebastian Steinhauser"
date: "20.07.2016, major update 09.10.2017"
output:
BiocStyle::html_document:
toc: yes
keep_md: yes
---
`Bratwurst` is a software package providing functions for preprocessing,
wrappers for non-negative matrix factorization and postprocessing in `R`. This
repo hosts the code of the Bratwurst software package.
A detailed description
of the software and an application to cells of the human hematopoietic system
are available as a preprint: https://doi.org/10.1101/199547.
Intermediate results for the analysis on hematopoietic cells are available on
zenodo: https://doi.org/10.5281/zenodo.800049.
In this README document, we present major parts of the vignette of the package.
First some packages have to be loaded.
```{r load_style, warning=FALSE, message=FALSE, results="hide"}
library(knitr)
library(ComplexHeatmap)
```
# Introduction
**NMF** (**nonnegative matrix factorization**) is a matrix decomposition
method. It was originally described by Lee & Seung in 1999. In 2003, Brunet et
al. applied NMF to gene expression data. In 2010, `r CRANpkg("NMF")`, an R
package implementing several NMF solvers was published by Gaujoux et al.
NMF basically solves the problem as illustrated in the following figure
(Image taken from
<https://en.wikipedia.org/wiki/Non-negative_matrix_factorization>):
![NMF](vignettes/NMF.png)
Here, `V` is an input matrix with dimensions `n x m`. It is decomposed
into two matrices `W` of dimension `n x l` and `H` of dimension
`l x m`, which when multiplied approximate the original matrix `V`. `l` is
a free parameter in NMF, it is called the factorization rank. If we call the
columns of `W` signatures, then `l` corresponds to the number of
signatures. The decomposition thus leads to a reduction in complexity if
`l < n`, i.e. if the number of signatures is smaller than the number of
features, as indicated in the above figure.
In 2015, Mejia-Roa et al. introduced an implementation of an NMF-solver in
CUDA, which lead to significant reduction of computation times by making use of
massive parallelisation on GPUs. Other implementations of
NMF-solvers on GPUs exist.
It is the pupose of the package `Bratwurst` described here to provide wrapper
functions in R to these NMF-solvers in CUDA. Massive parallelisation not only
leads to faster algorithms, but also makes the benefits of NMF accessible to
much bigger matrices. Furthermore, functions for preprocessing, estimation of
the optimal factorization rank and post-hoc feature selection are provided.
# The Bratwurst package
The main feature of the package `Bratwurst` is an S4 object called `nmf.exp`.
It is derived from `SummarizedExperiment`, has containers for a data matrix,
column annotation data and row annotation data and inherits
`SummarizedExperiment`'s accessor functions `colData` and `rowData`. The matrix
to be stored in this data structure is the matrix V as described above,
corresponding to the input matrix for the NMF-solver. `nmf.exp` furthermore has
containers for the matrices W and H which are results of the decomposition.
As NMF algorithms have to be run iteratively, an instance of the class
`nmf.exp` can store large lists of matrices, corresponding to the results of
different iteration steps. Accessor functions to all different containers are
provided.
A crucial step in data analysis with NMF is the determination of the optimal
factorization rank, i.e. the number of columns of the matrix W or
equivalently the number of rows of the matrix H. No consensus method for an
automatic evaluation of the optimal factorization rank has been found to date.
Instead, the decomposition is usually performed iteratively over a range of
possible factorization ranks and different quality measures are computed for
every tested factorization ranks. Many quality measures have been proposed:
* The `Frobenius reconstruction error`, i.e. the Frobenius norm of the
residuals of the decomposition: ||W x H - V||
* Criteria to assess the stability of the decomposition:
+ The `cophenetic correlation coefficient`
+ An `Amari type distance`
+ `Silhouette values` over clusters of patterns extracted iteratively at the
same factorization rank
The package `Bratwurst` provides functions to compute all
# Example: leukemia data
In this README, as in the vignette for the package, we use a dataset containing
Affymetrix Hgu6800 microarray expression data of B-ALL, T-ALL and AML samples.
This dataset had also been used by Brunet et al. (PNAS, 2004) and Gaujoux et
al. (BMC Bioinformatics, 2010). In Brunet et al., this dataset is called the
__Golub__ dataset.
Preparations: load the necessary software packages:
```{r, warning=FALSE, message=FALSE}
library(Bratwurst)
library(NMF)
```
Load the example data (the __Golub__ dataset is stored in the R package
`Bratwurst`)
```{r, eval = T}
data(leukemia)
samples <- "leukemia"
```
This data was initially generated by the following commands:
```{r, eval = F}
data.path <- file.path(getwd(), "data")
matrix.file <- list.files(data.path, "*data.txt", full.names = T)
rowAnno.file <- list.files(data.path, "micro.*anno.*txt", full.names = T)
rowAnno.bed <- list.files(data.path, ".bed", full.names = T)
colAnno.file <- list.files(data.path, "sample.*anno.*txt", full.names = T)
# Read files to summarizedExperiment
leukemia.nmf.exp <- nmfExperimentFromFile(matrix.file = matrix.file,
rowAnno.file = rowAnno.file,
colData.file = colAnno.file)
save(leukemia.nmf.exp, file = file.path(data.path, "leukemia.rda"))
```
Now we are ready to start an NMF analysis.
## NMF analysis
### Call wrapper function
The wrapper function for the NMF solvers in the Bratwurst package is
`runNmfGpu`. It is called as follows:
```{r, warning=FALSE}
k.max <- 4
outer.iter <- 10
inner.iter <- 10^4
leukemia.nmf.exp<- runNmfGpuPyCuda(nmf.exp = leukemia.nmf.exp,
k.max = k.max,
outer.iter = outer.iter,
inner.iter = inner.iter,
tmp.path = "/tmp/tmp_leukemia",
cpu = TRUE)
```
Depending on the choice of parameters (dimensions of the input matrix, number
of iterations), this step may take some time. Note that the algorithm updates
the user about the progress in the iterations.
Several getter functions are available to access the data in the generated
`nmf.exp` object:
### `HMatrixList`
Returns a list of matrices `H` for a specific factorization
rank `k`. There are as many entries in this list as there were iterations in
the outer iteration. Of course the number of rows of the matrix `H` corresponds
to the chosen factorization rank.
```{r}
tmp.object <- HMatrixList(leukemia.nmf.exp, k = 2)
class(tmp.object)
length(tmp.object)
class(tmp.object[[1]])
dim(tmp.object[[1]])
```
If no value for `k` is supplied, the function returns a list of lists, one for
every iterated factorization rank.
```{r}
tmp.object <- HMatrixList(leukemia.nmf.exp)
class(tmp.object)
length(tmp.object)
class(tmp.object[[1]])
length(tmp.object[[1]])
```
### `HMatrix`
Returns the matrix `H` for the optimal decomposition (i.e. the one
with the minimal residual) for a specific factorization rank `k`. As in the
previous paragraph, the number of rows of the matrix `H` corresponds to the
chosen factorization rank.
```{r}
tmp.object <- HMatrix(leukemia.nmf.exp, k = 2)
class(tmp.object)
dim(tmp.object)
```
If no value for `k` is supplied, the function returns a list of optimal
matrices, one for every iterated factorization rank.
```{r}
H.list <- HMatrix(leukemia.nmf.exp)
class(H.list)
length(H.list)
```
### `WMatrixList`
Returns a list of matrices `W` for a specific factorization
rank `k`. There are as many entries in this list as there were iterations in
the outer iteration. Of course the number of columns of the matrix `W`
corresponds to the chosen factorization rank.
```{r}
tmp.object <- WMatrixList(leukemia.nmf.exp, k = 2)
class(tmp.object)
length(tmp.object)
class(tmp.object[[1]])
dim(tmp.object[[1]])
```
If no value for `k` is supplied, the function returns a list of lists, one for
every iterated factorization rank.
### `WMatrix`
Returns the matrix `W` for the optimal decomposition (i.e. the one
with the minimal residual) for a specific factorization rank `k`. As in the
previous paragraph, the number of columns of the matrix `W` corresponds to the
chosen factorization rank.
```{r}
tmp.object <- WMatrix(leukemia.nmf.exp, k = 2)
class(tmp.object)
dim(tmp.object)
```
If no value for `k` is supplied, the function returns a list of optimal
matrices, one for every iterated factorization rank.
```{r}
W.list <- WMatrix(leukemia.nmf.exp)
class(W.list)
length(W.list)
```
### `FrobError`
Returns a data frame with as many columns as there are iterated factorization
ranks and as many rows as there are iterations per factorization rank.
```{r}
FrobError(leukemia.nmf.exp)
```
## Determine the optimal factorization rank
In NMF, Several methods have been described to assess the optimal factorization
rank. The Bratwurst packages implements some of them. They are computed by
applying custom functions which subsequently update the data structure of type
`nmf.exp`.
### Get Frobenius error.
The most important information about the many iterated decompositions is the
norm of the residual. In NMF this is often called the Frobenius error, as the
Frobenius norm may be used.
```{r}
leukemia.nmf.exp <- computeFrobErrorStats(leukemia.nmf.exp)
```
### Evaluate silhouette values
In 2013, Alexandrov et al. published an NMF analysis on mutational signatures.
They used an approach which a modified silhouette criterion is used to estimate
the stability across iteration steps for one fixed factorization rank `k`.
```{r}
leukemia.nmf.exp <- computeSilhoutteWidth(leukemia.nmf.exp)
```
### Cophenetic correlation coefficient plot
```{r}
leukemia.nmf.exp <- computeCopheneticCoeff(leukemia.nmf.exp)
```
### Compute amari type distance
```{r}
leukemia.nmf.exp <- computeAmariDistances(leukemia.nmf.exp)
```
After having executed all these functions, the values of the computed measures
can be accessed with `OptKStats`:
```{r}
OptKStats(leukemia.nmf.exp)
```
These quality measures can be displayed together:
### Generate plots to estimate optimal k
```{r, warning = FALSE, message = FALSE}
gg.optK <- plotKStats(leukemia.nmf.exp)
gg.optK
```
### Generate ranked error plot.
It may also be useful to inspect the Frobenius error after ranking. This may
give an estimation of the convergence in the parameter space of initial
conditions.
```{r}
gg.rankedFrobError <- plotRankedFrobErrors(leukemia.nmf.exp)
gg.rankedFrobError
```
## Visualize the matrix H (exposures)
The matrices `H` may be visualized as heatmaps. We can define a meta
information object and annotate meta data:
```{r}
entity.colVector <- c("red", "blue")
names(entity.colVector) <- c("ALL", "AML")
subtype.colVector <- c("orange", "darkgreen", "blue")
names(subtype.colVector) <- c("B-cell", "T-cell", "-")
anno_col <- list(V2 = entity.colVector,
V3 = subtype.colVector)
heat.anno <- HeatmapAnnotation(df = colData(leukemia.nmf.exp)[, c(2:3)],
col = anno_col)
```
And now display the matrices `H` with meta data annotation. Bratwurst provides
a plotting function to display the matrices `H` with meta data annotation:
```{r}
# Plot Heatmaps for H over all k
lapply(seq(2, k.max), function(k) {
plotHMatrix(leukemia.nmf.exp, k)
})
```
## Feature selection
### Row K-means to determine signature specific features
```{r}
### Find representative regions.
# Get W for best K
leukemia.nmf.exp <- setOptK(leukemia.nmf.exp, 4)
OptK(leukemia.nmf.exp)
signature.names <- getSignatureNames(leukemia.nmf.exp, OptK(leukemia.nmf.exp))
signature.names
FeatureStats(leukemia.nmf.exp)
leukemia.nmf.exp <- computeFeatureStats(leukemia.nmf.exp)
FeatureStats(leukemia.nmf.exp)
# You might want to add additional selection features
# such as entropy or absolute delta
# Entropy
leukemia.nmf.exp <- computeEntropy4OptK(leukemia.nmf.exp)
FeatureStats(leukemia.nmf.exp)
leukemia.nmf.exp <- computeAbsDelta4OptK(leukemia.nmf.exp)
FeatureStats(leukemia.nmf.exp)
```
### Feature visualization
```{r}
# Plot all possible signature combinations
plotSignatureFeatures(leukemia.nmf.exp)
# Plot only signature combinations
plotSignatureFeatures(leukemia.nmf.exp, sig.combs = F)
# Try to display selected features on W matrix
sig.id <- "1000"
m <- WMatrix(leukemia.nmf.exp, k = OptK(leukemia.nmf.exp))[
FeatureStats(leukemia.nmf.exp)[, 1] == sig.id, ]
m <- m[order(m[, 1]), ]
c <- getColorMap(m)
#m <- t(apply(m, 1, function(r) (r - mean(r))/sd(r)))
Heatmap(m, col = c, cluster_rows = F, cluster_columns = F)
```