finish case study rough draft

KaiAragaki · Jul 16, 2024 · fc9c509 · fc9c509
1 parent d296904
commit fc9c509
Show file tree

Hide file tree

Showing 5 changed files with 92 additions and 39 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -30,19 +30,24 @@ Language: en
 Roxygen: list(markdown = TRUE)
 RoxygenNote: 7.3.1
 Suggests:
+    aws.s3,
     Biobase,
+    cellebrate,
     knitr,
-    recipes,
     rmarkdown,
     SummarizedExperiment,
-    testthat (>= 3.0.0)
+    testthat (>= 3.0.0),
+    tidymodels
+Remotes:
+  McConkeyLab/cellebrate
 Config/testthat/edition: 3
 Imports:
     cli,
     collapse,
     dplyr,
     hardhat,
     parsnip,
+    recipes,
     rlang,
     stats,
     tibble

diff --git a/R/clanc-predict.R b/R/clanc-predict.R
@@ -11,6 +11,9 @@
 #'
 #' @param method If `type` is `numeric`, the method of correlation
 #'
+#' @param assay If `object` inherits `SummarizedExperiment`, the index of the
+#'   assay.
+#'
 #' @param ... Not used, but required for extensibility.
 #'
 #' @return
@@ -19,8 +22,8 @@
 #' to be the same as the number of rows in `new_data`.
 #'
 #' @export
-predict.clanc <- function(object, new_data, type, ...) {
-  new_data <- wrangle_data(new_data)
+predict.clanc <- function(object, new_data, type, assay = NULL, ...) {
+  new_data <- wrangle_data(new_data, assay)
   forged <- custom_forge(new_data, object$blueprint)
   rlang::arg_match(type, valid_clanc_predict_types())
   predict_clanc_bridge(type, object, forged$predictors, ...)

diff --git a/R/predict-functions.R b/R/predict-functions.R
@@ -52,17 +52,17 @@ predict_dist_multi <- function(dists) {
 
 # dists: col = sample, row = class, value = dist
 
-wrangle_data <- function(data) {
-  if (inherits(data, "SummarizedExperiment")) return(wrangle_data_se(data))
+wrangle_data <- function(data, assay) {
+  if (inherits(data, "SummarizedExperiment")) return(wrangle_data_se(data, assay))
   if (inherits(data, "data.frame")) return(wrangle_data_df(data))
   if (inherits(data, "matrix")) return(wrangle_data_matrix(data))
   if (inherits(data, "ExpressionSet")) return(wrangle_data_es(data))
   cli::cli_abort("Unable to predict object of type {class(data)}")
 }
 
-wrangle_data_se <- function(data) {
-  # Should assay number be an optional arg?
-  data <- SummarizedExperiment::assay(data)
+wrangle_data_se <- function(data, assay) {
+  assay <- ifelse(is.null(assay), 1, assay)
+  data <- SummarizedExperiment::assay(data, assay)
   t(data)
 }
 

diff --git a/man/predict.clanc.Rd b/man/predict.clanc.Rd
diff --git a/vignettes/case-study.Rmd b/vignettes/case-study.Rmd
@@ -14,7 +14,7 @@ knitr::opts_chunk$set(
 )
 ```
 
-```{r setup}
+```{r setup, message = FALSE}
 library(reclanc)
 library(aws.s3)
 library(Biobase)
@@ -30,14 +30,14 @@ Let's start with the fitting procedure. We first need gene expression data.
 
 The data I'm using is from [Sjödahl et al. (2012)](https://pubmed.ncbi.nlm.nih.gov/22553347/). It contains RNA expression from 308 bladder cancer tumors. 
 
-```r
+```{r}
 lund <- s3readRDS("lund.rds", "reclanc-lund", region = "us-east-2")
 lund
 ```
 
 In their paper, they used the transcriptional data to classify the tumors into several molecular subtypes (MS). 
 
-```r
+```{r}
 table(lund$molecular_subtype)
 ```
 
@@ -49,13 +49,13 @@ We'd like to apply this subtype framework to other datasets. Our first goal is t
 
 Before we can begin, we need to convert our outcomes to factors. In this case, our outcomes are the molecular subtypes called by Sjödahl et al.
 
-```r
+```{r}
 lund$molecular_subtype <- factor(lund$molecular_subtype)
 ```
 
 In its simplest form, since `clanc` accepts `ExpressionSet` objects, we could do the following and be done with it:
 
-```r
+```{r}
 simple_centroids <- clanc(lund, classes = "molecular_subtype", active = 5)
 head(simple_centroids$centroids)
 ```
@@ -64,15 +64,15 @@ head(simple_centroids$centroids)
 
 For the next few sections, we're going to leverage the `tidymodels` collection of packages. In order to ease into that, I'll show you how to do the same analysis, but with packages from `tidymodels`.
 
-```r
+```{r, message = FALSE}
 library(tidymodels)
 ```
 
 Many `tidymodels` workflows begin with a *model specification*. The rationale behind this is to separate the *model specification* step from the *model fitting* step (whereas in base R, they generally all happen at once).
 
 We specify our model like this:
 
-```r
+```{r}
 mod <- discrim_linear() |>
   set_engine(
     engine = "clanc",
@@ -84,20 +84,20 @@ This `mod` doesn't do anything - and that's kind of the point: it only specifies
 
 We also need to wrangle our data a bit to be in a 'wide' format - fortunately that's quite simple:
 
-```r
+```{r}
 wrangled <- data.frame(class = lund$molecular_subtype, t(exprs(lund)))
 head(wrangled[1:5])
 ```
 
 Finally, we specify a formula for fitting the model. This uses the `recipes` package from tidymodels. While this is a delightful package that can help you preprocess your data, it's out of the scope of this vignette. Instead, just think of it as a way to specify a formula.
 
-```r
+```{r}
 recipe <- recipe(class ~ ., wrangled) # Note that the recipe requires 'template data'
 ```
 
 We can bundle our model specification (`mod`) and our preprocessing steps (`recipe`, which is just a formula) into a `workflow`:
 
-```r
+```{r}
 wf <- workflow() |>
   add_recipe(recipe) |>
   add_model(mod)
@@ -106,7 +106,7 @@ wf
 
 Now we can fit our model:
 
-```r
+```{r}
 tidymodels_fit <- fit(wf, data = wrangled)
 head(tidymodels_fit$fit$centroids)
 ```
@@ -115,7 +115,7 @@ head(tidymodels_fit$fit$centroids)
 
 Now that we've dialed in to the `tidymodels` framework, we can do a lot of elaborate things with ease. One of our concerns is whether 5 active genes was a good choice (`active = 5`). A somewhat simple way to determine how good our choice of 5 genes is to use [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)). Cross-validation allows us to test how good our fit, without having to spend our test data. Speaking of which, let's split our data into training and testing data right now. To do this, we're going to use the lovely `rsample` package.
 
-```r
+```{r}
 set.seed(123)
 splits <- initial_split(wrangled, prop = 0.8, strata = class)
 train <- training(splits)
@@ -124,23 +124,23 @@ test <- testing(splits)
 
 `train` and `test` are just subsets of the original data, containing 80% and 20% of the original data (respectively). It also tries to maintain the relative proportions of each of the classes within each of the datasets (because we set `strata = class`):
 
-```r
+```{r}
 round(prop.table(table(train$class)), 2)
 ```
 
-```r
+```{r}
 round(prop.table(table(test$class)), 2)
 ```
 
 Creating folds for crossfold validation is nearly the same:
 
-```r
+```{r}
 folds <- vfold_cv(train, v = 5, strata = class)
 ```
 
 We can reuse our workflow `wf`:
 
-```r
+```{r}
 fits <- fit_resamples(
   wf,
   folds,
@@ -150,14 +150,14 @@ fits <- fit_resamples(
 
 We can then extract our accuracy metrics by using `collect_metrics`, which roots around in each of our fits and helpfully extracts the metrics, aggregates them, and calculated the standard error:
 
-```r
+```{r}
 metrics <- collect_metrics(fits)
 metrics
 ```
 
 Our model has an accuracy of about `r round(metrics$mean*100)`%. Applying this model to our testing data:
 
-```r
+```{r}
 final_fit <- clanc(class ~ ., train, active = 5) # Fit a model using *all* of our training data
 preds <- predict(final_fit, new_data = test, type = "class") # Use it to predict the (known) classes of our test data
 w_preds <- cbind(preds, test)
@@ -173,7 +173,7 @@ So now we at least have *some* measure of how good our model fits, but could it
 
 To use `tune`, we need to respecify our model to let `tune` know what parameters we want to tune:
 
-```r
+```{r}
 tune_mod <- discrim_linear() |>
   set_engine(
     engine = "clanc",
@@ -183,21 +183,21 @@ tune_mod <- discrim_linear() |>
 
 We could update our previous workflow using `update_model`, but let's just declare a new one:
 
-```r
+```{r}
 tune_wf <- workflow() |>
   add_recipe(recipe) |>
   add_model(tune_mod)
 ```
 
 We then have to specify a range of values of `active` to try:
 
-```r
+```{r}
 values <- data.frame(active = seq(from = 1, to = 50, by = 4))
 ```
 
 We can then fit our `folds` using the spread of values we chose:
 
-```r
+```{r, message = FALSE}
 # This is going to take some time, since we're fitting 5 folds 13 times each.
 tuned <- tune_grid(
   tune_wf,
@@ -209,14 +209,14 @@ tuned <- tune_grid(
 
 As before, we can collect our metrics - this time, however, we have a summary of metrics for each of values for `active`:
 
-```r
+```{r}
 tuned_metrics <- collect_metrics(tuned)
 tuned_metrics
 ```
 
 Or graphically:
 
-```r
+```{r}
 ggplot(tuned_metrics, aes(active, mean)) +
   geom_line() +
   coord_cartesian(ylim = c(0, 1)) +
@@ -225,7 +225,7 @@ ggplot(tuned_metrics, aes(active, mean)) +
 
 It looks like we read maximal accuracy at around 21 genes - let's choose 20 genes for a nice round number:
 
-```r
+```{r}
 final_fit_tuned <- clanc(class ~ ., data = train, active = 20)
 preds <- predict(final_fit_tuned, new_data = test, type = "class") # Use it to predict the (known) classes of our test data
 w_preds <- cbind(preds, test)
@@ -238,13 +238,55 @@ It looks like our accuracy is a little better now that we've chosen an optimal n
 
 # Predicting
 
-Now we want to apply our classifier to new data. Our second dataset is RNAseq data from 30 bladder cancer cell lines:
+Now we want to apply our classifier to new data. Our second dataset is [RNAseq data from 30 bladder cancer cell lines](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE97768):
 
-```r
+```{r, message = FALSE}
 library(cellebrate)
 cell_rna
 ```
 
-```r
-predict(final_fit_tuned, t(assay(cell_rna, 2)), type = "numeric", method = "spearman")
+Predicting is incredibly simple. Since we're using a different sequencing method (RNAseq vs array-based sequencing), it probably makes sense to use a correlation based classification rather than the original distance-based metric used in the original ClaNC package. We can do that by specifying `type = "numeric"` and then whatever correlation method we prefer. 
+
+```{r}
+cell_preds <- predict(
+  final_fit_tuned,
+  cell_rna,
+  assay = 2,
+  type = "numeric",
+  method = "spearman"
+)
+
+out <- cbind(colData(cell_rna), cell_preds) |>
+  as_tibble()
+
+out
+```
+
+```{r, fig.width = 10, fig.height = 7}
+plotting_data <- out |>
+  pivot_longer(cols = starts_with(".pred"))
+
+plotting_data |>
+  ggplot(aes(cell, value, color = name)) +
+  geom_point() +
+  facet_grid(~clade, scales = "free_x", space = "free_x")
+```
+
+In the Sjödahl paper, the seven subtypes were simplified into five subtypes by merging some of the two that had similar biological pathways activated. To ease interpretation, we can do that too:
+
+```{r}
+table <- plotting_data |>
+  summarize(winner = name[which.max(value)], .by = c(cell, clade)) |>
+  mutate(
+    five = case_when(
+      winner %in% c(".pred_MS1a", ".pred_MS1b") ~ "Urobasal A",
+      winner %in% c(".pred_MS2a.1", ".pred_MS2a.2") ~ "Genomically unstable",
+      winner == ".pred_MS2b.1" ~ "Infiltrated",
+      winner == ".pred_MS2b2.1" ~ "Uro-B",
+      winner == ".pred_MS2b2.2" ~ "SCC-like"
+    )
+  ) |>
+  relocate(cell, five, clade)
+
+print(table, n = 30)
 ```