-
Notifications
You must be signed in to change notification settings - Fork 0
/
lme4_iDCT.Rmd
177 lines (102 loc) · 4.9 KB
/
lme4_iDCT.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
title: "lme4_iDCT"
author: "Luiz"
date: "2024-06-08"
output: html_document
---
# Loading libraries
```{r}
library(lme4)
library(dplyr)
library(ggplot2)
library(tidyverse)
```
# Data preprocessing
```{r}
#
data <- read.csv('results/summary_DMS_results_regression_1e-3_v02.csv')
data <- select(data, c('Dataset', 'Compression_methd', 'Fold', 'R2_score_test', 'rho_score_test'))
data <- filter(data, Dataset != "HIS7_YEAST_Kondrashov2017")
data
# Convert categorical variables to factors and releveling the mean to be the reference group
data$Compression_methd <- factor(data$Compression_methd)
data$Compression_methd <- relevel(data$Compression_methd, ref = "mean")
data$Dataset <- factor(data$Dataset)
data$Fold <- factor(data$Fold)
data
```
# Fiting the lme4 model
Dependent Variable: The dependent variable (response). Here is R2_score_test.
Fixed Effect: The predictor variables (fixed effects) can be numeric or categorical. Categorical variables should be factors.
'Compression_methd' is included as a fixed effect to see how different compression methods influence the R2_score_test.
Random Effects: The grouping variables (random effects) should be factors. In here are Dataset and Fold.
Both Dataset and Fold are included as random effects to account for variability across datasets and cross-validation folds.
```{r}
#model <- lmer(R2_score_test ~ Compression_methd + (1 | Dataset) + (1|Fold), data=data)
#model <- lmer(R2_score_test ~ Compression_methd + (1 | Dataset), data=data)
model <- lmer(rho_score_test ~ Compression_methd + (1 | Dataset) + (1|Fold), data=data)
#model <- lmer(rho_score_test ~ Compression_methd + (1 | Dataset), data=data)
summary(model)
```
```{r}
library(multcomp)
glht_results <- summary(glht(model, linfct=c('Compression_methdbos=0',
'Compression_methdiDCT=0',
'Compression_methdiDCT2=0',
'Compression_methdmaxPool=0',
'Compression_methdpca1=0',
'`Compression_methdpca1-2`=0',
'Compression_methdpca2=0',
'Compression_methdrbf1=0',
'Compression_methdrbf2=0',
'Compression_methdsigmoid1=0',
'Compression_methdsigmoid2=0')))
glht_results
```
```{r}
#summary(glht(model, linfct=mcp(Compression_methd="Tukey")))
```
## Interpretation
The intercept in a regression model represents the expected value of the dependent variable (in this case, R2_score_test)
when all predictor variables are set to zero. It's the baseline value from which other effects are measured.
Fixed effects are the estimated coefficients for the predictor variables in the model. These effects are assumed to be the
same across all groups in the dataset. In your model, Compression_methd is treated as a fixed effect, meaning its impact on
R2_score_test is consistent across all datasets.
Random effects account for variability within the data that is not explained by the fixed effects. They allow the model to
consider differences across groups (e.g., different datasets or cross-validation folds). For example, if some datasets inherently
have higher or lower R2 scores, this variability is captured by including Dataset as a random effect.
Residuals are the differences between the observed values and the values predicted by the model. They represent the "error"
or "noise" in the model. Residual analysis helps to check the assumptions of the model (e.g., normality and homoscedasticity
of residuals).
# Evaluating individual datasets
```{r}
library(broom)
# Tidy the results
tidy_glht <- tidy(glht_results)
tidy_glht<- tidy_glht %>% dplyr::filter(contrast != 'Compression_methdiDCT')
ci <- tidy(confint(glht_results)) %>% dplyr::select("contrast", "conf.high", "conf.low")
res_glht = merge(tidy_glht, ci, by = 'contrast')
res_glht$contrast <- gsub("Compression_methd", "", res_glht$contrast)
res_glht$contrast <- gsub("iDCT2", "iDCT", res_glht$contrast)
# Determine significance levels
res_glht$signif <- ifelse(res_glht$adj.p.value < 0.001, "***",
ifelse(res_glht$adj.p.value < 0.01, "**",
ifelse(res_glht$adj.p.value < 0.05, "*", "")))
res_glht
```
```{r}
# Adding Confidence Intervals
#tidy_glht$Lower_CI <- tidy_glht$estimate - 1.96 * tidy_glht$std.error
#tidy_glht$Upper_CI <- tidy_glht$estimate + 1.96 * tidy_glht$std.error
# Plotting
res_glht %>%
ggplot(aes(x = fct_reorder(contrast, -estimate), y = estimate)) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.2) +
theme_minimal() +
labs(title = "95% family-wise confidence level",
x = "Compression Method",
y = "Estimate") +
geom_hline(yintercept = 0, linetype = "dotted", color = "black", size = 1) +
geom_text(aes(label = signif), vjust = -2.5)
```