-
-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two new y-transformation approaches #611
base: development
Are you sure you want to change the base?
Conversation
mlindauer
commented
Mar 4, 2020
- bilog (log transformations above 0 and below 0)
- Gaussian Copula (ECDF -> quantiles -> Inverse Gaussian CDF)
If everyone is happy with the implementation, I will merge this branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want to merge this PR at the moment:
- We don't have a method that uses quantile transformations
- I think the quantile transformation should be improved
- We don't have a method that uses bilog transformations at the moment
np.ndarray | ||
""" | ||
# ECDF | ||
quants = [sp.stats.percentileofscore(values, v)/100 - VERY_SMALL_NUMBER for v in values] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is incorrect. I reimplemented this according to Salinas et al., which appears to give better, and most importantly, symmetric outputs:
import numpy as np
import scipy.stats
values = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
VERY_SMALL_NUMBER = 1e-10
# This PR
quants = [scipy.stats.percentileofscore(values, v)/100 - VERY_SMALL_NUMBER for v in values]
output = np.array([scipy.stats.norm.ppf(q) for q in quants]).reshape((-1, 1))
print(output)
# Correct
quants = (scipy.stats.rankdata(values.flatten()) - 1) / (len(values) - 1)
cutoff = 1 / (4 * np.power(len(values), 0.25) * np.sqrt(np.pi * np.log(len(values))))
quants = np.clip(quants, a_min=cutoff, a_max=1 - cutoff)
# Inverse Gaussian CDF
rval = np.array([scipy.stats.norm.ppf(q) for q in quants]).reshape((-1, 1))
print(rval)
output:
[-1.28155157e+00 -8.41621234e-01 -5.24400513e-01 -2.53347103e-01
-2.50662848e-10 2.53347103e-01 5.24400512e-01 8.41621233e-01
1.28155156e+00 6.36134089e+00]
[-1.62322583 -1.22064035 -0.76470967 -0.4307273 -0.1397103 0.1397103
0.4307273 0.76470967 1.22064035 1.62322583]
We will have a look at how these methods perform once we have the new benchmarking fully in place. |
The recent HEBO suggests using a PowerTransform from scikit-learn. If you plan to benchmark these two, could you also throw this one in the mix? |
Thanks for the pointer. Sure! |