Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two new y-transformation approaches #611

Open
wants to merge 2 commits into
base: development
Choose a base branch
from

Conversation

mlindauer
Copy link
Contributor

  • bilog (log transformations above 0 and below 0)
  • Gaussian Copula (ECDF -> quantiles -> Inverse Gaussian CDF)

@dengdifan
Copy link
Contributor

If everyone is happy with the implementation, I will merge this branch

Copy link
Contributor

@mfeurer mfeurer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want to merge this PR at the moment:

  1. We don't have a method that uses quantile transformations
  2. I think the quantile transformation should be improved
  3. We don't have a method that uses bilog transformations at the moment

np.ndarray
"""
# ECDF
quants = [sp.stats.percentileofscore(values, v)/100 - VERY_SMALL_NUMBER for v in values]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is incorrect. I reimplemented this according to Salinas et al., which appears to give better, and most importantly, symmetric outputs:

import numpy as np
import scipy.stats

values = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
VERY_SMALL_NUMBER = 1e-10

# This PR
quants = [scipy.stats.percentileofscore(values, v)/100 - VERY_SMALL_NUMBER for v in values]
output = np.array([scipy.stats.norm.ppf(q) for q in quants]).reshape((-1, 1))
print(output)

# Correct
quants = (scipy.stats.rankdata(values.flatten()) - 1) / (len(values) - 1)
cutoff = 1 / (4 * np.power(len(values), 0.25) * np.sqrt(np.pi * np.log(len(values))))
quants = np.clip(quants, a_min=cutoff, a_max=1 - cutoff)
# Inverse Gaussian CDF
rval = np.array([scipy.stats.norm.ppf(q) for q in quants]).reshape((-1, 1))
print(rval)

output:

[-1.28155157e+00 -8.41621234e-01 -5.24400513e-01 -2.53347103e-01
 -2.50662848e-10  2.53347103e-01  5.24400512e-01  8.41621233e-01
  1.28155156e+00  6.36134089e+00]
[-1.62322583 -1.22064035 -0.76470967 -0.4307273  -0.1397103   0.1397103
  0.4307273   0.76470967  1.22064035  1.62322583]

@stale stale bot added the stale label Jun 17, 2022
@renesass renesass added feature and removed stale labels Jun 23, 2022
@automl automl deleted a comment from stale bot Jun 23, 2022
@alexandertornede
Copy link
Contributor

We will have a look at how these methods perform once we have the new benchmarking fully in place.

@alexandertornede alexandertornede self-assigned this Jan 26, 2023
@mfeurer
Copy link
Contributor

mfeurer commented Jan 27, 2023

The recent HEBO suggests using a PowerTransform from scikit-learn. If you plan to benchmark these two, could you also throw this one in the mix?

@alexandertornede
Copy link
Contributor

Thanks for the pointer. Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

7 participants