Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add easy model fitting and comparison #77

Open
wants to merge 90 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
90 commits
Select commit Hold shift + click to select a range
64ed36d
adding test scripts
Sep 6, 2024
851bfb0
matching tabrepo and fit df, using zeroshot_context
Sep 6, 2024
b0b2552
plotting functionality
Sep 6, 2024
ef2aa9a
Update
Innixma Sep 7, 2024
275df36
WIP exec.py
Sep 18, 2024
2ec9b6e
Add updates
Innixma Sep 18, 2024
66ea368
Add v2 scripts
Innixma Sep 18, 2024
2c5b3c3
Remove y_uncleaned
Innixma Sep 19, 2024
3b3f791
resolve merge conflicts
Innixma Sep 20, 2024
a1df0a4
resolve merge conflicts
Innixma Sep 20, 2024
022fc3f
resolve merge conflicts
Innixma Sep 20, 2024
6ab5304
adding test scripts
Sep 6, 2024
fd1d0a9
plotting functionality
Sep 6, 2024
a411f5e
Initial Class implementation
Sep 23, 2024
7227ab2
typo
Sep 24, 2024
08b266c
minor updates
Innixma Sep 23, 2024
095ceed
add run_scripts_v4
Innixma Sep 24, 2024
41b098e
making run_experiment a staticmethod
prateekdesai04 Sep 24, 2024
1ef8070
Updated run_experiments
prateekdesai04 Sep 27, 2024
8b25bac
Cleanup, add TabPFNv2 prototype
Innixma Sep 27, 2024
a69596d
Cleanup
Innixma Sep 27, 2024
f5fe3c7
Cleanup
Innixma Sep 27, 2024
b95a76e
Cleanup
Innixma Sep 27, 2024
f8b8da4
Cleanup
Innixma Sep 27, 2024
74df85f
Cleanup
Innixma Sep 27, 2024
8f62e02
bug fix
Innixma Sep 30, 2024
6c3833f
Add run_tabpfn_v2_benchmark.py + additional bugfixes
Innixma Sep 30, 2024
175a38c
Add TabForestPFN_class.py
Innixma Oct 2, 2024
3401d4e
Add TabForestPFN_class.py
Innixma Oct 2, 2024
e75dbd1
Delete old files
Innixma Oct 2, 2024
5e6afab
Update file locations
Innixma Oct 3, 2024
c4eb4e1
Add AutoGluon_class.py, tabforestpfn_model.py
Innixma Oct 3, 2024
4623a37
add hyperparameter/init_args support
prateekdesai04 Oct 3, 2024
90a7bad
Add run_tabforestpfn_benchmark.py
Innixma Oct 3, 2024
9f9a269
removing unused files
prateekdesai04 Oct 3, 2024
5d044f0
Update add simulation_artifacts support
Innixma Oct 4, 2024
3070f31
Add simulation ensemble comparison support via `evaluate_ensemble_wit…
Innixma Oct 5, 2024
e801a17
update
Innixma Oct 8, 2024
d3b4cfe
update
Innixma Oct 8, 2024
7b2195d
minor cleanup
prateekdesai04 Oct 10, 2024
53bf01b
minor cleanup
prateekdesai04 Oct 10, 2024
36e0715
Update evaluate_ensemble_with_time
Innixma Oct 11, 2024
1048d34
Fix bug in zeroshot_configs
Innixma Oct 11, 2024
5e30d89
Refactor baselines.py
Innixma Oct 11, 2024
e405606
Add repo.evaluate_ensemble_with_time_multi
Innixma Oct 11, 2024
0ceb91b
Update repo.evaluate_ensemble to return DataFrame
Innixma Oct 11, 2024
0097ef2
Add logger module, and adding wrapper logs to run scripts, will add d…
prateekdesai04 Oct 11, 2024
17344dd
minor update
Innixma Oct 15, 2024
8570fca
Refactor evaluate_ensemble
Innixma Oct 15, 2024
0339c91
Refactor evaluate_ensemble
Innixma Oct 16, 2024
1d2ec71
Refactor evaluate_ensemble
Innixma Oct 16, 2024
426d041
Cleanup
Innixma Oct 16, 2024
a12dd5c
Cleanup
Innixma Oct 16, 2024
1546337
Cleanup
Innixma Oct 17, 2024
5508542
Add logic to context.py
Innixma Oct 17, 2024
e429dd6
minor update
Innixma Oct 17, 2024
3ed7678
Add save/load logic to ZeroshotSimulatorContext
Innixma Oct 17, 2024
84d4d81
Add save/load logic to EvaluationRepository
Innixma Oct 17, 2024
697b477
Align column names in model fits
Innixma Oct 17, 2024
8b8e06c
Add unit tests for repo save/load
Innixma Oct 17, 2024
92c0cf7
Add extra unit tests for repo save/load
Innixma Oct 17, 2024
9b65c66
Fix Self import
Innixma Oct 17, 2024
bc7e3d7
Fix imports
Innixma Oct 17, 2024
4099d06
fix tests
Innixma Oct 17, 2024
248e9cf
simplify run_quickstart_from_scratch.py
Innixma Oct 18, 2024
d4e8b59
minor update
Innixma Oct 18, 2024
1f827e1
update `repo.from_raw`
Innixma Oct 18, 2024
56016bf
Add root, app and console loggers
prateekdesai04 Oct 18, 2024
9003e1a
addition to logging module
prateekdesai04 Oct 18, 2024
f1abdb1
add context save/load with json + relative path support
Innixma Oct 18, 2024
0102ac4
add ebm and tabpfnv2 models
Innixma Oct 30, 2024
16c0329
add ebm and tabpfnv2 models
Innixma Oct 30, 2024
1474e65
update
Innixma Oct 30, 2024
315aa99
update
Innixma Oct 30, 2024
b5d2838
update
Innixma Oct 30, 2024
2b962fb
update
Innixma Oct 30, 2024
b9380dc
update
Innixma Nov 1, 2024
3b0a932
Support loading repo artifact from cloned directory
Innixma Nov 1, 2024
c5c69a6
minor fix
Innixma Nov 1, 2024
2f8df7e
cleanup
Innixma Nov 1, 2024
afdb8b9
update
Innixma Nov 5, 2024
6d22d4d
Update
Innixma Nov 6, 2024
e7390a9
cleanup
Innixma Nov 6, 2024
d0484d8
Add simple benchmark runner
Innixma Nov 6, 2024
798219a
cleanup
Innixma Nov 6, 2024
7440051
Update for ag12
Innixma Nov 18, 2024
1393238
Update for ag12
Innixma Nov 19, 2024
8c78273
Update for ag12
Innixma Nov 19, 2024
6223f8c
TabPFN support stopped at best epoch
Innixma Nov 19, 2024
f65c4ef
update
Innixma Nov 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ To evaluate an ensemble of any list of configuration, you can run the following:
```python
from tabrepo import load_repository
repo = load_repository("D244_F3_C1530_30")
print(repo.evaluate_ensemble(datasets=["Australian"], configs=["CatBoost_r22_BAG_L1", "RandomForest_r12_BAG_L1"]))
print(repo.evaluate_ensemble(dataset="Australian", fold=0, configs=["CatBoost_r22_BAG_L1", "RandomForest_r12_BAG_L1"]))
```

this code will return the error of an ensemble whose weights are computed with the Caruana procedure after loading model
Expand Down
24 changes: 24 additions & 0 deletions examples/context_dl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
from __future__ import annotations

from typing import Tuple, List, Union
from autogluon_benchmark import OpenMLTaskWrapper
from tabrepo import EvaluationRepository

class ContextDataLoader(OpenMLTaskWrapper):
"""
Class to Fetch Train Test Splits of context dataset
"""
def get_context_train_test_split(self, repo: EvaluationRepository, task_id: Union[int, List[int]], repeat: int = 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to get those!

fold: int = 0, sample: int = 0):
if repo.tid_to_dataset(task_id) in repo.datasets():
train_indices, test_indices = self.task.get_train_test_split_indices(repeat=repeat, fold=fold,
sample=sample)
X_train = self.X.loc[train_indices]
y_train = self.y[train_indices]
X_test = self.X.loc[test_indices]
y_test = self.y[test_indices]
return X_train, y_train, X_test, y_test
else:
raise KeyError(f"Dataset for task_id {task_id} not found.")

# Add Another function to just get the X and y for random state
4 changes: 2 additions & 2 deletions examples/run_quickstart.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,8 +61,8 @@
y_val = repo.labels_val(dataset=dataset, fold=0)
print(f"Ground Truth Val (dataset={dataset}, fold=0):\n{y_val[:10]}")

df_ranks, df_ensemble_weights = repo.evaluate_ensemble(datasets=[dataset], configs=configs, ensemble_size=100)
print(f"Ensemble rank per task:\n{df_ranks}")
df_result, df_ensemble_weights = repo.evaluate_ensemble(dataset=dataset, fold=0, configs=configs, ensemble_size=100)
print(f"Ensemble result:\n{df_result}")

df_ensemble_weights_mean_sorted = df_ensemble_weights.mean(axis=0).sort_values(ascending=False)
print(f"Top 10 highest mean ensemble weight configs:\n{df_ensemble_weights_mean_sorted.head(10)}")
Expand Down
51 changes: 8 additions & 43 deletions examples/run_quickstart_from_scratch.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,9 @@
import pandas as pd

from autogluon.common.savers import save_pd
from autogluon.common.utils.simulation_utils import convert_simulation_artifacts_to_tabular_predictions_dict
from autogluon.tabular import TabularPredictor
from autogluon_benchmark import OpenMLTaskWrapper

from tabrepo import EvaluationRepository
from tabrepo.repository import EvaluationRepositoryZeroshot
from tabrepo.predictions import TabularPredictionsInMemory
from tabrepo.contexts.context import BenchmarkContext, construct_context
from tabrepo.contexts.subcontext import BenchmarkSubcontext
from tabrepo.simulation.ground_truth import GroundTruth


def get_artifacts(task: OpenMLTaskWrapper, fold: int, hyperparameters: dict, dataset: str = None, time_limit=60):
Expand Down Expand Up @@ -125,51 +118,23 @@ def convert_leaderboard_to_configs(leaderboard: pd.DataFrame, minimal: bool = Tr
)
)

# TODO: Move into AutoGluonTaskWrapper
simulation_artifacts_full = dict()
leaderboards = []
for simulation_artifacts, leaderboard in artifacts:
leaderboards.append(leaderboard)
results_lst_simulation_artifacts = [simulation_artifacts for simulation_artifacts, leaderboard in artifacts]

leaderboards = [leaderboard for simulation_artifacts, leaderboard in artifacts]
leaderboard_full = pd.concat(leaderboards)
print(leaderboard_full)
for simulation_artifacts, leaderboard in artifacts:
for k in simulation_artifacts.keys():
if k not in simulation_artifacts_full:
simulation_artifacts_full[k] = {}
for f in simulation_artifacts[k]:
if f in simulation_artifacts_full:
raise AssertionError(f"Two results exist for tid {k}, fold {f}!")
else:
simulation_artifacts_full[k][f] = simulation_artifacts[k][f]

zeroshot_pp, zeroshot_gt = convert_simulation_artifacts_to_tabular_predictions_dict(simulation_artifacts=simulation_artifacts_full)

save_loc = "./quickstart/"
save_loc_data_dir = save_loc + "model_predictions/"

predictions = TabularPredictionsInMemory.from_dict(zeroshot_pp)
ground_truth = GroundTruth.from_dict(zeroshot_gt)
predictions.to_data_dir(data_dir=save_loc_data_dir)
ground_truth.to_data_dir(data_dir=save_loc_data_dir)

df_configs = convert_leaderboard_to_configs(leaderboard=leaderboard_full)
save_pd.save(path=f"{save_loc}configs.parquet", df=df_configs)
print(df_configs)

context: BenchmarkContext = construct_context(
name="quickstart",
datasets=datasets,
folds=folds,
local_prefix=save_loc,
local_prefix_is_relative=False,
has_baselines=False)
subcontext = BenchmarkSubcontext(parent=context)
repo = EvaluationRepository.from_raw(df_configs=df_configs, results_lst_simulation_artifacts=results_lst_simulation_artifacts)

# Note: Can also skip all the above code if you want to use a readily available context rather than generating from scratch:
# from tabrepo.contexts import get_subcontext
# subcontext = get_subcontext(name="D244_F3_C1530_30")

repo: EvaluationRepository = subcontext.load_from_parent()
repo: EvaluationRepositoryZeroshot = repo.to_zeroshot()
repo.print_info()

repo = repo.to_zeroshot()

results_cv = repo.simulate_zeroshot(num_zeroshot=3, n_splits=2, backend="seq")
df_results = repo.generate_output_from_portfolio_cv(portfolio_cv=results_cv, name="quickstart")
Expand Down
159 changes: 159 additions & 0 deletions examples/run_scripts_v2/exec_v2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
from tabpfn import TabPFNClassifier
from autogluon.features import AutoMLPipelineFeatureGenerator
from autogluon.core.metrics import get_metric, Scorer
import pandas as pd
from autogluon_benchmark.utils.time_utils import Timer
from autogluon_benchmark.frameworks.autogluon.run import ag_eval_metric_map
from autogluon.core.data import LabelCleaner
from autogluon.core.utils import generate_train_test_split


def fit_outer(task, fold: int, task_name: str, method: str, init_args: dict = None, **kwargs):
if init_args is None:
init_args = {}
if 'eval_metric' not in init_args:
init_args['eval_metric'] = ag_eval_metric_map[task.problem_type]

X_train, y_train, X_test, y_test = task.get_train_test_split(fold=fold)

out = fit_custom_clean(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test,
problem_type=task.problem_type, eval_metric=init_args['eval_metric'], label=task.label)

out["framework"] = method
out["dataset"] = task_name
out["tid"] = task.task_id
out["fold"] = fold
out["problem_type"] = task.problem_type
print(f"Task Name: {out['dataset']}")
print(f"Task ID: {out['tid']}")
print(f"Metric : {out['eval_metric']}")
print(f"Test Error: {out['test_error']:.4f}")
print(f"Fit Time: {out['time_fit']:.3f}s")
print(f"Infer Time: {out['time_predict']:.3f}s")

out.pop("predictions")
out.pop("probabilities")
out.pop("truth")

df_results = pd.DataFrame([out])
ordered_columns = ["dataset", "fold", "framework", "test_error", "eval_metric", "time_fit"]
columns_reorder = ordered_columns + [c for c in df_results.columns if c not in ordered_columns]
df_results = df_results[columns_reorder]
return df_results


# TODO: Nick: This works for 99.99% of cases, but to handle all possible edge-cases,
# we probably want to use Tabular's LabelCleaner during metric calculation to avoid any oddities.
# This can be done as a follow-up
# We also need to track positive_class for binary classification
def calc_error(
y_true: pd.Series,
y_pred: pd.Series,
y_pred_proba: pd.DataFrame,
problem_type: str,
scorer: Scorer,
) -> float:
if scorer.needs_pred: # use y_pred
error = scorer.error(y_true=y_true, y_pred=y_pred)
elif problem_type == "binary": # use y_pred_proba
error = scorer.error(y_true=y_true, y_pred=y_pred_proba.iloc[:, 1])
else:
error = scorer.error(y_true=y_true, y_pred=y_pred_proba)
return error


def fit_custom_clean(X_train, y_train, X_test, y_test, problem_type: str = None, eval_metric: str = None, **kwargs):
label_cleaner = LabelCleaner.construct(problem_type=problem_type, y=y_train)
y_train_clean = label_cleaner.transform(y_train)
y_test_clean = label_cleaner.transform(y_test)

# TODO: Nick: For now, I'm preprocessing via AutoGluon's feature generator because otherwise TabPFN crashes on some datasets.
feature_generator = AutoMLPipelineFeatureGenerator()
X_train_clean = feature_generator.fit_transform(X=X_train, y=y_train)
X_test_clean = feature_generator.transform(X=X_test)

out = fit_custom(
X_train=X_train_clean,
y_train=y_train_clean,
X_test=X_test_clean,
y_test=y_test_clean,
problem_type=problem_type,
eval_metric=eval_metric,
**kwargs,
)

y_pred_test_clean = out["predictions"]
y_pred_proba_test_clean = out["probabilities"]

scorer: Scorer = get_metric(metric=eval_metric, problem_type=problem_type)

test_error = calc_error(
y_true=y_test_clean,
y_pred=y_pred_test_clean,
y_pred_proba=y_pred_proba_test_clean,
problem_type=problem_type,
scorer=scorer,
)

y_pred_test = label_cleaner.inverse_transform(y_pred_test_clean)
out["predictions"] = y_pred_test

if y_pred_proba_test_clean is not None:
y_pred_proba_test = label_cleaner.inverse_transform_proba(y_pred_proba_test_clean, as_pandas=True)
out["probabilities"] = y_pred_proba_test

out["test_error"] = test_error
out["eval_metric"] = eval_metric
out["truth"] = y_test

return out


def fit_custom(
X_train: pd.DataFrame,
y_train: pd.Series,
X_test: pd.DataFrame,
y_test: pd.Series,
eval_metric: str,
problem_type: str = None,
label: str = None,
) -> dict:

# FIXME: Nick: This is a hack specific to TabPFN, since it doesn't handle large data, parameterize later
sample_limit = 4096
if len(X_train) > sample_limit:
X_train, _, y_train, _ = generate_train_test_split(
X=X_train,
y=y_train,
problem_type=problem_type,
train_size=sample_limit,
random_state=0,
min_cls_count_train=1,
)

# with Timer() as timer_fit:
# model = TabPFNClassifier(device='cpu', N_ensemble_configurations=32).fit(X_train, y_train, overwrite_warning=True)

from tabpfn_client.estimator import TabPFNClassifier as TabPFNClassifierV2, TabPFNRegressor
model = TabPFNClassifierV2(model="latest_tabpfn_hosted", n_estimators=32)
with Timer() as timer_fit:
model = model.fit(X_train, y_train)

is_classification = problem_type in ['binary', 'multiclass']
if is_classification:
with Timer() as timer_predict:
y_pred_proba = model.predict_proba(X_test)
y_pred_proba = pd.DataFrame(y_pred_proba, columns=model.classes_, index=X_test.index)
y_pred = y_pred_proba.idxmax(axis=1)
else:
with Timer() as timer_predict:
y_pred = model.predict(X_test)
y_pred = pd.Series(y_pred, name=label, index=X_test.index)
y_pred_proba = None

return {
'predictions': y_pred,
'probabilities': y_pred_proba,
'time_fit': timer_fit.duration,
'time_predict': timer_predict.duration,
}
105 changes: 105 additions & 0 deletions examples/run_scripts_v2/temp_script_ag_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
from __future__ import annotations

import pandas as pd

from autogluon_benchmark.tasks.experiment_utils import run_experiments
from tabrepo import load_repository, EvaluationRepository


def convert_leaderboard_to_configs(leaderboard: pd.DataFrame, minimal: bool = True) -> pd.DataFrame:
df_configs = leaderboard.rename(columns=dict(
time_fit="time_train_s",
time_predict="time_infer_s",
test_error="metric_error",
eval_metric="metric",
val_error="metric_error_val",
))
if minimal:
df_configs = df_configs[[
"dataset",
"fold",
"framework",
"metric_error",
"metric",
"problem_type",
"time_train_s",
"time_infer_s",
"tid",
]]
return df_configs


if __name__ == '__main__':
# Load Context
context_name = "D244_F3_C1530_30"
repo: EvaluationRepository = load_repository(context_name, cache=True)

expname = "./initial_experiment_ag_models" # folder location of all experiment artifacts
ignore_cache = False # set to True to overwrite existing caches and re-run experiments from scratch

datasets = [
"blood-transfusion-service-center",
"Australian",
]
tids = [repo.dataset_to_tid(dataset) for dataset in datasets]
folds = repo.folds

# all_configs = repo.configs()
# import random
# reproduce_configs = random.sample(all_configs, k=10)

reproduce_configs = [
"RandomForest_c1_BAG_L1",
"ExtraTrees_c1_BAG_L1",
"LightGBM_c1_BAG_L1",
"XGBoost_c1_BAG_L1",
"CatBoost_c1_BAG_L1",
"TabPFN_c1_BAG_L1",
"NeuralNetTorch_c1_BAG_L1",
"NeuralNetFastAI_c1_BAG_L1",
]

methods_dict = {}
for c in reproduce_configs:
ag_hyperparameters = repo.autogluon_hyperparameters_dict(configs=[c])
methods_dict[c + "_V2"] = {"hyperparameters": ag_hyperparameters}

extra_kwargs = {
"fit_weighted_ensemble": False,
"num_bag_folds": 8,
"num_bag_sets": 1,
}

for k, v in methods_dict.items():
v.update(extra_kwargs)

methods = list(methods_dict.keys())

results_lst = run_experiments(
expname=expname,
tids=tids,
folds=folds,
methods=methods,
methods_dict=methods_dict,
task_metadata=repo.task_metadata,
ignore_cache=ignore_cache,
)
results_df = pd.concat(results_lst, ignore_index=True)
results_df = convert_leaderboard_to_configs(results_df)

with pd.option_context('display.max_rows', None, 'display.max_columns', None, 'display.width', 1000):
print(results_df)

metrics = repo.compare_metrics(
results_df,
datasets=datasets,
folds=folds,
baselines=["AutoGluon_bq_4h8c_2023_11_14"],
configs=reproduce_configs,
)
with pd.option_context("display.max_rows", None, "display.max_columns", None, "display.width", 1000):
print(f"Config Metrics Example:\n{metrics}")
evaluator_output = repo.plot_overall_rank_comparison(
results_df=metrics,
save_dir=expname,
)
Loading
Loading