-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for one-hot encoded features in minimization #87
Conversation
Signed-off-by: abigailt <[email protected]>
Signed-off-by: abigailt <[email protected]>
Signed-off-by: abigailt <[email protected]>
…ons for 1-hot encoded features are consistent. Signed-off-by: abigailt <[email protected]>
Signed-off-by: abigailt <[email protected]>
Signed-off-by: abigailt <[email protected]>
Signed-off-by: abigailt <[email protected]>
apt/minimization/minimizer.py
Outdated
self.categorical_features = [] | ||
if categorical_features: | ||
self.categorical_features = categorical_features | ||
self.features_to_minimize = features_to_minimize | ||
self.feature_slices = feature_slices | ||
if self.feature_slices: | ||
self.all_one_hot_features = set([str(feature) for encoded in self.feature_slices for feature in encoded]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use set comprehension right away instead of converting list comprehension to set.
@@ -375,6 +396,8 @@ def fit(self, X: Optional[DATA_PANDAS_NUMPY_TYPE] = None, y: Optional[DATA_PANDA | |||
x_test_dataset = ArrayDataset(x_test, features_names=self._features) | |||
self._ncp_scores.fit_score = self.calculate_ncp(x_test_dataset) | |||
self._ncp_scores.generalizations_score = self.calculate_ncp(x_test_dataset) | |||
else: | |||
print('No fitting was performed as some information was missing') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the message be a bit more helpful?
apt/minimization/minimizer.py
Outdated
elif range['end'] is None and range['start'] > 0: | ||
feature_value = 1 | ||
elif range['start'] is not None and range['end'] is not None: | ||
print(range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the feature_value in this case? Seems like an unassigned feature_value will be appended in the next line.
Is the print the only thing that happens here? And even if so, shouldn't some text explain the meaning of this print?
apt/minimization/minimizer.py
Outdated
feature_value = 1 | ||
elif range['start'] is not None and range['end'] is not None: | ||
print(range) | ||
new_cell['categories'][feature].append(feature_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what's the feature_value if none of the ifs match?
apt/minimization/minimizer.py
Outdated
def _get_other_features_in_encoding(feature, feature_slices): | ||
for encoded in feature_slices: | ||
if feature in encoded: | ||
return (list(set(encoded) - set([feature]))), encoded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: set([feature])
can be replaced with {feature}
apt/minimization/minimizer.py
Outdated
new_cell['categories'][other_feature].append(1) | ||
else: | ||
new_cell['categories'][other_feature].append(0) | ||
new_cell['categories'][other_feature].append(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand where this is narrowed down to the single correct value.
…nding so that options are narrowed down Signed-off-by: abigailt <[email protected]>
Signed-off-by: abigailt <[email protected]>
Signed-off-by: abigailt <[email protected]>
Signed-off-by: abigailt <[email protected]>
Any columns in the input data that represent one-hot encoded features can be minimized together to maintain the correctness of the encoding. This is the case both when using the transform method, and in the representative values within the generalizations structure.