Support for one-hot encoded features in minimization #87

abigailgold · 2023-11-19T18:33:58Z

Any columns in the input data that represent one-hot encoded features can be minimized together to maintain the correctness of the encoding. This is the case both when using the transform method, and in the representative values within the generalizations structure.

Signed-off-by: abigailt <[email protected]>

…ons for 1-hot encoded features are consistent. Signed-off-by: abigailt <[email protected]>

Signed-off-by: abigailt <[email protected]>

andersonm-ibm · 2023-12-20T16:54:37Z

apt/minimization/minimizer.py

        self.categorical_features = []
        if categorical_features:
            self.categorical_features = categorical_features
        self.features_to_minimize = features_to_minimize
+        self.feature_slices = feature_slices
+        if self.feature_slices:
+            self.all_one_hot_features = set([str(feature) for encoded in self.feature_slices for feature in encoded])


You can use set comprehension right away instead of converting list comprehension to set.

andersonm-ibm · 2023-12-20T19:34:20Z

apt/minimization/minimizer.py

@@ -375,6 +396,8 @@ def fit(self, X: Optional[DATA_PANDAS_NUMPY_TYPE] = None, y: Optional[DATA_PANDA
            x_test_dataset = ArrayDataset(x_test, features_names=self._features)
            self._ncp_scores.fit_score = self.calculate_ncp(x_test_dataset)
            self._ncp_scores.generalizations_score = self.calculate_ncp(x_test_dataset)
+        else:
+            print('No fitting was performed as some information was missing')


Can the message be a bit more helpful?

andersonm-ibm · 2023-12-20T19:38:40Z

apt/minimization/minimizer.py

+                        elif range['end'] is None and range['start'] > 0:
+                            feature_value = 1
+                        elif range['start'] is not None and range['end'] is not None:
+                            print(range)


What's the feature_value in this case? Seems like an unassigned feature_value will be appended in the next line.
Is the print the only thing that happens here? And even if so, shouldn't some text explain the meaning of this print?

andersonm-ibm · 2023-12-20T19:42:03Z

apt/minimization/minimizer.py

+                            feature_value = 1
+                        elif range['start'] is not None and range['end'] is not None:
+                            print(range)
+                        new_cell['categories'][feature].append(feature_value)


And what's the feature_value if none of the ifs match?

andersonm-ibm · 2023-12-21T17:43:51Z

apt/minimization/minimizer.py

+    def _get_other_features_in_encoding(feature, feature_slices):
+        for encoded in feature_slices:
+            if feature in encoded:
+                return (list(set(encoded) - set([feature]))), encoded


nit: set([feature]) can be replaced with {feature}

andersonm-ibm · 2023-12-21T20:37:43Z

apt/minimization/minimizer.py

+                                new_cell['categories'][other_feature].append(1)
+                            else:
+                                new_cell['categories'][other_feature].append(0)
+                                new_cell['categories'][other_feature].append(1)


I'm not sure I understand where this is narrowed down to the single correct value.

…nding so that options are narrowed down Signed-off-by: abigailt <[email protected]>

Signed-off-by: abigailt <[email protected]>

abigailgold added 4 commits November 15, 2023 08:21

Initial version with first working test

2a65738

Signed-off-by: abigailt <[email protected]>

Second test (pandas)

e7ee42f

Signed-off-by: abigailt <[email protected]>

One more test + fixes

904462a

Signed-off-by: abigailt <[email protected]>

More tests and fixes. Make sure representative values in generalizati…

c122fc7

…ons for 1-hot encoded features are consistent. Signed-off-by: abigailt <[email protected]>

abigailgold requested a review from andersonm-ibm November 19, 2023 18:34

abigailgold added 3 commits November 19, 2023 14:18

Formatting

8c1a186

Signed-off-by: abigailt <[email protected]>

Indication when fitting failed

f602806

Signed-off-by: abigailt <[email protected]>

Updated notebooks for one-hot encoded data

0e01e19

Signed-off-by: abigailt <[email protected]>

andersonm-ibm reviewed Dec 21, 2023

View reviewed changes

abigailgold added 4 commits December 24, 2023 13:13

Replace values in multi-column 1-hot encoded features instead of appe…

f646109

…nding so that options are narrowed down Signed-off-by: abigailt <[email protected]>

Fix test

a3d294a

Signed-off-by: abigailt <[email protected]>

Review comments

686969e

Signed-off-by: abigailt <[email protected]>

Formatting

e7a0a6a

Signed-off-by: abigailt <[email protected]>

abigailgold merged commit 6d81cd8 into main Dec 24, 2023
4 checks passed

abigailgold deleted the one_hot_minimization branch December 24, 2023 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for one-hot encoded features in minimization #87

Support for one-hot encoded features in minimization #87

abigailgold commented Nov 19, 2023

andersonm-ibm Dec 20, 2023

andersonm-ibm Dec 20, 2023

andersonm-ibm Dec 20, 2023

andersonm-ibm Dec 20, 2023

andersonm-ibm Dec 21, 2023

andersonm-ibm Dec 21, 2023

Support for one-hot encoded features in minimization #87

Support for one-hot encoded features in minimization #87

Conversation

abigailgold commented Nov 19, 2023

andersonm-ibm Dec 20, 2023

Choose a reason for hiding this comment

andersonm-ibm Dec 20, 2023

Choose a reason for hiding this comment

andersonm-ibm Dec 20, 2023

Choose a reason for hiding this comment

andersonm-ibm Dec 20, 2023

Choose a reason for hiding this comment

andersonm-ibm Dec 21, 2023

Choose a reason for hiding this comment

andersonm-ibm Dec 21, 2023

Choose a reason for hiding this comment