Learning when drop_zeros=False #1595

ColdTeapot273K · 2024-08-23T15:29:26Z

Versions

river version: recent main (d606d7b)
Python version: Python 3.11.8
Operating system: Fedora Linux

Describe the bug

These 4 interesting lines effectively stop OneHotEncoder from learning when drop_zeros=False:

Steps/code to reproduce

Setup:

# Sample code to reproduce the problem
>>> from pprint import pprint
>>> import random
>>> import string

>>> random.seed(42)
>>> alphabet = list(string.ascii_lowercase)
>>> X = [
...     {
...         'c1': random.choice(alphabet),
...         'c2': random.choice(alphabet),
...     }
...     for _ in range(4)
... ]

>>> from river import preprocessing
>>> oh = preprocessing.OneHotEncoder(drop_zeros=True)
>>> for x in X:
...:     oh.learn_one(x)
...:     pprint(oh.transform_one(x))
...: 
{'c1_u': 1, 'c2_d': 1}
{'c1_a': 1, 'c2_x': 1}
{'c1_i': 1, 'c2_h': 1}
{'c1_h': 1, 'c2_e': 1}

Actual result:

>>> oh.values
defaultdict(set, {})

Expected result:

>>> oh.values
defaultdict(set, {'c1': {'a', 'h', 'i', 'u'}, 'c2': {'d', 'e', 'h', 'x'}})

The text was updated successfully, but these errors were encountered:

MaxHalford · 2024-08-25T20:44:56Z

Hey there @ColdTeapot273K!

Would you say this is a bug? The idea here is that there is no need to keep track of values when drop_zeros=True. Indeed, if drop_zeros=False, then we need to keep track of values in order to add zeros. But if not, there is no need to keep track, which is why we return directly in learn_one and learn_many.

What am I missing? Maybe it would be preferable if oh.values was a private property?

ColdTeapot273K · 2024-09-27T15:25:28Z

@MaxHalford sorry for the delay

I got the idea, but let me argue that it's a hack.

Let me argue that learning implies some "learnt state", as per principle of least astonishment. Workflows for debug, integration with other libs, serialization, etc. of learnable transformers/estimators rely on this behavior.

Personally I had quite some cases where it was at least handy (sometimes - necessary) to inspect such a state to verify the behaviour, especially during productionisation of pipelines, converting to other frameworks, developing extensions, etc.

The current implementation is a logic shortcut which saves on some space (& maybe time) complexity. At the cost of "correctness", as in, making learning logic inconsistent and disabling workflows that depend on inspecting a learnt state.
E.g. now I have to make custom patches for river fork or give up on performance gains from sparsity or stay on old river releases. I think this trade was not worth it.

Proposal: the previous implementation (which had the learnt state) was just fine and should be the default. The current shortcut implementation can be re-implemented by advanced users who are onto optimizing library code for some specific use case.

ColdTeapot273K · 2024-11-05T16:01:28Z

@MaxHalford I can make a PR with fix if it makes your life easier 😅

MaxHalford added the Bug label Aug 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning when drop_zeros=False #1595

Learning when drop_zeros=False #1595

ColdTeapot273K commented Aug 23, 2024

MaxHalford commented Aug 25, 2024

ColdTeapot273K commented Sep 27, 2024

ColdTeapot273K commented Nov 5, 2024

Learning when drop_zeros=False #1595

Learning when drop_zeros=False #1595

Comments

ColdTeapot273K commented Aug 23, 2024

Versions

Describe the bug

Steps/code to reproduce

MaxHalford commented Aug 25, 2024

ColdTeapot273K commented Sep 27, 2024

ColdTeapot273K commented Nov 5, 2024