TF-IDF uses the wrong transform_many() #1629

e10e3 · 2024-11-13T17:28:49Z

Versions

River version: 0.21.1
Python version: 3.12.7
Operating system: macOS 14.7

Describe the bug

The output of TFIDF.transform_one() and TFIDF.transform_many() are inconsistent.

transform_one() gives a dictionnary mapping a word to its importance, while transform_many() gives a dataframe of the word counts.

This is because TFIDF inherits from BagOfWords but does not reimplement the *_many() methods, leading Python to use the ones from BagOfWords in their absence.

Code to reproduce

import pandas as pd
from river import feature_extraction

tfidf = feature_extraction.TFIDF()

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

new_df = pd.DataFrame()
for sentence in corpus:
    tfidf.learn_one(sentence)
    print(tfidf.transform_one(sentence))

batch_corpus = pd.Series(corpus)
print(tfidf.transform_many(batch_corpus))

Output

# from transform_one
{'this': 0.4472135954999579, 'is': 0.4472135954999579, 'the': 0.4472135954999579, 'first': 0.4472135954999579, 'document': 0.4472135954999579}
{'this': 0.3337910861940011, 'document': 0.6675821723880022, 'is': 0.3337910861940011, 'the': 0.3337910861940011, 'second': 0.4691317250431934}
{'and': 0.49711994139048155, 'this': 0.2936070455647439, 'is': 0.2936070455647439, 'the': 0.2936070455647439, 'third': 0.49711994139048155, 'one': 0.49711994139048155}
{'is': 0.3840852409148149, 'this': 0.3840852409148149, 'the': 0.3840852409148149, 'first': 0.580285823684436, 'document': 0.4697913855799205}
# from transform_many
   this  is  the  first  document  second  and  third  one
0     1   1    1      1         1       0    0      0    0
1     1   1    1      0         2       1    0      0    0
2     1   1    1      0         0       0    1      1    1
3     1   1    1      1         1       0    0      0    0

Expected behaviour

transform_many should give a dataframe of floats. I don't know if this is exactly what the values should be, but this is how it should look:

       this        is       the     first  document    second      and    third      one
0  0.447214  0.447214  0.447214  0.447214  0.447214  0.000000  0.00000  0.00000  0.00000
1  0.333791  0.333791  0.333791  0.000000  0.667582  0.469132  0.00000  0.00000  0.00000
2  0.293607  0.293607  0.293607  0.000000  0.000000  0.000000  0.49712  0.49712  0.49712
3  0.384085  0.384085  0.384085  0.580286  0.469791  0.000000  0.00000  0.00000  0.00000

The text was updated successfully, but these errors were encountered:

e10e3 · 2024-11-13T18:01:39Z

Maybe linked to #1576

raphaelsty · 2024-11-13T18:35:07Z

We did not wanted to hide a for loop in the learn_many and transform_many but rather propose a method which is faster on batch. Right now learn_many and transform_many should not be available for tfidf.

for document in documents:
    tfidf.learn_one(document)

@smastelini Do you think we should raise an error here or add a simple for loop with learn_one and transform_one.

The best would be to have a dedicated and optimized batch tfidf. Don't have much time yet to work on it but at some point I could do it. :)

smastelini · 2024-11-13T20:07:38Z

Hi @e10e3 and @raphaelsty. I believe it should raise a NotImplementedError at the moment. I agree with your point @raphaelsty. The phylosophy for mini batches is "do it right or don't do it". In case there are no real benefits in speed, we should not hide a loop from the user, as you mention.

e10e3 mentioned this issue Nov 14, 2024

Make mini-batch TF-IDF raise an exception #1631

Merged

MaxHalford closed this as completed in #1631 Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TF-IDF uses the wrong transform_many() #1629

TF-IDF uses the wrong transform_many() #1629

e10e3 commented Nov 13, 2024

e10e3 commented Nov 13, 2024

raphaelsty commented Nov 13, 2024 •

edited

Loading

smastelini commented Nov 13, 2024

TF-IDF uses the wrong transform_many() #1629

TF-IDF uses the wrong transform_many() #1629

Comments

e10e3 commented Nov 13, 2024

Versions

Describe the bug

Code to reproduce

Output

Expected behaviour

e10e3 commented Nov 13, 2024

raphaelsty commented Nov 13, 2024 • edited Loading

smastelini commented Nov 13, 2024

raphaelsty commented Nov 13, 2024 •

edited

Loading