Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF-IDF uses the wrong transform_many() #1629

Closed
e10e3 opened this issue Nov 13, 2024 · 3 comments · Fixed by #1631
Closed

TF-IDF uses the wrong transform_many() #1629

e10e3 opened this issue Nov 13, 2024 · 3 comments · Fixed by #1631

Comments

@e10e3
Copy link
Contributor

e10e3 commented Nov 13, 2024

Versions

River version: 0.21.1
Python version: 3.12.7
Operating system: macOS 14.7

Describe the bug

The output of TFIDF.transform_one() and TFIDF.transform_many() are inconsistent.

transform_one() gives a dictionnary mapping a word to its importance, while transform_many() gives a dataframe of the word counts.

This is because TFIDF inherits from BagOfWords but does not reimplement the *_many() methods, leading Python to use the ones from BagOfWords in their absence.

Code to reproduce

import pandas as pd
from river import feature_extraction

tfidf = feature_extraction.TFIDF()

corpus = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",
]

new_df = pd.DataFrame()
for sentence in corpus:
    tfidf.learn_one(sentence)
    print(tfidf.transform_one(sentence))

batch_corpus = pd.Series(corpus)
print(tfidf.transform_many(batch_corpus))

Output

# from transform_one
{'this': 0.4472135954999579, 'is': 0.4472135954999579, 'the': 0.4472135954999579, 'first': 0.4472135954999579, 'document': 0.4472135954999579}
{'this': 0.3337910861940011, 'document': 0.6675821723880022, 'is': 0.3337910861940011, 'the': 0.3337910861940011, 'second': 0.4691317250431934}
{'and': 0.49711994139048155, 'this': 0.2936070455647439, 'is': 0.2936070455647439, 'the': 0.2936070455647439, 'third': 0.49711994139048155, 'one': 0.49711994139048155}
{'is': 0.3840852409148149, 'this': 0.3840852409148149, 'the': 0.3840852409148149, 'first': 0.580285823684436, 'document': 0.4697913855799205}
# from transform_many
   this  is  the  first  document  second  and  third  one
0     1   1    1      1         1       0    0      0    0
1     1   1    1      0         2       1    0      0    0
2     1   1    1      0         0       0    1      1    1
3     1   1    1      1         1       0    0      0    0

Expected behaviour

transform_many should give a dataframe of floats. I don't know if this is exactly what the values should be, but this is how it should look:

       this        is       the     first  document    second      and    third      one
0  0.447214  0.447214  0.447214  0.447214  0.447214  0.000000  0.00000  0.00000  0.00000
1  0.333791  0.333791  0.333791  0.000000  0.667582  0.469132  0.00000  0.00000  0.00000
2  0.293607  0.293607  0.293607  0.000000  0.000000  0.000000  0.49712  0.49712  0.49712
3  0.384085  0.384085  0.384085  0.580286  0.469791  0.000000  0.00000  0.00000  0.00000
@e10e3
Copy link
Contributor Author

e10e3 commented Nov 13, 2024

Maybe linked to #1576

@raphaelsty
Copy link
Member

raphaelsty commented Nov 13, 2024

We did not wanted to hide a for loop in the learn_many and transform_many but rather propose a method which is faster on batch. Right now learn_many and transform_many should not be available for tfidf.

for document in documents:
    tfidf.learn_one(document)

@smastelini Do you think we should raise an error here or add a simple for loop with learn_one and transform_one.

The best would be to have a dedicated and optimized batch tfidf. Don't have much time yet to work on it but at some point I could do it. :)

@smastelini
Copy link
Member

Hi @e10e3 and @raphaelsty. I believe it should raise a NotImplementedError at the moment. I agree with your point @raphaelsty. The phylosophy for mini batches is "do it right or don't do it". In case there are no real benefits in speed, we should not hide a loop from the user, as you mention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants