-
-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFIDF.transform_many()
fails on DataFrame
input
#1576
Comments
Well, I looked at the source code. Unfortunately, the problems go deeper than inconsistent type annotations: While I'm going to side-step all of this for now by just iterating over the records one at a time, since the mini-batch operations are both invalid for |
I took a shot at implementing learn/transform many for the class MyTFIDF(river.feature_extraction.TFIDF):
def learn_many(self, X: pd.Series) -> None:
# increment global document counter
self.n += X.shape[0]
# update document counts
doc_counts = (
X.map(lambda x: set(self.process_text(x)))
.explode()
.value_counts()
.to_dict()
)
self.dfs.update(doc_counts)
def transform_many(self, X: pd.Series) -> pd.DataFrame:
"""Transform pandas series of string into tf-idf pandas sparse dataframe."""
indptr, indices, data = [0], [], []
index: dict[int, int] = {}
for doc in X:
term_weights: dict[int, float] = self.transform_one(doc)
for term, weight in term_weights.items():
indices.append(index.setdefault(term, len(index)))
data.append(weight)
indptr.append(len(data))
return pd.DataFrame.sparse.from_spmatrix(
scipy.sparse.csr_matrix((data, indices, indptr)),
index=X.index,
columns=index.keys(),
) |
Versions
river version: 0.21.2
Python version: 3.11.7
Operating system: macOS 14.4
Describe the bug
The
TFIDF
feature extractor claims to support both online and mini-batch transformations, but the latter case only works when the transformer doesn't specify theon
parameter. In other words, batch mode works forpd.Series
input, but notpd.Dataframe
.Steps/code to reproduce
That last call produces the following traceback:
The text was updated successfully, but these errors were encountered: