Draft: Adding CountVectorizer & TfidfVectorizer #164

acalejos · 2023-08-27T02:23:28Z

Adding CountVectorizer and TFIDF Vectorizer very similar to SKLearn's implementation. Based on some work currently being done by @polvalente on Nx to potentially add sparse tensor support (see https://github.com/elixir-nx/nx/tree/pv-proof-of-concept-sparse-torchx) , this would definitely benefit from sparse tensors, but for now I plan on proceeding with dense tensors.

Considering these vectorizers operate on text, perhaps they would best be fit for a standalone NLP library (I have begun scaffolding out a bit of one) but they could also very well fit in here in Scholar.Preprocessing. I'm open to suggestions as to where they best belong.

As of now, I only have finished setting up the options, and building a vocab given the options. I will try to wrap up at least the CountVectorizer by tomorrow night.

josevalim · 2023-08-27T07:17:21Z

@acalejos I would propose to keep this in the NLP library for now. For Scholar, we have been focusing on keeping everything implemented with defn and this is plain not possible here. Also, I can see future interest in implementing this in native code for performance, so I am not even sure if this will survive after some iterations :)

acalejos · 2023-08-28T03:01:09Z

@acalejos I would propose to keep this in the NLP library for now. For Scholar, we have been focusing on keeping everything implemented with defn and this is plain not possible here. Also, I can see future interest in implementing this in native code for performance, so I am not even sure if this will survive after some iterations :)

Makes sense. Appreciate the feedback!

Adding CountVectorizer -- build vocab done

29ac81e

acalejos closed this Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Adding CountVectorizer & TfidfVectorizer #164

Draft: Adding CountVectorizer & TfidfVectorizer #164

acalejos commented Aug 27, 2023

josevalim commented Aug 27, 2023

acalejos commented Aug 28, 2023

Draft: Adding CountVectorizer & TfidfVectorizer #164

Draft: Adding CountVectorizer & TfidfVectorizer #164

Conversation

acalejos commented Aug 27, 2023

josevalim commented Aug 27, 2023

acalejos commented Aug 28, 2023