Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Adding CountVectorizer & TfidfVectorizer #164

Closed
wants to merge 1 commit into from

Conversation

acalejos
Copy link
Contributor

Adding CountVectorizer and TFIDF Vectorizer very similar to SKLearn's implementation. Based on some work currently being done by @polvalente on Nx to potentially add sparse tensor support (see https://github.com/elixir-nx/nx/tree/pv-proof-of-concept-sparse-torchx) , this would definitely benefit from sparse tensors, but for now I plan on proceeding with dense tensors.

Considering these vectorizers operate on text, perhaps they would best be fit for a standalone NLP library (I have begun scaffolding out a bit of one) but they could also very well fit in here in Scholar.Preprocessing. I'm open to suggestions as to where they best belong.

As of now, I only have finished setting up the options, and building a vocab given the options. I will try to wrap up at least the CountVectorizer by tomorrow night.

@josevalim
Copy link
Contributor

@acalejos I would propose to keep this in the NLP library for now. For Scholar, we have been focusing on keeping everything implemented with defn and this is plain not possible here. Also, I can see future interest in implementing this in native code for performance, so I am not even sure if this will survive after some iterations :)

@acalejos
Copy link
Contributor Author

@acalejos I would propose to keep this in the NLP library for now. For Scholar, we have been focusing on keeping everything implemented with defn and this is plain not possible here. Also, I can see future interest in implementing this in native code for performance, so I am not even sure if this will survive after some iterations :)

Makes sense. Appreciate the feedback!

@acalejos acalejos closed this Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants