Draft: Adding CountVectorizer & TfidfVectorizer #164
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adding CountVectorizer and TFIDF Vectorizer very similar to SKLearn's implementation. Based on some work currently being done by @polvalente on Nx to potentially add sparse tensor support (see https://github.com/elixir-nx/nx/tree/pv-proof-of-concept-sparse-torchx) , this would definitely benefit from sparse tensors, but for now I plan on proceeding with dense tensors.
Considering these vectorizers operate on text, perhaps they would best be fit for a standalone NLP library (I have begun scaffolding out a bit of one) but they could also very well fit in here in Scholar.Preprocessing. I'm open to suggestions as to where they best belong.
As of now, I only have finished setting up the options, and building a vocab given the options. I will try to wrap up at least the CountVectorizer by tomorrow night.