-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add benchmark for tfidf #239
base: main
Are you sure you want to change the base?
Conversation
Add a benchmark for calculating tfidf scores and calculating the top 10 similar documents We use a cosine similarity for calculating relevancy of documents for a given query
Hi @GoWind! The biggest SimSIMD improvement should come from the Sparse kernels. Have you managed to use them in your TFIDF implementation? |
@ashvardanian , not yet, I am trying to figure out what could be the ideal benchmark (and also learning how to use TF-IDF) In scikitlearn, the TfIdfVectorizer creates an 2d array for the documents where each row is the document, and there is a column for each "token" in the corpus and Similarly, we can tokenize the query and run |
@GoWind, the |
will do. Thanks for the pointers ! Also reading through the scikit implementation to see how I can possibly do this :) |
Hi @ashvardanian , making progress on the tfidf based similarity calculator. Here is how i prepared the script (added a few hacks to test quickly, will update the scripts to a be
I took the first 10k lines and batched them into 10 document of 1k lines each.
Not sure if I am doing something wrong, but could there be some sort of discrepancy here ? |
Seems like one is |
Ah, I see. now it makes sense
|
Add a vanilla Rust program for calculating tfidf scores and calculating the top 10 similar documents
We use a cosine similarity for calculating relevancy of documents for a given query.
@ashvardanian , thanks for your time !
Wanted to check with you on the approach to see if it makes sense to you as a valid benchmark
Use the
leipzig1m
and the § XL Sum datasetas corpuses. I assumes that each 10000 lines in the
leipzig1m` dataset is a single document and then calculate the tf-idf scores for each (term, document) in the corpus.Given a query, calculate the tf-idf score of a given query based on the same corpus (and assume the query to be a separate document)
The next step is to calculate the cosine score. I couldn't fine a good source for where I can calculate the score as a vector for a query or a document (Claude gave me a fn that I could possibly use)
Once we have a cosine similarity score, sort and fetch top 10.
For benchmarking SImSIMD, I assume the cosine part is where I can use methods from SimSIMD and benchmark it against the vanilla implementation ?
And for the query, in memchr vs stringzilla you basically pick a random set of tokens and then benchmark searching from left and right using memchr and stringzilla . I was thinking of doing something similar by picking terms at random from the corpus and constructing random queries to benchmark.