Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add benchmark for tfidf #239

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

GoWind
Copy link
Contributor

@GoWind GoWind commented Nov 23, 2024

Add a vanilla Rust program for calculating tfidf scores and calculating the top 10 similar documents

We use a cosine similarity for calculating relevancy of documents for a given query.

@ashvardanian , thanks for your time !
Wanted to check with you on the approach to see if it makes sense to you as a valid benchmark

Use the leipzig1m and the § XL Sum datasetas corpuses. I assumes that each 10000 lines in theleipzig1m` dataset is a single document and then calculate the tf-idf scores for each (term, document) in the corpus.

  1. Given a query, calculate the tf-idf score of a given query based on the same corpus (and assume the query to be a separate document)

  2. The next step is to calculate the cosine score. I couldn't fine a good source for where I can calculate the score as a vector for a query or a document (Claude gave me a fn that I could possibly use)

  3. Once we have a cosine similarity score, sort and fetch top 10.

For benchmarking SImSIMD, I assume the cosine part is where I can use methods from SimSIMD and benchmark it against the vanilla implementation ?
And for the query, in memchr vs stringzilla you basically pick a random set of tokens and then benchmark searching from left and right using memchr and stringzilla . I was thinking of doing something similar by picking terms at random from the corpus and constructing random queries to benchmark.

Add a benchmark for calculating tfidf scores and calculating the top 10 similar
documents

We use a cosine similarity for calculating relevancy of documents for a given query
@ashvardanian
Copy link
Owner

Hi @GoWind! The biggest SimSIMD improvement should come from the Sparse kernels. Have you managed to use them in your TFIDF implementation?

@GoWind
Copy link
Contributor Author

GoWind commented Nov 23, 2024

@ashvardanian , not yet, I am trying to figure out what could be the ideal benchmark (and also learning how to use TF-IDF)

In scikitlearn, the TfIdfVectorizer creates an 2d array for the documents where each row is the document, and there is a column for each "token" in the corpus and
array[row][column] = frequency of the token in the document

Similarly, we can tokenize the query and run intersect between the query (as a row vector) and each document in our corpus to get the intersection size.
the intersection would be if the frequency of every unique token in the query matches the frequency of the term in the document (something like using simsimd_intersect ). Is that what you had in mind using sparse kernels ?

@ashvardanian
Copy link
Owner

@GoWind, the intersect function may not be the only one you need. Also look into: spdot_counts and spdot_weights 🤗

@GoWind
Copy link
Contributor Author

GoWind commented Nov 23, 2024

will do. Thanks for the pointers ! Also reading through the scikit implementation to see how I can possibly do this :)

@GovindarajanNagarajan-TomTom
Copy link
Contributor

Hi @ashvardanian , making progress on the tfidf based similarity calculator.
I noticed some discrepancy when calculating the cosine similarity for vec of f64s via the Rust bindings
The plain rust cosine calculations match the values I get both from numpy and from the simsimd bindings via Python
For the implementation I tried to compute a vector of tfidf_values per query and then compute a cosine similarity between the query and each document , based on the answers on the SO question

Here is how i prepared the script (added a few hacks to test quickly, will update the scripts to a be [[bench]] in the subsequent commits)

head -n 10000 leipzig1m.txt > leipzig10000.txt
cargo run --bin tfidf leipzig10000.txt

I took the first 10k lines and batched them into 10 document of 1k lines each.
The query (hardcoded) in the script is transformed into a vector representation and I compute the cosine similarities

Similarity for document via simsimd 0: Some(0.5928754241421308)
Similarity for document via plain cosine similarity 0: Some(0.40712457585786876)
Similarity for document via simsimd 1: Some(0.5993249839897541)
Similarity for document via plain cosine similarity 1: Some(0.4006750160102468)
Similarity for document via simsimd 2: Some(0.5914559242162761)
Similarity for document via plain cosine similarity 2: Some(0.408544075783724)
Similarity for document via simsimd 3: Some(0.5998267820476098)
Similarity for document via plain cosine similarity 3: Some(0.40017321795239075)
Similarity for document via simsimd 4: Some(0.5906444006555799)
Similarity for document via plain cosine similarity 4: Some(0.40935559934442023)
Similarity for document via simsimd 5: Some(0.5902192553116478)
Similarity for document via plain cosine similarity 5: Some(0.4097807446883521)
Similarity for document via simsimd 6: Some(0.5943923707602529)
Similarity for document via plain cosine similarity 6: Some(0.4056076292397477)
Similarity for document via simsimd 7: Some(0.6028015678055032)
Similarity for document via plain cosine similarity 7: Some(0.3971984321944968)
Similarity for document via simsimd 8: Some(0.5957380843868555)
Similarity for document via plain cosine similarity 8: Some(0.4042619156131435)
Similarity for document via simsimd 9: Some(0.5913356879962984)
Similarity for document via plain cosine similarity 9: Some(0.4086643120037022)

Not sure if I am doing something wrong, but could there be some sort of discrepancy here ?

@ashvardanian
Copy link
Owner

Seems like one is x and the other is 1-x. One is a similarity score and the other is a distance.

@GoWind
Copy link
Contributor Author

GoWind commented Nov 27, 2024

Ah, I see. now it makes sense


SIMSIMD_INTERNAL simsimd_distance_t _simsimd_cos_normalize_f64_neon(simsimd_f64_t ab, simsimd_f64_t a2,
                                                                    simsimd_f64_t b2) {
    if (a2 == 0 && b2 == 0) return 0;
    if (ab == 0) return 1;
    simsimd_f64_t squares_arr[2] = {a2, b2};
    float64x2_t squares = vld1q_f64(squares_arr);
    ......
    rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
    rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
    vst1q_f64(squares_arr, rsqrts);
    simsimd_distance_t result = 1 - ab * squares_arr[0] * squares_arr[1];
    return result > 0 ? result : 0;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants