WIP: Add benchmark for tfidf #239

GoWind · 2024-11-23T00:30:45Z

Add a vanilla Rust program for calculating tfidf scores and calculating the top 10 similar documents

We use a cosine similarity for calculating relevancy of documents for a given query.

@ashvardanian , thanks for your time !
Wanted to check with you on the approach to see if it makes sense to you as a valid benchmark

Use the leipzig1m and the § XL Sum datasetas corpuses. I assumes that each 10000 lines in theleipzig1m` dataset is a single document and then calculate the tf-idf scores for each (term, document) in the corpus.

Given a query, calculate the tf-idf score of a given query based on the same corpus (and assume the query to be a separate document)
The next step is to calculate the cosine score. I couldn't fine a good source for where I can calculate the score as a vector for a query or a document (Claude gave me a fn that I could possibly use)
Once we have a cosine similarity score, sort and fetch top 10.

For benchmarking SImSIMD, I assume the cosine part is where I can use methods from SimSIMD and benchmark it against the vanilla implementation ?
And for the query, in memchr vs stringzilla you basically pick a random set of tokens and then benchmark searching from left and right using memchr and stringzilla . I was thinking of doing something similar by picking terms at random from the corpus and constructing random queries to benchmark.

Add a benchmark for calculating tfidf scores and calculating the top 10 similar documents We use a cosine similarity for calculating relevancy of documents for a given query

ashvardanian · 2024-11-23T00:33:10Z

Hi @GoWind! The biggest SimSIMD improvement should come from the Sparse kernels. Have you managed to use them in your TFIDF implementation?

GoWind · 2024-11-23T00:46:28Z

@ashvardanian , not yet, I am trying to figure out what could be the ideal benchmark (and also learning how to use TF-IDF)

In scikitlearn, the TfIdfVectorizer creates an 2d array for the documents where each row is the document, and there is a column for each "token" in the corpus and
array[row][column] = frequency of the token in the document

Similarly, we can tokenize the query and run intersect between the query (as a row vector) and each document in our corpus to get the intersection size.
the intersection would be if the frequency of every unique token in the query matches the frequency of the term in the document (something like using simsimd_intersect ). Is that what you had in mind using sparse kernels ?

ashvardanian · 2024-11-23T15:19:48Z

@GoWind, the intersect function may not be the only one you need. Also look into: spdot_counts and spdot_weights 🤗

GoWind · 2024-11-23T20:49:43Z

will do. Thanks for the pointers ! Also reading through the scikit implementation to see how I can possibly do this :)

GovindarajanNagarajan-TomTom · 2024-11-27T12:09:34Z

Hi @ashvardanian , making progress on the tfidf based similarity calculator.
I noticed some discrepancy when calculating the cosine similarity for vec of f64s via the Rust bindings
The plain rust cosine calculations match the values I get both from numpy and from the simsimd bindings via Python
For the implementation I tried to compute a vector of tfidf_values per query and then compute a cosine similarity between the query and each document , based on the answers on the SO question

Here is how i prepared the script (added a few hacks to test quickly, will update the scripts to a be [[bench]] in the subsequent commits)

head -n 10000 leipzig1m.txt > leipzig10000.txt
cargo run --bin tfidf leipzig10000.txt

I took the first 10k lines and batched them into 10 document of 1k lines each.
The query (hardcoded) in the script is transformed into a vector representation and I compute the cosine similarities

Similarity for document via simsimd 0: Some(0.5928754241421308)
Similarity for document via plain cosine similarity 0: Some(0.40712457585786876)
Similarity for document via simsimd 1: Some(0.5993249839897541)
Similarity for document via plain cosine similarity 1: Some(0.4006750160102468)
Similarity for document via simsimd 2: Some(0.5914559242162761)
Similarity for document via plain cosine similarity 2: Some(0.408544075783724)
Similarity for document via simsimd 3: Some(0.5998267820476098)
Similarity for document via plain cosine similarity 3: Some(0.40017321795239075)
Similarity for document via simsimd 4: Some(0.5906444006555799)
Similarity for document via plain cosine similarity 4: Some(0.40935559934442023)
Similarity for document via simsimd 5: Some(0.5902192553116478)
Similarity for document via plain cosine similarity 5: Some(0.4097807446883521)
Similarity for document via simsimd 6: Some(0.5943923707602529)
Similarity for document via plain cosine similarity 6: Some(0.4056076292397477)
Similarity for document via simsimd 7: Some(0.6028015678055032)
Similarity for document via plain cosine similarity 7: Some(0.3971984321944968)
Similarity for document via simsimd 8: Some(0.5957380843868555)
Similarity for document via plain cosine similarity 8: Some(0.4042619156131435)
Similarity for document via simsimd 9: Some(0.5913356879962984)
Similarity for document via plain cosine similarity 9: Some(0.4086643120037022)

Not sure if I am doing something wrong, but could there be some sort of discrepancy here ?

ashvardanian · 2024-11-27T12:11:44Z

Seems like one is x and the other is 1-x. One is a similarity score and the other is a distance.

GoWind · 2024-11-27T16:43:13Z

Ah, I see. now it makes sense


SIMSIMD_INTERNAL simsimd_distance_t _simsimd_cos_normalize_f64_neon(simsimd_f64_t ab, simsimd_f64_t a2,
                                                                    simsimd_f64_t b2) {
    if (a2 == 0 && b2 == 0) return 0;
    if (ab == 0) return 1;
    simsimd_f64_t squares_arr[2] = {a2, b2};
    float64x2_t squares = vld1q_f64(squares_arr);
    ......
    rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
    rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
    vst1q_f64(squares_arr, rsqrts);
    simsimd_distance_t result = 1 - ab * squares_arr[0] * squares_arr[1];
    return result > 0 ? result : 0;
}

feat: Add benchmark for tfidf

ec6df04

Add a benchmark for calculating tfidf scores and calculating the top 10 similar documents We use a cosine similarity for calculating relevancy of documents for a given query

GoWind added 3 commits November 27, 2024 02:18

Get vanialla version of TfIdf matrix calculation working

52cf048

calculate cosine similarity using simsimd

19850f6

add plain implementation of cosine similarity

4cf2e0c

Speed up stuff by removing Regex out of the hot loop

92eda8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add benchmark for tfidf #239

WIP: Add benchmark for tfidf #239

GoWind commented Nov 23, 2024

ashvardanian commented Nov 23, 2024

GoWind commented Nov 23, 2024 •

edited

Loading

ashvardanian commented Nov 23, 2024

GoWind commented Nov 23, 2024

GovindarajanNagarajan-TomTom commented Nov 27, 2024

ashvardanian commented Nov 27, 2024

GoWind commented Nov 27, 2024

WIP: Add benchmark for tfidf #239

Are you sure you want to change the base?

WIP: Add benchmark for tfidf #239

Conversation

GoWind commented Nov 23, 2024

ashvardanian commented Nov 23, 2024

GoWind commented Nov 23, 2024 • edited Loading

ashvardanian commented Nov 23, 2024

GoWind commented Nov 23, 2024

GovindarajanNagarajan-TomTom commented Nov 27, 2024

ashvardanian commented Nov 27, 2024

GoWind commented Nov 27, 2024

GoWind commented Nov 23, 2024 •

edited

Loading