By Matt Anderson. 2016-2017
Levenshtein_search is a Python module that stores any number of documents as ternary search trees. It performs fuzzy searches for words in a document that are d distance away from a query word. Searches can also be used in conjunction with TF-IDF calculations. The term frequency (TF) is computed for each approximately matching word in the document, as well as the Levenshtein distance from your query word. The module was written in C and increases search speed by using graph search algorithms and minimizing the number of redundant comparisons.
import Levenshtein_search
excerpt1 = ["We","went","to","the","fire","Mother","said","Is","he","cold","Versh","Nome","Versh","said","Take","his","overcoat","and","overshoes","off","Mother","said","How","many","times","do","I","have","to","tell","you","not","to","bring","him","into","the","house","with","his","overshoes","on"]
excerpt2 = ["Yessum","Versh","said","Hold","still","now","He","took","my","overshoes","off","and","unbuttoned","my","coat","Caddy","said","Wait","Versh","Cant","he","go","out","again","Mother","I","want","him","to","go","with","me","Youd","better","leave","him","here","Uncle","Maury","said","Hes","been","out","enough","today"]
first_wordset = Levenshtein_search.populate_wordset(-1,excerpt1)
# first_wordset = 0
last_wordset = Levenshtein_search.populate_wordset(-1,excerpt2)
# last_wordset = 1
The module accepts documents as Python lists of strings. To create a new document and give it a set of words, use the populate_wordset(x,excerpt1) function where x is an integer representing the document's index. If you would like the new document's index to be assigned, x should equal -1 and the function will return the new document's index, starting with 0. If you would like to add words to a preexisting document, x should equal that document's index. In the example above, excerpt1 is the Python list of strings.
q = "overcoat"
maxdist = 4
results1 = Levenshtein_search.lookup(first_wordset,q,maxdist)
results2 = Levenshtein_search.lookup(last_wordset,q,maxdist)
To search a document for your query word, use the lookup(x,q,maxdist) function where x is a non-negative integer representing the document's index, q is a string representing your query word, and maxdist is a non-negative integer representing the maximum allowable Levenshtein distance from your query word.
idx = Levenshtein_search.add_string(first_wordset,"coat")
rm_idx = Levenshtein_search.remove_string(first_wordset,"coat")
# (idx==rm_idx) is true
Levenshtein_search.remove_string(first_wordset,q)
results3 = Levenshtein_search.lookup(first_wordset,q,maxdist)
You can add or remove words from the document on an individual basis, using add_string and remove_string functions. The first argument is the document's index and the second is the word to add or remove.
Levenshtein_search.clear_wordset(last_wordset)
To clear a document from memory, use the clear_wordset(x) function where x is a non-negative integer representing the document's index. After clearing a document, x, documents retain their previous index (this is different from v1.3).
print("Query word: %s" % q)
for i in range(0,len(results1)):
print("%s" % results1[i][0]+": "+ str(results1[i][1])+", "+str(results1[i][2]))
print("Query word: %s" % q)
for i in range(0,len(results2)):
print("%s" % results2[i][0]+": "+ str(results2[i][1])+", "+str(results2[i][2]))
print("Query word: %s" % q)
for i in range(0,len(results3)):
print("%s" % results3[i][0]+": "+ str(results3[i][1])+", "+str(results3[i][2]))
The output of the lookup() function is a list of lists equal in length to the number of unique words returned by the search. For each unique word, the function retrieves the Levenshtein distance from your query word, and the term frequency of that word in the document.
Query word: overcoat
overcoat: 0, 0.0238095238095
overshoes: 4, 0.047619047619
Query word: overcoat
coat: 4, 0.0222222222222
overshoes: 4, 0.0222222222222
Query word: overcoat
overshoes: 4, 0.0487804878049
If there is any word in the results that is the same as your query word, it is guaranteed to be the first item of the result list. All other words are in no particular order.
pip install Levenshtein-search
In a timed test of this module with Levenshtein edit distance of 2, it was approx. 10x faster than using an equivalent search in PostgreSQL.
select name from restaurant_nophone_training where levenshtein_less_equal(name, '"philippe the original"', 2) <= 2;
0.00198695853199 sec
[('"philippe\'s the original"',), ('"philippe the original"',)]
Levenshtein_search algorithm:
0.00019705373871 sec
[['"philippe the original"', 0, 0.0011574074074074073], ['"philippe\'s the original"', 2, 0.0011574074074074073]]
Specialized O(number of nodes) algorithm for large edit distances