Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd like to implement the same paper. Perhaps I'm missing something, but I'm not sure how the residual strings are created. Do you have a link to an implementation or a description of the residual strings?

I get that a residual string is the original string with a deletion, incrementing the deletions until you hit edit distance d. What I'm not sure about is if it's all permutations of possible deletions.



The residual strings are all subwords where exactly d letters were deleted. For d=1 and the word "Levenshtein", that would be {"evenshtein", "Lvenshtein", "Leenshtein", "Levnshtein", "Leveshtein", "Levenhtein", "Levenstein", "Levenshein", "Levenshtin", "Levenshten", "Levenshtei"}.

The paper does not specify how to generate those efficiently, and I haven't given it any thought yet. I don't know of any implementations of the paper, but this aspect of it should be common enough.

EDIT: sorry, didn't read your comment fully. I'm not sure what you mean with "all permutations of possible deletions". The d-deletion-Neighbourhood of w contains all sub-words of w that you obtain by deleting any d letters from w. For d=2, take any two letters and remove them. N₂(jamra) = {jam,jar,amr,jaa,ama,jmr,jra,ara,mra} (hope I didn't forget any...)

Does that make it clearer?


Yes that makes it supremely clearer. I also found a FastSS implementation, which uses the same d-deletion neighborhood. Here it is: http://fastss.csg.uzh.ch

I am looking at a python implementation for examples.


nice, that seems to be based upon a similar idea as the paper I mentioned (but earlier and less refined).


The paper you mentioned reduces memory consumption hugely and averages out the query and insertion time. It's a good improvement.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: