This is sweet...image-matching is one of those functions that external services do extremely well for the consumer -- e.g. Google Image Search and TinEye -- but not for those who need to have such a service in a private domain, such as an in-house photo library...I've used pHash for comparisons, and have a decent idea of how to build my own classifiers...but pretty much no idea how to do it efficiently and in a structured way to do reverse-image matching.
You might want to look at Morelikethis queries to boost performance. I worked on a proprietary version of this and at the time Lucene performance dropped off nearly linearly with the number of query terms.
We used MoreLikeThis to reduce our queries count to the 30-40 most statistically interesting terms. The one hiccup being an issue in Lucene [1] where the term cache wasn't operating properly. We just added our own image query term cache and a custom MLT query to leverage it, which gave us a 10x speed bump over any other methods we tried.
The interestingness of the terms is assessed on a per-term basis though, so you might see a relevence drop for some types of image if you set MoreLikeThis to use too few terms.
Thank you for the suggestion. I actually did try restricting the terms by measuring correlation between columns -- the idea being that more discriminating terms should be searched first. This did result in modest speedups.
Fortunately or unfortunately, we were already achieving pretty good speed with Elasticsearch so we didn't implement it. However, it didn't occur to me to try a MoreLikeThis query, which should be even simpler -- I will look into it!
Cool. Impressive project by the way; I forgot to say that before.
I tried something similar; but with a different approach. I tried creating compound words, a bit like n-grams. I didn't get it working as that was a side-project and I couldn't commit enough time.
Last time I looked into doing some content based image matching in nodejs the best I could find was a node-phash fork that was difficult to get working on osx.
It appears that since your comment was posted the original repo's README was updated with a notice stating it was no longer maintained and that OP's linked repo is actually the correct one.
Hey, author here. We were using the code internally and it was kind of a mess so I fixed it up under my own repository before forking it into ascribe's repository.
FWIW, John Resig uses pastec for his work:
http://ejohn.org/blog/image-similarity-search-wanted/
http://ryanfb.github.io/etc/2015/11/03/finding_near-matches_...