Amazing post, I didn’t think this through a lot, but since you are normalizing the vectors and calculating the euclidean distance, you will get the same results using a simple matmul, because euclidean distance over normalized vectors is a linear transform of the cosine distance.
Since you are just interested in the ranking, not the actual distance, you could also consider skipping the sqrt. This gives the same ranking, but will be a little faster.
Using the phrase "without the benefit of hindsight" is interesting. The hardest thing with any technology is knowing when to spend the effort/money on applying it. The real question is: do you want to spend your innovation tokens on things like this? If so, how many? And where?
Not knocking this, just saying that it is easy to claim improvements if you know there are improvements to be had.
I recognize it as a quote from A Year With Swollen Appendices, which is a great read even if you aren't an Eno fan (although I am, which admittedly makes me biased :P)
I don’t really believe this is a paradigm shift with regards to train/test splits.
Before LLMs you would do a lot of these things, it’s just become a lot easier to get started and not train. What the author describes is very similar to the standard ml product loop in companies, including it being very difficult to “beat” the incumbent model because it has been overfit on the test set that is used compare the incumbent to your own model.
“Normal search” is generally called bm25 in retrieval papers. Many, if not all, retrieval papers about modeling will use or list bm25 as a baseline. Hope this helps!
I fully agree, except that I think this will still be a very “power user” thing. Perhaps this is also what you mean because you reference Linux. But traditional search will be very important for a very long while, imo
Since you are just interested in the ranking, not the actual distance, you could also consider skipping the sqrt. This gives the same ranking, but will be a little faster.
reply