they built specialized model which is after bunch of trickery still has 99% accuracy(naive model had very low accuracy) on very simple deterministic algo. I also think most of the accuracy came from memorization of training set(model didn't provide intermediate results, and started failing significantly at sligtly larger input). In my book it is fundamental inability to learn and reproduce algorithm.
They also demonstrated that transformer can't learn sorting.
Their method does not require building specialized models from scratch (you can but you don't have to) and they did not prove transformers can't learn sorting. If you think they did, then you don't understand what it means to prove something.
In my books what they built (specialized training data + specialized embeddings format) is exactly specialized model. You can disagree of course and say again that I don't understand something, but discussion will be over.
A poor result when testing one model is not proof that the architecture behind the model is incapable of getting good results. It's just that simple. The same way seeing the OG GPT-3 fail at chess was not proof LLMs can't play chess.
This
>They also demonstrated that transformer can't learn sorting.
> No it didn't. They tested up to 100 digits with very high accuracy. I don't think you even read the abstract of this, nevermind the actual paper.
they have two OOD (out of distribution) accuracies in the paper: OOD: up to 100 digits, and 100+ OOD: 100-160 digits. 100+ OOD accuracy is significantly worse: around 30%.
they built specialized model which is after bunch of trickery still has 99% accuracy(naive model had very low accuracy) on very simple deterministic algo. I also think most of the accuracy came from memorization of training set(model didn't provide intermediate results, and started failing significantly at sligtly larger input). In my book it is fundamental inability to learn and reproduce algorithm.
They also demonstrated that transformer can't learn sorting.