> https://arxiv.org/abs/2405.17399 they built specialized model which is after b...

famouswaffles · on Oct 1, 2024

They did not build a specialized model. They changed the embeddings. You can do this for any existing model.

>They also demonstrated that transformer can't learn sorting.

They did not demonstrate anything of the sort.

riku_iki · on Oct 1, 2024

What you are saying contradicts to my reading of publication.

famouswaffles · on Oct 1, 2024

Their method does not require building specialized models from scratch (you can but you don't have to) and they did not prove transformers can't learn sorting. If you think they did, then you don't understand what it means to prove something.

riku_iki · on Oct 1, 2024

In my books what they built (specialized training data + specialized embeddings format) is exactly specialized model. You can disagree of course and say again that I don't understand something, but discussion will be over.

famouswaffles · on Oct 1, 2024

The data isn't specialized ?

A poor result when testing one model is not proof that the architecture behind the model is incapable of getting good results. It's just that simple. The same way seeing the OG GPT-3 fail at chess was not proof LLMs can't play chess.

This

>They also demonstrated that transformer can't learn sorting.

is just wrong. Nothing more to it.

famouswaffles · on Oct 1, 2024

>I also think most of the accuracy came from memorization of training set

Oh yes..it memorized a 20 digit training set to solve 100 digit problems. That makes sense. Lol

>(model didn't provide intermediate results, and started failing significantly at sligtly larger input).

No it didn't. They tested up to 100 digits with very high accuracy. I don't think you even read the abstract of this, nevermind the actual paper.

riku_iki · on Oct 1, 2024

> No it didn't. They tested up to 100 digits with very high accuracy. I don't think you even read the abstract of this, nevermind the actual paper.

they have two OOD (out of distribution) accuracies in the paper: OOD: up to 100 digits, and 100+ OOD: 100-160 digits. 100+ OOD accuracy is significantly worse: around 30%.