For what it's worth, I wonder the same thing and think it's not as obvious as others suggest. e.g. if you have an autoencoder for a one-hot encoding, you're essentially learning a pair of nonlinear maps that approximately invert each other, and that map some high dimensional space to a low one. You could imagine that it could instead learn something like a dense bit packing with a QAM gray code[0]. In a one-hot encoding the dot product for similar tokens is zero, so your transformations can't be learning to preserve it.
Somewhat naively, I might speculate that for e.g. sequence prediction, even if you had some efficient packing of space like that to try to maximally separate individual tokens, it's still advantageous to learn an encoding so that synonyms are clustered so that if there is an error, it doesn't cause mispredictions for the rest of the sequence.
I suppose then the point is that the structure exists in the latent space of language itself, and your coordinate maps pull it back to your naive encoding rather than preserving a structure that exists a priori on the naive encoding. i.e. you can't do dot products on the two spaces and expect them to be related. You need to map forward into latent space and do the dot product there, and that defines a (nonlinear) measure of similarity on your original space. Then the question is why latent space has geometry, and there I guess the point is it's not maximally information dense, so the geometry exists in the redundancy. So perhaps it is obvious after all!
I think my comment was not worded properly. I was thinking "geometry properties = linear properties", what I really should say is:
Why does the latent space has geometry properties where we could use functions like cosine similarity to compare?
So when training, the signal will be mapped to latent space that will minimize the error of the objective function as much as possible.
Many applications already use cosine similarity function at the end the network, it would be obvious why they work. I reviewed other cost functions such as Triplet Loss. They use euclidean distances, so I guess it make sense why the geometry properties exist too.
For "and there I guess the point is it's not maximally information dense, so the geometry exists in the redundancy", what does "maximally information dense" means, I still don't quite get it.
Somewhat naively, I might speculate that for e.g. sequence prediction, even if you had some efficient packing of space like that to try to maximally separate individual tokens, it's still advantageous to learn an encoding so that synonyms are clustered so that if there is an error, it doesn't cause mispredictions for the rest of the sequence.
I suppose then the point is that the structure exists in the latent space of language itself, and your coordinate maps pull it back to your naive encoding rather than preserving a structure that exists a priori on the naive encoding. i.e. you can't do dot products on the two spaces and expect them to be related. You need to map forward into latent space and do the dot product there, and that defines a (nonlinear) measure of similarity on your original space. Then the question is why latent space has geometry, and there I guess the point is it's not maximally information dense, so the geometry exists in the redundancy. So perhaps it is obvious after all!
[0] https://en.wikipedia.org/wiki/File:16QAM_Gray_Coded.svg