I would say one weakness of the paper is that they primarily compare performance...

I would say one weakness of the paper is that they primarily compare performance with LSTM (a simpler recursion model), rather than similar attention / diffusion models. I would be curious how well a model that just has N layers of attention in/out would perform on these tasks (using a recursive time-stepped approach). My guess is performance will be very similar, and network architecture will also be quite similar (although a true transformer is a bit different than input attention + unet which they employ).