Even if one buys the (demonstrably false) claim that all real-world problems are...

Even if one buys the (demonstrably false) claim that all real-world problems are sparse, most sparse techniques (especially sparse-direct, and, to a limited degree, Krylov subspace methods) boil down to dense linear algebra on smaller matrices. When executing dense linear algebra on accelerators, it can be surprising just how carefully one must organize the computation in order to make the best use of the memory hierarchy. When I was writing these types of routines years ago, it was often beneficial to pre-/post-process certain operations, such as A^T B^T = C, by explicitly transposing either the input or output matrix (in my case, to ensure that reads from global memory could be coalesced in one of the inner loops).

With that said, an example of efficiently transposing dense matrices was one of the CUDA examples five years ago...