In my experience, matrix operations where performance matters are usually done on sparse matrices, where a transposition would be done differently. If a matrix package supports both column-major and row-major sparse matrices, it would be a matter of swapping to (!current-major) indexing.
This paper is useful for transpose-once use-many type scenarios, but for real-time transposes inside a main loop, it should be easier to write a fake-transpose wrapper that calculates flipped indices on access.
Even if one buys the (demonstrably false) claim that all real-world problems are sparse, most sparse techniques (especially sparse-direct, and, to a limited degree, Krylov subspace methods) boil down to dense linear algebra on smaller matrices. When executing dense linear algebra on accelerators, it can be surprising just how carefully one must organize the computation in order to make the best use of the memory hierarchy. When I was writing these types of routines years ago, it was often beneficial to pre-/post-process certain operations, such as A^T B^T = C, by explicitly transposing either the input or output matrix (in my case, to ensure that reads from global memory could be coalesced in one of the inner loops).
With that said, an example of efficiently transposing dense matrices was one of the CUDA examples five years ago...
This paper is useful for transpose-once use-many type scenarios, but for real-time transposes inside a main loop, it should be easier to write a fake-transpose wrapper that calculates flipped indices on access.