Coalesced transpose via shared memory NVIDIA parallel for all