+ // For very small types, all the individual reads in the normal
+ // path perform poorly. We can do better, given efficient unaligned
+ // load/store, by loading a larger chunk and reversing a register.
+
+ // Ideally LLVM would do this for us, as it knows better than we do
+ // whether unaligned reads are efficient (since that changes between
+ // different ARM versions, for example) and what the best chunk size
+ // would be. Unfortunately, as of LLVM 4.0 (2017-05) it only unrolls
+ // the loop, so we need to do this ourselves. (Hypothesis: reverse
+ // is troublesome because the sides can be aligned differently --
+ // will be, when the length is odd -- so there's no way of emitting
+ // pre- and postludes to use fully-aligned SIMD in the middle.)
+