This improves the spectralnorm shootout benchmark through a few vectors after
looking at the leading C implementation:
* The simd-based f64x2 is now used to parallelize a few computations
* RWLock usage has been removed. A custom `parallel` function was added as a
form of stack-based fork-join parallelism. I found that the contention on the
locks was high as well as hindering other optimizations.
This does, however, introduce one `unsafe` block into the benchmarks, which
previously had none.
In terms of timings, the before and after numbers are:
```
$ time ./shootout-spectralnorm-before
./shootout-spectralnorm-before 2.07s user 0.71s system 324% cpu 0.857 total
$ time ./shootout-spectralnorm-before 5500
./shootout-spectralnorm-before 5500 11.88s user 1.13s system 459% cpu 2.830 total
$ time ./shootout-spectralnorm-after
./shootout-spectralnorm-after 0.58s user 0.01s system 280% cpu 0.210 tota
$ time ./shootout-spectralnorm-after 5500
./shootout-spectralnorm-after 5500 3.55s user 0.01s system 455% cpu 0.783 total
```