600ms seems too long for that set size. I will take a look at the performance when I have my Rust harness working. Let's make sure it's working the same way as the C++/JS impls first, and then we can see if there any speed-ups.
One optimisation I did in the C++ was to ensure that the XORing is always done word-wise, instead of byte-for-byte. In fact, with attention to the alignment, I got the compiler to emit PXOR SIMD instructions to do multiple words in a single instruction.
That's very cool about the UniFFI bindings, I'm looking forward to it!