Bitwise xor of two 256-bit integers

Kan Li Published at Dev

Kan Li

I have a AVX cpu (which doesn't support AVX2), and I want to compute bitwise xor of two 256 bits integer.

Since _mm256_xor_si256 is only available on AVX2, can I load these 256 bits as __m256 using _mm256_load_ps and then do a _mm256_xor_ps. Will this generate expected result?

My major concern is if the memory content is not a valid floating point number, will _mm256_load_ps not loading bits to registers exactly the same as that in memory?

Thanks.

Peter Cordes

First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four xor instructions to get the result (and 4 mov instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (xor can run on any of the 4 ALU ports). So if you're just doing a single xor in between some add/adc or whatever, stick with integers.

Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using movq / pinsrq would cost more execution resources than xor. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. movq / pextrq still suck, but would have lower latency (at the cost of more uops).

FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions:

VADDPS

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

VMOVAPS

SIMD Floating-Point Exceptions
None.

(From Intel's insn ref manual. See the x86 wiki for links to that and other stuff.)

On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from.

Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register.

No existing design I've read about has handled movapd any differently from movaps. Presumably Intel created movapd as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (movapd is movaps with a 66h prefix, just like the double version of every other SSE instruction just has the 66h prefix byte tacked on. Or F2 instead of F3 for scalar instructions.)

Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of addps as the input for addpd, for example. I don't think movaps between two addpd instructions, or even xorps would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.)

Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2):

256b operations with AVX `xorps`

VMOVDQU   ymm0, [A]
VXORPS    ymm0, ymm0, [B]
VMOVDQU   [result], ymm0

3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)
load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port.

So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though.

Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache.

Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port xorps can run on) is needed once every two clocks.

128b ops

VMOVDQU   xmm0, [A]
VMOVDQU   xmm1, [A+16]
VPXOR     xmm0, xmm0, [B]
VPXOR     xmm1, xmm1, [B+16]
VMOVDQU   [result],    xmm0
VMOVDQU   [result+16], xmm1

This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks.

In registers

Between registers, one 256b VXORPS and two 128b VPXOR per clock would saturate SnB. On Haswell, three AVX2 256b VPXOR per clock would give the most XOR-ing per cycle. (XORPS and PXOR do the same thing, but XORPS's output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.)

Z Boson's hybrid idea:

VMOVDQU   ymm0, [A]
VMOVDQU   ymm4, [B]
VEXTRACTF128 xmm1, ymm0, 1
VEXTRACTF128 xmm5, ymm1, 1
VPXOR     xmm0, xmm0, xmm4
VPXOR     xmm1, xmm1, xmm5
VMOVDQU   [res],    xmm0
VMOVDQU   [res+16], xmm1

Even more fused-domain uops (8) than just doing 128b-everything.

Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle.

ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor).

So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-23

Comments

0 comments

From Dev

Related Related

Article