Friday 17 August 2012

parallel 64 bit tests using NEON

So the next bit of code i've been working on requires a lot of 64-bit bit tests. Essentially using the tests as a 64-element 1-bit lookup table.

My first approach was to take an obvious C implementation, and convert it to NEON and perform the bit tests in parallel.

The C is basically like this:

  int i, count=0;
  for (i=0;i<N;i++) {
    uint64_t test = tests[i];
    uint64_t bit = (uint64_t)1 << bits[i];

    count += (test & bit) ? 1 : -1;
  }

It works, and one can perform the tests - but the tests need to be performed using 64-bit arithmetic - which takes a lot of register space, and a large amount of fuffing to get the data in the right format ('fuffing' is my technical term for data conversion/re-arrangement), and then the results must be converted back to 32-bits to form the selection masks.

So it's a great deal of code, much of which isn't doing anything but fuffing.

But the idea I had last night was to perform the testing using 8 bit values, and then scale these up to 32-bits to form the masks.

If one were writing the above code to avoid the 64-bit accesses in C, you could just use addressing to convert the 64-bit test to an 8 bit test - but on the correct byte instead.

  int i, count=0;
  for (i=0;i<N;i++) {
    int bitno = bits[i];
    uint8_t test = testsbyte[i * 8 + (bitno >> 3)];
    uint8_t bit = bitno & 7;

    count += (test & bit) ? 1 : -1;
  }

This requires per-byte addressing ... which you don't get for neon memory accesses per element ... except with vtbl.

So the following performs 8x 64-bit bit tests in only 5 instructions, and the actual tests are only using 8-bit arithmetic.

        @ d0 has 8 indices to test
        @ d1 has the 64-bit mask to test against
        @ d2 is set to 7 in 8 bytes
        @ d3 is set to 1 in 8 bytes
        
        vand.u8         d4,d2,d0        @ test & 7
        vshl.u8         d4,d3,d4        @ (1 << (test & 7))

        vshr.u8         d5,d0,#3        @ test >> 3
        vtbl.u8         d5,{d16},d5     @ data[test >> 3]
        
        vtst.u8         d4,d5           @ data[test >> 3] & (1 << (test & 7))

Changing the classifier to use code similar to this (but unrolled/scheduled better) gave me a 100% speed-up on the beagleboard (still unverified code ...).

Unfortunately, I still need nearly as many instructions just to convert the 8 bit mask into 32-bits for a selection mask (i need 32 bits for what i'm doing). But some mathematics might let me change that.

No comments: