Friday 27 December 2013

not extending bitonic sort to non-powers of 2

Update: Turns out I was a bit over-optimistic here - this stuff below doesn't work, it just happened to work for a couple of cases with the data I had. Bummer. Maybe it can be made to work but maybe not, but I didn't have much luck so I gave up for now.

Original post ...

This morning I started having a look at combining the OpenCL code I have into a GA solver. Which mostly involved lots of bug fixing in the classifier code and extending it to process all classifiers and images in a single run.

All went fine till I hit the sort routine for the population fitness when I found out that the bitonic sort I wrote doesn't handle totals which are non-powers of two.

e.g. with 11 elements:

input : b a 9 8 7 6 5 4 3 2 1
output: 1 2 3 7 8 9 a b 4 5 6 
        --------------- +++++
Each sub-list is sorted, there is just a problem with the largest-stride merge. I noticed that this merge processes indices:
  0 -> 8
  1 -> 9
  2 -> 10

data  : 1 2 3 7 8 9 a b 4 5 6 
        a a a           b b b

Which misses out on comparing 3-7, which are already going to be greater than 0-2.

So a simple solution is just to offset the first comparison index so that it compares indices 5-7 to indices 8-10 instead. The calculation is simply:

  off = max(0, (1 << (b + 1)) - N); // (here, b=3, N=11, poff = 5)

Which results in the correct merge indices for the largest offset:

data  : 1 2 3 7 8 9 a b 4 5 6 
                  a a a b b b

So an updated version of the sort becomes:

  lid = local work item id;
  array = base address of array;
  for (bit=0; (1<<bit) < N; bit++) {
    for (b=bit; b >= 0; b--) {
      // insertion masks for bits
      int upper = ~0 << (b+1);
      int lower = (1 << b)-1;
      // extract 'flip' indicator - 1 means use opposite sense
      int flip = (lid >> bit) & 1;
      // insert a 0 bit at bit position 'b' into the index
      int p0 = ((lid << 1) & upper) | (lid & lower);
      // insert a 1 bit at the same place
      int p1 = p0 | (1 << b);

      p0 += max(0, (1 << (b+1)) - N);

      swap_if(array, p0, p1, compare(p0, p1) ^ flip);
    }
  }

I haven't extensively tested it but I think it should work. It doesn't!

I'm probably not going to investigate it but I also realised that a serial-cpu version of a bitonic sort wouldn't need to frob around with bit insertions in the inner loop; it could just nest a couple of loops instead. Actually I may well end up investigating it because a similar approach would work on epiphany.

No comments: