Saturday 9 May 2015

Faster loops

I've been playing with streams and iterators and stuff again a bit. Although i've found that having a custom calculation loop is pretty good for performance trying to call a (lambda) function for each item can have some fairly large overheads once the jvm decides it wants to de-optimise the calling loop. But the latter is just so convenient to use it's worth a little more effort if it will make a performance difference.

de-optimisation

This stuff is just based on observation and not some internal knowledge of hotspot

So I had another look at trying to optimise the case of processing per-item in a loop with minimal overheads in a way that is practical and efficient. So far i've found the jvm will inline a call inside a loop (depending on size?) if the loop only calls up to two different class types. Unfortunately each new dynamicinvoke counts as a new class type, so for example the first of the following will optimise fine, but the second wont.

  // these will remain optimised
  IntUnaryOperator op = Integer::reverse;
  A.forEach(op);
  A.forEach(op);
  A.forEach(op);
  // the third invocation will cause a de-optimisation,
  //  subsequently all will run as de-optimised
  A.forEach(Integer::reverse);
  A.forEach(Integer::reverse);
  A.forEach(Integer::reverse);

And this applies globally so using a singleton wont get you very far.

So how to address this?

forEacherator

I found that if i used bytecode manipulation and created a new copy-class the optimisation stayed around - because the loop only ever calls one function. So the goal then was to create the simplest class so the overhead of doing this (and the subsequent code) remained small.

After a couple of iterations I settled on the following pair of interface+implementation.

public interface ByteRowFunction {
        public void forEach(byte[] data, int offset, int length);
}

public class ByteRow implements ByteRowFunction {

        private final IntUnaryOperator op;

        public ByteRow(IntUnaryOperator op) {
                this.op = op;
        }

        public void forEach(byte[] data, int offset, int length) {
                for (; length > 0; length--, offset++)
                        data[offset] = (byte) op.applyAsInt(data[offset] & 0xff);
        }
}

This form of for loop is also the fastest I could find, at least with this hotspot on this platform. (i suppose also it's really a map() or apply() call, just read it as the one you prefer).

It still has the same issue as the examples above in that using it with 3 or more different 'op' values will de-optimise it, even if it can be used itself as a singleton itself (since the forEach call is pure).

Class specialisation

So this is where the bytecode manipulation comes in. I use ASM to simply create a new class for each class of op. And the jvm will worry about in-lining the call if it makes sense otherwise.

public static ByteRowFunction of(IntUnaryOperator op) {
    try {
        Class cp = ASMTools.specialisedClass(ByteRow.class.getName(), op);
        
        return (ByteRowFunction) cp.getConstructor(IntUnaryOperator.class).newInstance(op);
    } catch (Exception ex) {
        throw new RuntimeException(ex);
    }
}

The specialisedClass() function simply takes the original class and creates an exact copy but renames it to a unique value tied to op.getClass(). There is an out-of-date example in the ASM FAQ on how to do this but it's pretty easy using ASM. And that's more or less all it takes.

Actually further ... in this case the forEach() call is pure (re-entrant and side-effect free) so the of() function could return a singleton instance as well. But that adds some other mess to gc and so on so probably isn't worth it or even detrimental in the long run; if necessary a caller could keep track of their own.

Results

I did two tests on a (2^25) byte array. The first tries to isolate just the overheads and invokes Integer.hashCode() on each item. The second calls Integer.reverse() which is somewhat slow on x86 without a bitrev instruction (ignoring that this will always result in zero when byte-integer-byte is used).

In each case i'm calling the same thing in 3 different places with 3 different lambdas (`Integer::xx' will create a new dynamicinvoke handle each time) which should trigger a de-optimisation if it's going to.

                                 hashCode        reverse
  new ByteRow().forEach          0.1800000       0.29
  new of().forEach               0.0000700       0.146

  singleton hash of().forEach    0.0000020       0.146
  singleton saved of().forEach   0.0000013       0.146
  for loop                       0.0000010       0.158

This is the best of 5 runs after a couple of warmup laps. Because it's so short the first column results are a bit noisy but the general trend is clear enough and i took some representative values from a few runs.

The first two include (one) object instantiation. The third uses a (non-synchronised) hash table lookup, the fourth just creates an instance and re-uses it. The last is a simple for-loop over the byte array.

It would be handy to see the generated object code but one can guess that the first vs the rest is the difference between an in-line `foo[x] = foo[x]' and a function call `foo[x] = identity(foo[x])'.

Of course a `memcpy' operation isn't much of a test so with something a little more substantial like Integer.reverse() the overheads are only about 100% - which is still pretty much the "doesn't matter" point in most cases but it's there anyway. Oddly enough the for loop loses out here a little bit but that probably just comes down to different micro-optimisations.

The point of this is to save having to type out yet! another! for! loop! and use lambdas but still retain the performance of using a specialised for-loop. The grand prize would be to re-compile the code for SIMD instructions or HSA or OpenCL - i.e. aparapi. But that requires more than just a bit more effort ;-).

I was hoping that the same technique would be applicable to creating optimised spliterators for use with streams, but with the first approach I tried unfortunately by the time the spliterator gets to see the operator it's been wrapped in so many anonymous inner classes and/or dynamicinvoke handles that the compiler doesn't try or can't inline everything. Bummer I guess.

I guess if expose the spliterator boundaries to the pipeline it could work. Instead of creating a stream of integers it could create a stream of batches or rows, and then some helper 'of()' functions could wrap simple per-item calculations into optimised loop running instances whilst still retaining most of the simplicity of use.

  thing.rows().map(ByteRow.ofMap((v) -> itemcalc)). ...;
  thing.rows().flatMap(ByteRow.ofFlatMap((v) -> itemcalc)). ...;
  etc

But i've had enough of this for today. I dunno why i was even on this thing - i had an overly long work week and spent too much time infront of screens as it is. But with a crappy cold day and that sore foot options are limited. Might see if the footy is on, but that channel 7 commentary makes it unbearable.

C?

But just for a ground-check i thought i'd see how C compares. Unfortunately the compiler detects that a bit reverse of a byte will end in zero and optimises it away to a byte-store of 0. Oops. Well i mean it's great that it does that but not really important here. I'm using the same code as in Integer.reverse().

So I changed it to the byte-correct reverse(i)>>24.

                         reverse(i)>>24    ~i (x5 loops)
 new ByteRow().forEach   0.151                   0.037
 for loop                0.147                   0.035

 C call or forEach       0.118                   0.233
 C inline                0.08                    0.096

So yeah it's slower but only about 2x worst case but in the more realistic case were you're not going to include inline implementations of everything it's only ~30% slower.

I also tried a 'not' function and here java pounces on gcc, even the in-line case is 3x slower and via a function call is 6x slower. This is just with -O2 and it is not doing any loop unrolling or simdisation. But -O3 doesn't make much difference. Using -O3 and -mtune=native results in no real difference either although it generates a bunch of SIMD code and unrolls the loop a few times.

The gcc generated code looks ok at a glance - not that i care enough about x86 mc to be able to determine the finer details. Maybe it's an alignment thing or something?

It is still a bit surprising if not very important but is enough to demonstrate C doesn't automatically mean best-of-speed.

No comments: