Friday 6 September 2013

Scheduling in detail

Just some notes on optimising the assembly language version of the viola-jones cascade walker I've been working on for the eiphany chip.

I'm still working toward a tech demo but I got bogged down with the details of the resampling code - i'm using the opportunity to finally grok how upfirdn works.

Excuse the typos, i did this in a bit of a rush and don't feel like fully proof-reading it.

The algorithm

First the data structure. This allows the cascade to be encoded using dword (64-bit) alignment. It's broken into 64-bit elements for C compatability.

union drecord {
        unsigned long long v;
        struct {
                unsigned int flags;
                float sthreshold;
        } head0;
        struct {
                unsigned int count2;
                unsigned int count3;
        } head1;
        struct {
                unsigned short a,b,c,d;
        } rect;
        struct {
                float weight1;
                float fthreshold;
        } f0;
        struct {
                float fsucc,ffail;
        } f1;
};

And then a C implementation using this data structure. The summed-area-table (sat) sum calculates the average of all pixels within that rectangle. The sat table size is hard-coded to a specific width and encoded into the compiled cascade. Because it is only ever processed as part of a window of a known size this doesn't limit it's generality.

It performs a feature test on a 2-region feature which equates to either a "this half is brighter than the other half" test in all 4 directions, or a "the middle half is brighter than the two quarter sides" in both directions and senses.

// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3
int test_cascade(float *sat, float var, const union drecord *p, float *ssump) {
        union drecord h0;
        union drecord h1;
        float ssum = *ssump;


        do {
                h0 = (*p++);
                h1 = (*p++);

                while (h1.head1.count2) {
                        union drecord r0, r1, f0, f1;
                        float rsum;

                        r0 = (*p++);
                        r1 = (*p++);
                        f0 = (*p++);
                        f1 = (*p++);

                        rsum = (sat[r0.rect.a] + sat[r0.rect.d]
                                - sat[r0.rect.b] - sat[r0.rect.c]) * -0.0025f;
                        rsum += (sat[r1.rect.a] + sat[r1.rect.d]
                                 - sat[r1.rect.b] - sat[r1.rect.c]) * f0.f0.weight1;

                        ssum += rsum < f0.f0.fthreshold * var ? f1.f1.fsucc : f1.f1.ffail;
                        h1.head1.count2--;
                }

                ... 3-feature test is much the same ...

                if (h0.head0.flags & 1) {
                        if (ssum < h0.head0.sthreshold) {
                                return 0;
                        }
                        ssum = 0;
                }
        } while ((h0.head0.flags & 2) == 0);

        *ssump = ssum;

        // keep on going
        return 1;
}

As one can see the actual algorithm is really very simple. The problem with making it run fast is dealing with the amount of data that it can chew through as i've mentioned and detailed in previous posts.

I don't have any timings but this should be a particularly fast implementation on an desktop cpu too - most of the heavy lifting fits in the L1 cache for example, and it's pre-compling as much as possible.

Hardware specific optimisations

This covers a couple of optimisations made to take advantage of the instruction set.

First issue is that there is no comparison operation - all one can do is subtract and compare flags. Furthermore there are only limited comparison operators available - equal, less-than and less-than-or-equal. So in general a compare is at least 2 instructions (and more if you want to be ieee compliant but that isn't needed here).

On the other hand there are fmad and fmsub instructions - AND these set the flags. So it is possible to perform all three operations in one instruction given that we don't need to know the precise value.

Another feature of the epu is that the floating point and integer flags are separate so this can be utilised to fill instruction slots and also perform control flow without affecting the flags.

The epu is most efficient when performing dword loads. It's the same speed as a word load, and faster than a short or byte load. So the format is designed to support all dword loads.

Another general optimisation is in pre-compiling the cascade for the problem. So far i'm only using it to pre-calculate the array offsets but it could also be used to alter the sign of calculations to suit the available fpu flags.

Update: Because the eiphipany LDS is so small another optimisation was to make the cascade streamable. Although the single biggest stage with the test cascade fits in 8k it is pretty tight and limits the code flexbility and tuning options (e.g. trade-off space and time). It also limits generality - other cascades may not have the same topology. So the cascade format is designed so it can be broken at completely arbitrary boundary points with very little overhead - this is probably the single most important bit of engineering in the whole exercise and determines everything else. The difficulty isn't so much in designing the format as in recognising the need for it and it's requirements in the first place. Having a streamable cascade adds a great deal of flexibility for dealing with large structures - they can be cached easily and implementing read-ahead is trivial.

There were some other basic optimisation techniques which became available after studying the actual data:

  • 2-region features use only two variations of weights, therefore it can be encoded in 1 bit or in a single float (the first one is always the same).
  • 3-region features all use the same weights, therefore all 3 floats can be thrown away.
  • The original cascade format had 2 or 3 region features scattered amongst the cascade randomly which means any inner loop has to deal with the different number of elements (and branch!). Once on realises the only result is the sum then they can be processed in any order (summation algebra ftw ... again), meaning i could group them and optimise each loop separately.

Some of these seem to lose the generality of the routine - but actually the weights are always the same relationship they are just scaled to the size of the native cascade window. So making the algorithm general would not take much effort.

These are things I missed when I worked on my OpenCL version so I think I could improve that further too. But trying to utilise the concurrency and dealing with the cascade size is what kills the GPU performance so it might not help much as it isn't ALU constrained at all. If I ever get a GCN APU I will definitely revisit it though.

Unscheduled ASM

After a (good) few days worth hacking blind and lots of swearing I finally came up with the basic code below. I was dreaming in register loads ...

Actually this was de-scheduled in order to try to follow it and re-schedule it more efficiently. This is the top part of the C code and the entire 2-region loop.

// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3

0:      ldrd    r18,[r7,#1]     ; count2, count3
        ldr     r16,[r7],#4     ; flags

        and     r0,r18,r5       ; check zer0 count
        beq     1f

2:      ldrd    r0,[r7],#4      ; 0: load a,b,c,d
        ldrd    r2,[r7,#-3]     ; 1: load a,b,c,d

        lsr     r4,r0,#16       ; 0:b index
        ldr     r21,[r6,r4]     ; 0:load b
        and     r0,r0,r5        ; 0:a index
        ldr     r20,[r6,r0]     ; 0:load a

        lsr     r4,r1,#16       ; 0: d index    
        ldr     r23,[r6,r4]     ; 0: load d
        and     r1,r1,r5        ; 0: c indec
        ldr     r22,[r6,r1]     ; 0: load c

        lsr     r4,r2,#16       ; 1: b index
        ldr     r25,[r6,r4]     ; 1: load b
        and     r2,r2,r5        ; 1: a index
        ldr     r24,[r6,r2]     ; 1: load a
        
        lsr     r4,r3,#16       ; 1: d iindex
        ldr     r27,[r6,r4]     ; 1: load d
        and     r3,r3,r5        ; 1: c index
        ldr     r26,[r6,r3]     ; 1: load c

        ldrd    r50,[r7,#-2]    ; load w1, rthreshold
        
        fsub    r44,r20,r21     ; 0: a-b
        fsub    r45,r23,r22     ; 0: d-c
        fsub    r46,r24,r25     ; 1: a-b 
        fsub    r47,r27,r26     ; 1: d-c
        
        fmul    r48,r51,r60     ; rthreshold *= var
        
        fadd    r44,r44,r45     ; 0[-1]: a+d-b-c
        fadd    r45,r46,r47     ; 1[-1]: a+d-b-c
        
        fmsub   r48,r44,r63     ; [-1]: var * thr -= (a+d-b-c) * w0
        ldrd    r52,[r7,#-1]    ; [-1] load fsucc, ffail
        fmsub   r48,r45,r50     ; [-1] var * thr -= (a+d-b-c) * w1
        movblte r52,r53
        fsub    r17,r17,r52     ; [-2]: ssum -= var * thr > (rsum) ? fsucc: ffail

        sub     r18,r18,#1
        bne     2b
1:

Apart from the trick with the implicit 'free' comparison operations for all that it pretty much ended up in a direct translation of the C code (much of the effort was in the format design and getting the code to run). But even in this state it will execute much faster than what the compiler generates for the very simple loop above. Things the C compiler is missing:

  • It doesn't use dword loads - more instructions are needed
  • It does use hword loads - causes fixed stalls
  • It is using an ieee comparison function (compiler flags may change this)
  • It doesn't use fmsub as much, certainly not for comparison
  • It needs to multiply the array references by 4

Because there are no datatypes in asm, this can take advantage of the fact that the array lookups are by the byte and pre-calculate the shift (multiply by sizeof(float)) in the cascade. In the C version I do not as it adds a shift for a float array reference - I do have a way to remove that in C but it's a big ugly.

Otherwise - it's all very straightforward in the inner loop.

First it loads all the rect definitions and then looks them up in the sat table (r6).

Then it starts the calculations, first calculating the average and then using fmsub to perform the multiply by the weight and comparison operation in one.

At the very end of the loop the last flop is to perform a subtraction on the ssum - this sets the status flags to the final comparison (if (ssum < h0.head0.sthreshold) in c). This actually requires some negation in code that uses it which could be improved - the threshold could be negated in the cascade for example.

If one looks closely one will see that the registers keep going up even though many are out of scope and can be re-used. This is done on purpose and allows for the next trick ...

I don't have the full profiling info for this version, but I have a note that it includes 15 RA stalls, and I think from memory only dual-issues 2 of the 10 flops.

Scheduling

A typical optimisation technique is to unroll a loop, either manually or by letting the compiler do it. Apart from reducing the relative overhead of any loop support constructs it provides modern processors with more flexibility to schedule instructions.

The code already has some loop unrolling anyway - the two regions are tested using in-line code rather than in a loop.

But unrolling gets messy when you don't know the the loop bounds or don't have some other hard detail such as that there is always an even number of loops. I didn't really want to try to look at pages of code and try to schedule by hand either ...

So instead I interleaved the same loop - as one progresses through the loop calculating the addresses needed for "this" result, the fpu is performing the calculations for the "last" result. You still need a prologue which sets up the first loop for whatever the result+1 code is expecting, and also an epilogue for the final result - and if only 1 value is processed the guts is completely bypassed. I'll only show the guts here ...

// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3

        .balign 8
2:
[  0]   fsub    r46,r24,r25     ; [-1] 1: a-b 
[  0]   ldrd    r0,[r7],#4      ; [ 0] 0: load a,b,c,d
[  1]   fsub    r47,r27,r26     ; [-1] 1: d-c
[  1]   ldrd    r2,[r7,#-3]     ; [ 0] 1: load a,b,c,d
[  2]   fmul    r48,r51,r60     ; [-1] rthreshold *= var
        
[  2]   lsr     r4,r0,#16       ; [ 0] 0:b index
[  3]   fadd    r44,r44,r45     ; [-1] 0: a+d-b-c
[  3]   ldr     r21,[r6,r4]     ; [ 0] 0:load b
[  4]   and     r0,r0,r5        ; [ 0] 0:a index
[  5]   ldr     r20,[r6,r0]     ; [ 0] 0:load a

[  6]   lsr     r4,r1,#16       ; [ 0] 0: d index    
[  6]   fadd    r45,r46,r47     ; [-1] 1: a+d-b-c
[  7]   ldr     r23,[r6,r4]     ; [ 0] 0: load d
[  8]   and     r1,r1,r5        ; [ 0] 0: c indec
[  8]   fmsub   r48,r44,r63     ; [-1] var * thr -= (a+d-b-c) * w0
[  9]   ldr     r22,[r6,r1]     ; [ 0] 0: load c
        
[ 10]   lsr     r4,r2,#16       ; [ 0] 1: b index
[ 11]   ldr     r25,[r6,r4]     ; [ 0] 1: load b
[ 12]   and     r2,r2,r5        ; [ 0] 1: a index
[ 13]   ldr     r24,[r6,r2]     ; [ 0] 1: load a

[ 13]   fmsub   r48,r45,r50     ; [-1] var * thr -= (a+d-b-c) * w1

[ 14]   ldrd    r52,[r7,#-5]    ; [-1] load fsucc, ffail
[ 15]   lsr     r4,r3,#16       ; [ 0] 1: d iindex
[ 16]   and     r3,r3,r5        ; [ 0] 1: c index
[ 17]   ldr     r27,[r6,r4]     ; [ 0] 1: load d
[ 18]   movblte r52,r53         ; [-1] val = var * thr < rsum ? fsucc : ffail
[ 19]   fsub    r44,r20,r21     ; [ 0] 0: a-b
[ 19]   ldr     r26,[r6,r3]     ; [ 0] 1: load c
[ 20]   fsub    r45,r23,r22     ; [ 0] 0: d-c

[ 20]   sub     r18,r18,#1
[ 21]   ldrd    r50,[r7,#-2]    ; [-1] load w1, rthreshold

[ 21]   fsub    r17,r17,r52     ; [-1] ssum -= var * thr > (rsum) ? fsucc: ffail

[ 22]   bne     2b
[ 26] ; if back to the start of the loop

Update: I tried to improve and fix the annotations in the comments. The [xx] value is the index of the result this instruction is working on, the next x: value is the index of the region being worked on (where it is needed).

I've attempted to show the clock cycles the instructions start on (+ 4 for the branch), but it's only rough. I know from the hardware profiling that every flop dual-issues and there are no register stalls. The loop start alignment is also critical to the lack of stalls. And it took a lot of guess-work to remove the final stall which lingered in the last 5 instructions (someone'll probably tell me now that the sdk has a cycle timer, but that would be no matter if they did).

It almost fell out almost completely symmetrically - that is having all ialu ops in loop 0 and having all flops in loop 1, but by rotating the flops around a bit I managed to get the final flop being the ssum "subtraction + comparison" operation and with no stalls ...

The movblte instruction which performs the ternary is the one that uses the implicit comparison result from the fmsub earlier. Not only does this save one instruction, it also saves the 5 clock cycle latency it would add - and this loop has no cycles to spare that I could find.

There is some more timing info for this one on the previous post. The version that this is 30% faster is not the unscheduled one above but an earlier scheduling attempt.

Oh I should probably mention that i found the bugs and the timings in the previous post did change a bit for the worse, but not significantly.

No comments: