Monday 31 March 2014

a (slightly) better barrier, signals

So I kept thinking about the last point on the last post. That is, having a job-based runtime similar to hsa queuing.

One feature that employs is a signalling mechanism that I think lets waiters wait on multiple events before continuing. I came up with a possible epiphany-friendly primitive to support that.

First there is the sending object - this is what each sender gets to signal against. It requires a static allocation of slots by the runtime so that they can signal with a single asynchronous write without needing an atomic index update that a semaphore like object would require.

struct ez_signal_t {
 uint slot;
 ez_signal_bank_t *bank;
};

void ez_signal(ez_signal_t s) {
 s.bank->s.signals[s.slot] = 0xff;
}

The signal bank is basically just an array of bytes per sender. But also allows for word access for a more efficient wait.

struct ez_signal_bank_t {
 uint size;
 uint reserved;
 union {
  ubyte signals[8];
  uint block[2];
 } s;
 // others follow
 // must be rounded up to 4 bytes
};
(the structure is aligned to 8 because of the union, it's probably not worth it and i will use an assembly version anyway)

The bank is initialised by setting every byte within size to 0, but those outside of that range which might fall into a word slot are pre-filled with the signal value of 0xff. This allows the waiter to execute a simple uint based loop without having to worry about edge conditions.

void ez_signal_wait(ez_signal_bank_t *bank) {
        uint size = bank->size;
        volatile uint *si = bank->s.block;

        size = ((size+3) >> 2);
        while (size > 0) {
                while (si[size-1] != 0xffffffff)
                        ;
                size -= 1;
        }
}

The same technique could be applied to barrier() as well - a tiny improvement but one nonetheless. It does need some edge case handling for the reset and init though which will take more code.

(Hmm, on closer inspection it isn't clear whether the hsa signalling objects actually support multiple writers or not - it seems perhaps not. However they do operate at a higher level of abstraction and may need multiple writers internally in order to implement the public interface when multiple cores are involved. Update: On further inspection it looks like they are basically signal semaphores but with a flexible condition check on the counter and the ability to use opencl atomics on the index.)

Actually another thought I had was that since the barrier is tied directly to the workgroup why is the user code even calling init on it's own structure anyway? It may as well be a parameter-less function that just uses a pre-defined area as the barrier memory which is setup by the loader and/or the init routine. I think the current barrier implementation mechanism may have an issue because you can't initialise the barrier data structures in parallel without it being a race condition and so even the setup needs another synchronisation mechanism to implement it properly. By placing it in an area that can be initialised by the loader/runtime that race goes away quietly to die. And if the barrier is chip-wide or the WAND routing ever gets fixed it could just use a hardware barrier implementation instead.

This thought may be extended to other data structures and facilities that are needed at runtime. Why have the error prone and bulky code to setup a dma structure when the logic could be encoded in a routine with simple arguments that covers the most common cases with just two functions: 1D and 2D? And if you're doing that why not add an async copy capability with interrupt driven queuing as well? Well it may not be worth it in the end but its worth exploring to find out.

gdb sim

So I actually used the gdb simulator for the first time yesterday to step through some assembly language to verify some code. So far I've just debugged using the run-till-it-stops-crashing method. Bit of a pain it doesn't seem to think assembly language is actually a thing though:

$ e-gdb
...

(gdb) target sim
Connected to the simulator.
(gdb) file a.out
Reading symbols from /home/notzed/src/elf-loader/a.out...done.
(gdb) load
Loading section ivt_reset, size 0x4 lma 0x0
Loading section .reserved_crt0, size 0xc lma 0x58
Loading section NEW_LIB_RO, size 0x170 lma 0x64
Loading section NEW_LIB_WR, size 0x450 lma 0x1d8
Loading section GNU_C_BUILTIN_LIB_RO, size 0x6 lma 0x628
Loading section .init, size 0x24 lma 0x62e
Loading section .text, size 0x268 lma 0x660
Loading section .fini, size 0x1a lma 0x8c8
Loading section .ctors, size 0x8 lma 0x8e4
Loading section .dtors, size 0x8 lma 0x8ec
Loading section .jcr, size 0x4 lma 0x8f4
Loading section .data, size 0x4 lma 0x8f8
Loading section .rodata, size 0x8 lma 0x900
Start address 0x0
Transfer rate: 17632 bits in <1 sec.
(gdb) b ez_signal_init
Breakpoint 1 at 0x85a
(gdb) r
Starting program: /home/foo/src/elf-loader/a.out 

Breakpoint 1, 0x0000085a in ez_signal_init ()
(gdb) stepi
0x0000085e in ez_signal_init ()
(gdb) 
0x00000860 in ez_signal_init ()
(gdb) 
0x00000862 in ez_signal_init ()
(gdb) 
0x00000864 in ez_signal_init ()

Yay? Had better tools on the C=64.

But I came across this article which provides something of a workaround. Not the best but it will suffice.

(gdb) display /3i $pc
1: x/3i $pc
=> 0x864 <ez_signal_init+10>:   lsl r1,r1,0x2
   0x866 <ez_signal_init+12>:   lsl r3,r3,0x3
   0x868 <ez_signal_init+14>:   beq 0x880 <ez_signal_init+38>
(gdb) stepi
0x00000866 in ez_signal_init ()
1: x/3i $pc
=> 0x866 <ez_signal_init+12>:   lsl r3,r3,0x3
   0x868 <ez_signal_init+14>:   beq 0x880 <ez_signal_init+38>
   0x86a <ez_signal_init+16>:   mov r2,0xffff
(gdb) 
0x00000868 in ez_signal_init ()
1: x/3i $pc
=> 0x868 <ez_signal_init+14>:   beq 0x880 <ez_signal_init+38>
   0x86a <ez_signal_init+16>:   mov r2,0xffff
   0x86e <ez_signal_init+20>:   movt r2,0xffff
(gdb) 
0x0000086a in ez_signal_init ()
1: x/3i $pc
=> 0x86a <ez_signal_init+16>:   mov r2,0xffff
   0x86e <ez_signal_init+20>:   movt r2,0xffff
   0x872 <ez_signal_init+24>:   lsr r2,r2,r3
(gdb) 

set confirm off may also reduce offence, but otherwise it was all a bit easier than I remember it seemed when I read about it as the few commands above show.

I tend to only use debuggers these days as a tool of absolutely last resort. Most 'bugs' I work on are 'known bugs' like missing or incomplete features where such a tool isn't any help. And debugger features which should be useful like variable watches are often just too much of a pain to setup and work with or trying to step to the 5000th iteration of some loop which also tends to be painful with debuggers even if they have support for it. Obviously i'm not having to debug production code in-place these days, thankfully.

But I remember using it heavily in the bad old days when working on Evolution which uses threads a fair bit - whether each updated revision of gdb would even be able to produce usable stack traces of a threaded application was a bit of a lucky dip. I remember not upgrading for years at a time once I got a combination of kernel and gdb that actually worked. It's probably the single most influential experience that lead me to become very conservative with my tool updates. There's nothing more frustrating than having to fix a breakage for something that wasn't broken in the first place.

No comments: