Friday 25 April 2014

zZed is listening to ...

Turrican_2/mdat.intro_and_title on (manual) loop.

I had to compile UADE from source - the audacious plugin needs to be disabled since it only supports version 2.0. I couldn't get the TMFX plugin for xmms to work (although i was impressed that slackware still includes xmms).

But yeah ... just sitting back with a couple of G&T's and reminiscing about 'good times' ...

Computers were so much more fun back in the early 90s. Weekend hacking was trying to see how few scan-lines you could get a MOD player to execute in or see how many bobs you could get on the screen within a frame - not writing a fucking ELF loader. Actually those 'good times' were pretty lonely and miserable too; maybe that's why I return whenever I'm again feeling shit.

I plugged one of the small amps I got into my workstation and hooked it up to some desktop speaker modules I acquired at some point to try them out (some sony vaio ones, they're decent enough). I thought the remote on my amp wasn't working so I took it apart and had a look. Oh dear the soldering and component mounting was surprisingly shithouse - some of the pins were nearly touching and the circuit board was covered in splattered solder. I reflowed the main amp IC and a few other pins and added some extra jumpers for the main output signals, did a bit of a clean-up, (the tracks get very thin in places) and made sure the power socket was properly soldered in (it was well short of enough solder). In the end I realised the remote only controls the MP3 module anyway so it was pretty much a pointless exercise apart from a bit of soldering practice.

Last night I finally started playing Dragons Dogma after downloading it from PS+ a few months ago. After Oblivion and Dragon Age it's character creation system is actually pretty decent - people look like people; and the cut-scenes almost look pre-rendered even with your custom character in them. Well pretty people anyway, and if i'm going to look at them for a few days solid that what i'd rather be looking at even if you mostly just see their backs and awkward gait. If I keep on playing it anyway; just can't seem to get much into games for the last couple of years but occasionally something keeps me busy for a while.

Had a weird and disturbing dream though. One of the "trying to run but feet can't get any purchase" variety which is pretty much a recurring nightmare of the last 20 years from my body trying to wake me up either from sleep apnoea or overheating. At least this one was dodging meteorites which was something new and novel, although by the end my mobility was so impaired I was reduced to crawling and still not being able to make forward progress (an alarmingly unpleasant experience, real or not). The Armageddon themed ones are always the most interesting although waking up from them is still horrible.

Ran out of Tonic, trying to find a quaffer in the cellar which isn't off ... they're all off so far. Damn.

Thursday 24 April 2014

A kernel runtime

Yesterday I had to get out of the house but in the morning I had a little bit of a play trying to work out how to implement a kernel runtime. That is one that loads small self-contained routines in sequence which may work on parts or a whole problem at a time. One feature I want to support is the ability to have persistent local storage so that multiple waves of code might work on the same buffers before exporting the results off-core - it may not end up being practical but I'm hoping it will be useful.

At first I thought of trying to embed multiple kernels in the same binary but that doesn't really work - things like the library functions are coalesced across all kernels at best and everything needs a unique name which is just a bit of a pain to work with. And it basically just devolved into using overlays which requires a linker script and are just generally a bit involved.

Whilst I was out 'doing stuff' I came up with another solution which I think solves the usability problem and at this stage I think should support everything I want.

  • A runtime executive (the exec) provides some global functions which are shared amongst all cores, possibly all of libezecore (e-lib);
  • Each kernel is compiled into a standalone binary which includes any ezecore functions it needs that aren't in the exec, it has no startup routine;
  • Persistent data structures are placed in special sections such as .data.kernel. The normal mechanism of using weak symbols is to to reference them from other kernels/cores;
  • At runtime, the full set of interacting kernels must be supplied for a given workgroup.

This last point breaks the queuing abstraction somewhat but I don't see any practical alternative because the total persistent data space needs to be known in advance. Well unless an alternative mechanism is used such as supplying a separate data object as a base elf file.

At load-link time there should be enough information to assign unique locations for any shared buffers and then relocate the kernel code to a common location (or multiple) which the exec can DMA in as required.

Hmm, well I guess I see if it will work when I get around to coding something up.

Saturday 19 April 2014

a better (debugging) printf

As a bit of a side-mission I thought i'd have a poke at the printf problem on epiphany. Using it drags in a whole pile of floating point snot and stdio and it completely blows out the text space so it wont fit on an epu.

#include <stdio.h>

int main(int argc, char **ragv) {
 printf("test! %f\n", 1.0);
}

$ e-gcc e-test.c
$ e-size a.out
   text    data     bss     dec     hex filename
  42282    2304      88   44674    ae82 a.out

The only way to use it is to drop it into the external memory which has some performance issues. The performance issues aren't critical for a debugging function but even then e-hal doesn't install a listener so it doesn't actually work.

So the solution i'm going to use ... just use a stub and dump the printf data to a queue which can then be processed by eze-host. The stub can hopefully be small enough to fit into LDS and the queuing provides implicit buffering that should let the code run fast enough to debug most problems.

The problem here comes in that printf is a varargs call and varargs by definition doesn't know how many arguments it has ... so the stub needs to parse the format string and marshall the data out to another structure and then the host has to interpret this.

Fortunately for me the varargs format on both epiphany and arm appear to be the same, and whats more, 'va_list' is just a simple pointer which simplifies the host processing considerably. Or appears to be from some investigations.

So the approach is basically:

ezecore:

void
ez_printf(const char *__restrict fmt, ...) {
   va_list ap;

   va_start(fmt, ap);

   ... scan fmt and work out how big the argument list is
   ...  copy any strings referenced and change the pointer

   ... allocate a queue slot and copy all data there
  
   va_end(ap);
}

Some of the more esoteric features of printf just don't need to be supported like output parameters and so the work is just some marshalling based on the fmt string. This still takes a bit of code so i have to try to shrink it as much as possible whilst not losing any important functionality. I think long double can go for instance.

The c compiler has already promoted every argument to 4 or 8 (or 16?) bytes long, and if it wasn't for strings it could just memcpy the va_list once it knew how long it was.

ezehost:

... process syscall queue, find printf:

int (*aprintf)(const char *fmt, uint32_t *a) = (int(*)(const char *, uint32_t *))vprintf;

void do_printf(const char *fmt, uint32_t *args) {
 aprintf(fmt, args);
}

This last 'hack' of just rewriting the argument types is wildly unportable but it works on arm with gcc. Without it you're basically forced to write your own printf or do some deeper (also non-portable) poking since there's no way to create a va_list portably in code.

One last pain is that that varags abi promotes all floats to doubles. So just using varargs with floats drags in 800 bytes or so of code to perform float to double conversion. I wrote a much smaller one that wont be fully standards compliant but should suffice for debugging purposes (well, probably).

Update: Hmm, I played with it and I dunno, parsing the format and using the va_arg stuff is still a bit of code. Probably acceptable; ... but

I guess two other alternatives are available:

  1. Use a trap and move all the processing to the host.

    Blocks but the code-size is absolutely minimal, just a stub which calls a stub which is a trap;

  2. Just copy a fixed-sized block of the ap across and have that as a known limitation. Along with demanding that any %s argument strings have to be in .shared sections.

    Better performance but the limitations are error prone.

There's always something ...

Update 2: So I had a look at this today ... and basically I decided to just go with the trap version because it's the smallest bit of code both on-core and on-host (and TBH, it's the first thing I got working and it's not interesting enough to keep investigating further.)

Because of a couple of decisions it turned out to be pretty easy actually. All I do is proxy it straight to the host and because the varargs abi is identical between the two cpus I don't even have to do any argument rewriting.

Rather than fudge around with tricking C to do what I want the epu stub is much easier just implemented in assembly language. The varargs call is like any other - it just gets the first 4 arguments in registers and the rest are stored on the stack. I could just handle this on the host but it's easier if I just convert it to an array on-epu first and then trap directly to the host.

// LGPL3
        ;; c-prototype
        ;; void ez_dprintf(const char *fmt, ...);

        ;; stores r1-r3 on the stack and changes r1 to point to it

        .balign 4
_ez_dprintf:
        strd    r2,[sp],#-2
        str     r1,[sp,#3]
        add     r1,sp,#12

        ;; r0 = fmt
        ;; r1 = args
        trap    #16
        
        add     sp,sp,#16
        rts
        export  _ez_dprintf

On the host code I launch a monitor thread which polls the DEBUGSTATUS register (unfortunately polling is the only possibility right now). This turns to 1 when the core is halted from a trap instruction (and a couple of others). At this point the host is free to peek and poke pretty much anything on the core including all registers so it just looks up r0 and r1 and then invokes vprintf directly using the type-rewriting trick mentioned above.

// LGPL3
// note this is not using the epiphany sdk
// runs in a polling loop:
        uint status = ee_read_reg(dev, r, c, E_REG_DEBUGSTATUS);

        if (status & 1) {
                e_core_t *ecore = ee_get_core(dev, r, c);
                int pc = ee_read_reg(dev, r, c, E_REG_PC);
                unsigned short insn = ((unsigned short *)(ecore->mems.base + pc))[-1];

                // check for trap instruction
                if ((insn & 0x3ff) == 0x3e2) {
                        // check trap code, there are 6 bits for the code
                        switch (insn>>10) {
                        case 16: { // dprintf
                                char *efmt = ez_host_addr(wg, ecore->mems.base,
                                                          ee_read_reg(dev, r, c, E_REG_R0));
                                uint *eargs = ez_host_addr(wg, ecore->mems.base,
                                                           ee_read_reg(dev, r, c, E_REG_R1));

                                aprintf(efmt, eargs);
                                break;
                        }
                        }
                }
                e_resume(dev, r, c);
        }

Now ... the next trick is to make sure string arguments are in the right location. I modified the eze-loader to just put all constant strings (SHF_STRINGS in the section header flags) into the shared external memory block and ezehost puts this as the same virtual address location for both the host and the epiphany so at least in most cases it "just works". This may have side-effects although i can't see why an epu should be working with constant strings locally in the general case and particularly for debugging statements it's nice that it takes as little memory as possible on-core (added bonus: the host-accesses don't need to go across the mesh either).

To use a %s specifier on non-constant strings the caller currently has to just make sure the buffer is in the shared memory block. If necessary the host code can be modified to fix any string addresses by parsing the format.

The buffering version I started with is probably still worth looking into but this is a start. Problems with this include the latency of the call and the fact that it behaves almost like a barrier in practice and thus interferes with the running state.

If I re-try the size test:

float f = 12.0f;
int main(void) {
        ez_dprintf("test! %f\n", f);
}

$ e-size e-test-printf.elf
   text    data     bss     dec     hex filename
    208       4       0     212      d4 e-test-printf.elf

I moved the float to a variable to force the double conversion at runtime, which uses my lightweight (non-compliant?) implementation. The whole on-core overhead including the double converter is only 90 bytes.

Friday 18 April 2014

More thoughts on the software code cache

So I veered off on another tangent ... just how to implement the software instruction cache from the previous post.

Came up with a few 'interesting' preliminary ideas although it definitely needs more work.

function linkage table

Each function call is mapped to a function linkage table. This is a 16-byte record which contains some code and data:

 ;; assembly format, some d* macros define the offset of the given type

        dstruct
        dint32  flt_move        ; the move instruction
        dint32  flt_branch      ; the branch instruction
        dint16  flt_section     ; section address (or index?)
        dint16  flt_offset      ; instruction offset
        dint16  flt_reserved0   ; padding
        dint16  flt_next        ; list of records in this section
        dend    flt_sizeof

The move instruction loads the index of the function itself into scratch register r12 and then either branches to a function which loads the section that contains the code, or a function which handles the function call itself. The runtime manages this. Unfortunately due to nesting it can't just invoke the target function directly. The data is needed to implement the functionality.

section table

The section table is also 16 bytes long and keeps track of the current base address of the section

        dstruct
        dint16  sc_base         ; loaded base address or ~0 if not loaded
        dint16  sc_flt          ; function list for this section
        dint16  sc_size         ; code size
        dint16  sc_reserved0
        dint16  sc_next         ; LRU list
        dint16  sc_prev
        dint32  sc_code         ; global code location
        dend    sc_sizeof

section header

To avoid the need of an auxiliary sort required at garbage collection time, the sections loaded into memory also have a small header of 8 bytes (for alignment).

        dstruct
        dint32  sh_section      ; the section record this belongs to
        dint32  sh_size         ; the size of the whole section + data
        dend    sh_sizeof

Function calls, hit

If the section is loaded ("cache hit") flt_branch points to a function which bounces to the actual function call and more importantly makes sure the calling function is loaded in memory before returning to it, which is the tricky bit.

Approximate algorithm:

rsp = return stack pointer
ssp = section pointer

docall(function, lr)
   ; save return address and current section
   rsp.push((lr-ssp.sc_base));
   rsp.push(ssp);

   ; get section and calc entry point   
   ssp = function.flt_section
   entry = ssp.sc_base + function.flt_offset

   ; set rebound handler
   lr = docall_return

   ; call function
   goto entry

docall_return()
   ; restore ssp
   ssp = rsp.pull();

   ; if still here, return to it (possibly different location)
   if (ssp.sc_base) {
      lr = rsp.pull() + ssp.sc_base;
      goto lr;
   }

   ; must load in section
   load_section(ssp)

   ; return to it
   lr = rsp.pull() + ssp.sc_base;
   goto lr;

I think I have everything there. It's fairly straightforward if a little involved.

If the section is the same it could avoid most of the work but the linker wont generate such code unless the code uses function pointers. The function loaded by the stub (flt record) might just be an id (support 65K functions) or an address (i.e. all on-core functions).

I have a preliminary shot at it which adds about 22 cycles to each cross-section function call in the case the section is present.

Function calls, miss

If the section for the function is not loaded, then the branch will instead point to a stub which loads the section first before basically invoking docall() itself.

doload(function, lr)
   ; save return address and current section
   rsp.push((lr-ssp));
   rsp.push(ssp);

   ; load in section
   ssp = function.flt_section
   load_section(ssp);

   ; calculate function entry
   entry = ssp.sc_base + function.flt_offset

   ; set rebound handler (same)
   lr = docall_return

   ; call function
   goto entry  

load_section() is where the fun starts.

Cache management

So I thought of a couple of ways to manage the cache but settled on a solution which uses garbage collection and movable objects. This ensures every byte possible is available for function use and i'm pretty certain will take less code to implement.

This is where the sh struct comes in to play - the cache needs both an LRU ordered list and a memory-location ordered list and this is the easiest way to implement it.

Anyway i've written up a bit of C to test the idea and i'm pretty sure it's sound. It's fairly long but each step is simple. I'm using AmigOS/exec linked lists as they('re cool and) fit this type of problem well.

loader_ctx {
  list loaded;
  list unloaded;
  int alloc;
  int limit;
  void *code;
} ctx;

load_section(ssp) {
   needed = ctx.limit - ssp.sc_size - sizeof(sh);

   if (ctx.alloc > needed) {
      ; not enough room - garbage collect based on LRU order

      ; scan LRU ordered list for sections which still fit
      used = 0;
      wnode = ctx.loaded.head;
      nnode = wnode.next;
      while (nnode) {
         nused = wnode.sc_size + used + sizeof(sh);

         if (nused > needed)
            break;

         used = nused;

         wnode = nnode;
         nnode = nnode.next;
      }

      ; mark the rest as removed
      while (nnode) {
         wnode.sc_base = -1;

         ;; fix all entry points to "doload"
         unload_functions(wnode.sc_flt);

         wnode.remove();
         ctx.unloaded.addhead(wnode);

         wnode = nnode;
         nnode = nnode.next;
      }

      ; compact anything left, in address order
      src = 0;
      dst = 0;
      while (dst < used) {
         sh = ctx.code + src;
         wnode = sh.sh_section;
         size = sh.sh_size;

         ; has been expunged, skip it
         if (wnode.sc_base == -1) {
            src += size;
            continue;
         }

         ; move if necessary
         if (src != dst) {
            memmove(ctx.code + dst, ctx.code + src, size);
            wnode.sc_base = dst + sizeof(sh);
         }

         src += size;
         dst += size;
      }
   }

   ; load in new section
   ;; create section header
   sh = ctx.code + ctx.alloc;
   sh.sh_section = ssp;
   sh.sh_size = ssp.size + sizeof(sh);

   ;; allocate section memory
   ssp.sc_base = ctx.alloc + sizeof(sh);
   ctx.alloc += ssp.size + sizeof(sh);

   ;; copy in code from global shared memory
   memcpy(ssp.sc_base, ssp.sc_code, ssp.sc_size);

   ;; fix all entry points to "docall"
   load_functions(ssp.sc_flt);

   ;; move to loaded list
   ssp.remove();
   ctx.loaded.addhead(ssp);   
}

The last couple of lines could also be used at each function call to ensure the section LRU list is correct, which is probably worth the extra overhead. Because the LRU order is only used to decide what to expunge and the memory order is used for packing it doesn't seem to need to move functions very often - which is obviously desirable. It might look like a lot of code but considering this is all that is required in totality it isn't that much.

The functions load_functions() and unload_functions() just set a synthesised branch instruction in the function stubs as appropriate.

Worth it?

Dunno and It's quite a lot of work to try it out - all the code above basically needs to be in assembly language for various reasons. And the loader needs to create all the data-structures needed as well, which is the flt table, the section table, and the section blocks themselves. And ... there needs to be some more relocation stuff done if the sections use relative relocs (i.e. -mshort-calls) when they are loaded or moved - not that this is particularly onerous mind you.

AmigaOS exec Lists

The basic list operations for an exec list are always efficient but turn out to be particularly compact in epiphany code, if the object is guaranteed to be 8-byte aligned which it should be due to the abi.

For example, node.remove() is only 3 instructions:

; r0 = node pointer
        ldrd    r2,[r0]         ; n.next, n.prev
        str     r2,[r3]         ; n.prev.next = n.next
        str     r3,[r2,#1]      ; n.next.prev = n.prev

The good thing about exec lists is that they don't need the list header to remove a node due to some (possibly) modern-c-breaking address-aliasing tricks, but asm has no such problems.

list.addTail() is only 4 instructions if you choose your registers wisely (5 otherwise).

; r3 = list pointer
; r0 = node pointer
        ldr     r2,[r3]         ; l.head
        strd    r2,[r0]         ; n.next = l.head, n.prev = &l.head
        str     r0,[r2,#1]      ; l.head.prev = n
        str     r0,[r3]         ; l.head = n

By design, &l.head == l.

Unfortunately the list iteration trick of going through d0 loads (which set the cc codes directly) on m68K doesn't translate quite as nicely, but it's still better than using a container list and still only uses 2 registers for the loop:

; iterate whole list
; r2 = list header
        ldr     r0,[r2]         ; wnhead  (work node)
        ldr     r1,[r0]         ; nn = wn.next (next node)
        sub     r1,r1,0         ; while (nn) {
        beq     2f
1:
        ; r0 is node pointer and free to be removed from list/reused
        ; node pointer is also ones 'data' so doesn't need an additional de-reference

        mov     r0,r1           ; wn = nn
        ldr     r1,[r1]         ; nn = nn.next
        sub     r1,r1,0
        bne     1b
2:

Again the implementation has a hidden bonus in that the working node can be removed and moved to another list or freed without changing the loop construct or requiring additional registers.

For comparison, the same operation using m68K (devpac syntax might be out, been 20 years):

; a0 = list pointer
        mov.l   (a0),a1         ; wn = l.head  (work node)
        mov.l   (a1),d0         ; nn = wn.next (next node)
        beq.s    2f             ; while (nn) {
.loop:
        ; a1 is node pointer, free to be removed from list/reused

        mov.l   d0,a1           ; wn = nn
        mov.l   (a1),d0         ; nn = wn.next
        bne.s   .loop
2:        

See lists.c from puppybits for a C implementation.

Wednesday 16 April 2014

software instruction cache

So I was posting on the parallella forums and had an idea for something to investigate in the future.

Basically have the ezesdk loader create an automatic software code cache based on sections.

The thinking is this:

  • The relocatable elf files contain RELOC hunks for all function calls outside of the current section (and inside if they are not compiled with -mshort-calls);
  • These need to be resolved by the loader anyway;
  • The loader could point these anywhere - including to a global relocation table which included loader-generated entries;
  • The global relocation table could call a stub which loads the code anywhere in memory because the code is now relocatable (since every function call goes via the stubs).

Well, it's a little more complex than that because return addresses from the stack also need to make sure they track the current location of the caller and if it needs to be re-loaded into the cache. Still off the top of my head, ... this may be possible. For example each stub could track the current callee module in a separate return stack which can be updated should any section be unloaded or relocated. Interrupts would need special handling.

Idea needs to stew.

I haven't had the energy to do much but a 0.0 release of 'ezesdk' isn't too far off. I did the license headers, updated the readme from elf-loader, and tweaked the makefiles a few times. Still playing with some of the apis too.

Sunday 13 April 2014

On liver n stuff n shit.

So one of those things you never really liked as a kid. Once in a while I think 'man i could go some liver', even mum's cooked-to-hell-and-back version that just turns each sliver into a piece of copper-tainted rubber. I have some lamb's fry in the fridge so that might be (part of) dinner tonight. It's usually kind of 'yeah did that, not sure why' but you know, it has to be done sometimes.

Made some lime cordial today, had a couple of g&t's while i was bringing it to the boil and continued thereafter. Went with the straight ABC recipe (hmm, seems the original page is gone) this time even with it's 1.5Kg of sugar although I added a lot more citric acid - love me a good bit of tart and it's better to be safe than sorry on the preservation side of things. The last few batches didn't have enough punch and i've been thinking it might be due to a lack of sugar so this is part of the experiment to confirm that hypothesis. I had to buy the limes this time :( but they were cheap because they were old, and they've been in the fridge for weeks - but they still have a nice sweet ripe taste so i'm hopeful of a good batch. I usually make my own 'soda' with it: 500ml glass of really cold water + ice, some cordial 'to taste', 3/4 teaspoon of citric acid and 3/4 teaspoon of bi-carb, stir till it fizzes then slam it down; a really fast (and tasty) way to re-hydrate after an afternoon in the sun. Doesn't work if the water isn't super-cold - it just fizzes over and goes flat.

Trying not to think of the week ahead. The 200km of cycling I did last week at least shaved a couple of kilo's off my spare tyre and turned my legs into iron but i'm still recovering after 3 days rest - wildly out of practice and alas not 20 any-more. At least it's a short week coming up with Easter next weekend. The biltong I made last week is just about ready - taking a while to dry in this weather and I didn't find the 40W incandescent 'heater' till a couple of days ago (normal globes around here don't produce enough heat anymore, this is a narrow candle flame shaped one which still has a tungsten filament). Forgot to dip it in vinegar and the spice mix might have a touch too much fennel but it's still pretty tasty, i'll jot down the recipe on here in another post in the not too distant future. Made it straight from some corned silverside which was on special, so it both cheap and easy to make.

End of intermission ... back to the g&t's ... and maybe some liver if i can be bothered ...

Saturday 12 April 2014

the eze-thing

After the last week I felt the need for a bit of coding ... i worked a bit on the eze-library, deciding on a directory structure and using it, filling out some missing bits and working on the build system. I learnt a few make and gcc things along the way - which is always nice.

I included and/or implemented the various bits and pieces of mentioned on the last few parallella related posts - things like the global loader-defined barrier memory, async dma (via a queue and interrupts), and the startup routine that can pass arguments to kernels, track the current running state and set a return code.

I decided to just allocate the barrier memory for the whole workgroup before the code address every time even if barriers aren't being used. It can always be changed. Still yet to test the implementation.

The start of the memory-map (excluding the isv entries) now looks like this:

 +----------+------
 |     0028 | extmem
 |     002c | group_id
 |     0030 | group_rows
 |     0034 | group_cols
 |     0038 | core_row
 |     003c | core_col
 |     0040 | group_size
 |     0044 | core_index
 +----------+------
 |     0048 | imask  (short, but here so it can be loaded as int)
 |     004a | status (short)
 |     004c | exit code
 |     0050 | entry
 |     0054 | sp
 |     0058 | arg0
 |     005c | arg1
 |     0060 | arg2
 |     0064 | arg3
 +----------+------
 |     0068 | barrier, group_size bytes
 |          |
 +----------+------
 |    ≥006c | .text .data .bss
 |          | .text.bank0 .data.bank0 .bss.bank0

Not sure if it'll work but i experimented with a @workgroup "tag" on section names. If present the allocation is multiplied by the workgroup size - this was whilst working on the barrier stuff before I realised that wont work because the barrier location has to be the same across all work-items in the work-group even if they're running separately linked code. Something I can play with later anyway.

After getting the most basic test running i'm to the point of being able to debug the new features. And I just got the async dma interfaces to function (yay?) before writing this up. Actually it works out pretty nicely. I define the async dma queue inside the isr handler code so that by using the c functions which reference the queue it drags in the isr and isr vector automatically, which the loader tracks so that the new sync isr automagically sets the correct imask too.

Once i've got everything going i've got a bit of housekeeping stuff to deal with before it can go further. But for now I've settled on two libraries:

libezehost.so

The host-side library (surprise). This includes a fork of the adapteva esdk 'e-hal' as well as the elf-loader stuff. The on-core runtime interface has been changed to accommodate the new features so it wont work with pre-linked binaries.

libezecore.a

This is the on-core support library and equivalent to e-lib. Most of the functions are inline calls which generates smaller code-size and more efficient compilation of leaf functions.

I have assembly versions of almost every non-inline routine too; they save some code-space but maybe not enough to be worth it. I might include it as another library option. Perhaps.

I'm probably also going to look at different runtime mechanisms such as a "job queue" mode rather than the current "one shot" mode. This will be changed by specifying a different crt0.o file. Already the crt0 implementation I have allows one to restart the core by just using e_start() without requiring a reset first because the exit routine just idles rather than trapping.

Friday 11 April 2014

developer tools and tasks vs non-developer tools and tasks

So I started on a project that uses maven for it's build / test and deploy 'script'. Or some part of it; it's complicated.

Like ant (or maybe worse), maven doesn't seem to do dependency checking on targets. All it does is build a dependency graph of the tasks and executes them in an order which satisfies that graph. Or if it fails mid-build you can add some command line arguments to continue from a specific place. Or you can manually cd to a specific part of the system and build that in isolation - but that's one-way ticket to fragility and is error prone even if you've got a really good idea of how the project fits together. So basically you're forced to recompile the whole shooting match every time even if nothing changed in most of the application. This is probably a direct result of both ant and maven's design as a task-based dependency tree rather than a goal based one.

Whilst talking to someone about how useless maven is they mentioned it sounded more like tool more suited to qa or production builds.

Which makes a lot of sense to me.

A developer needs:

  • Fast;
  • Simple;
  • Reliable;
  • Repeatable;
  • Robust.

But for configuration management for QA builds or production, the operator only needs:

  • Reliable;
  • Repeatable;
  • Robust.

Whilst there is considerable overlap there doesn't seem to be enough to me to justify the overhead in developer time - unless you're paying them peanuts having them wait Dx10xN minutes for D developers to re-build stuff that has already been built every time becomes an incredibly costly exercise.

I think this is similar to my main problem with git. Git was written by a configuration/qa manager (Linus), not a typical developer. His needs are that of a configuration/qa manager and not a typical developer. Apart from having an offensive name, it forces every developer to become their own configuration manager as well and whilst that might be a lofty goal it's taking away time from actually getting work done too. This costs money.

What's remotely agile about Agile?

They also use some sort of 'agile' team management process. I really can't say anything good about that at all.

Apart from some pretty shitty choice of words ('sprint' being the most egregious, but also 'backlog item' for all tasks, not just those beyond due-date; all designed to subtly force people to work harder for the same pay) the whole point of 'agile' seems to be to force all of the team to manage the team together. Which is ... a bit backwards. Firstly there's the problem of specialised skills and experience. It also exposes everyone to the responsibilities that managers "get paid the big bucks" for in compensation.

But ultimately it's just another overhead which interferes with actually getting work done. The point of a manager/team leader is to grease the wheels to let the other developers get work done as efficiently as possible and in a cohesive way. And people are different so more than one approach may be necessary. They usually never get to do any 'real work' themselves but without them (as in agile) nobody else does either. I sat through a 2 hour 'planning' meeting in abject horror as they seemed to mostly just group-operate a bug tracker. Something which could have taken the team leader 15 minutes of his own time (and the bigger items could have been done properly rather than just a hand-wavy guesstimate). I'm sure my disgust was visible because i don't even try to hide it and i was too tired to care.

Anyway I did a little reading about it and it just seems like someone went and interviewed (or imagined) a bunch of 'highly productive' teams about their "process" and wrote it down in a book to make money - creating a whole new language in the process as a way to start a new religion (and more profits).

But they forgot the key element: teams are made of people. And people are ultimately what make the team work regardless of (and often in spite of) any processes in place. But this has probably been forgotten on purpose, books like this are for management who wants to treat individual employees as identical resource units to be reallocated and 'consumed' at will.

I can see agile working ... but only in environments and on projects where it was never needed to start with and in those it would only add unnecessary and considerable overheads. This kind of magic cannot be bottled into self-help books which is all these types of books are and why these things are so faddish. Like self-help books if they actually worked the whole self-help publishing industry would collapse.

The whole two-week lock-step milestone thing is just nonsense too. There's never time to do any big architectural changes should they be required so they are never even suggested. And even little tasks just keep bouncing along across 'sprints' if they need to anyway it just means more wasted group-management time rather than letting them sit in the bug tracker till they're ready to be done.

I think like maven/ant they also forgot that a product which keeps the customer happy is the ultimate goal, not adherence to processes and racking up the count of passing short and arbitrary milestones.

But it's my day off ... but i'm too tired to do anything so I might just try to catch up on sleep and let my legs recover from the cycling. I haven't done this much for years and I was pretty much at my limit getting home yesterday; at least that's something that should improve quickly.

Sunday 6 April 2014

linaro arm gnueabihf cross compilers on Slackware64

I've been trying to get a full cross compilation setup going for the parallella but had way more trouble than I should have getting the cross compiler for arm going. I'm using the linaro sources but can't use the binaries I saw because Slackware64 isn't setup for multilib by default and I didn't want to turn it on. It turned out to be easy but I kept getting wrong arguments from old web pages or got the wrong info from others that weren't doing quite the same thing.

It's very well possible there are binaries elsewhere but I didn't see them; and in any case getting this to work is a lot more useful for me. The linaro build tool 'crosstool-ng' seems to expect debian and/or ubuntu and simply didn't work.

I started from linaro-toolchain-binaries page, and downloaded the full sources sources part 1 and part 2 (although part2 is just the linux kernel source which isn't necessary).

First, some variables/setup which makes it work:

top=`pwd`
prefix=/home/notzed/cross
mkdir -p ${prefix}
export PATH=${PATH}:${prefix}/bin

prefix is the install location.
tar xvjf ~/Downloads/gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src-part1.tar.bz2 
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/gmp-5.0.2.tar.bz2
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/libiconv-1.14.tar.gz
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/expat-2.1.0.tar.gz
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/gdb-linaro-7.6.1-2013.10.tar.bz2
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/zlib-1.2.5.tar.gz
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/isl-0.11.1.tar.bz2
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/gcc-linaro-4.8-2013.10.tar.xz
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/linaro-prebuilt-sysroot-2013.10.tar.bz2
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/binutils-linaro-2.23.2-2013.10-4.tar.bz2
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/mpfr-3.1.0.tar.xz
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/pkg-config-0.25.tar.gz
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/cloog-0.18.0.tar.gz
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/md5sum
gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/mpc-0.9.tar.gz
Then untar the interesting ones in a way that the configure scripts expect. I use a script for this but this is what it executes:
mkdir -p build
tar xJf gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/gcc-linaro-4.8-2013.10.tar.xz -C build
mv build/gcc-linaro-4.8-2013.10 build/gcc-linaro
tar xjf gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/binutils-linaro-2.23.2-2013.10-4.tar.bz2 -C build
mv build/binutils-linaro-2.23.2-2013.10-4 build/binutils-linaro
tar xjf gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/linaro-prebuilt-sysroot-2013.10.tar.bz2 -C ${prefix}
tar xjf gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/gmp-5.0.2.tar.bz2 -C build
ln -s ../gmp-5.0.2 build/gcc-linaro/gmp
tar xzf gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/mpc-0.9.tar.gz -C build
ln -s ../mpc-0.9 build/gcc-linaro/mpc
tar xJf gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/mpfr-3.1.0.tar.xz -C build
ln -s ../mpfr-3.1.0 build/gcc-linaro/mpfr
tar xzf gcc-linaro-arm-linux-gnueabihf-4.8-2013.10_src/zlib-1.2.5.tar.gz -C build
ln -s ../zlib-1.2.5 build/gcc-linaro/zlib

The

Then the building, just two stages: binutils then gcc. The sysroot from linaro is used and contains libc and so on.

Binutils is straightforward and only takes a minute or so to build.

top=`pwd`
mkdir -p ${top}/build/binutils-build
cd ${top}/build/binutils-build
../binutils-linaro/configure \
 --prefix=${prefix} \
 --target=arm-linux-gnueabihf \
 --disable-nls --disable-werror \
 --with-sysroot=${prefix}/linaro-prebuilt-sysroot-2013.10
make -j8
make install

Then gcc. This was the painful one ... but to cut the long story short this is what I did to get it to work. It takes a few minutes to build.

mkdir -p ${top}/build/gcc-build
cd ${top}/build/gcc-build
../gcc-linaro/configure \
 --prefix=${prefix} \
 --host=x86_64-linux-gnu \
 --build=x86_64-linux-gnu \
 --target=arm-linux-gnueabihf \
 --with-float=hard --with-arch=armv7-a -with-fpu=vfpv3-d16 \
 --disable-nls --disable-werror \
 --enable-languages=c,c++ \
 --with-sysroot=${prefix}/linaro-prebuilt-sysroot-2013.10
make -j8
make install

*shrug* I can't really say whether it's right or wrong but - but it worked well enough to cross compile a working binary.

The ezesdk

So mostly why I did this was so that I could do some re-arranging of the elf-loader code into a more wide-reaching library and keep working on it locally on my workstation rather than having to do it on the parallella. Not that it isn't fast enough to do the little bit of compiling I need - it certainly is - it was just all done on a whim really (one that drew out to a good few hours although not particularly intense ones, in between making another batch of biltong, cooking some corned beef, drinking, and surfing the web).

Another whimsical decision was to fork the epiphany e-hal library. This ... isn't really something I originally intended to do because it just means more to maintain and also there may be future functionality that would be best kept there - such as multi-process arbitration and so forth. But I guess one small but real justification is that right now I need to access some private (-local) data structures which are presumably 'private' for a reason; and if they ever really went private things would break.

Anyway ... I've just been extremely tired the last few days and a bit distracted by some stuff (to somewhat understate it) so it may well just be a bad idea but I suppose i'll see how far I get. It's taken long enough just to decide on a directory layout. At this point i'm going to have libezehost.so which will be the host driver (e-hal + elf-loader, etc), and libezecore.a which will be the on-core api (i.e. e-lib equivalent) although much of it is inline.

I unfortunately have to start working on that other project tomorrow which means a long commute on cyclist hostile roads (more so than average) so tonight better be an early one. Just as well i'm totally knackered and might not even make 10pm - although daylight saving just ended today so it's "really" later than that.

Friday 4 April 2014

Quick swing by JavaFX

Had a chance to poke around a small JavaFX application for the customer for a couple of days. Unfortunately i've been 'booked' for other work for the next few months so I'm still not getting as much of a chance to play with JavaFX as i'd hoped. Might be the last too (on that s/w) as this new work thrust on me is basically an unrelated graduate-level task in a graduate-level work environment, and that's the least of the issues with it (not too happy right now but i'll have to see how it transpires).

Anyway ... I'm still having issues with layouts although in my particular problem I think it came down to not realising that the scale pivot point is the centre of objects. I had inconsistent/jumping ui's as things changed so it just looked like a layout bug.

I had a few hours left after adding the required functionality so I thought i'd tart it up a bit with clipboard and then drag and drop support. Nice ... it's nice. Just simple and hides all the data-type negotiation and whatnot; something that made sense in 90s era computers without the memory (and also high hopes of extensive functionality) but now there's no real need for it (and dnd is mostly used for stupid things like editing text where it doesn't make sense because of it's clumsiness, and not for file opening where it does). One of those things were a really good style guide might have helped rather than a free-for-all.

As an example this function pointer (or method handle, or whatever it is) can just be added to any ImageView to make it draggable as a picture - e.g. to drop into a picture editor.

    static EventHandler<MouseEvent> dragImageView = (MouseEvent event) -> {
        ImageView iv = (ImageView) event.getSource();
        Image im = iv.getImage();
                
        if (im != null && im.getProgress() == 1) {
            Dragboard db = iv.startDragAndDrop(TransferMode.COPY);
            ClipboardContent cc = new ClipboardContent();

            cc.putImage(im);
            db.setContent(cc);
            event.consume();
        }
    };

    // any image view not in a list:
    iv.setOnDragDetected(dragImageView);

Dragging from lists has to go on the list not the imageview, though isn't much more work.

The clipboard is even a bit easier although I did some strange behaviour with it losing track of the content of the clipboard from another application if it the copy was done when the JavaFX application wasn't already running.

It doesn't seem to support the primary selection though which is a bit of a bummer. Not that most applications that use it do it properly any more :( You do have to stop highlighting the selection when you lose it!

I realised the last time I properly looked at cut and paste was about 15 years ago with gnome-terminal ... hmm.

Compositing

I'm still needing to use BufferedImage for I/O but one of the things I was worried about was whether I would still need to use it for composition. Looks like I don't and depending on how one feels about the javafx layout and blending stuff it offers "all the power" of that - as well as CSS. You just set up an off-screen scene and either snapshot that or one of it's components.

    Scene s = new Scene(root);
    WritableImage wi = s.snapshot(null);

If you want alpha to make it through to the output:

    s.setFill(Color.TRANSPARENT);
    root.setBackground(Background.EMPTY);

(actually that is pretty nice, and about time you're able to unify off-screen and on-screen layout and composition. And the api doesn't prevent the implementation using hardware to speed it up either).

One thing I found a little frustrating is the fact that none of the layout objects clip their content. This makes sense but it ended up not quite working properly when I added it to what I thought was the right place - suddenly all my alpha compositing simply stopped working. I had to wrap the container inside another one and just set the clip on the outer container that then when into a BorderPane or whatever it was. Around this time was when I was also having the strange jumping layout issues so that artificially inflated the frustration level.

So just as i'm starting to get the feel of it after a few short hours ... i'm pulled away again.

(Not that I can't work on something at home ... which i probably will once this parallella has run it's course).

Update: So the "real" customer had a look-see this week and apparently were pretty impressed. Not that the JavaFX had much to do with it but it didn't hurt i'm sure. As I said to the PM afterwards and I quote "they're probably used to seeing some matlab piece of shit that takes days to run" ... anyway I guess we'll see if any more money comes along. If so it could be quite interesting indeed - although again, not particularly for JavaFX related reasons.

Thursday 3 April 2014

async copies

I had a play with this idea far too late into the night last night. A dma memcpy is a fine idea and has some benefits but isn't really taking advantage of the hardware features nor providing much opportunity to deal with some of it's problems.

You can resort to manual dma but that is pretty fiddly and bulky code-wise and it's such a common operation it makes sense to have it available in the runtime. And once it's in the runtime the runtime itself can use it too.

My current thoughts on the api and implementation follow.

 // incase an int isn't enough at some point
 typedef unsigned int ez_async_t;

 // Enqueue 1d copy
 ez_async_t ez_async_memcpy(void *dst, void *src, size_t size);

 // Enqueue 2d copy
 ez_async_t ez_async_memcpy2d(void *dst, size_t dstride,
                              void *src, size_t sstride,
                              size_t width, size_t height);

 // Wait till dma is done
 void ez_async_wait(ez_async_t aid);

 // Query completion status
 int ez_async_complete(ez_async_t aid);

To drive it I have a short cyclic queue of dma header blocks which are written to by the memcpy functions. If the queue is empty when the memcpy is invoked it fires off the dma.

The interesting stuff happens when the queue is not empty - i.e. a dma is still outstanding. All it does in that case is write the queue block and increment the head index and return. Ok that's not really the interesting bit but how the work is picked up. A dma complete interrupt routine checks to see if the queue is not empty and if not then starts the next one.

If the queue is full then the memcpy calls just wait until a slot is free.

I haven't implemented chaining but it should be possible to implement - but it might not make a practical difference and adds quite a bit of complication. Message mode is probably something more important to consider though. This api would use channel 1, leaving 0 for application use (or other runtime functions with higher priority) although if it was flexible enough there shouldn't be a need for manual dma and perhaps the api could support both. It would then need two separate queues and isr's to handle them and would force all code to use these interfaces (it could always take an externally supplied dma request too).

This is the current (untested again) interrupt handler. I managed to get all calculations into only 3 registers to reduce the interrupt overheads.

_ez_async_isr:
        ;; make some scratch
        ;; this can cheat the ABI a bit since this isr has full control
        ;; over the machine while it is executing
        strd    r0,[sp,#-2]
        strd    r2,[sp,#-1]

        ;; isr must save status if it does any alu ops
        movfs   r3,status

        ;; Advance tail
        mov     r2,%low(_dma_queue + 8)
        ldrd    r0,[r2,#-1]             ; load head, tail
        add     r1,r1,#1                ; update tail
        str     r1,[r2,#-1]

        ;; Check for empty queue
        sub     r0,r0,r1
        beq     4f

        ;; Calc record address
        mov     r0,#dma_queue_size-1
        and     r0,r0,r1        ; tail & size-1
        lsl     r0,r0,#5        ; << record size
        add     r0,r0,r2        ; &dma_queue.queue[(tail & (size - 1))]

        ;; Form DMA control word
        lsl     r0,r0,#16       ; dmacon = (addr << 16) | startup
        add     r0,r0,#(1<<3)
        
        ;; Start DMA
        movts   dma1config,r0
4:
        ;; restore state
        movts   status,r3
        
        ldrd    r2,[sp,#-1]
        ldrd    r0,[sp,#-2]
        rti

When I wrote this I thought using 32-bytes for each queue record would be a smart idea because it simplifies the addressing, but multiply by 24 is only 2 more instructions and a scratch register over multiply by 32 so might a liveable expense. The address of the queue is loaded directly so that saves having to add the offset to .queue when calculating the dma request location.

The enqueue is simple enough.

// Pads the dma struct to 32-bytes for easier arithmetic
struct ez_dma_entry_t {
        ez_dma_desc_t dma;
        int reserved0;
        int reserved1;
};

// a power of 2, of course
#define dma_queue_size 4

struct ez_dma_queue_t {
        volatile unsigned int head;
        volatile unsigned int tail;
        struct ez_dma_entry_t queue[dma_queue_size];
};

struct ez_dma_queue_t dma_queue;

...

        uint head = dma_queue.head;
        int empty;

        // Wait until there's room
        while (head - dma_queue_size >= dma_queue.tail)
                ;

 ... dma_queue[head & (dma_queue_size-1)] is setup here

        ez_irq_disable();

        // Enqueue job
        dma_queue.head = head + 1;

        // Check if start needed
        empty = head == dma_queue.tail;

        ez_irq_enable();

        if (empty) {
                // DMA is idle, start it
                ez_dma_start(E_REG_DMA1CONFIG, dma);
        }

        return head;

The code tries to keep the irq disable block as short as possible since that's just generally a good idea. With interrupts this is basically equivalent to a mutex to protect a critical section.

Wednesday 2 April 2014

The journey. It continues.

So I spent the last couple of nights staying up way too late poking at my epiphany runtime library. Last night I basically finished the first cut of the feature complete version with a C implementation backend where i've finally settled on specific implementations of each function tuned mostly for size.

Most of the functions are intended to be inlined and only those more than about 10 instructions are worth putting into their own separate function - this in-lining is something the C compiler can do that isn't possible with assembly language. In every case these functions produce measurably smaller overall code by being inlined either because the function invocation itself takes more to form than the total work done and/or because it allows the callee to become a leaf and thus completely changes the way registers and stack can be assigned and used.

I spent a lot of time swearing at the compiler. Probably if i hadn't been up so late the night before I would merely have been puzzled by it. I played quite a bit with barrier() but ended up with something that is basically identical to the assembly version I came up with, but a good bit bulkier because the compiler doesn't try to use the small registers (0-7) for the whole function. It also does some redundant testing, but I think this is about as good as i'm going to get and it does pick up a couple of optimisations. I go into detail below.

I tried implementing the barrier of the the previous post whereby it tests 4 states per inner iteration but I decided to stay with the byte version due to the code-size increase. I also decided that because a workgroup barrier has to be workgroup wide, there's no need to ever allow the developer to assign memory for it and i've included the barrier memory as part of the runtime setup code (or it can be created by a linker script) which guarantees it will always be in the same location across all cores in the workgroup even if they are running different binaries. This can go somewhere fixed - either before or after the executive (aka kernel aka bios) or after the stack at the end of memory.

One of the 'interesting' loops was where the remote cores are notified of the barrier continuation. In e-lib it uses an array of pointers to implement this which leads to a trivial inner loop at the cost of extra memory reads. I decided to use the workgroup information to update the barrier directly which means it needs a bit more calculation which increases code-size but the loops are still quite small. This is the final implementation which I created by converting the assembly version literally into C after giving up trying to fight with the compiler for a better one.

        volatile unsigned char *root = ez_global_core(barrier, 0, 0);
        // Notify from farthest away
        int r = ez_config->group_rows-1;
        int ce = ez_config->group_cols-1;
        do {
                int c = ce;
                
                do {
                        root[((r<<6) | c) << 20] = 0;
                } while (--c >= 0);
        } while (--r >= 0);

Looking at the listing output (oh boy, finding that option to as really brought back memories, it's -Wa,-a to gcc) one observes that the loop setup takes about half the code with the loop implementation taking up the other. The inner loop is only 6 instructions long.

// build with -O2 -msmall16
   6 0000 0305                  mov r0,#40
   7 0002 C420                  ldr r1,[r0,#1]
   8 0004 4C010040              ldr r16,[r0,#2]
   9 0008 CC210040              ldr r17,[r0,#3]

  10 000c 0B800220              mov ip,_barrier
  11 0010 9606                  lsl r0,r1,#20
  12 0012 7F900A24              orr ip,ip,r0

  13 0016 9B03FF48              add r16,r16,#-1
  14 001a 9B27FF48              add r17,r17,#-1

  15 001e 0360                  mov r3,#0
  16                    .L5:
  17 0020 DF400608              lsl r2,r16,#6
  18 0024 EF040208              mov r0,r17
  19                    .L3:
  20 0028 7A21                  orr r1,r0,r2
  21 002a 9626                  lsl r1,r1,#20
  22 002c 9303                  add r0,r0,#-1
  23 002e 99700004              strb r3,[ip,r1]
  24 0032 3320                  sub r1,r0,#0
  25 0034 70FA                  bgte .L3

  26 0036 9B03FF48              add r16,r16,#-1
  27 003a 3B000008              sub r0,r16,#0
  28 003e 70F1                  bgte .L5

This is almost identical to the assembly I came up with beforehand as the emboldened sections which indicate the non-redundant instructions show.

(untested code follows)

   6 0000 0360                  mov     r3,#0           ; for fragment

   7 0002 0325                  mov     r1,#ez_config
   8 0004 C444                  ldr     r2,[r1,#ezc_group_id]
   9 0006 E484                  ldrd    r4,[r1,#ezc_group_rows/2]
  10 0008 0B000200              mov     r0,%low(_barrier)
  11 000c 964A                  lsl     r2,r2,#20       ; groupid << 20
  12 000e 7A48                  orr     r2,r2,r0        ; root = get_global_address(barrier, 0, 0)
  13 0010 B390                  sub     r4,r4,#1        ; r = rows - 1
  14                    1:
  15 0012 3B360000              sub     r1,r5,#1        ; c = cols - 1
  16 0016 D6D0                  lsl     r6,r4,#6        ; rshift = r << 6
  17                    2:
  18 0018 FAF8                  orr     r7,r6,r1        ; o = (r << 6) | c
  19 001a 96FE                  lsl     r7,r7,#20       ; o <<= 20
  20 001c 916B                  strb    r3,[r2,r7]      ; root[o] = 0
  21 001e B324                  sub     r1,r1,#1        ; while (--c >= 0)
  22 0020 70FC                  bgte    2b

  23 0022 B390                  sub     r4,r4,#1        ; while (--r >= 0)
  24 0024 70F7                  bgte    1b

The register usage is because I extracted it from a bigger fragment - I may have broken it too but since it's only 4 accountable instructions different to the C compiler it should be ok apart from typos.

Being able to use the lower 8 registers really makes a big difference to code size as you'd imagine. Despite only being 4 instructions shorter this is 38 bytes compared to 64. Due to the ABI it does come at the cost of having to save/restore 4 registers which is 16 bytes of instructions and some time but it doesn't take much code before that is made up.

The other point of note is the direct use of the condition codes after the loop index adjustment. This saves one instruction per loop here plus the scratch register which would otherwise be required (line 24 and 27 of the first listing). This is really something I would expect the compiler to pick up. Perhaps less expected is the double-word load which saves one instruction, and the other instruction is saved by changing the redundant register move on line 18 in listing one into a substraction on line 15 in listing two which means the one on line 14 in listing one can be discarded.

The full barrier implementation needs to use a volatile read of a byte array and I found the compiler (and oddly, the arm compiler as well although only in the signed char case) does some weird and unnecessary data size extension when you read from a volatile char array which adds two pointless instructions for every byte load.

So with that the total implementation in C hits 144 bytes vs 88 for the assembly - which is fair enough I guess but there is room for improvement and only small things can make quite a difference when you're dealing with such small functions.

The e_barrier() code in e-lib compiled with the same options (-O2 -msmall16) hits 236, but much of that is because of the need to perform some multiplies. It's actually much worse than that because e_barrier_init() which is also a necessary part of the implementation compiles to 346 bytes ... :-/. I'm pretty sure the e_barrier_init()/e_barrier() pair has a race condition as well which can't be solved without another ... barrier.

If e_group_config is modified to include group_size and core_index values this drops them down to 132 and 224 bytes respectively which in terms of on-core code is still 2.5x the C one above (and 4.0x the asm). Strangely enough the notify loop uses implicit condition code tests on it's loop counter! But it added a separate counter register to implement this so that is probably why it isn't available on 'normal' calculations. Actually an externally initialised version of this implementation could be made very small; but then the data-size will dominate.

If only that hardware barrier worked properly in this case.

I did some further experiments while writing this and although it's kind of academic because of it needing 5x the runtime memory I fiddled with an implementation that uses an array of pointers assuming they have been initialised by the runtime / loader. I can get it down to 70 bytes for the barrier() call.