Wednesday 26 March 2014

inlining register reads

42.

I've been looking at the way the epiphany on-core library implements a couple of functions in order to improve them. One of the simplest are the special register read/write functions used for dma and other purposes.

 unsigned e_reg_read(e_core_reg_id_t reg_id);

 unsigned e_reg_read(e_core_reg_id_t reg_id)
 {
        volatile register unsigned reg_val;
        unsigned *addr;

        // TODO: function affects integer flags. Add special API for STATUS
        switch (reg_id)
        {
        case E_REG_CONFIG:
                __asm__ __volatile__ ("MOVFS %0, CONFIG" : "=r" (reg_val) : );
                return reg_val;
        case E_REG_STATUS:
                __asm__ __volatile__ ("MOVFS %0, STATUS" : "=r" (reg_val) : );
                return reg_val;
        default:
                addr = (unsigned *) e_get_global_address(e_group_config.core_row,
                                     e_group_config.core_col, (void *) reg_id);
                return *addr;
        }
 }

As alluded to in the comments this actually breaks reading the status register anyway ... and it is incomplete. There are 42 special registers but because they are not contiguous and the actual memory address is passed to the function the compiler generates either a giant jump table or a long sequence of nested branches searching for the switch target.

And apart from this any calling code needs to go via the a function invocation which may be as simple as a branch but is more likely to be a 32-bit load followed by a jsr, and then the function itself needs to implement a switch.

Update: I added some markup to the output. Bold is the code that is required to do the actual job, for the first examples it includes the e_reg_read function implementation but in the ideal case it is a single instruction. Italic is bad code either due to incorrect implementation or the compiler going whacko the didlio for whatever reason.

00000000 _e_reg_read:
   0:   40e2            mov r2,r0
   2:   000b 0042       mov r0,0x400
   6:   01eb 1002       movt r0,0xf
   a:   d65c 2700       str lr,[sp],-0x4
   e:   283a            sub r1,r2,r0
  10:   2800            beq 60 <_e_reg_read+0x60>
  12:   008b 0042       mov r0,0x404
  16:   01eb 1002       movt r0,0xf
  1a:   283a            sub r1,r2,r0
  1c:   1700            beq 4a <_e_reg_read+0x4a>
  1e:   000b 0002       mov r0,0x0
  22:   200b 0002       mov r1,0x0
  26:   000b 1002       movt r0,0x0
  2a:   200b 1002       movt r1,0x0
  2e:   600b 0002       mov r3,0x0
  32:   0044            ldr r0,[r0]
  34:   2444            ldr r1,[r1]
  36:   600b 1002       movt r3,0x0
  3a:   0d52            jalr r3
  3c:   d64c 2400       ldr lr,[sp,+0x4]
  40:   0044            ldr r0,[r0]
  42:   b41b 2402       add sp,sp,16
  46:   194f 0402       rts
  4a:   0512            movfs r0,status
  4c:   15dc 0400       str r0,[sp,+0x3]
  50:   15cc 0400       ldr r0,[sp,+0x3]
  54:   d64c 2400       ldr lr,[sp,+0x4]
  58:   b41b 2402       add sp,sp,16
  5c:   194f 0402       rts
  60:   0112            movfs r0,config
  62:   15dc 0400       str r0,[sp,+0x3]
  66:   15cc 0400       ldr r0,[sp,+0x3]
  6a:   d64c 2400       ldr lr,[sp,+0x4]
  6e:   b41b 2402       add sp,sp,16
  72:   194f 0402       rts
  76:   01a2    nop
And an example call:
unsigned a;
void foo(void) {
 a = e_reg_read(E_REG_STATUS);
}
 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000000 _foo:
   0:   008b 0042       mov r0,0x404
   4:   200b 0002       mov r1,0x0
   8:   d55c 2700       str lr,[sp],-0x2
   c:   200b 1002       movt r1,0x0
  10:   01eb 1002       movt r0,0xf
  14:   0552            jalr r1
  16:   400b 0002       mov r2,0x0
  1a:   400b 1002       movt r2,0x0
  1e:   0854            str r0,[r2]
  20:   d54c 2400       ldr lr,[sp,+0x2]
  24:   04e2            mov r0,r1
  26:   b41b 2401       add sp,sp,8
  2a:   194f 0402       rts
  2e:   01a2    nop

Can we do better?

inline

So the solution is to inline it. Simply moving e_reg_read to an inline function in a header helps. Well sort of helps.

 static inline unsigned ex_reg_read(e_core_reg_id_t reg_id)
 .. exactly the same

int foo_inline(void) {
        a = ex_reg_read(E_REG_STATUS);
}

 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000000 _foo_inline:
   0:   b41b 24ff       add sp,sp,-8
   4:   0512            movfs r0,status
   6:   15dc 0400       str r0,[sp,+0x3]
   a:   35cc 0400       ldr r1,[sp,+0x3]
   e:   000b 0002       mov r0,0x0
  12:   000b 1002       movt r0,0x0
  16:   2054            str r1,[r0]
  18:   810b 2002       mov r12,0x8
  1c:   b61f 248a       add sp,sp,r12
  20:   194f 0402       rts

Yeah not sure what's going on there to make it go through the stack. Maybe the type or something. What the hell? (oh I later worked it out: the unnecessary volatile on the reg_val value is forcing a store to and read from the stack which is not desirable at all, i filed a bug in prickhub).

Actually I'm going backwards here, I actually already wrote this and wanted to compare to what currently happens, so lets just forget all that and see what I came up with.

static inline uint32_t
ez_reg_read(e_core_reg_id_t id) {
        register uint32_t v;

        switch (id) {
        case E_REG_CONFIG:
                asm volatile ("movfs %0,config" : "=r" (v));
                break;
        case E_REG_STATUS:
                asm volatile ("movfs %0,status" : "=r" (v));
                break;
        default:
                v = *((volatile uint32_t *) e_get_global_address(e_group_config.core_row,
                     e_group_config.core_col, (void *) id));
                break;
        }
        return v;
}
And an example.
void fooz_inline(void) {
    a = ez_reg_read(E_REG_STATUS);
}

 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000024 _fooz_inline:
   0:   2512            movfs r1,status
   2:   000b 0002       mov r0,0x0
   6:   000b 1002       movt r0,0x0
   a:   2054            str r1,[r0]
   c:   194f 0402       rts

Ok that's what I wanted. This helps the compiler generate much better code too since the result can go in any register, no scratch registers need to be saved, etc.

So I took this and proceeded to add all 42 of the special registers ... and then compiled an example that had more than one register read and ... damn. Suddenly it decides that it doesn't want to inline it anymore and turns every get into a function call to a 912 byte function. Oops. __attribute__((always_inline)) fixed that at least. Although it's different depending on the compiler - on the device I don't need to add the always_inline thing (or maybe some tiny detail was different).

However, things get a bit nasty when a non-constant expression is passed. Suddenly it inlines as much of that gigantic switch statement as it needs - potentially the whole lot if it doesn't know the range. Obviously not much use.

gcc has one last trick ... __builtin_constant_p(x). This returns true if the parameter is probably a constant. Since for this particular case on the epiphany any special register can just be read by a global memory access (as it uses already for a fallback) this can be used to decide the path to use.

#define ez_reg_read(x) (__builtin_constant_p(x) \
    ? _ez_reg_read(x) \
    : (*((volatile uint32_t *) e_get_global_address( \
       e_group_config.core_row, e_group_config.core_col, (void *) x))))

The macro decides whether to call _ez_reg_read() which will compile into a single movfs instruction, or fall-back to the memory load path (this may have some nasty unrolled-loop cases, hopefully unlikely). Although ... It's probably not terribly important to support non-constant parameters because any tools that need it can do it themselves and the api could just be for known registers (the enum implies that already).

Given how much code the current implementation compiles into it doesn't seem worth special casing any registers at all and it could just fall back to a global memory load every time. I'm lead to believe that movfs/movts goes through the same physical path (unfortunately) and since the required address is the key to the function there's little more to do.

I think elib can be shrunk significantly by changing some of the parameterisation and judicious use of inlining.

So ... whilst playing with a version of e_dma_wait() using this ... I think I found another gcc bug when __builtin_constant_p() is used, so yeah, non-constant args are in the bin.

Update: So it seems I missed the brackets on the macro which someone kindly pointed out on the forums ... oops. Actually before that I had written up the implementation and realised that the macro wasn't doing anything magical and __builtin_constant_p() can just go into the inline function itself.

Below is the code so far which results in identical output.

I thought it wouldn't work with no optimisation turned on but it seems to do the right thing. It still in-lines the ez_reg_read() but drops to the memory access path. This wont work for the config and status registers because they need special handling according to Andreas but I just realised I had separate entry points for those anyway so it should be ok. I'm not sure why C would ever be reading status anyway.

ez_regval_t 
EZ_ALWAYS_INLINE
ez_reg_read(ez_regid_t id) {
 register uint32_t v;

 if (__builtin_constant_p(id)) {
  switch (id) {
   // bank 0
  case E_REG_CONFIG:
   asm volatile ("movfs %0,config" : "=r" (v));
   break;
  case E_REG_STATUS:
   asm volatile ("movfs %0,status" : "=r" (v));
   break;
  case E_REG_PC:
   asm volatile ("movfs %0,pc" : "=r" (v));
   break;
  case E_REG_DEBUGSTATUS:
   asm volatile ("movfs %0,debug" : "=r" (v));
   break;
  ... every other named register ...
  default:
   // unknown register, who cares
   v = 0;
   break;
  }
 } else {
  v = *(volatile uint32_t *)ez_global_core_self((void *)id);
 }

 return v;
}

No comments: