Saturday 19 April 2014

a better (debugging) printf

As a bit of a side-mission I thought i'd have a poke at the printf problem on epiphany. Using it drags in a whole pile of floating point snot and stdio and it completely blows out the text space so it wont fit on an epu.

#include <stdio.h>

int main(int argc, char **ragv) {
 printf("test! %f\n", 1.0);
}

$ e-gcc e-test.c
$ e-size a.out
   text    data     bss     dec     hex filename
  42282    2304      88   44674    ae82 a.out

The only way to use it is to drop it into the external memory which has some performance issues. The performance issues aren't critical for a debugging function but even then e-hal doesn't install a listener so it doesn't actually work.

So the solution i'm going to use ... just use a stub and dump the printf data to a queue which can then be processed by eze-host. The stub can hopefully be small enough to fit into LDS and the queuing provides implicit buffering that should let the code run fast enough to debug most problems.

The problem here comes in that printf is a varargs call and varargs by definition doesn't know how many arguments it has ... so the stub needs to parse the format string and marshall the data out to another structure and then the host has to interpret this.

Fortunately for me the varargs format on both epiphany and arm appear to be the same, and whats more, 'va_list' is just a simple pointer which simplifies the host processing considerably. Or appears to be from some investigations.

So the approach is basically:

ezecore:

void
ez_printf(const char *__restrict fmt, ...) {
   va_list ap;

   va_start(fmt, ap);

   ... scan fmt and work out how big the argument list is
   ...  copy any strings referenced and change the pointer

   ... allocate a queue slot and copy all data there
  
   va_end(ap);
}

Some of the more esoteric features of printf just don't need to be supported like output parameters and so the work is just some marshalling based on the fmt string. This still takes a bit of code so i have to try to shrink it as much as possible whilst not losing any important functionality. I think long double can go for instance.

The c compiler has already promoted every argument to 4 or 8 (or 16?) bytes long, and if it wasn't for strings it could just memcpy the va_list once it knew how long it was.

ezehost:

... process syscall queue, find printf:

int (*aprintf)(const char *fmt, uint32_t *a) = (int(*)(const char *, uint32_t *))vprintf;

void do_printf(const char *fmt, uint32_t *args) {
 aprintf(fmt, args);
}

This last 'hack' of just rewriting the argument types is wildly unportable but it works on arm with gcc. Without it you're basically forced to write your own printf or do some deeper (also non-portable) poking since there's no way to create a va_list portably in code.

One last pain is that that varags abi promotes all floats to doubles. So just using varargs with floats drags in 800 bytes or so of code to perform float to double conversion. I wrote a much smaller one that wont be fully standards compliant but should suffice for debugging purposes (well, probably).

Update: Hmm, I played with it and I dunno, parsing the format and using the va_arg stuff is still a bit of code. Probably acceptable; ... but

I guess two other alternatives are available:

  1. Use a trap and move all the processing to the host.

    Blocks but the code-size is absolutely minimal, just a stub which calls a stub which is a trap;

  2. Just copy a fixed-sized block of the ap across and have that as a known limitation. Along with demanding that any %s argument strings have to be in .shared sections.

    Better performance but the limitations are error prone.

There's always something ...

Update 2: So I had a look at this today ... and basically I decided to just go with the trap version because it's the smallest bit of code both on-core and on-host (and TBH, it's the first thing I got working and it's not interesting enough to keep investigating further.)

Because of a couple of decisions it turned out to be pretty easy actually. All I do is proxy it straight to the host and because the varargs abi is identical between the two cpus I don't even have to do any argument rewriting.

Rather than fudge around with tricking C to do what I want the epu stub is much easier just implemented in assembly language. The varargs call is like any other - it just gets the first 4 arguments in registers and the rest are stored on the stack. I could just handle this on the host but it's easier if I just convert it to an array on-epu first and then trap directly to the host.

// LGPL3
        ;; c-prototype
        ;; void ez_dprintf(const char *fmt, ...);

        ;; stores r1-r3 on the stack and changes r1 to point to it

        .balign 4
_ez_dprintf:
        strd    r2,[sp],#-2
        str     r1,[sp,#3]
        add     r1,sp,#12

        ;; r0 = fmt
        ;; r1 = args
        trap    #16
        
        add     sp,sp,#16
        rts
        export  _ez_dprintf

On the host code I launch a monitor thread which polls the DEBUGSTATUS register (unfortunately polling is the only possibility right now). This turns to 1 when the core is halted from a trap instruction (and a couple of others). At this point the host is free to peek and poke pretty much anything on the core including all registers so it just looks up r0 and r1 and then invokes vprintf directly using the type-rewriting trick mentioned above.

// LGPL3
// note this is not using the epiphany sdk
// runs in a polling loop:
        uint status = ee_read_reg(dev, r, c, E_REG_DEBUGSTATUS);

        if (status & 1) {
                e_core_t *ecore = ee_get_core(dev, r, c);
                int pc = ee_read_reg(dev, r, c, E_REG_PC);
                unsigned short insn = ((unsigned short *)(ecore->mems.base + pc))[-1];

                // check for trap instruction
                if ((insn & 0x3ff) == 0x3e2) {
                        // check trap code, there are 6 bits for the code
                        switch (insn>>10) {
                        case 16: { // dprintf
                                char *efmt = ez_host_addr(wg, ecore->mems.base,
                                                          ee_read_reg(dev, r, c, E_REG_R0));
                                uint *eargs = ez_host_addr(wg, ecore->mems.base,
                                                           ee_read_reg(dev, r, c, E_REG_R1));

                                aprintf(efmt, eargs);
                                break;
                        }
                        }
                }
                e_resume(dev, r, c);
        }

Now ... the next trick is to make sure string arguments are in the right location. I modified the eze-loader to just put all constant strings (SHF_STRINGS in the section header flags) into the shared external memory block and ezehost puts this as the same virtual address location for both the host and the epiphany so at least in most cases it "just works". This may have side-effects although i can't see why an epu should be working with constant strings locally in the general case and particularly for debugging statements it's nice that it takes as little memory as possible on-core (added bonus: the host-accesses don't need to go across the mesh either).

To use a %s specifier on non-constant strings the caller currently has to just make sure the buffer is in the shared memory block. If necessary the host code can be modified to fix any string addresses by parsing the format.

The buffering version I started with is probably still worth looking into but this is a start. Problems with this include the latency of the call and the fact that it behaves almost like a barrier in practice and thus interferes with the running state.

If I re-try the size test:

float f = 12.0f;
int main(void) {
        ez_dprintf("test! %f\n", f);
}

$ e-size e-test-printf.elf
   text    data     bss     dec     hex filename
    208       4       0     212      d4 e-test-printf.elf

I moved the float to a variable to force the double conversion at runtime, which uses my lightweight (non-compliant?) implementation. The whole on-core overhead including the double converter is only 90 bytes.

No comments: