Thursday 17 July 2014

more tooling around

So i've done a bit more work on the tools and am getting pretty close to the first cut of the timing tool. I did some code tidying and created a separate disassembler tool.

I improved the disassembler output to include the machine code and symbols and handle all addressing modes properly. Yet to do is adding symbols for branch targets (or displaying branch target addresses) and showing aliases for the special registers.

 01:  1 SHT_PROGBITS .text   00000000 00000038 000001a0       0       0       0 0000008 SHF_ALLOC SHF_EXECINSTR

_e_build_wtable:
       0: 070094fc      strd.l  r4,[r13],#-1    
       4: 0400d47c      strd.l  r6,[r13,#+0]    
       8: 2002800b      mov.l   r12,#0x0000     
       c: a40090ec      ldrd.l  r44,[r12,#+1]   
      10: 8023     mov.s   r4,#0x0001      
      12: a400d16c      ldrd.l  r46,[r12,#+2]   
      16: 400a312f      lsl.l   r17,r4,r2       
      1a: c40011ec      ldrd.l  r48,[r12,#+3]   
      1e: a5ba     sub.s   r5,r1,r3        
      20: c400526c      ldrd.l  r50,[r12,#+4]   

Pipeline simulator

I did get the timing tool pretty close to functional but then the edge cases started getting messy and I tried reworking it a couple of times.

My current pipeline is something like this:

   -> fetch [1] [0]          instructions decoded into a 2-element queue
             |   |
        decode and issue     detect dual issue, assign pipeline
              \ /
               X             cross-bar switch/router
              / \
            DE   DE          implement RA stall logic.  pair stalls together.
             |   |
            RA   RA          implement E1 stall logic
             |   |
            E1   E1          execute stages
             |   |
            E2   E2
                 |
                 E3
                 |
                 E4

           alu   fpu          alu = ialu, load/store, control

Note: This is a software-based model of the hardware so doesn't necessarily reflect the physical hardware; it just has to behave the same.

The pipeline is executed from the end backwards - serialising the inherently parallel process of a hardware pipeline but in a way which reaches the same result after each cycle. At each point an instruction only advances if the next pipeline slot is empty. Stalls can happen before RA or E1. I don't know if this is hardware-correct but the timings don't come out right otherwise.

At the fetch stage the next two instructions are read and decoded. Mainly this determines if each a 16-bit or 32-bit instruction and does some setup for the scheduling task. This approximates the way the hardware loads 8-byte instruction blocks and is required to implement the dual-issue logic. There are some fiddly details with the physical hardware that I haven't fully discovered yet (to do with instruction size, alignment, etc) but Andreas says it will probably change in future hardware and is not that important just yet.

The dual issue decision simply decides if both of the next instructions can be executed together. If they can they are then assigned to their respective pipelines and the fetch code is told to grab 2 new instructions. Dual issue rules are quite simple: no written register dependencies and one instruction must be an alu or load/store instruction and the other a floating point op. If they can't be dual-issued then only one instruction advances and the fetch code is told to get only one new one.

The next pipeline stage checks to see if the instructions have the registers they need. This is presumably meant to be in the 'RA' stage but if I put it there instructions are retired too early. If both pipelines contain an instruction (i.e. dual-issue) then they both must pass their register checks before either advance. To calculate if the registers are busy all the instructions present in all the other stages are scanned to determine what registers are still busy. There are some complications here because the answer depends on who's asking.

The next stage then checks for any registers read in E1. AFAIK this is only needed for the store instruction. Each pipeline is tested separately.

The instructions then pass through the pipeline stages until they retire.

My initial cut at the code had each instruction decide when it should reserve a register for itself and when it should mark the register as available (i.e. instruction completed). The decision was mostly based on addressing mode and the scheduling class (fpu/alu/load/control) with some hard-coded 'hacks' to handle specific variations like fmadd. This still wasn't enough as different instructions have different latencies depending upon which instruction is accessing it, and it also varies because some instructions can update two values. The code also tried to incrementally update the 'register busy' set but this was messy and error prone and multiple values needed to be maintained separately.

It got messy.

So i've been working on trying to convert it to a table-driven algorithm with fewer special cases.

After a few iterations this is some output from the current code:

        code                      0123456789abcdef0123456789abcdef0123456789abcdef0123456789abcdef

        mov.l   r1,#0x0000       |   1                                                            |
        nop.s                    |    1                                                           |
        ldr.s   r0,[r1,#-0]      |     123                                                        |
        fadd.s  r0,r0,r0         |        1234                                                    |
        fadd.s  r0,r0,r0         |             1234                                               |
        str.s   r0,[r1,#-0]      |                 1                                              |
The display needs a bit of tweaking but the execution start times and total running time of each instruction match what i'm getting out of the hardware counters (17 cycles total time).

The other output format I have shows the stalls are happening in the right place. There are 3 E1 stalls and 3+6 total register stalls which matches the hardware counters too. (It displays the alu ops continuing through the pipeline although most are retired after E1.)

          DE       RA       E1       E2             DE       RA       E1       E2       E3       E4
  alu:  0 mov      -        -        -      fpu:    -        -        -        -        -        -    
  alu:  1 nop    0 mov      -        -      fpu:    -        -        -        -        -        -    
  alu:  2 ldr    1 nop    0 mov      -      fpu:    -        -        -        -        -        -    
  alu:    -      2 ldr    1 nop    0 mov    fpu:  3 fadd     -        -        -        -        -    
  alu:    -        -      2 ldr    1 nop    fpu:  3 fadd     -        -        -        -        -    
  alu:    -        -        -      2 ldr    fpu:  3 fadd     -        -        -        -        -    
  alu:    -        -        -        -      fpu:  6 fadd   3 fadd     -        -        -        -    
  alu:    -        -        -        -      fpu:  6 fadd     -      3 fadd     -        -        -    
  alu:    -        -        -        -      fpu:  6 fadd     -        -      3 fadd     -        -    
  alu:    -        -        -        -      fpu:  6 fadd     -        -        -      3 fadd     -    
  alu:    -        -        -        -      fpu:  6 fadd     -        -        -        -      3 fadd 
  alu: 11 str      -        -        -      fpu:    -      6 fadd     -        -        -        -    
  alu: 12 jr    11 str      -        -      fpu:    -        -      6 fadd     -        -        -    
  alu: 12 jr    11 str      -        -      fpu:    -        -        -      6 fadd     -        -    
  alu: 12 jr    11 str      -        -      fpu:    -        -        -        -      6 fadd     -    
  alu: 12 jr    11 str      -        -      fpu:    -        -        -        -        -      6 fadd 
  alu: 16 nop   12 jr    11 str      -      fpu:    -        -        -        -        -        -    
  alu: 17 b     16 nop   12 jr    11 str    fpu:    -        -        -        -        -        -    
  alu: 18 b     17 b     16 nop   12 jr     fpu:    -        -        -        -        -        -    

I've annotated each instruction+addressing mode pair with a few pieces of information:

srcRA
Set of which registers rd,rn,rm are read by the instruction in RA
srcE1
Set of which registers rd,rn,rm are read by the instruction in E1, used by str instruction. I had to add a stall test in E1 to implement this.
dALU, nALU
How many cycles the d or n register is reserved in terms of alu instructions reading them. I've implemented this as a bit-mask but since every bit is set from 0-x it could just be a number.
dFPU, nFPU
How many cycles the d or nregister is reserved in terms of fpu instructions reading them.

Still a bit left like write-after-write stalls but I think that can use a similar mechanism. Things like double-loads or stores can be handled by seeing if the instruction has the appropriate size bits and using rd and rd+1. Because it's only a static analysis tool branches penalties and external loads are not required (although something for the latter might be nice, at least for special register access).

Time passes ... I added write-after-write stalls now (and double load/stores).

So, I think I now have enough guts; but I have go through a few tables and define correct values and then probably convert them into a resource file.

No comments: