Tuesday 3 June 2014

atomic counters

Yesterday I tried a mutex based implementation of an atomic counter to see how it compares.

My first test was to read the atomic counter 2^20 (1024x1024) from each core in a tight loop. Times are wall-clock on the host.

  RxC   Interrupt    Mutex
  ----  ----------   ------
  1x1   0.193488s    0.114833s
  1x2   0.317739s    0.122575s
  2x1   0.317739s    0.121737s
  4x1   0.393244s    0.298871s
  1x4   0.393244s    0.361574s
  4x2   0.542462s    1.122173s
  2x4   0.543283s    0.903163s
  4x4   0.849627s    3.493985s

Interesting to note that orientation of a single line of cores makes a difference. May have something to do with using core 0,0 as the location of the mutex. Also of interest is that the 4x4 case is accessing the atomic counter 16x as many times as the 1x1 case - here the low-bandwidth requirements of the interrupt implementation is scaling better than linear compared to the mutex implementation - because the requesting core is effectively batching up multiple requests if they come too fast, rather than having to serialise everything remotely.

But as can be seen - once more than 4x cores are in play the interrupt driven routine starts to win out comfortably. This is despite effectively blocking the host core while the others are running.

But this isn't a very useful test because no practical software simply increments a remote counter. So for something more realistic I added a delay to the loop using one of the ctimers.

So looking at the most congested case of all cores busy:

4x4

 Delay     Interrupt    Mutex
 ------    ----------   ------
     10    0.965649s    3.138225s
    100    1.083733s    3.919083s
    200    1.630165s    3.693539s
    300    1.780689s    3.792168s
    400    2.297966s    3.666745s
    500    2.448892s    3.563474s
   1000    3.840059s    1.851269s
   2000    4.923238s    3.402963s

So the cross-over point is around 1000 cycles worth of work, at least on a 16 core machine. 1000 cycles isn't much.

Given this data, it's probably better to go with a mutex implementation after-all. It requires about 1/3 of the code and doesn't load the servicing core nearly as much. Oh well, worth a try. (hang on, this doesn't let the host participate ... argh).

I have no direct data on the mesh load or the fairness of the count distribution. Intuitively the mutex based implementation wont be as fair due to the way the routing works, but once you have enough work to do the network shouldn't be particularly busy.

I'm hoping a future hardware revision may include an option on the ctimer to count reads of the ctimer register - which would effectively be a 2xatomic counters per core in hardware. In the meantime there's always that FPGA (maybe that's something I will look at, just no experience there).

No comments: