Ed's Diary 2016 05 25

Another writeup of another inconclusive adventure - are we perhaps nearing the source of the Nile?

2016-05-25 another pidirect afternoon with Ed at Dave's

Dave now has a pi3 and a couple of zeros

We tried to run the software on it. Have to write kernels to SDcard because the bootloader doesn't (yet) work on this platform.

But the interrupt service time variability and latency jitter are terrible. In fact we're missing the trailing edge of ntube and getting LATE messages

We've got the primary core running the ISR, the emulator on the second and two cores spinning. The ISR can no longer just bump the instruction table base address using a shared register, so now the emulator needs to poll the tube mailbox value. That will slow down each instruction - we can probably find ways to improve this but we try to move forward.

We diverted to have a look at dave's design for a level shifter board - in pi hat format. There's a chance that the shifter and the pi will stack up and fit under the beeb, directly plugged into the tube socket, but in any case they can be cable connected. Not an ide cable but a tube extender, like the one used for the matchbox.

Looks like 10 parts would cost £25 or so with a couple of weeks turnaround from China.

Back to the software...

Possibly we don't have the caches set up right. It turns out the documentation is not so good for the pi3: the usual doc is much shorter and refers to us to a very much longer doc for details - so long we can't find them.

Fallback: try a pi 2, borrowed from the home entertainment system

It does look better - interrupt service is quicker - but still not fast enough.

We can tell that the ping pong between ISR and main code via the mailbox is happening.

Previously the latency from ntube going low to ISR raising the telltale was

"This shows the ISR latency ranges between 80 and 300ns."

and with the Pi2, it's pretty much the same.

Tried removing two I/O writes from the beginning of the ISR

the telltale LED
clearing the event

and this brought in the read data by 152ns

So 76ns per I/O write - that's quite a cost. (and is suspiciously similar to the 80ns min ISR latency.

And yet, we do several I/O accesses:

read all our pins
write data values set
write data values clear
write direction register 1/3
write direction register 2/3
write direction register 3/3

but in fact all this - and the latency - is over in 268ns

Another possible source of slowness is cache misses

We could now see a mix of successful and failing reads and we can measure that the 6502 needs 58ns setup time relative to the clock we have - the clock on the tube.

And yet we know that the Pi 1 and 0 both do better than this

Dave reports he ran elite for 6 hours with only graphical glitches no cache.

Perhaps having the ISR and emulator on different cores is causing a memory coherency delay? We can try running both on the same core. We continue to run our own spin code just in case the constant mailbox polling which is the default idle is actually costly.

We note there is a WFI instruction for spinning without spinning.

We note that there is a fast interrupt mode whereby a restartable instruction will be aborted instead of completed. But we're not at all sure that the cost of interrupting an instruction is what's slowing us down.

We noticed there's an ldrt instruction, which means 32bit constants can be assembled in two instructions with no data access - unlike the =CONSTANT idiom.

We can slow down the SDRAM clock - that should have no effect if we are running almost entirely from cache. But if the cache is off, we'll notice. And it turns out that does hurt quite a lot. It hurts the best case time for data ready too - which means even the best case is somehow stalling for SDRAM. Could it be that we have no data cache?

Let's go back to the Pi Zero, which has better docs and very nearly works. We have the improved code already, with the two fewer I/O writes at the front of the ISR. We can investigate the effect of faster or slower SDRAM and/or core clock and/or ARM clock.

Also we can use the bootloader!

(Our emulated performance will suffer because instruction dispatch now checks the mailbox on each instruction)

But our latest multicore code just doesn't work on the zero - will need attention - so it's stashed and we revert to last known good.

baseline on the zero arm 1000 core 400 sdram 450

We note that the timing of read data on the zero is very stable - very little variation - when no emulator is running. Is the emulator causing cache pressure, or are some instructions taking ages to be interrupted? Or both?

(We recall that locking more waysets to make a smaller effective cache did hurt the data transfer reliability)

Leaving elite running for quite a bit to see the range of ISR latency and ISR durations.

Removing/moving the first two I/O writes has only budged the timing a little bit. Why? With the pi 2 or 3 it was a major difference. But it might just have improved the worst case a little bit.

Let's see about how changing the SDRAM clock speed makes a difference... not a lot of difference. SDRAM at 250 and there was a very slight increase in interrupt latency.

Dave wrote a short machine code program THRASH to read and rewrite bytes every 4k. That caused just a couple of late ISRs over the course of a few minutes.

We've inherited a config with PVT calibration for SDRAM refresh. Let's disable that: disable_pvt = 1 and again run THRASH a few minutes... ... and indeed there were no stray late ISRs.

And yet we find it hard to see how the emulator isn't entirely in L1 Icache and the 64k Beeb memory isn't entirely in the L2 cache.

And also, if THRASH is fine, why is elite still seeing some late interrupts?

Might interrupt masking, which we thought we now understood, be causing a long interrupt latency? We thought we were only running this code in the event handler - when we don't expect to see an interrupt! But we think we know we can remove it - with an RC filter on tube reset we shouldn't have back to back interrupts so we shouldn't need.

Doing that, we see elite has one late ISR - 120ns or so late.

other tools we can yet apply: