I think I've got the final code configuration settled for the dirty-row scan stage now, and I'm just putting it through its' paces to make sure it does exactly what I expect. There's a little bit of tuning to do as I'm pretty sure I can optimise register usage in a couple of places, and I believe there'll be a nice little payoff where the routine to set dirty-row bits is essentially a freebie - the last thing the new scan logic does is clear the bit it found (if any) and this is pretty-much exactly the same code as is needed to set a bit when a line needs updating. So if I can arrange things really neatly, I might be able to make the 'set bit' routine nothing more than a call into the end of the scan logic with a '1' instead of a '0'. Sweet.
So what we have now is a chunk of code that's bigger than it was before, but notably faster. The obvious speed increase is simply by virtue of the fact that the main loop now only reads and tests 4 bytes instead of 25, so the 'best-case' scan where no rows are marked as dirty can be completed in 40 cycles instead of 290 (a substantial gain). The code size increase is where we now have to scan bits in a byte - having detected a byte with one or more bits set, there's a subsequent loop to iterate through each bit to establish which is/are set and thereby derive the row number. This is also pretty quick because I use a small 8-byte table to hold preset bitmask patterns that make the comparison a simple AND operation - much like the mask table we use to strobe the VIA when doing keyboard scanning. In this case, we're just looking to see when a bit is set, and then (if one is) do the little calculation to compute the absolute row number to pass to the redraw routine.
On that note, by the way, I got to use another new undocumented instruction to make the calculation quite neat - the SLO mnemonic. This is a Shift Left and Or instruction (also known as ASO by some assemblers) which has a very slick function whereby it does an ASL on a memory location, and then performs a logical OR with the result against the Accumulator. So my calculation routine (multiply the dirty-row byte index by 8, then add the bit index) becomes two ASLs (to multiply by 4) and then an SLO of that shifted value against the bit index in .A (thus multiplying by 2 again, for a total of 8, and ORing the bit counter into the result to yield the final dirty-row number). There are still a good few undocumented instructions I haven't used yet, but I'm slowly working my way through them... :)
One interesting effect of writing this new routine to use the smaller table is that the scan process is notably faster. Of course I expected it to be quicker for frames where there are no dirty-rows - anywhere you're reading and testing 4 bytes instead of 25 is obviously going to be faster. What's really fascinating though is that my perception of the speed of the 25-byte table scan was distorted by the smallness of the code - in the worst-case scenario where the scan finds a dirty-row marker in the last line (actually line #1, because the loop runs backwards*) it took 286 cycles to identify it. The new version, although larger in size, will find that same marker in 212 cycles - not quite as amazing a reduction as for the 'no row found' scenario, but still a bit of a surprise because I genuinely though the old code was much quicker. The new VICE stopwatch and some painstaking manual double-checking says otherwise!
A little more testing to do, and then I'll alter the IRQ handler to call the new logic and rip out the old. Then it'll be on to those 8 ZP bytes the row-drawing routine uses to do the glyph rendering, where I think a bit of cunning refactoring can reduce that to just one byte.
Until next time!
A little more testing to do, and then I'll alter the IRQ handler to call the new logic and rip out the old. Then it'll be on to those 8 ZP bytes the row-drawing routine uses to do the glyph rendering, where I think a bit of cunning refactoring can reduce that to just one byte.
Until next time!
* This is actually a good general 6502 tip; try to arrange for your index register loops to count down towards zero wherever possible because then you can avoid an explicit compare against a limit and instead use the inherent flag mechanics of the CPU. Observe in this first example we do the usual, human-nature, thing where we count upwards until hitting the limit at 10 - it's perfectly good code:
LDX #$00 ; [2] Set loop counter start value
loop:
NOP ; [2] Do whatever the loop does
INX ; [2] Increment the loop counter
CPX #$0A ; [2] Have we got to 10 yet?
BNE loop ; [3/2] If not, branch back and iterate again
However in this alternate version, we count 10 iterations again but backwards - the code is 2 bytes smaller because we've lost the CPX, and is 2 cycles faster per iteration (we take advantage of the fact that the Zero flag is set when DEX drops the .X register to zero: LDX #$0A ; [2] Set loop counter start value
loop:
NOP ; [2] Do whatever the loop does
DEX ; [2] Decrement the loop counter
BNE loop ; [3/2] Branch back and iterate again until .X = 0
Obviously you can't always arrange for things to work out this way, but you'd be surprised how often you can juggle things a little so that loops can take this form and save those couple of bytes and, more importantly, 2 cycles per iteration. Doesn't sound like much? That's 512 cycles over a 256-iteration loop, and might make the difference between a bit of code fitting inside, say, an IRQ time-limit or not...
No comments:
Post a Comment