Tuesday, 27 March 2012

In Which The Code Sucks, But Gets Better


The logic for displaying text on the 40-column screen proved to be surprisingly complicated - the principle is straight-forward enough (copy glyph data from the Character Generator ROM to the screen bitmap) but making it efficient and fast proved to be an interesting challenge. It turned into a sort of Russian Doll problem, where each section of the code needed something else to be available before it - so I'd turn to that, and then discover that it too needed something else before it. Eventually I got to the bottom of the stack - the smallest Doll - and wrote a tiny bit of code, which then let me track back up with higher layers of logic until I got to where I started and had a routine I could call with parameters describing the start address of a string of ASCII characters, and a row/column pair for where on the screen I wanted the string displayed.

Along the way, I had to do some serious thinking about how the screen matrix is laid-out because double-height characters occupy 16 consecutive bytes in the bitmap, but are displayed as two 8-bit cells one above the other; the visible image is an 8x16 cell, but it's held in memory as 16x8. This implies that the VIC is doing some fairly funky alternating-offset indexing when drawing the screen in this mode, which is devastatingly clever of it but it makes it a real pain when you're trying to turn a row/column pair into a linear bitmap address - if you take the simple approach to filling the matrix with 240 consecutive bytes, then each 160-byte 'row' in the bitmap actually corresponds to a pair of 80-byte half-width lines and, as I discovered, there's just no way to write neat, fast code that can map over this without lots of conditional tests or a huge table of addresses to help.

So I had a good ponder about it, and tried a few ideas out before settling on a vertical-stripe approach where the 240 bytes in the matrix are arranged from left to right in top-to-bottom strips of 12 bytes each. This means that each successive pair of bytes in the first 24 bytes of the bitmap are the first column on the screen, rather than a chunk of the first two rows - bitmap byte zero is row 0, column 0; byte one is row 1, column 0, and so on. I've effectively turned the bitmap on its' side so that consecutive bytes run down the screen instead of across, and this now matches how the VIC actually draws the 16x8 cells. It means that the calculations to find the next row or column address from a given cell are now linear - the next row is just +1, the next column simply +192 - instead of needing either a complex (and slow) calculation or a table to map the discontiguous address space.

You might have noticed I said each matrix column is 12 bytes, which gives us 24 rows in double-height mode - remember that the 25th line is generated with a nifty little raster routine I've already described. Since that last line is a 'flat' single-height row, the code to figure-out where columns are can be much simpler (just six instructions!) but I've actually integrated the mechanics into a single piece of logic that manages both the primary and secondary bitmaps the same way, albeit with different column increments. This means the code to write to row 25 is slower than it could be (by up to around 100 cycles) but no slower than any other line, and crucially it follows the same code path. A rare occasion where I deliberately choose a slower solution than is actually possible, trading speed (where it's not critical) for simplicity.

UPDATE:

I first published this post in a hurry, so it was both brief and uninteresting - sorry about that. To make up for it, I've now extended it and added some code which you might like to look at. To begin with, let's fast-forward a couple of hours past the point where I'd written some code that integrated the 'special' line-25 case into the main logic, and accept the fact that this turned-out to be a Bad Idea. I thought I was being cleverly frugal with the space the routine occupied by doing this, but it transpired that there were still some glitches in the combined code that proved to be fixable only by adding inelegant tests to check for edge-conditions. In the end, I gave up and re-wrote everything as a two-path solution, branching to handle line 25 separately - here it is:

setbitdr SUBROUTINE
LDX #$10 ; [2] bitmap hi-byte for lines 1-24
STX _DRAWADDR+1 ; [3] ZP initialise bitmap draw address hi-byte
LDX #$F0 ; [2] mask for even column numbers (0,2,4...38)
LDA _COLNUM ; [3] ZP get text column number
LSR ; [2] divide by 2 for column counter
BCC .skipmask ; [2/3] flag is clear if column number was even
LDX #$0F ; [2] mask for odd column numbers (1,3,5...39)
.skipmask
STX _COLMASK ; [3] ZP set bitmap draw mask
TAX ; [2] stash column counter in .X
LDA _ROWNUM ; [3] ZP get text row number
CMP #$18 ; [2] is this line 25?
BEQ .line25 ; [2/3] yep, skip row calculation for lines 1-24
ASL ; [2] multiply row by 8
ASL ; [2]
ASL ; [2]
CPX #$00 ; [2] check column byte value
BEQ .done ; [2/3] column byte of 0 means nothing else to do
.nextcol
CLC ; [2] clear Carry
ADC #$C0 ; [2] add 192 for next column byte
BCC .skipinc ; [3/2] skip increment if no Carry
INC _DRAWADDR+1 ; [5] ZP increment draw address hi-byte
.skipinc
DEX ; [2] decrement column multiplier countdown
BNE .nextcol ; [3/2] loop until all columns counted
BEQ .done ; [3/3] skip line 25 section
.line25
LDA #$00 ; [2]
STA _DRAWADDR+1 ; [3] ZP initialise bitmap draw address hi-byte
TXA ; [2] get column byte
BEQ .done ; [2/3] column byte of 0 means nothing else to do
ASL ; [2] multiply column by 8 for line 25 offset
ASL ; [2]
ASL ; [2]
.done
STA _DRAWADDR ; [3] ZP set bitmap draw address lo-byte
RTS ; [6]

So what do those 54 bytes of setbitdr do? It's purpose is to take a given row/column co-ordinate and derive the address in the screen bitmap which corresponds to it, so that a higher-level routine can then plot text into the bitmap; it also sets-up the character mask, since each 8x8 bitmap cell actually holds two 4x8 glyphs - so the calling routine needs to know which half of the bitmap cell referenced by the address is the actual target for the glyph to be plotted. This is a simple bitmask, which tells the caller how to mask the addressed cell so that the new glyph goes in the correct half. The code shown above works properly in all valid cases, achieves the objectives, and sucks. Badly.

Well, OK, it's not as bad as all that - it takes about 330 cycles to do its' thing in the worst-case scenario, which is computing the bitmap address for row 24, column 40. But after I'd written this and tested it with a simple 'fill the screen' demo, I was quite disappointed with the draw performance - admittedly the demo code wasn't brilliant, but the lions' share of the screen-fill time was being spent in here. I went for a smoke, and as is often the case I found that by reflecting on working but non-performant code, I could come up with a couple of things to do which might improve matters.

My first thought was that there wasn't actually any point in this routine being responsible for determining the mask to use for plotting glyphs in the bitmap byte - that's something that would be better-placed higher up the logic chain, where data is actually being copied to the screen and where decisions about nybble-masking can be made in a more strategic way (for example, there's no use for the mask at all if the byte to be written-to is empty and we're just plotting one glyph). So we can shave around 10 cycles (give or take a few after restructuring) off the routines' CPU consumption just by not doing that. Which is nice, but hardly significant as a percentage of the whole.

I knew where the bulk of the time was being spent though - in that little loop at .nextcol, which takes the base row address and accumulates column additions to it with the ADC/INC combination. ADC is a cheap instruction at 2 cycles but INC is a hog at 5, and the obvious solution is to reduce the amount of calculation the routine has to do - pre-calculating stuff and storing it in tables so we can do a fast lookup instead of a slow computation is a trade-off between speed and space, and for routines which will be called A LOT this makes sense. What we're computing here is an address which radically changes as we advance by column, so the table should hold the address of the first row in each column - we can then do a lookup based on the column number to get the column start address, and then the only computed part is the addition for the row offset - an ADC - plus another ADC to increment the address hi-byte if there's a Carry.

We'll have to do a little manipulation of the column number to turn it into a table index, and the row number still has to be multiplied by 8 for the cell size, but the core calculation ceases to be an expensive accumulator loop and becomes a simple one-or-two-step increment. Here it is:

setbitdr SUBROUTINE
LDA _COLNUM ; [3] ZP get text column number
AND #%11111110 ; [2] mask LSB off for table index
TAX ; [2] stash column index in .X
LDA _ROWNUM ; [3] ZP get text row number
CMP #$18 ; [2] is this line 25?
BEQ .line25 ; [2/3] yep, skip stuff for lines 1-24
ASL ; [2] multiply row by 8
ASL ; [2]
ASL ; [2]
ADC _COLADDRS,X ; [4] add bitmap column address lo-byte
STA _DRAWADDR ; [3] ZP set bitmap draw address lo-byte
LDA _COLADDRS+1,X ; [4] get bitmap column address hi-byte
ADC #$00 ; [4] add Carry
BNE .done ; [3/3] can never be zero, always branch
.line25
TXA ; [2] get column index
ASL ; [2] multiply column index for line 25 offset
ASL ; [2]
STA _DRAWADDR ; [3] ZP set bitmap draw address lo-byte
LDA #$00 ; [2]
.done
STA _DRAWADDR+1 ; [3] ZP set bitmap draw address hi-byte
RTS ; [6]

Now isn't that much nicer? The routine has shrunk to 36 bytes, although we do now have a 40-byte table at _COLADDRS which has the column start addresses pre-calculated in it - but here's the kicker; for rows 1-24, the CPU time is now just 47 cycles! And row 25 is even quicker, at 35 cycles. I have to admit this has turned-out considerably neater than I envisaged and I'm really pleased with it.

I ran another quick test by plugging it into the screen-fill demo code, and it zips through much more quickly than the first version - despite still doing a primitive column-by-row fill without any intelligence applied for mask optimisation or column-pairing. Of which, incidentally, more next time; I'll leave you with a giganto-screenshot of the demo showing what we've got so far...

No comments:

Post a Comment