?Device Not Present: In Which We Squeeze The Code Until The Pips Squeak

One of my favourite quotes, and the one I find most particularly suited to software engineering, is this one by Antoine de Saint-Exupery:

“Perfection is achieved not when there is nothing left to add, but when there is nothing left to take away”.

For me, part of the joy of writing code is to find a way to achieve the objective in as concise and elegant a way as possible. That doesn't always equate to 'smallest' or even 'fastest' (though I confess I'll often choose a fast solution over a small one) but is frequently an aesthetic decision which, like all art, can be divisive - one mans' elegant is another mans' clunky. The thing about writing Assembler, though, is that small and fast are pretty-much a given whatever you do - and the challenge is to find the very smallest and/or fastest way of doing something, where reductions in size or cycles can be measured in single digits but still make a big difference.

I've spent the last three or four days finding things to take away from the pair of routines at the heart of the VIC++ text-display engine - dispstr and drawdrty. I haven't shown you dispstr yet, although I did mention it last time - it's the 'wrapper' routine which is called when something needs to be displayed. Have a look:

dispstr SUBROUTINE
    LDX _CURSORR                ; [3]   ZP get cursor row
    DEX                         ; [2]   decrement for zero-based addressing
    TXA                         ; [2]   move row to .A
    ASL                         ; [2]   multiply by 2 for table index
    TAX                         ; [2]   set index in .X

; get text buffer address
    LDY _CURSORC                ; [3]   ZP get cursor column
    DEY                         ; [2]   decrement for zero-based addressing
    TYA                         ; [2]   move column to .A
    ADC _TBUFTAB,X              ; [4]   add text buffer row address lo-byte
    STA _TBUFADDR               ; [3]   ZP save for text buffer address
    LDA _TBUFTAB+1,X            ; [4]   get text buffer row address hi-byte
    ADC #$00                    ; [2]   add Carry if set
    STA _TBUFADDR+1             ; [3]   ZP save for text buffer address

; get attribute buffer address
    TYA                         ; [2]   move column to .A
    LSR                         ; [2]   divide by 2
    CLC                         ; [2]   clear Carry
    ADC _ABUFTAB,X              ; [4]   add attribute buffer row address lo-byte
    STA _ABUFADDR               ; [3]   ZP save for attribute buffer address
    LDA _ABUFTAB+1,X            ; [4]   get attribute buffer row address hi-byte
    ADC #$00                    ; [2]   add Carry if set
    STA _ABUFADDR+1             ; [3]   ZP save for attribute buffer address

; copy string to text buffer
    LDY #$00                    ; [2]   string length
.copystr
    LDA (_STRADDR),Y            ; [5]   get byte from source
    BEQ .checklen               ; [2/3] check length when we hit the zero-terminator
    STA (_TBUFADDR),Y           ; [6]   store byte at destination
    INY                         ; [2]   increment length
    BNE .copystr                ; [3/2] loop until we hit the maximum string length
    DEY                         ; [2]   decrement for maximum length

.checklen
    CPY #$00                    ; [2]   empty string?
    BEQ .done                   ; [2/3] exit if empty

; set attribute buffer bytes
    DEY                         ; [2]   decrement length
    TYA                         ; [2]   move to .A
    LSR                         ; [2]   divide by 2
    TAY                         ; [2]   move back to .Y
    LDA _STRATTR                ; [3]   get attribute byte
.setbyte
    STA (_ABUFADDR),Y           ; [6]   store byte at destination
    DEY                         ; [2]   decrement length
    CPY #$FF                    ; [2]   looped below zero?
    BNE .setbyte                ; [3/2] nope, loop for next byte

; update dirty-row marker
    LAX _CURSORR                ; [3]   ZP get cursor row
    STA _DRTYROWS-1,X           ; [4]   set dirty-row marker
.done
    RTS                         ; [6]

Its' purpose is simple, which is to take a string from somewhere, stuff it into the Text Buffer, write an appropriate quantity of the specified attribute byte into the corresponding area of the Attribute Buffer, and then flag the newly-written line as 'dirty' for drawdrty to detect and plot into the 40-column bitmap on the next IRQ. I've sweated over this code for several hours, refactoring and adjusting across a number of iterations until the final version is as you see it here. I'm not declaring it to be perfect, but I think it's pretty good - it does what its' designed to do, and nothing more. Did you notice LAX in there? Such a useful little opcode... :)

The rest of my time on the project over the last few days has been spent in drawdrty itself, because I knew there was scope for 'things to be taken away' and I wanted to see just how far I could squeeze it - I'd decided that if I could get it below 10,000 cycles per row then I could afford to call it twice during the IRQ and double the update rate (permitting a full screen refresh in a quarter of a second rather than half). But there was a hill to climb, because I thought it was already pretty close to optimal already and it ran in 11,720 cycles - shaving around 2000 cycles off that was going to be tough.

So here's a revision which, as far as it goes, is as close to 'nothing left to take away' as I think it'll get. There are quite a lot of changes, both to the overall structure and to various internal elements, but it very definitely runs a lot faster. In fact, this version of the routine profiles at 9508 cycles - a reduction of 2242 cycles. I admit that although my manual cycle-counting hinted it was going to be faster, I was actually a bit flabbergasted when it blew past my sub-10,000 target with ease! Here it is:

resdirty SUBROUTINE
    LDA #$19                    ; [2] reset dirty row counter (#25)
    STA _DRAWROW                ; [3] ZP save it
    RTS                         ; [6]

drawdrty SUBROUTINE
    LDX _DRAWROW                ; [3]   ZP get screen row
.nextrow
    DEX                         ; [2]   decrement row
    BMI resdirty                ; [2/3] reset counter if we dropped below zero
    LDA _DRTYROWS,X             ; [4]   ZP get dirty-row marker
    BEQ .nextrow                ; [3/2] zero means not dirty, loop for next

; set dirty-row counter and clear the dirty-row marker
    STX _DRAWROW                ; [3]   ZP set screen row
    LDA #$00                    ; [2]   clear dirty-row marker
    STA _DRTYROWS,X             ; [4]
 
; get attribute and text buffer row addresses
    TXA                         ; [2]   get screen row from .X
    ASL                         ; [2]   multiply by 2 for table index
    TAX                         ; [2]   set index
    LDA _ABUFTAB,X              ; [4]   get attribute buffer row address lo-byte
    STA _ATTRADDR               ; [3]   ZP save for attribute buffer address
    LDA _ABUFTAB+1,X            ; [4]   get attribute buffer row address hi-byte
    STA _ATTRADDR+1             ; [3]   ZP save for attribute buffer address
    LDA _TBUFTAB,X              ; [4]   get text buffer row address lo-byte
    STA _TEXTADDR               ; [3]   ZP save for text buffer address
    LDA _TBUFTAB+1,X            ; [4]   get text buffer row address hi-byte
    STA _TEXTADDR+1             ; [3]   ZP save for text buffer address

; determine bitmap draw address
    LDY #$26                    ; [2]   initial column index (#38)
.drawloop
    LDA _DRAWROW                ; [3]   ZP get screen row
    CMP #$18                    ; [2]   is this line 25?
    BEQ .line25                 ; [2/3] yep, skip stuff for lines 1-24
    ASL                         ; [2]   multiply row by 8
    ASL                         ; [2]
    ASL                         ; [2]
    ADC _COLADDRS,Y             ; [4]   add bitmap column address lo-byte
    STA _DRAWADDR               ; [3]   ZP set bitmap draw address lo-byte
    LDA _COLADDRS+1,Y           ; [4]   get bitmap column address hi-byte
    ADC #$00                    ; [2]   add Carry
    BNE .sethi                  ; [3/3] can never be zero, always branch
.line25
    TYA                         ; [2]   move column for multiply
    ASL                         ; [2]   multiply by 4 for line 25 offset
    ASL                         ; [2]
    STA _DRAWADDR               ; [3]   ZP set bitmap draw address lo-byte
    LDA #$00                    ; [2]
.sethi
    STA _DRAWADDR+1             ; [3]   ZP set bitmap draw address hi-byte
 
; get glyph data addresses (quicker unrolled than in a loop, and 33% faster than the earlier ASL/ROL method)
    STY _DRAWCOL                ; [3]   ZP stash .Y for later
    LAX (_TEXTADDR),Y           ; [5]   get ASCII char from text buffer
    LSR                         ; [2]   shift bits 7-5 down to 2-0
    LSR                         ; [2] 
    LSR                         ; [2] 
    LSR                         ; [2] 
    SEC                         ; [2]   set Carry
    ROR                         ; [2]   last bit shift also rotates Carry into bit 7 for character ROM
    STA _GLYPADD1+1             ; [3]   ZP set first glyph address hi-byte
    TXA                         ; [2]   get char back from .X
    ASL                         ; [2]   multiply by 8
    ASL                         ; [2]
    ASL                         ; [2]
    STA _GLYPADD1               ; [3]   ZP set first glyph address lo-byte
    INY                         ; [2]   increment column index for next char
    LAX (_TEXTADDR),Y           ; [5]   get ASCII char from text buffer
    LSR                         ; [2]   shift bits 7-5 down to 2-0
    LSR                         ; [2] 
    LSR                         ; [2] 
    LSR                         ; [2] 
    SEC                         ; [2]   set Carry
    ROR                         ; [2]   last bit shift also rotates Carry into bit 7 for character ROM
    STA _GLYPADD2+1             ; [3]   ZP set second glyph address hi-byte
    TXA                         ; [2]   get char back from .X
    ASL                         ; [2]   multiply by 8
    ASL                         ; [2]
    ASL                         ; [2]
    STA _GLYPADD2               ; [3]   ZP set second glyph address lo-byte

; copy, mask and merge glyph data to Zero Page
    LDY #$07                    ; [2]   glyph bytes to copy
    LDX #$F0                    ; [2]   set mask for first glyph
.nextbyte
    LDA (_GLYPADD1),Y           ; [5]   get glyph data byte
    SAX _GLYPHDAT,Y             ; [4]   ZP mask and store
    LDA (_GLYPADD2),Y           ; [5]   get glyph data byte
    AND #$0F                    ; [2]   mask upper nybble
    ORA _GLYPHDAT,Y             ; [4]   merge with first glyph
    STA _GLYPHDAT,Y             ; [5]   store in ZP
    DEY                         ; [2]   decrement byte counter
    BPL .nextbyte               ; [3/2] loop for next byte

; check attribute byte
    LDA _DRAWCOL                ; [3]   ZP get column
    LSR                         ; [2]   divide by 2
    TAY                         ; [2]   move to .Y
    LDA (_ATTRADDR),Y           ; [5]   get attribute byte for the column pair
;   BNE .getattr                ; [2/3] set attribute(s) if byte is non-zero

; copy glyph data to bitmap
.drawbmp
    LDY #$07                    ; [2]   glyph data byte counter
.drawbyte
    LAX _GLYPHDAT,Y             ; [3]   get glyph data byte
    STA (_DRAWADDR),Y           ; [6]   write it to bitmap
    DEY                         ; [2]
    BPL .drawbyte               ; [3/2] loop until all glyph bytes drawn

; prepare for next iteration
    LDY _DRAWCOL                ; [3]   ZP get column
    DEY                         ; [2]   decrement column twice for next char pair
    DEY                         ; [2]
    BPL .drawloop               ; [3/2] loop for next character pair
    RTS                         ; [6]

Now you sharp-eyed attentive types will spot the obvious caveat here - this version does not handle attribute rendering. So of course it's faster - there's a chunk missing, right? Well yes, but actually I'm currently wrestling with the attribute renderer right now, and I think I can get it in for about 250 cycles - so even though the profiler results will go up, we'll still be below the 10,000 target (not by as much, but still - spectacularly - over 2000 cycles faster than the original). Interestingly, I have three different versions of the attribute render code which I'm playing with - they all work in different ways, and each has its' pros and cons. The one I liked best from an 'elegance' point of view actually turned-out to suck very badly in terms of cycle-consumption, and I'm trying to reduce it - but I suspect it's a lemon and will never make it into the final version of the code because there are two other algorithms which perform faster.

My objective is to find the very fastest way to re-integrate attribute rendering without climbing above a total of 10,000 cycles again. I think I can, but this is a cerebral process - it's not so much about finding a 'killer' instruction trick or sequence, but more about finding a method that will lend itself to a fast solution.

Oh, but speaking of 'killer' instructions - did you spot SAX in there? This is an absolutely gorgeous little undocumented opcode which does a logical AND between .A and .X, but saves the result in memory instead of either register - extremely useful, if you arrange things appropriately, for doing an otherwise-impossible boolean operation between two registers without altering the contents of either (or any flag in .SR).

I anticipate that my next post will reveal the missing .getattr logic, and we can move on to talk about the next task - implementing the cursor.

Friday, 27 April 2012

In Which We Squeeze The Code Until The Pips Squeak

No comments:

Post a Comment