One of my favourite quotes, and the one I find most particularly suited to software engineering, is this one by Antoine de Saint-Exupery:
“Perfection is achieved not when there is nothing left to add, but when there is nothing left to take away”.
For me, part of the joy of writing code is to find a way to achieve the objective in as concise and elegant a way as possible. That doesn't always equate to 'smallest' or even 'fastest' (though I confess I'll often choose a fast solution over a small one) but is frequently an aesthetic decision which, like all art, can be divisive - one mans' elegant is another mans' clunky. The thing about writing Assembler, though, is that small and fast are pretty-much a given whatever you do - and the challenge is to find the very smallest and/or fastest way of doing something, where reductions in size or cycles can be measured in single digits but still make a big difference.
I've spent the last three or four days finding things to take away from the pair of routines at the heart of the VIC++ text-display engine - dispstr and drawdrty. I haven't shown you dispstr yet, although I did mention it last time - it's the 'wrapper' routine which is called when something needs to be displayed. Have a look:
Its' purpose is simple, which is to take a string from somewhere, stuff it into the Text Buffer, write an appropriate quantity of the specified attribute byte into the corresponding area of the Attribute Buffer, and then flag the newly-written line as 'dirty' for drawdrty to detect and plot into the 40-column bitmap on the next IRQ. I've sweated over this code for several hours, refactoring and adjusting across a number of iterations until the final version is as you see it here. I'm not declaring it to be perfect, but I think it's pretty good - it does what its' designed to do, and nothing more. Did you notice LAX in there? Such a useful little opcode... :)
The rest of my time on the project over the last few days has been spent in drawdrty itself, because I knew there was scope for 'things to be taken away' and I wanted to see just how far I could squeeze it - I'd decided that if I could get it below 10,000 cycles per row then I could afford to call it twice during the IRQ and double the update rate (permitting a full screen refresh in a quarter of a second rather than half). But there was a hill to climb, because I thought it was already pretty close to optimal already and it ran in 11,720 cycles - shaving around 2000 cycles off that was going to be tough.
So here's a revision which, as far as it goes, is as close to 'nothing left to take away' as I think it'll get. There are quite a lot of changes, both to the overall structure and to various internal elements, but it very definitely runs a lot faster. In fact, this version of the routine profiles at 9508 cycles - a reduction of 2242 cycles. I admit that although my manual cycle-counting hinted it was going to be faster, I was actually a bit flabbergasted when it blew past my sub-10,000 target with ease! Here it is:
Now you sharp-eyed attentive types will spot the obvious caveat here - this version does not handle attribute rendering. So of course it's faster - there's a chunk missing, right? Well yes, but actually I'm currently wrestling with the attribute renderer right now, and I think I can get it in for about 250 cycles - so even though the profiler results will go up, we'll still be below the 10,000 target (not by as much, but still - spectacularly - over 2000 cycles faster than the original). Interestingly, I have three different versions of the attribute render code which I'm playing with - they all work in different ways, and each has its' pros and cons. The one I liked best from an 'elegance' point of view actually turned-out to suck very badly in terms of cycle-consumption, and I'm trying to reduce it - but I suspect it's a lemon and will never make it into the final version of the code because there are two other algorithms which perform faster.
My objective is to find the very fastest way to re-integrate attribute rendering without climbing above a total of 10,000 cycles again. I think I can, but this is a cerebral process - it's not so much about finding a 'killer' instruction trick or sequence, but more about finding a method that will lend itself to a fast solution.
Oh, but speaking of 'killer' instructions - did you spot SAX in there? This is an absolutely gorgeous little undocumented opcode which does a logical AND between .A and .X, but saves the result in memory instead of either register - extremely useful, if you arrange things appropriately, for doing an otherwise-impossible boolean operation between two registers without altering the contents of either (or any flag in .SR).
I anticipate that my next post will reveal the missing .getattr logic, and we can move on to talk about the next task - implementing the cursor.
dispstr SUBROUTINE
LDX _CURSORR ; [3] ZP get cursor row
DEX ; [2] decrement for zero-based addressing
TXA ; [2] move row to .A
ASL ; [2] multiply by 2 for table index
TAX ; [2] set index in .X
; get text buffer address
LDY _CURSORC ; [3] ZP get cursor column
DEY ; [2] decrement for zero-based addressing
TYA ; [2] move column to .A
ADC _TBUFTAB,X ; [4] add text buffer row address lo-byte
STA _TBUFADDR ; [3] ZP save for text buffer address
LDA _TBUFTAB+1,X ; [4] get text buffer row address hi-byte
ADC #$00 ; [2] add Carry if set
STA _TBUFADDR+1 ; [3] ZP save for text buffer address
; get attribute buffer address
TYA ; [2] move column to .A
LSR ; [2] divide by 2
CLC ; [2] clear Carry
ADC _ABUFTAB,X ; [4] add attribute buffer row address lo-byte
STA _ABUFADDR ; [3] ZP save for attribute buffer address
LDA _ABUFTAB+1,X ; [4] get attribute buffer row address hi-byte
ADC #$00 ; [2] add Carry if set
STA _ABUFADDR+1 ; [3] ZP save for attribute buffer address
; copy string to text buffer
LDY #$00 ; [2] string length
.copystr
LDA (_STRADDR),Y ; [5] get byte from source
BEQ .checklen ; [2/3] check length when we hit the zero-terminator
STA (_TBUFADDR),Y ; [6] store byte at destination
INY ; [2] increment length
BNE .copystr ; [3/2] loop until we hit the maximum string length
DEY ; [2] decrement for maximum length
.checklen
CPY #$00 ; [2] empty string?
BEQ .done ; [2/3] exit if empty
; set attribute buffer bytes
DEY ; [2] decrement length
TYA ; [2] move to .A
LSR ; [2] divide by 2
TAY ; [2] move back to .Y
LDA _STRATTR ; [3] get attribute byte
.setbyte
STA (_ABUFADDR),Y ; [6] store byte at destination
DEY ; [2] decrement length
CPY #$FF ; [2] looped below zero?
BNE .setbyte ; [3/2] nope, loop for next byte
; update dirty-row marker
LAX _CURSORR ; [3] ZP get cursor row
STA _DRTYROWS-1,X ; [4] set dirty-row marker
.done
RTS ; [6]
Its' purpose is simple, which is to take a string from somewhere, stuff it into the Text Buffer, write an appropriate quantity of the specified attribute byte into the corresponding area of the Attribute Buffer, and then flag the newly-written line as 'dirty' for drawdrty to detect and plot into the 40-column bitmap on the next IRQ. I've sweated over this code for several hours, refactoring and adjusting across a number of iterations until the final version is as you see it here. I'm not declaring it to be perfect, but I think it's pretty good - it does what its' designed to do, and nothing more. Did you notice LAX in there? Such a useful little opcode... :)
The rest of my time on the project over the last few days has been spent in drawdrty itself, because I knew there was scope for 'things to be taken away' and I wanted to see just how far I could squeeze it - I'd decided that if I could get it below 10,000 cycles per row then I could afford to call it twice during the IRQ and double the update rate (permitting a full screen refresh in a quarter of a second rather than half). But there was a hill to climb, because I thought it was already pretty close to optimal already and it ran in 11,720 cycles - shaving around 2000 cycles off that was going to be tough.
So here's a revision which, as far as it goes, is as close to 'nothing left to take away' as I think it'll get. There are quite a lot of changes, both to the overall structure and to various internal elements, but it very definitely runs a lot faster. In fact, this version of the routine profiles at 9508 cycles - a reduction of 2242 cycles. I admit that although my manual cycle-counting hinted it was going to be faster, I was actually a bit flabbergasted when it blew past my sub-10,000 target with ease! Here it is:
resdirty SUBROUTINE
LDA #$19 ; [2] reset dirty row counter (#25)
STA _DRAWROW ; [3] ZP save it
RTS ; [6]
drawdrty SUBROUTINE
LDX _DRAWROW ; [3] ZP get screen row
.nextrow
DEX ; [2] decrement row
BMI resdirty ; [2/3] reset counter if we dropped below zero
LDA _DRTYROWS,X ; [4] ZP get dirty-row marker
BEQ .nextrow ; [3/2] zero means not dirty, loop for next
; set dirty-row counter and clear the dirty-row marker
STX _DRAWROW ; [3] ZP set screen row
LDA #$00 ; [2] clear dirty-row marker
STA _DRTYROWS,X ; [4]
; get attribute and text buffer row addresses
TXA ; [2] get screen row from .X
ASL ; [2] multiply by 2 for table index
TAX ; [2] set index
LDA _ABUFTAB,X ; [4] get attribute buffer row address lo-byte
STA _ATTRADDR ; [3] ZP save for attribute buffer address
LDA _ABUFTAB+1,X ; [4] get attribute buffer row address hi-byte
STA _ATTRADDR+1 ; [3] ZP save for attribute buffer address
LDA _TBUFTAB,X ; [4] get text buffer row address lo-byte
STA _TEXTADDR ; [3] ZP save for text buffer address
LDA _TBUFTAB+1,X ; [4] get text buffer row address hi-byte
STA _TEXTADDR+1 ; [3] ZP save for text buffer address
; determine bitmap draw address
LDY #$26 ; [2] initial column index (#38)
.drawloop
LDA _DRAWROW ; [3] ZP get screen row
CMP #$18 ; [2] is this line 25?
BEQ .line25 ; [2/3] yep, skip stuff for lines 1-24
ASL ; [2] multiply row by 8
ASL ; [2]
ASL ; [2]
ADC _COLADDRS,Y ; [4] add bitmap column address lo-byte
STA _DRAWADDR ; [3] ZP set bitmap draw address lo-byte
LDA _COLADDRS+1,Y ; [4] get bitmap column address hi-byte
ADC #$00 ; [2] add Carry
BNE .sethi ; [3/3] can never be zero, always branch
.line25
TYA ; [2] move column for multiply
ASL ; [2] multiply by 4 for line 25 offset
ASL ; [2]
STA _DRAWADDR ; [3] ZP set bitmap draw address lo-byte
LDA #$00 ; [2]
.sethi
STA _DRAWADDR+1 ; [3] ZP set bitmap draw address hi-byte
; get glyph data addresses (quicker unrolled than in a loop, and 33% faster than the earlier ASL/ROL method)
STY _DRAWCOL ; [3] ZP stash .Y for later
LAX (_TEXTADDR),Y ; [5] get ASCII char from text buffer
LSR ; [2] shift bits 7-5 down to 2-0
LSR ; [2]
LSR ; [2]
LSR ; [2]
SEC ; [2] set Carry
ROR ; [2] last bit shift also rotates Carry into bit 7 for character ROM
STA _GLYPADD1+1 ; [3] ZP set first glyph address hi-byte
TXA ; [2] get char back from .X
ASL ; [2] multiply by 8
ASL ; [2]
ASL ; [2]
STA _GLYPADD1 ; [3] ZP set first glyph address lo-byte
INY ; [2] increment column index for next char
LAX (_TEXTADDR),Y ; [5] get ASCII char from text buffer
LSR ; [2] shift bits 7-5 down to 2-0
LSR ; [2]
LSR ; [2]
LSR ; [2]
SEC ; [2] set Carry
ROR ; [2] last bit shift also rotates Carry into bit 7 for character ROM
STA _GLYPADD2+1 ; [3] ZP set second glyph address hi-byte
TXA ; [2] get char back from .X
ASL ; [2] multiply by 8
ASL ; [2]
ASL ; [2]
STA _GLYPADD2 ; [3] ZP set second glyph address lo-byte
; copy, mask and merge glyph data to Zero Page
LDY #$07 ; [2] glyph bytes to copy
LDX #$F0 ; [2] set mask for first glyph
.nextbyte
LDA (_GLYPADD1),Y ; [5] get glyph data byte
SAX _GLYPHDAT,Y ; [4] ZP mask and store
LDA (_GLYPADD2),Y ; [5] get glyph data byte
AND #$0F ; [2] mask upper nybble
ORA _GLYPHDAT,Y ; [4] merge with first glyph
STA _GLYPHDAT,Y ; [5] store in ZP
DEY ; [2] decrement byte counter
BPL .nextbyte ; [3/2] loop for next byte
; check attribute byte
LDA _DRAWCOL ; [3] ZP get column
LSR ; [2] divide by 2
TAY ; [2] move to .Y
LDA (_ATTRADDR),Y ; [5] get attribute byte for the column pair
; BNE .getattr ; [2/3] set attribute(s) if byte is non-zero
; copy glyph data to bitmap
.drawbmp
LDY #$07 ; [2] glyph data byte counter
.drawbyte
LAX _GLYPHDAT,Y ; [3] get glyph data byte
STA (_DRAWADDR),Y ; [6] write it to bitmap
DEY ; [2]
BPL .drawbyte ; [3/2] loop until all glyph bytes drawn
; prepare for next iteration
LDY _DRAWCOL ; [3] ZP get column
DEY ; [2] decrement column twice for next char pair
DEY ; [2]
BPL .drawloop ; [3/2] loop for next character pair
RTS ; [6]
Now you sharp-eyed attentive types will spot the obvious caveat here - this version does not handle attribute rendering. So of course it's faster - there's a chunk missing, right? Well yes, but actually I'm currently wrestling with the attribute renderer right now, and I think I can get it in for about 250 cycles - so even though the profiler results will go up, we'll still be below the 10,000 target (not by as much, but still - spectacularly - over 2000 cycles faster than the original). Interestingly, I have three different versions of the attribute render code which I'm playing with - they all work in different ways, and each has its' pros and cons. The one I liked best from an 'elegance' point of view actually turned-out to suck very badly in terms of cycle-consumption, and I'm trying to reduce it - but I suspect it's a lemon and will never make it into the final version of the code because there are two other algorithms which perform faster.
My objective is to find the very fastest way to re-integrate attribute rendering without climbing above a total of 10,000 cycles again. I think I can, but this is a cerebral process - it's not so much about finding a 'killer' instruction trick or sequence, but more about finding a method that will lend itself to a fast solution.
Oh, but speaking of 'killer' instructions - did you spot SAX in there? This is an absolutely gorgeous little undocumented opcode which does a logical AND between .A and .X, but saves the result in memory instead of either register - extremely useful, if you arrange things appropriately, for doing an otherwise-impossible boolean operation between two registers without altering the contents of either (or any flag in .SR).
I anticipate that my next post will reveal the missing .getattr logic, and we can move on to talk about the next task - implementing the cursor.
No comments:
Post a Comment