Wednesday, 6 February 2013

In Which We Are Eaten By A Grue


This is, I think, the fifth time I've edited this post in the last ten days. Each time, I've added or adjusted something as a prelude to publication with the intent to check 'one more thing' before hitting the button, and each time I've found myself chasing shadows and wandering further away from a final, conclusive statement on what I've been up to. Truly a maze of twisty passages all alike, and the various lights at the end of tunnels have typically been signs saying things like 'You are lost', 'Abandon all hope', and 'Grue feeding area'.

Allow me to regale you with the tale...

Day 1-2
  • After my last post, I successfully ticked-off that first 'to-do' item and fully integrated the new dirty-row logic into the existing codebase. I then turned to the second item and dived-in to the glyph rendering routine to see if I could grab those 7 bytes of Zero Page back that I thought might be achievable. After an hour or so of hard thinking and tinkering, I had a reworked version of the routine that did indeed use just 1 byte instead of 8 as working-storage for the bitmap operations. The refactoring shaved a few bytes off the length of the routine, and in the end really only left the very beginning and end of the code untouched - I reckon maybe 85% of the routine changed, mostly in terms of position as I combined loops and resequenced various events to accommodate the fact that we now process the glyph render over a single 8-iteration loop, one pixel row at a time. Together with a couple of other minor adjustments, and of course the reduction in the size of the dirty-row table, I now have 53 free ZP bytes. Win!
  • Whilst I was finishing that job, and tweaking the code to optimise register usage, I was conscious of the fact that there was going to be an inevitable impact on cycle-time; I wasn't sure by how much, but I guessed that the elapsed render time for a row might go up by a hundred or so cycles at a minimum. So as I wrapped everything up, I ran a profile using our new friend the XVIC Stopwatch against the previous version, and got the expected time of 9,136 cycles (for a row with no attributes applied). Then I ran a profile over the new version, held my breath for a second, and then saw the new time: 8,743 cycles!
  • Well, I was just completely blown away. I double-checked with a series of breakpoints to make sure nothing was being accidentally missed as the code executed, but no, this was the actual proper time and it was down by almost 400 cycles. That improvement is caused primarily by the fact that with the new dirty-row routines plugged-in properly and the rendering routine largely revamped - plus having then optimised a variety of register usage and storage accesses accordingly - the whole thing just moves a fair whack faster.
  • This was simultaneously amazing, because things were happening quicker than before, and somewhat disappointing, because when I wrote this code and tuned it over several iterations last year I finished with a post saying something like "I think this is as good as it gets". The evidence now says otherwise - and if the routine is susceptible to a 5% speed increase by accident, then perhaps there are still further improvements to be made if I concentrate hard enough. And so my geeky brain immediately went into 'hungry' mode - hungry for more speed, better code, higher efficiency - and started sparking with several ideas right away for how to chase this particular dragon. 
That was jolly exciting, wasn't it? Let's see what happened next.

Day 3-5
  • First up, there was something I'd been meaning to do for a while but hadn't had the will to undertake the logic surgery necessary - remove the strikethrough attribute code. Nice though it was to have it in there, and thoroughly cool as well, I'd come to realise that being able to represent text on the video display that has such an attribute was a bit pointless - why bother displaying strikethrough on a video display unless you're writing a WYSIWYG document processor (which this isn't)? So, deep in the code as I now was, it came out - comparatively painlessly as it transpired, and it also streamlined the way the remaining attributes (inverse and underline) get processed too, which was nice.
  • Second, as a result of this, I now only hold 4 bits of attribute information per attribute-table-byte; we hold two bits per attribute for each bitmap byte, as they in turn hold two glyphs each, and there were three attributes to represent, so that was 6 bits. Losing strikethrough dropped that to 4 bits ... also known as a nybble. So, full marks and a shiny Prefects' badge to anyone who can connect my thoughts before I tell you what I did next...
  • Yeah, attribute nybbles, then. Which are, in fact, easily slotted into some of that lovely spare Colour RAM I talked about a few posts back. Into the memory layout file I went, and one fast cut 'n' paste later relocated the Attribute Buffer up into spare colour memory. An extra instruction in the glyph renderer code was needed to avoid bus noise by masking the upper 4 bits off when we read the byte, and with the table no longer in Page $04 I could also drop the start of free memory (for user code) down as well, giving a tidy little boost to the 'Bytes Free' startup message.
All good so far - some stripping-out of code I didn't want any more, and a chance to move the attribute storage into a chunk of otherwise-unused memory which is both more efficient for the OS and also frees-up a chunk of RAM for the user. What fun!

Day 6
  • The third big change was in the way the glyph-renderer handles inverse-mode attributes. Way back near the beginning of the project, I made a conscious decision to handle inverse-mode as an 'on-the-fly' effect, with the renderer doing a neat EOR bit-twiddle for each line of pixels in a glyph as it was drawn - the alternative being to store both 'normal' and 'inverse' copies of each glyph bitmap in the Character Generator ROM, the same approach that Commodore used with the standard VIC-20 setup and which I thought was a bit wasteful of space. I've since given the matter much thought, and come-around to the same point of view that the Commodore engineers did all those years ago - yes, it's a bit wasteful (especially when you have a slick chunk of code to do inverse-mode in-situ as you draw to the screen) but it's also noticeably quicker to just pull the data out of the ROM already in an inverted state.
  • So I've cloned and inverted a copy of the glyph pixel data, and adjusted the renderer so that it can do a very quick lookup to get that inverted data instead of applying the effect at the point of drawing the glyphs. The penalty is that I can only map ASCII codes 0-127 into the available space, as I'm using the ROM area for OS data storage too - for now I'll live with it, but if in the future it becomes an issue and I have to have the other 128 characters mapped as well then I'll either move the OS storage out, or if there is no room elsewhere then I'll rework the glyph data to represent two different characters per byte instead of one (they're stored doubled-up side by side to make merging faster during rendering).
  • But I wasn't done yet, because as I was making all these changes, I spotted a way to avoid repeatedly deriving the source and target addresses for glyph rendering; as each character is plotted, the routine figures-out where in the text and attribute buffers we've got to for the source of the string data we're plotting, where in the Character Generator ROM the glyph pixel data is coming from for each of the two glyphs that get merged into a single display-byte, and where in the bitmap we're going to draw that pixel data. That code is quite efficient, but bits of it get executed for each character to be drawn, and I suddenly saw a way to avoid much of that work and turn the calculation into a simpler (and faster) table lookup.
And this, dear reader, is where things started getting decidedly uncool. Because I then ran the Stopwatch over the resulting code, which you might rightly assume is now super-fast as it's quite a bit shorter and does things in more efficient ways. And the total cycles consumed by this new improved routine was... 13,577.

What. The. F***?

All my careful optimisation, code removal, storage reorganisation and restructuring has resulted in a routine which is not only not faster than 8,743 cycles, but actually slower than it was to start with, and in fact is even slower than the original routine was at rendering a complete line with all attributes on. <sigh>

Day 7
  • I can't believe it. A routine which is now getting-on for 60 bytes smaller, with an entire attribute-type (strikethrough) no longer present, and another (inverse) now handled by a fast lookup instead of an on-the-fly bit of pixel-twiddling - as well as an overall tune-up and register optimisation - is just shy of 5,000 cycles slower than the version I accidentally created during the Zero Page refinement, and over 4,000 cycles slower than it was before I did anything at all to it. How can this be? Some possibilities spring to mind, and have been carefully considered:
  • - 1. The XVIC Stopwatch has a bug, perhaps related to undocumented instructions, and is mis-reporting the elapsed cycles in this new configuration. The problem with that is that I've single-stepped through the code, and it reports all instructions correctly, both documented and undocumented. Verdict: Demonstrably false.
  • - 2. Having moved the attribute table up into Colour RAM, something in the XVIC emulator causes accesses to this memory to be slower. However, the original VIC-20 colour memory wasn't slower, only 4-bits wide instead of 8, so there's no reason for the emulator to make this an issue. Furthermore, we're measuring performance in machine cycles, not wall-clock time, so any possible memory-access delay wouldn't affect the Stopwatch. Verdict: Implausibly irrelevant.
  • - 3. Somehow I'm executing the routine twice, so in fact it's actually running really fast (as intended) but I'm inadvertently double-counting. However, since I have breakpoints at the beginning and end of the routine, I'd soon know if it was being called twice. Plus, temporarily, I've tweaked the IRQ routine to avoid the double-call opportunity (if time allows within the frame) whilst I'm testing the new logic; and there are no calls (deliberate or accidental) to entry-points other than the start of the routine. Verdict: Appealing but impossible.
  • - 4. Despite the code being smaller and having had a goodly amount of effort expended to make it super-efficient, I've either missed something, or just basically cocked-up somewhere. I don't see how at this point, but the facts speak for themselves. Verdict: Highly probable.
  • It's time to down tools and completely re-appraise the entire routine, because I refuse to accept that a revision which should be - must be - so improved over the earlier version can be such a dog.
Much deep analysis ensued - there was an hour when I looked at two instructions almost continuously, searching for a way to do an indirect memory lookup a different way without resorting to copying the routine to RAM and making it self-modifying. The next day was a struggle.

Day 8
  • I've changed the way the attribute table works again. It's out of the Colour RAM area and back in low memory, not because I think colour memory is in some absurd way poorly-performing but because I need to double it in size and it doesn't fit there any more. The idea is that by making it the same size as the text buffer, I can move through both in lockstep and only have to add the page offset difference from one to the other in order to calculate the index address. This means it's twice the size it needs to be, which I don't like, but I reconcile this by avoiding the need to spend cycles dividing the text-buffer index by 2 each iteration to get the attribute-buffer index - although that's just a single LSR instruction, there's some associated register-management that goes with it, so the net saving is enough to justify the change.
  • I've also very carefully counted where the time is being spent. The routine is essentially two nested loops - a 20-iteration outer loop which controls which byte of the screen row bitmap we're doing stuff to, and an inner 8-iteration loop that handles the copying and merging of glyph pixel data into that byte. The obvious challenge is to reduce the number and cost of instructions in the inner loop, since they are executed 160 times over the course of the routines' runtime. So I have a particular interest in finding ways to optimise in that area, because that's where the biggest savings will be made - the downside is that this bit of logic hasn't really changed much since the last version, so I really can't quite figure-out why it might now be causing a problem.
  • Anyway, the reworking of the attribute-buffer lookup seems to be paying a dividend - total CPU time for the routine is now down to 11,911 cycles. That's still far from great, but at least it's moving in the right direction.
Suffice it  to say that these were dark days - I just couldn't see why this intentionally-improved version of the code was sucking so badly compared to the old version. Many minutes during smoke-breaks at work were spent pondering the various aspects of how the routine worked, trying to fit the pieces together in different ways to - at worst! - get back to a runtime in the 9,000-cycle region where I started.

Day 9
  • Well, that's it. I can't do any more to this - I've tuned, tweaked, counted, tested, experimented and rewritten just about every line of this routine, and I simply can't see why it's such a disaster. I've managed to get the time down to 10,431 cycles but I have nowhere else to go, so I'm just going to revert to the earlier, accidentally-improved 8,743-cycle version. Annoyed? You bet.
Which brings us up-to-date. Here then is todays' update:

Day 10

I seem to do my best brain-work when walking the dog. Tonight, as I strolled through the cold night with my Labradoodle sniffing stuff at the end of her lead, I smoked a cigarette and turned the routine over in my head one last time. It's got to the point where I can see the code as clear as day in my minds' eye, and I was examining it looking for anywhere there might be an option to alter a runtime calculation to a pre-calculated lookup table. As it happens, there are only two places now where we do anything that even looks like a calculation, and even then they're just ADC instructions to compute addresses, so the pickings are slim. Nevertheless, one of those places is in that inner loop, so any chance to speed things up there was worth taking another look.

And suddenly, there it was - the answer, right in front of me, and only so blindingly obvious when I compared the new version with the old. That calculation (which incidentally cannot be converted to a predefined table because there are two variables, which means a table would be huge) is done inside the inner loop - but it wasn't before. I had the start-of-loop label in the wrong place, so a calculation that happens 20 times in the old version is now executing 160 times!

Well, when I got home I fired the Development rig up and moved that label. Doing so also presented an opportunity to save a register switch, but that was an incidental. I saved the change, ran 'make', and then dropped into the XVIC monitor to place my breakpoints, set the Stopwatch, and see what I got. At which point, all the other changes I'd sweated-over suddenly had the chance to show me just what a difference they'd been trying to make, had they not been stifled by a misplaced label.

Ready?

Here we go then - total cycles needed to render a complete row: 7,011.

Oh man. They must have heard my 'YESSSSS!' on Mars. :)

1 comment: