Friday, 20 January 2017

FAST-40


Here's a running commentary on a little side-project I'm fiddling with at the moment which you might find interesting. I've temporarily stopped active development on VIC++ because I'm just not getting enough continuous time to spend on it, but I'd like to release something and I decided that I'd steal some code from the main project and turn it into FAST-40. Over the summer of 2013 I wrote a blisteringly-fast renderer for VIC++ which allows me to display a 40x25 text screen and update up to three complete lines per frame (there's enough CPU time during the IRQ to render three lines on a PAL machine and two on NTSC) which I'm jolly pleased with. It occurred to me that I might be able to turn this into a stand-alone product and release it for 'normal' VIC-20 machines, i.e. those running the stock Kernal OS rather than my custom VIC++ ROM image.

I was always in complete awe of the various 40-column programs for the VIC-20 that did the rounds back in the early 80's - I would just look at them and know that their programmers were as gods to me, somehow able to conjure miraculous things from the depths of this machine. Turning a standard 22-column display into a 40-column one seemed akin to magic to my inexperienced eyes, and I guess it set my personal bar for technical achievement - I'd know I was 'good' if I ever managed to do something like that. So now I'm going to. ;)

The Competition


If you have a scout-around somewhere like the Zimmers archive you'll find four 40-column programs for the VIC - there might be more out there somewhere - and I distinctly recall owning Fat 40 back in the day. They all do 40x24 displays, but doing a quick comparison in VICE I was intrigued by the variations that exist between them:

1. Fat 40 - weighs-in at 7311 bytes, making it the chunkiest of the set, and appears to be quite as accomplished as I remember it. It has the full 256 Commodore characters defined for both upper- and lower-case charactersets (which presumably accounts for 4K of the payload), works in both PAL and NTSC modes (although it skews the viewport way off-centre in the latter mode) and has a reasonable update rate. On an 8K-expanded VIC the FRE(0) function reports 4500 bytes free.

2. Screen-40 - the smallest of the bunch at 2158 bytes, this is designed for NTSC machines; and unsurprisingly, in PAL mode the viewport is off-centre. The biggest issue, betrayed by the tiny file size, is that it only implements the upper-case character set - and it has an irritating habit of doubling-up carriage-returns after printing to the screen sometimes. But aside from that, it's possibly the fastest at refreshing the screen, and FRE(0) reports 5117 bytes free after installation on a machine with an 8K expansion.

3. Vic 40 Scherm - evidently of German origin (judging by the name) and being a moderate 6828 bytes in size, this should easily be the best of the bunch. It does a good job of implementing the whole 4K characterset, but strangely it takes a little while to 'warm up' - to begin with, the renderer doesn't quite know where to draw stuff on the bitmap and leaves odd gaps all over it. After a while it gets the hang of it, and thereafter behaves very well, but it's a vexing flaw that detracts from an otherwise impressive product. Refresh speed is a bit quicker than Fat 40, and similarly works in both PAL and NTSC mode (but again makes a mess of viewport centring in NTSC). FRE(0) reports 4942 bytes free on an 8K VIC.

4. Mighty Term - technically not a 40-column display utility, this is actually a dial-up terminal program which implements a 40-column view. The display is quite good and has upper- and lower-case characters showing in the top line (the only visible element) but of course I can't gauge its' refresh speed or how much memory it consumes.

Now I haven't disassembled any of these, for two reasons; firstly, I don't want to get distracted by spending hours rummaging around in the inner workings of these programs, and secondly I'm content (for the moment) to compare my solution purely at the 'user' level - that is, will FAST-40 measure-up when compared side-by-side. Will it look as good (or better), be as quick to refresh the screen (or quicker, even though I render an extra line), handle edge conditions like Run/Stop-Restore with equal grace...? I guess I'm making a fairly bold claim in my choice of name, as I do genuinely expect it to trump all of these in terms of responsiveness, but other than that I don't want to allow any of these to colour my approach. Afterwards, when I'm done, we'll perhaps have a little look inside Fat 40 and Vic 40 Scherm to see what makes them tick.

The Plan


In order to get a 22-column VIC-20 to display 40 columns, we have to do some fairly sophisticated programming to switch the 6560/6561 (Video Interface Chip, NTSC or PAL model) into high-res mode and then do a bunch of address-tweaking and pixel-drawing operations at just the right points during the screen update. We have to synchronise this logic with the actual raster beam via careful IRQ manipulation, and we have to link the new screen layout into the OS so that the Kernal, BASIC and other programs continue to do screen I/O without needing to change.

So, goals for FAST-40 then; obviously render a 40x25 screen as quickly as possible, ideally quicker than any of the above; integrate nicely with the OS, and handle stuff like NMI gracefully; implement the full upper- and lower-case charactersets; and leave as much free memory for the user as possible. I'm also going to have the program make use of the 3K expansion RAM area in Block 0 if it's available, and if I'm feeling flash I might tuck a little BASIC hook in there to add a 'REFRESH' command so that the user can choose the balance between refresh speed and CPU availability.

Execute


10th November - I've got the BASIC stub working (the bit that loads and runs as a friendly BASIC program to invoke the assembler code with a SYS command), a simple makefile that builds the code and invokes XVIC to run it, and a few initialiser routines in. The code detects whether it's running on a PAL or NTSC machine, checks to see if there's RAM in BLK0 (the 3K expansion area), sets the IRQ vector to a placeholder routine that just does a quick screen colour change before routing to the standard IRQ logic at $EABF, and synchronises the IRQ timer on VIA#2 with raster line 0.

12th November - The two-phase IRQ logic is in. After a little VIA timer initialisation, two IRQ routines take turn - one firing at the middle of the screen and the other just before the bottom. They're merely drawing a little colour at the moment, but their primary role will be to twiddle the VIC settings in exactly the right places as the screen is drawn so that the appropriate sections of bitmap get switched-in at the right time. These two bits of code could actually be a single routine with a flag to indicate which phase is active, but in fact it's much quicker to have two separate routines each doing their own thing, and just tickle the IRQ vector at $0314/5 to select the right one. By page-aligning them, this 'tickle' becomes nothing more than an INC/DEC of the vector hi-byte - so the entire decision process involved in alternating them is reduced to six cycles.

17th November - Having got the two-phase IRQ working, I've decided not to go with a 25-line display after all; it would work, just as it does in VIC++, but would mean I'd have to do a bunch of faffing-around copying Pages 2 and 3 out and back in order to preserve regular VIC-20 functionality. Dropping to a 24-line display means I can squeeze the 40-column bitmap and matrix both into the main 4K area, and not have to spend (way too much) precious CPU time keeping memory arranged the way the stock Kernal likes it. That consequently means no raster-split requirement, and thus no two-phase IRQ. The downside is that I have to use double-height characters to fit into the available RAM, which means I lose some colour granularity - however this is the way those older 40-column programs work so I'm not particularly distressed. If you want full 40-column, 25-line, per-character-pair colouring then wait for VIC++ ;)

22nd November - The screen and bitmap memory areas are now initialised, and the VIC configuration settings are tweaked so that it points at the new display matrix. At the moment it's just displaying garbage, as I haven't yet plugged-in the text renderer logic - and also because the display memory overlaps BASIC memory on an 8K-expanded VIC-20. My next task is to write a bit of code that relocates FAST-40 to somewhere else, and pushes the start of BASIC memory up so that it's out of range of the display area. That code begins by looking to see whether there is RAM in the 3K expansion area (BLK0 starting at $0400) or in the so-called 8K Cartridge ROM area (BLK5 at $A000) which can also contain RAM. If either has RAM present, a choice of relocation options will be given and FAST-40 moved appropriately; the default option will be to simply move it to the top of the highest 8K RAM block present (BLK1, 2 or 3). The obvious advantage to moving to BLK0 or BLK5 is that less BASIC memory will have to be reserved for it, leaving more available for user code.

26th November - The relocator selection menu is now being displayed according to what expansion memory is in the system; if none is found in either BLK0 or BLK5 (using a non-destructive read-increment-write-compare-revert test) then no menu is displayed and FAST-40 relocates to the top of memory by default. Otherwise, a simple menu displays whichever or both of the two expansion areas have RAM in them, and offers the choice of either (or the default top of memory choice).

30th November - Of all the possible hyper-complex things I could get stuck on, the thing that's giving me aggravation right now is, bizarrely, the incredibly-not-complex keypress handler in the relocation selector. Yes, that's right - this dozen or so really simple bytes of assembler related to reading the keyboard are misbehaving in a very peculiar way. The code calls SCNKEY, the ROM routine to scan the keyboard, then picks-up the resultant scancode and makes a decision on where to push the FAST-40 payload depending on which of three keys the user presses. This is actually working fine, except that certain other keys are also appearing as scancodes I'm looking for - so, for example, sometimes if I press 'H' or 'B' repeatedly, they show up as the scancode for '8', which is one I'm looking for. I have a feeling this is somehow related to the interplay between VICE and my little laptop that I'm working on this project on, in that there might be a keymapping config anomaly somewhere. I'm going to push the project over to my big development rig (which is a quad-core monster PC that I use for heavy projects in C# and suchlike) and see if it does the same thing there; I know the VICE keyscan config definitely works perfectly on that box, because it's something I spent a lot of time getting right when I was writing the keyscan logic for VIC++ (the equivalent of the Commodore ROM routine I'm calling here).

2nd December - Problem solved, thanks to a clue from a fellow member over at Denial; after running the code on the big rig and getting the same result, it turns out that the fault lay with STROUT, the ROM routine that pushes text strings to the screen. I call this to display the menu options, and unbelievably its last act before returning is to enable interrupts! WTF? Why it does this is a mystery, and my personal theory is that it's a typo bug in the original ROM source because right before the CLI instruction is a CLC; I bet that CLI was spotted, and someone said "Doh! That should be CLC - fix it quick!", and then the CLC was punched-in but the CLI wasn't removed. Now, I had disabled interrupts way before so that I could do the VIA IRQ twiddly things I needed to do, and was then calling SCNKEY to read the keyboard - but because STROUT had sneakily re-enabled interrupts, SCNKEY itself was being interrupted, and it's not thread-safe (on these 8-bit machines thread-safety is virtually non-existent). Hence all sorts of Weird Stuff was happening, including corrupted values returned for key scancodes. A swift re-sequence of some of my code has fixed the problem - I now don't disable interrupts until after the menu stage, and it also means I don't have to call SCNKEY myself as the stock IRQ handler does it for me.

8th December - Spent an hour or so tidying-up the code after the re-sequencing I did last week to fix the key-scanning issue, and shrank the initialisation logic by a few bytes. I also devoted some time to working-out precisely where the screen and bitmap areas will sit, and deriving the appropriate VIC settings so that it knows where stuff is. I actually had a bad half-hour where I thought I'd screwed-up rather spectacularly and wasn't going to be able to fit a 40x24 display into the stock VIC-20 RAM without doing some raster-split stuff like I do in VIC++ but then I realised I wasn't accounting for the fact that double-height characters occupy 16 bytes rather than 8. So it does all fit - just! What you have to do is put the 240-byte screen matrix at $1000 (20 'real' columns times 24 rows = 480 cells, divided by 2 for double-height chars = 240 bytes) but with characters 16-255 as the content and 0-15 not there. Then the bitmap also sits at $1000, but because the first 16 characters aren't present in the matrix, the VIC never looks at $1000-$10FF for pixel data. That means the bitmap is actually 4096 bytes, but the first 256 bytes are overlaid on to the matrix and not used because the first sixteen characters are not in the matrix. I'm pretty sure FAT-40 works this way too, because after I'd figured this out and calculated the VIC register settings for it, I fired FAT-40 up and PEEKed the VIC - and it's using the same values. Nice. :)

12th December - I'm in The Zone now, having got to the stage where there was enough working setup code that I could copy a string into the new screen text buffer and do a test-fire of the renderer. There was a bunch of logic in that code (copied from VIC++) that I didn't need, mostly associated with rendering attributes (inverse and underline) and there was also a bunch of Zero Page usage I had to remove, since FAST-40 only has 8 bytes of ZP available in a stock VIC-20 running BASIC. VIC++ runs in ROM and so uses ZP for speed when indirect memory accesses are required, but as FAST-40 is a normal RAM-based program I can make use of the utter absence of memory protection on the 6502 and have the routine modify itself on-the-fly - so indirect address accesses via ZP become absolute accesses because the address gets stashed into the code itself as it's calculated. The end result is a smaller, faster routine (it runs just short of 6000 cycles per line) and, to my delight, rendered my test string first time through.

14th December - Right-brain, meet Left-brain; I got stuck on a silly bug last night after tweaking register usage in the renderer - having ripped-out a bunch of unneeded code, I saw a way to tune things a little by optimising register allocation, but ended-up in a maze of twisty passages. Everything still worked, apart from the minor inconvenience that some characters were being rendered as spaces or garbage. I eventually got sick of looking at the code, and went to bed. This morning, as I walked along the embankment, the winter sun and clear air combined with a phrase of music I was humming, and my right-brain delivered the solution without my even really trying to think seriously about it. "You've forgotten to take out that second DEY at the bottom of the loop, Lefty", it said. And it was right.

23rd December - The little bit of code is in to reset the BASIC pointers prior to handing-over to the IRQ, so that the user can write code without splatting anything over the space FAST-40 needs to use. I've also plugged-in a prototype 'inflate' implementation, which (unsurprisingly) inflates a chunk of data that's been compressed by the 'deflate' algorithm; the reason for this little bit of extra baggage is because the characterset bitmap data and the renderer code payload actually deflate by about 65% so the total size of the final build is MUCH smaller than it would be without compression. Adding a couple of hundred bytes for a Huffman decoder is more than compensated-for by the reduction in size of the overall binary, and that decoder logic is part of the initialisation phase so it gets thrown away once it's done its' job. Right now I'm debugging a little quirk where the IRQ routine runs off into the weeds instead of doing what I expect, and I think it's a Stack corruption problem so the RTI is pulling junk by mistake. Once I fix that, we'll be into the fiddly bit of tweaking stuff so that the KERNAL screen I/O routines understand the new layout, which means re-writing bits of code it calls through indirection vectors for things like line length breaks, cursor positioning, and suchlike.

27th December - I found another delightful gem in the BASIC ROM which was causing (as I suspected) a nasty Stack corruption fault. Tucked away at the end of the 'NEW' routine (_SCRTCH at $C642) that I call to reset all the BASIC pointers after relocating FAST-40 is a little bit of code that resets the BASIC pseudo-stack, and it does this with a blatantly unsafe adjustment of the CPU Stack Pointer. This of course assumes that BASIC is in charge at this point, and takes no consideration of the possibility that NEW might be being called in some other way (i.e. from a bit of 6502 assembler that wants to reset BASIC). So I just need to figure-out how to code around that, either by fixing the Stack afterwards or by pre-rigging the Stack beforehand so the call to NEW doesn't break my return address - which is what's happening now, and causing the IRQ routine to return somewhere unexpected.

2nd January 2014 - Well, I didn't make the end-of-year deadline, but then the festive season has a habit of distracting even the most hardened coder from their pet projects. I did find time to eliminate that Stack problem, though - I noticed that a higher-level call to the BASIC warm-start entrypoint would have the desired effect and avoid the problem entirely, so that's what I do now. So the initialisation code is now essentially done, and it's time to look at the IRQ logic that'll manage invocation of the renderer as and when necessary. This is all basically logic I've already written for VIC++ but I have to also patch-in the flag-bit updates for screen writes and make that work with the existing VIC-20 Kernal - new territory for me, as I've never tried to rework bits of the Commodore screen editor before.

9th January - I'm doing some fiddly stuff at the moment, hooking-up FAST-40 to the screen editor subsystem. The program has to integrate with the normal VIC-20 screen and keyboard functionality so that it behaves just as it does in normal 22-column mode, and that means tying it into the Kernal so that everything works in the custom 40-column mode. The trick is to hook into the screen and keyboard I/O vectors which get called whenever something is written to the screen or a key is pressed, and override the standard logic with custom code that understands where things are and how they work when FAST-40 is in charge. There's not a lot of difference, but there are places where the Kernal expects line lengths to be 22 characters (for example) that should now act on 40-character lines. It's not particularly exciting code, but it's important for the functionality of the program.

13th January - Well, this isn't working as well as I envisaged. My idea was to clone the vectored I/O routines, tweak them where they needed to know about the 40-column stuff, and have them defer to the ROM where-ever possible. But the problem is that there's a lot of gnarly, intertwined, inter-dependent spaghetti code in there, and I'm up to almost 500 bytes of ROM code cloned already and the end is not yet in sight. This is, to use a technical term, a bit of a bugger. So I'm going to put that code to one side and try a different approach, which will be to basically re-write the I/O logic entirely, and only call back to the Kernal when it's appropriate. Let's see how it pans out...

18th January - Yep this is working; after a little bit of research (i.e. reading the ROM disassembly for a couple of hours) to determine what the guts of _CHROUT ($F27A) does, I have a nice bit of code written which is already pushing bytes to the text buffer and setting the dirty-row bits so that the renderer can draw stuff. This is vectored through _OUTVEC2 ($0326) so calls to the main I/O routine in the Kernal at $FFD2 route through my code and defer to the ROM for output to devices other than the screen. What this means is that everything that normally writes to the screen will continue to do so unchanged, but my code is now handling it - so for example, when I call the BASIC tepid-start routine at $E37E (between the cold-start and warm-start entry points so that vectors and RAM don't get re-initialised) the screen clears and the standard 'COMMODORE BASIC V2' messages appear on the 40-column bitmap. This is seriously cool, and now I'm writing the control-character handlers for things like Carriage Return, Cursor Down, etc.

26th January - One of the key features that distinguish the Commodore Kernal from a lot of other poorly-coded OS stuff of the time is the use of the jump-table at the top of ROM which has a bunch of core system routines punched through indirection vectors held low down in RAM. The idea is that if you want to do something standard, like write a character to the screen, you call that code through the jump-table entrypoint, which then does an indirect jump through the vector back into the ROM. This means that by altering the RAM vector, you can add, change or remove functionality in a way which requires no changes to any/every program that calls the routine - and this is precisely how FAST-40 works with the _CHROUT routine. I tweak the appropriate RAM vector to point to my code, and everything making calls to _CHROUT continues to do so in exactly the same way - anything using the Kernal to write to the screen now writes through my 40-column logic and gets drawn on a hi-res bitmap instead. This is very cool, and the Kernal itself does this the same way - any time something needs to write to the screen, there's a call to the jump-table address. Except in one place - the _CHRIN code, which reads input, and (in the case of keyboard input) echoes it to the screen. Using an absolute call. Which is a bugger, because that means I have to replicate _CHRIN within FAST-40 now so that it'll call my code instead of the Kernal. Thanks for that little bit of carelessness, Commodore. Grr.

8th February - Significant milestone today, as enough of my re-workings of _CHRIN and _CHROUT are operational that the program is starting to behave as an interactive system again instead of a passive collection of disconnected bits of code. We're still some distance from a finished product, but I think we're close to an Alpha release which will let people play with it if they want to. As a small celebration, I'm going to post a couple of screenies. First up, the initial selector screen that runs when you load FAST-40 and gives you the choice of where to locate the runtime, based on what memory configuration you've got in your VIC-20 - here I have 8K RAM in Block 1 and 8K RAM in Block 5, so the selector has found those and is offering them:


And here, having selected Block 5, the runtime has relocated up into the RAM at $A000 (the cartridge area) and switched to 40-column mode, leaving the whole of Block 1 free and having enough functionality to let me type into it:


10th February - Doing the nitty-gritty chore of making sure _CHROUT responds properly to all input values, i.e. my version yields all the same results as the KERNAL version. It's a moderately boring job except for when my code either does something odd, or nothing at all, at which point it all gets quite exciting. Well, sort of. ;) Anyway, there are things like handling quote-mode, control characters, and a few other edge-cases that have to be implemented - the core of the routine works, which really just plops the designated character into the text buffer and sets the appropriate dirty-row bit to tell the renderer it's time to draw something. Things like control-codes for colour changes are quite simple, but then things get interesting when you do something like hit [Return] because that has to trigger the internal BASIC line rebuild logic, and other such fun stuff. Just plugging away towards Alpha now...

13th February - Still working through the _CHROUT routines to handle character outputs, and starting to think about the Line Link table. This is a 24-byte chunk of Zero Page that keeps track of when a logical line on the screen overruns on to subsequent lines, and is used by a variety of bits of the screen editor to make sure that multi-line lines are bundled together. I've got to do some work on it anyway, since it uses the 24th entry as an end-of-table byte, and now we're running a 24-line display that'll have to be changed. However, it occurred to me as I deciphered how it worked that a simple bitmap could be used instead - in fact, almost exactly the same technique as I use for the Dirty Row table that tells the renderer what rows to redraw. If it works as expected, the Line Link table will drop from 24 bytes to 3 - and I'll be able to consolidate the other ZP locations I use along with some other flags that wouldn't fit into ZP before, and thereby free-up those other locations and still have some of those 24 bytes free.

9th March - I've been quiet for the last month, I've just noticed, but work continues. I'm in the middle of a bug-hunt at the moment, fixing niggly little glitches like 'dead' cursor trails, so there's not much in the way of tangible progress to report. I have however come up with a beautiful way to scroll the screen - rather than doing a mechanical byte-copy to shift lines upward, I simply adjust the layout of the underlying character matrix and shuffle pointers around. This means I can scroll the whole screen up two lines in around a third of a frame, which is nice and quick and avoids 'screen tearing' where a scroll event clashes with a refresh. Why two lines and not one? Well, the screen matrix is configured as a 20x12 grid of double-height characters, so one byte in the matrix represents 4 characters - 2 columns, 2 rows - thus moving the matrix back a row means two lines scroll out of view. It's a trade-off because the native KERNAL scrolls one line at a time, so I'm not exactly mimicking standard behaviour, but then again scroll events only occur when the cursor moves off line 24 so having the display move two lines up at that point isn't particularly worrisome.

19th April - Yep, you're right, updates have been non-existent over the past month or so; real-life interjected itself in a fairly major way just after my last post, and it's really only been in the last couple of days that I've had a chance to get back to the project. So where are we? Well, the basic functionality is mostly there now, and I'm pretty happy with how it's hanging together. The focus of attention at the moment is a difficult little bug concerning dead-cursor trails being left on the screen after non-printing character outputs - I think it might be a timing issue or race condition between _CHROUT and the IRQ handler, with the cursor blink phase at the root. The logic as written is supposed to undraw the cursor when necessary if the blink phase is 'on', but for some reason that isn't always the case - the phase is occasionally 'off' when I expected it to be 'on', so the cursor doesn't undraw before moving, and so dead cursor blocks get left behind. Investigation is ongoing...
EDIT: nailed it - timing issue resulting in loss-of-sync between cursor visibility and the associated flag.

5th May - I'm at the point now where I'm patching HLTs. There were a good number of places where I dropped this (undocumented) opcode into the code as I found things that were going to need attention later, when the program was in better, more complete, shape. Stuff like the bit of logic to tell the renderer which characterset to look at (upper- or lower-case), or the altogether more complicated subroutine that accepts an entered line from the screen and invokes the KERNAL code to tokenise it and stash it as a BASIC line. Back in February I wrote that we were close to an Alpha release, and here we are three months later still awaiting that release - but I want it to be at least capable of running a BASIC program, even if there are a bunch of caveats like specific memory configurations, known bugs, etc. Why is this such a key objective? Simple - alongside the release I want to include a BASIC measurement program which will run on any VIC with any 40-column program and describe the character throughput rate, and thereby demonstrate why FAST-40 has its' name. ;)

6th May - It occurred to me that there might be some questions over my use of HLT (opcode $02) instead of the more likely BRK (opcode $00) to trigger breakpoints in the code, as I mentioned previously. The reason is quite simple and is tied to how the 6502, and the Commodore KERNAL, handle these two instructions. If we consider BRK first, it's a standard opcode ("break") and is used, unsurprisingly, to trigger breakpoints - the 6502 has a special piece of logic that watches for executions of BRK and triggers the IRQ vector handler with the 'B' flag set in the Status Register when one is encountered. The Commodore KERNAL code that handles IRQ looks to see if the 'B' flag is set when it's called, and if it is then control bounces out of IRQ and through the BRK vector to manage what action to take in response - it is therefore possible to alter what happens (normally it's a BASIC reset) and do something else instead. The key point being that everything is under software control, and therefore if you want to actually physically stop execution dead in VICE, you still have to apply a monitor breakpoint to the BRK handler. The HLT ("halt") instruction, on the other hand, is an undocumented opcode that physically jams the 6502 and renders it incapable of further activity until reset (it actually messes-up the internal T-State register so no instruction can execute). On real hardware HLT is game over, but VICE simply stops emulation and pops an alert - crucially the PC and other registers retain their emulated state at the point the HLT instruction was executed, and in fact you can enter the monitor and resume execution after the HLT if you wish. And that's why I use HLT as a debugging aid under VICE instead of BRK.

Tuesday, 21 May 2013

In Which We Consider How To Establish Stability


Let's consider a hypothetical, simple, video generator chip which produces a display of 100 raster lines (0-99) each of which equates to 50 cycles to draw. The entire screen is therefore 100 * 50 (5000) cycles long. The chip has no hardware raster interrupt, but does have a counter tracking which line it is currently drawing. We'll call this counter RASLINE. There is also a timer in the system which has a simple countdown mode that runs at the CPU clock frequency and triggers an IRQ to the CPU when it hits zero. So if we load the timer with 5000 and set it running, it'll trigger an IRQ after 5000 cycles, or once per video frame.

Imagine we wanted something to happen on line 21 of the screen on every frame; we could run a little loop watching for RASLINE to hit 21, then start the timer with a value of 5000. On every frame afterwards we'd get an IRQ on line 21 where we could run a little routine to do something exciting like a colour switch over the line. Pretty simple, and quite cool.

Now in this example we'd expect to see a nice thin bar of some other colour on line 21 (we'd have the code switch back to the previous colour at the beginning of the next line) but in fact there's a problem - for some reason the bar seems to start some distance in from the edge of the screen, and consequently then spills-over into line 22. Not by much, and not always by the same amount, but it's producing an annoying gap on the left and a jittery overrun on to the next line. Why is this?

Well, the gap on the left of line 21 is due to the fact that in our initialisation code we wait with a loop that watches RASLINE, and as soon as it hits 21 we start the timer with a value of 5000 so that we'll get an IRQ at this same place on every frame. But in fact it takes a few cycles for the wait-loop to spot that RASLINE has changed after it has actually changed - and if we run the program a few times, we might see that the point on line 21 where our timer starts is always somewhere between 2 and 9 cycles after the actual start of the line. It varies depending on where in the loop the CPU was at the point RASLINE ticked-over, and there's no magic number we can add or subtract to counter it.

The jittery overrun to line 22 is because we draw 50-cycles worth of alternate colour, which should exactly fill line 21 - but as there's a variable offset to the start of the line our coloured line runs on to line 22. It jitters because the CPU IRQ mechanism itself is not instantaneous, but needs 7 cycles to take effect - although this is a fixed amount and we can compensate for it. But what's worse is that - depending on what instruction is executing at the point the IRQ is triggered - there's an additional 7 cycle variability before our actual IRQ colour-changing code starts whilst the CPU finishes the current instruction.

So we have an initialisation variance of up to 7 cycles for the loop to discover that we're on the right line, and then on every IRQ afterwards we have up to another 7 cycles where the CPU has to finish what it's doing before the IRQ gets processed. Assuming we compensate for the fixed 7-cycle IRQ delay, then on any given frame our actual colour might start anywhere between 2 and 16 cycles after the video generator begins drawing line 21. How do we counter these two issues and get a stable raster interrupt that always starts exactly where we want it to, and always occurs in the same place thereafter?

Let's think about the initialisation start-point problem first, and consider the IRQ instruction-delay issue later.

We could try waiting for line 20 in the initialiser code and then having a small delay there to guess when RASLINE will hit 21 and start the timer at exactly the beginning of the line, but the problem is essentially the same because we will start line 20 at a variable point and then have no way of knowing how long to delay before the video generator is drawing line 21. Or we could use a second timer and some clever but complicated code to track an offset from the main IRQ timer, using that as a differentiator to compensate for the variability of the RASLINE wait-test loop - but then we've committed that second timer to the display, and can't use it for anything else. Plus, the code is gnarly, dude.

Or we could do something else:

First, establish execution on line zero; so wait for RASLINE to be zero, and we then know we're definitely running somewhere 2-9 cycles after the video generator begins drawing the very first line. What we don't know yet is where exactly on the line we are - so start a loop, arranging things such that the net cycle-length of the loop is 4999 cycles (one cycle less than the total needed for a whole frame); then check RASLINE. If it's still zero, we're tracking-back along that line; rerun the loop and then check RASLINE again. Eventually it will not be zero, because we'll have tracked-back to a point where the video generator is still drawing line 99. At this point we know exactly where we are - one cycle before the beginning of line zero.

In and amongst this we have to have some compensation applied to the loop length to counter the 7-cycle IRQ delay, and whatever cycles we consume actually doing the work of testing RASLINE - but that's trivial to calculate, because they're both fixed values.

Now the hard part - compensating for the 7-cycle variability in the IRQ where the CPU has to finish the current instruction. This might be REALLY gnarly, dude.

Thursday, 18 April 2013

In Which We Do Something New (Possibly)


I've got an idea for a new way to do stable raster synchronisation. I've looked around, and I don't think it's been done before - I'm working on the code now, and if it does what I think it will, I'll post it and an explanation of how it works next time.

Here's how I got on to this: as you know from my last post, I'm reworking the display logic to draw the 25-row screen bitmap with the extra line in the middle of the screen instead of at the bottom - this makes things much nicer in terms of memory allocation, speed, colour resolution, and algorithm programming to do the refresh. It does, however, require two IRQ interrupts per frame in order to effect three raster splits; one right at the top of the screen to set the VIC up for the first 12 rows, and one at the end of row 12 to set things up for row 13. The third raster split occurs at the end of row 13 to set the VIC for row 14 onwards, but I don't need an IRQ for that as I'll just wait for that raster line after doing the housekeeping stuff (timers, cursor blink, keyboard scan, etc.) in the 8 lines of row 13 itself.

Now the IRQ at the top could theoretically occur largely anywhere on whichever raster line we choose, because it's in the top border and therefore invisible - if it happens to start halfway along a raster line and cross into the next, it wouldn't matter so long as VIC is ready to go by the time we get to the first raster line of the top display row. But the IRQ at the end of row 12 has to be spot on because we have a very limited amount of time before row 13 starts drawing, and we've also got to have a precise interval value to give the VIA#2 timer so that the top-of-screen IRQ happens at a specific spot after all, because it's another precise number of cycles after that at which the row 12 IRQ has to happen all over again.

What this all boils down to is that the IRQ trigger points have to be absolutely cycle-accurate - and that's traditionally a bit of a bugger to do on a machine with no raster interrupts (on the C64, for example, you can tell the VIC-II to trigger an IRQ at the beginning of a specific raster line). Even though we can watch the raster counter in a tight loop and wait for a target line, there's still up to 7 cycles of inaccuracy between the counter ticking-over to the target value and our code actually being in a position to act on it:

            lda #$_TARGET_LINE  ; [2]   set the target raster line number
.waitline cmp $_RASTER_COUNT ; [4] see if the VIC raster counter matches the target
bne .waitline ; [3/2] loop back and check again if not

You see? If the counter hits our target just before the cmp, it's 4 cycles for cmp + 2 cycles for bne, making 6 cycles after the counter actually was the value we're looking for before we're ready to do something about it. If the counter hits the target just after cmp, it's 3 cycles for bne + 4 for cmp + 2 for bne, making 9 cycles. The best-case scenario is where the counter hits the target just before the end of cmp, in which case we just clock 2 further cycles for bne - but that still means we could be anywhere from 2 to 9 cycles on from the counter hitting the target, with no way to know how much of that 7-cycle variability is in force. And those 7 cycles make a lot of difference when you're trying to get a precise raster effect - it's what gives rise to the 'jitter' you sometimes see at the edges of the screen when playing a game with a complex visual display.

So what to do? Doing a simple wait-loop such as the code above gets you to a specific target line, but at best you've got a 7-cycle unpredictability to deal with somehow. Plus of course that wait-loop sits there eating 100% CPU, and it's hardly appropriate for an OS to spend huge chunks of processor time doing nothing but watch a raster counter. We need a way to eliminate that jitter, and so far I've only found one person who's managed to do it reliably - a hot coder named Marko Makela who some years back wrote a routine to use two timers offset from each other to measure the inaccuracy and compensate for it. His original article is here, and it took me about four passes over it before I grokked what he was doing. :)

All well and good, you might say - so just use his code, and be done with it.

Well, there's two things wrong with that: first, I'm not using anyone else's code in VIC++ (although it is entirely possible that I'll write something at some point that has been done in a similar way before); and secondly, although Markos' code is undoubtedly very cool and gets the job done, I read it and thought it seemed a tiny bit awkward to have to tie-up two VIA timers to make it work. Yeah yeah, like I know any better - but sometimes ignorance can be an asset, and in this case I think I've hit on something that'll guarantee a reliable, stable, cycle-perfect raster sync without needing a second timer.

You may point and laugh next time, when I show you what I've done. ;)

Monday, 15 April 2013

In Which We Turn The Display Up To 11


Been a bit quiet the last couple of weeks, with not a lot of time to work on the project mainly due to Real Life making me do other stuff. But a lack of posts here - or indeed tangible progress on the code - does not equate to zero progress on the project; in fact, I have been devoting considerable grey matter runtime to ... the renderer.

Yes, once again I've returned to this element of the OS. It strikes even myself as odd that such a small component of the system should demand so much attention, and to be frank I had thought that after the last pass (where I streamlined and improved the logic immensely) I was finally done with it. But then a fellow-forum-member at Denial posted a comment on this post, and suddenly I knew I had to return to it. Here's what Tokra wrote:
Reading through your post I noticed that you use 8x16 characters which will result in your 4-bit-wide characters only being able to be color-changed not in pairs but in chunks of four since the character a line directly below will use the same color-RAM position. This could be avoided by setting up the screen in 8x8 mode. However to make this fit, you would need to place the 25th line in the middle of the screen instead of at the bottom. The layout would then look like this:
  • Primary Screen Matrix @ $0200-$02ef - char 0-239 charram pointing to $1000 
  • Secondary Screen Matrix @ $02f0-$0303 - char 112-131 charram pointing to $1400 
  • Tertiary Screen Matrix @ $0304-$03f3 - char 4-243 charram pointing to $1800 
This way the whole bitmap of 4000 byte would reside from $1000-$1f9f and could be addressed in a straight way. However you would need TWO raster-splits right before and after the 13th line to make this work. You keep the zeropage free however and have doubled the color-resolution vertically.
Well, there was challenge-and-a-half: basically, just rework the entire display-logic, IRQ synchronisation code, and memory layout! I wrote a reply enthusing about the idea, but wasn't sure I really wanted to go all the way back to almost the very beginning of the system design. But the more I thought about it, the more I knew that I could do it, and that when I did it would be awesome - eliminate the 160-byte ZP usage, make the bitmap contiguous, improve colour resolution, and make the renderer code even simpler and faster. Yep, it had to be done.

So I fairly rapidly adjusted the memory map to accommodate this new model, and tweaked the screen-setup and clearing logic to reflect the fact that the bitmap is now a single block of memory from $0200 - $03F3. That's been tested and proven to work (and SO much quicker) and I'm now working on the IRQ multiplexer which has to split the screen into three sections - on that point Tokra was slightly mistaken, since I actually need THREE raster splits; one at the end of row 12, one at the end of row 13, and one somewhere after the end of row 25 but before the start of row 1 on the next frame. This does actually only equate to two IRQ interrupts per frame, since the second split is 8 raster lines after the first and I can squeeze most of the system housekeeping in there and then execute a wait-loop for the next split.

But we don't want to just have the CPU executing a series of wait-state loops until the VIC raster line counter gets to the right place so we can adjust things to correctly draw row 1 and row 13 - that would consume a huge percentage of the processor time, and leave little or nothing for anything else. Instead, we need a bit of code that keeps track of what section of the screen is being drawn by the VIC and set the IRQ timer in VIA#2 to an appropriate interval that will trigger the next IRQ at the right place, and allow us to reset the VIC for the next section of screen 'just in time' for it to be drawn.

In the previous version of the code, we synchronised the IRQ with the raster at the end of row 24 - this allowed us to do the VIC-twiddle to display row 25, do the IRQ housekeeping whilst row 25 was being drawn, and then wait a couple of raster lines before switching the VIC back for the next frame. The VIA#2 timer value was a constant, as we wanted the IRQ to occur once per frame at exactly the same place - but now we want two IRQ interrupts per frame, and that means the multiplexer code has to know which bit of the screen is being drawn in order to reset the VIA#2 timer to the right value.

So that's what I'm working on right now - getting the IRQ initialisation working and ensuring that it'll trigger twice per frame in just the right spots, taking account of NTSC/PAL timing differences too, of course. I'll post an updated memory map and some code next time.

Sunday, 24 March 2013

In Which We Pull Away From The Stack


I know you're on tenterhooks, itching to know what I did with that code for manipulating the Stack after my first attempt to get a working register-save routine up-and-running last time. Well, here's the result:

Nothing.

Actually, that's not entirely true - I did write two other variations of saveregs, but in both cases the cycle-time went up; I did get a version working which did a degree of faffing-around with the Stacked data and eliminated the two dead-weight return address bytes, but that tallied to over 70 cycles, and was quite ugly. So in the end, I settled on the version I wrote first, and decided that I would just have to accept the rigour needed to ensure I called loadregs nicely and let it tidy-up those two bytes.

After that, I found myself with very little time suddenly, so the project has lain largely neglected for the last week or so. I found a half-hour yesterday to squeeze some logic in to administer a keyboard buffer which the keyscan routine drops characters into as you type them, and then I tweaked the IRQ handler to spit the buffer contents to the screen if appropriate - but it'll be next week before I get a chance to do much more. The next thing on my list is to finalise the way the keyboard logic handles 'control-key' actions like Shift, Control, Tab, etc. so I'm hoping to have that wrapped-up in a few days.

There's also a little bug in the IRQ routine which has become apparent after my new Stack logic went in; I think I've got a sneaky 'pull' happening somewhere which is messing with the Stack Pointer, so I'll be doing a spot of tracing there to identify and fix that glitch.

Onward!

Wednesday, 13 March 2013

In Which We Push The Stack Around


I'm still hammering-away at the keyboard handler, getting things organised to my liking in the way it does stuff like handle control keys, feed printable characters to the screen, and so on, and in the course of these fun but fairly unexciting endeavours I've had occasion to do a bit of messing-about with the Stack. I haven't really needed to do much with it so far, since a round-trip needs 7 cycles (3 to push with PHA, 4 to pull with PLA) and as almost everything I've written so far has been speed-critical it's been more efficient to use Zero Page for temporary storage. I think there might be one place I use the Stack as a temporary store, and that's only because it's a single byte I need to keep handy and it's not needed anywhere else - so committing a ZP location to it is a bit wasteful.

But I'm writing a bit of code right now that needs to splat something to the screen and issue a 'change position' command to the cursor, and it passes three parameters to a subroutine in the .A, .X and .Y registers. However, before that subroutine makes use of those register values, it has to call a subroutine of its' own first - which will use at least two registers during execution and therefore nuke the original parameter values before they can be used. The obvious thing to do of course is stash the values before making the subroutine call, but again I don't fancy chewing into precious ZP storage just to hold them - and since this particular code isn't speed-critical and the Stack is designed for just this sort of thing, it's a no-brainer.

Bearing in mind the fact that the requirement to stash register-value parameters on the Stack is going to become more important as I climb higher up the OS structure towards userland, the logical thing to do at this point is to write a nice standardised pair of routines to handle the pushing of registers on to the Stack and the reciprocal action of pulling them off again. Then I can call these whenever anything needs to push or pull stuff, rather than having replicated chunks of code to do it scattered across the codebase - seems sensible, right? Sounds simple too, no doubt. Hah.

Let's take the simplest requirement first - we want to stash .A, .X and .Y to the Stack so they're safe and can be retrieved later; the code is short and sweet (ooh look, new formatting template!):
   PHA       ; [3] stash .A on the Stack
TXA ; [2] move .X to .A
PHA ; [3] stash .X on the Stack
TYA ; [2] move .Y to .A
PHA ; [3] stash .Y on the Stack

... ; [x] some code that uses registers executes here

PLA ; [4] get .Y from the Stack
TAY ; [2] move to .Y
PLA ; [4] get .X from the Stack
TAX ; [2] move to .X
PLA ; [4] get .A from the Stack
Dead simple - push each register to the Stack, .A first, then .X and .Y (which have to go through .A because there's no direct Push .X or Push .Y instructions on the original 6502) and then pull them back in reverse order to recover their original values. Hardly rocket science, and you'll see code very like this in a good percentage of any moderately complex software. But there's a little snag with it, because although .A is saved to the Stack, its' value is trashed since we had to put .X and .Y into .A to push them. So what if we want to be able to push all three registers but leave their values intact? Now things get a little more complicated:
   PHA               ; [3] stash .A on the Stack
TXA ; [2] move .X to .A
TSX ; [2] move .SP to .X
PHA ; [3] stash original .X on the Stack
TYA ; [2] move .Y to .A
PHA ; [3] stash .Y on the Stack
LDA _STACK+1,X ; [4] peek original .A from Stack (.X contains .SP after first PHA)
PHA ; [3] stash .A on the Stack
LDA _STACK,X ; [4] peek original .X from Stack
TAX ; [2] restore original .X
PLA ; [4] restore original .A
That looks a bit gnarly, but it's still reasonably straightforward - we push .A as before, and shift .X ready to push it too, but grab the Stack Pointer (.SP) first. This tells us where we are about to push .X into the Stack, or to put it another way, it indirectly tells us where we just pushed .A. So we then push .X and .Y as before, but we can get the original value of .A back by 'sniffing' the Stack directly using .SP which is in .X - and having got it, we then push it onto the Stack a second time before sniffing once more to get the original value of .X back, and finally pulling .A. Hey presto, all three registers on the Stack, and their original values still intact. We use the same 'pull' code as before to get them all back in reverse order when we need them.

But that's not the end of the story, because this code then fails abysmally if you make a subroutine out of it - we don't want it repeated everywhere we need to use it, so it makes sense to make it a subroutine and just call it whenever we want, but it doesn't work:
saveregs SUBROUTINE  stash registers to Stack preserving contents
PHA ; [3] stash .A on the Stack
TXA ; [2] move .X to .A
TSX ; [2] move .SP to .X
PHA ; [3] stash original .X on the Stack
TYA ; [2] move .Y to .A
PHA ; [3] stash .Y on the Stack
LDA _STACK+1,X ; [4] peek original .A from Stack (.X contains .SP after first PHA)
PHA ; [3] stash .A on the Stack
LDA _STACK,X ; [4] peek original .X from Stack
TAX ; [2] restore original .X
PLA ; [4] restore original .A
RTS ; [6] pull 2-byte return address from Stack
The reason is that the act of calling a subroutine with JSR causes the 6502 to push the two-byte return-address-minus-one to the Stack before jumping, which it then pops-off to reset the Program Counter and come back when it hits RTS - but our subroutine has now added three items to the Stack (our register values) and so the return instruction pops the values of .Y and .X instead of the address it expects. It doesn't know that the values it's pulled aren't the return address, so happily loads .PC with them and suddenly we're running Xod-knows-where through memory, executing all sorts of excitingly-fatal bits of whatever data happens to be there.

So what we have to do in the new subroutine is somehow contrive to push the registers as expected, leave their values intact, and simultaneously adjust the Stack so that the first two items to be pulled are actually the proper return address for the RTS instruction. Take a look:
saveregs SUBROUTINE  stash registers to Stack preserving contents
PHA ; [3] stash .A on the Stack
TXA ; [2] move .X to .A
TSX ; [2] move .SP to .X
PHA ; [3] stash original .X on the Stack
TYA ; [2] move .Y to .A
PHA ; [3] stash .Y on the Stack
LDA _STACK+3,X ; [4] sniff return .PCH from Stack (.X contains .SP after first PHA)
PHA ; [3] stash .PCH on the Stack again
LDA _STACK+2,X ; [4] sniff return .PCL from Stack
PHA ; [3] stash .PCL on the Stack again
LDA _STACK+1,X ; [4] peek original .A from Stack
PHA ; [3] stash .A on the Stack
LDA _STACK,X ; [4] peek original .X from Stack
TAX ; [2] restore original .X
PLA ; [4] restore original .A
RTS ; [6]
Oh-kaaaay. Deep breath. This is just a slightly more complicated variant of the previous version, with the addition of two extra LDA / PHA steps which sniff the .PC (high and low bytes) from the Stack and push them back on again so that they're the first thing to be pulled when the RTS executes - thus curing the 'random return' effect of the earlier version.

The snag with this is that we now have two orphaned bytes on the Stack, those being the original return address that the JSR pushed - so we have to tweak the 'pull' routine a bit so that it tidies-up and dumps those bytes as well as retrieving the register values, and of course handling the same return-address issue itself. Remember, because JSR pushes the return address (minus one) then if we just pull the first three values and assign them to the registers, we've actually pulled the two-byte return address for the subroutine into .Y and .X, and .A contains what should be in .Y - the registers are all wrong and the subroutine RTS will return to an incorrect location:
loadregs SUBROUTINE  retrieve registers from Stack
TSX ; [2] move .SP to .X
LDA _STACK+2,X ; [4] sniff return .PCH from Stack
STA _STACK+7,X ; [5] sneak .PCH on the Stack
PLA ; [4] pull return .PCL from stack
STA _STACK+6,X ; [5] sneak .PCL on the Stack
PLA ; [4] pull old .PCH from stack
PLA ; [4] pull .Y from stack
TAY ; [2] move to .Y
PLA ; [4] pull .X from stack
TAX ; [2] move to .X
PLA ; [4] pull .A from the Stack
RTS ; [6]
Since .A, .X and .Y are in the right order on the Stack, and it's just that we have a return address in the way and an orphan return address above them, we can simply grab .SP and copy the new return address over the orphaned one, pop the registers off as usual, and leave the Stack properly set-up to return from this routine normally. But it's a much more complex arrangement than we began with - it works, but pushing the three registers without disrupting their contents takes 58 cycles (including the 12 for the JSR / RTS) and pulling them back takes 52. In addition we're carrying two orphaned return-address bytes around on the Stack during the process, which is inefficient in itself and means we have to be extremely vigilant that every time we call saveregs we subsequently retrieve the registers through loadregs (and not with 'unsupervised' PLA instructions) because we've got to stop those orphaned bytes gradually filling-up the Stack.

I wasn't satisfied with this - it felt wrong, and clunky - and I wondered if there might be a smarter way. I'll show you what I came-up with next time...

Wednesday, 6 March 2013

In Which The Keyboard Reawakens


I had considerable success re-integrating and refining the cursor logic, stripping six or seven fairly ugly routines out and streamlining everything down to just four tidy bits of code to set/read the cursor screen co-ordinates, calculate the bitmap draw address based on those co-ordinates (re-using the new fast address-computation routine I wrote for the revamped glyph renderer), turn the cursor on/off, and do the actual draw/undraw. I was so pleased, in fact, that I decided to press ahead and tackle the unruly mess that the keyboard handler had evolved into right away.

The basic mechanics are of course still the same - set-up the VIA ports to strobe the keyboard lines and detect any activity, then decode whatever we found into meaningful data (either a 'displayable' character code, or a control code indicating keys like return, shift, control, and so on). The problem I'd run into last time was that I kept finding edge-conditions where the general rules didn't apply, or at least didn't apply in quite the usual way. For example, I needed to debounce the keypresses as they came in to prevent keys repeating at the IRQ frequency, but then I also needed to be able to enable key-repeat at a sensible rate - so my general rule, which eliminated accidental key-repeat due to the frequency of the keyboard scan, needed a bit of extra logic to handle the fact that we actually wanted to time the keypress duration and allow repeats after a given interval, and then only at a much slower rate than the IRQ could generate them. This turned into a series of unpalatable kludges around the basic functionality, and - in combination with some other similar issues - was the key contributor to my unhappiness with the project, and its' suspension.

The root cause was the design of the code - I'd never written a bare-metal keyboard handler before and there was a considerable amount to learn about the way the 6522 worked, how Commodore had wired things up, how XVIC catered for that hardware configuration inside its' emulation environment, and then just the general software-layer intricacies of managing a keypress-management state-machine that operates in a linear fashion but spread across successive interrupt cycles. Although things started-out quite cleanly, the evolution of the logic through the learning process meant that the routine became more and more unmanageable as the number of exceptions-to-the-rule grew. In the end I was sick of the sight of it - which is never a good frame of mind to be in about your own project!

But the payoff to a difficult learning process is that once you come out the other end of it, however nasty the result you've got, you've also got the fundamental understanding of how to do it better the second time around - and that's where I am today, slowly rebuilding the keyboard handler using the basic technique I mentioned above but now aware of, and avoiding, the glitches that cropped-up last time. So I have the VIA set-up, doing key-detection and debouncing, and an initial stub routine just spitting characters to the screen as they're registered. I changed the way keycodes are mapped to be a lot more efficient in terms of table storage, so I have to revisit the lookup table to re-arrange things a little (at the moment the character that appears on-screen bears no relation to the key you press) and then I can plug-in a neat little bit of code I've got an idea for that'll handle selective key-repeat.

I think, even though I haven't completed the to-do list for the release, I might make a demo available at the point the keyboard handler is at a stable position; better late than never. ;)