Here's a running commentary on a little side-project I'm fiddling with at the moment which you might find interesting. I've temporarily stopped active development on VIC++ because I'm just not getting enough continuous time to spend on it, but I'd like to release something and I decided that I'd steal some code from the main project and turn it into FAST-40. Over the summer of 2013 I wrote a blisteringly-fast renderer for VIC++ which allows me to display a 40x25 text screen and update up to three complete lines per frame (there's enough CPU time during the IRQ to render three lines on a PAL machine and two on NTSC) which I'm jolly pleased with. It occurred to me that I might be able to turn this into a stand-alone product and release it for 'normal' VIC-20 machines, i.e. those running the stock Kernal OS rather than my custom VIC++ ROM image.
I was always in complete awe of the various 40-column programs for the VIC-20 that did the rounds back in the early 80's - I would just look at them and know that their programmers were as gods to me, somehow able to conjure miraculous things from the depths of this machine. Turning a standard 22-column display into a 40-column one seemed akin to magic to my inexperienced eyes, and I guess it set my personal bar for technical achievement - I'd know I was 'good' if I ever managed to do something like that. So now I'm going to. ;)
The Competition
If you have a scout-around somewhere like the Zimmers archive you'll find four 40-column programs for the VIC - there might be more out there somewhere - and I distinctly recall owning Fat 40 back in the day. They all do 40x24 displays, but doing a quick comparison in VICE I was intrigued by the variations that exist between them:
1. Fat 40 - weighs-in at 7311 bytes, making it the chunkiest of the set, and appears to be quite as accomplished as I remember it. It has the full 256 Commodore characters defined for both upper- and lower-case charactersets (which presumably accounts for 4K of the payload), works in both PAL and NTSC modes (although it skews the viewport way off-centre in the latter mode) and has a reasonable update rate. On an 8K-expanded VIC the FRE(0) function reports 4500 bytes free.
2. Screen-40 - the smallest of the bunch at 2158 bytes, this is designed for NTSC machines; and unsurprisingly, in PAL mode the viewport is off-centre. The biggest issue, betrayed by the tiny file size, is that it only implements the upper-case character set - and it has an irritating habit of doubling-up carriage-returns after printing to the screen sometimes. But aside from that, it's possibly the fastest at refreshing the screen, and FRE(0) reports 5117 bytes free after installation on a machine with an 8K expansion.
3. Vic 40 Scherm - evidently of German origin (judging by the name) and being a moderate 6828 bytes in size, this should easily be the best of the bunch. It does a good job of implementing the whole 4K characterset, but strangely it takes a little while to 'warm up' - to begin with, the renderer doesn't quite know where to draw stuff on the bitmap and leaves odd gaps all over it. After a while it gets the hang of it, and thereafter behaves very well, but it's a vexing flaw that detracts from an otherwise impressive product. Refresh speed is a bit quicker than Fat 40, and similarly works in both PAL and NTSC mode (but again makes a mess of viewport centring in NTSC). FRE(0) reports 4942 bytes free on an 8K VIC.
4. Mighty Term - technically not a 40-column display utility, this is actually a dial-up terminal program which implements a 40-column view. The display is quite good and has upper- and lower-case characters showing in the top line (the only visible element) but of course I can't gauge its' refresh speed or how much memory it consumes.
Now I haven't disassembled any of these, for two reasons; firstly, I don't want to get distracted by spending hours rummaging around in the inner workings of these programs, and secondly I'm content (for the moment) to compare my solution purely at the 'user' level - that is, will FAST-40 measure-up when compared side-by-side. Will it look as good (or better), be as quick to refresh the screen (or quicker, even though I render an extra line), handle edge conditions like Run/Stop-Restore with equal grace...? I guess I'm making a fairly bold claim in my choice of name, as I do genuinely expect it to trump all of these in terms of responsiveness, but other than that I don't want to allow any of these to colour my approach. Afterwards, when I'm done, we'll perhaps have a little look inside Fat 40 and Vic 40 Scherm to see what makes them tick.
The Plan
In order to get a 22-column VIC-20 to display 40 columns, we have to do some fairly sophisticated programming to switch the 6560/6561 (Video Interface Chip, NTSC or PAL model) into high-res mode and then do a bunch of address-tweaking and pixel-drawing operations at just the right points during the screen update. We have to synchronise this logic with the actual raster beam via careful IRQ manipulation, and we have to link the new screen layout into the OS so that the Kernal, BASIC and other programs continue to do screen I/O without needing to change.
So, goals for FAST-40 then; obviously render a 40x25 screen as quickly as possible, ideally quicker than any of the above; integrate nicely with the OS, and handle stuff like NMI gracefully; implement the full upper- and lower-case charactersets; and leave as much free memory for the user as possible. I'm also going to have the program make use of the 3K expansion RAM area in Block 0 if it's available, and if I'm feeling flash I might tuck a little BASIC hook in there to add a 'REFRESH' command so that the user can choose the balance between refresh speed and CPU availability.
10th November - I've got the BASIC stub working (the bit that loads and runs as a friendly BASIC program to invoke the assembler code with a SYS command), a simple makefile that builds the code and invokes XVIC to run it, and a few initialiser routines in. The code detects whether it's running on a PAL or NTSC machine, checks to see if there's RAM in BLK0 (the 3K expansion area), sets the IRQ vector to a placeholder routine that just does a quick screen colour change before routing to the standard IRQ logic at $EABF, and synchronises the IRQ timer on VIA#2 with raster line 0.
12th November - The two-phase IRQ logic is in. After a little VIA timer initialisation, two IRQ routines take turn - one firing at the middle of the screen and the other just before the bottom. They're merely drawing a little colour at the moment, but their primary role will be to twiddle the VIC settings in exactly the right places as the screen is drawn so that the appropriate sections of bitmap get switched-in at the right time. These two bits of code could actually be a single routine with a flag to indicate which phase is active, but in fact it's much quicker to have two separate routines each doing their own thing, and just tickle the IRQ vector at $0314/5 to select the right one. By page-aligning them, this 'tickle' becomes nothing more than an INC/DEC of the vector hi-byte - so the entire decision process involved in alternating them is reduced to six cycles.
17th November - Having got the two-phase IRQ working, I've decided not to go with a 25-line display after all; it would work, just as it does in VIC++, but would mean I'd have to do a bunch of faffing-around copying Pages 2 and 3 out and back in order to preserve regular VIC-20 functionality. Dropping to a 24-line display means I can squeeze the 40-column bitmap and matrix both into the main 4K area, and not have to spend (way too much) precious CPU time keeping memory arranged the way the stock Kernal likes it. That consequently means no raster-split requirement, and thus no two-phase IRQ. The downside is that I have to use double-height characters to fit into the available RAM, which means I lose some colour granularity - however this is the way those older 40-column programs work so I'm not particularly distressed. If you want full 40-column, 25-line, per-character-pair colouring then wait for VIC++ ;)
22nd November - The screen and bitmap memory areas are now initialised, and the VIC configuration settings are tweaked so that it points at the new display matrix. At the moment it's just displaying garbage, as I haven't yet plugged-in the text renderer logic - and also because the display memory overlaps BASIC memory on an 8K-expanded VIC-20. My next task is to write a bit of code that relocates FAST-40 to somewhere else, and pushes the start of BASIC memory up so that it's out of range of the display area. That code begins by looking to see whether there is RAM in the 3K expansion area (BLK0 starting at $0400) or in the so-called 8K Cartridge ROM area (BLK5 at $A000) which can also contain RAM. If either has RAM present, a choice of relocation options will be given and FAST-40 moved appropriately; the default option will be to simply move it to the top of the highest 8K RAM block present (BLK1, 2 or 3). The obvious advantage to moving to BLK0 or BLK5 is that less BASIC memory will have to be reserved for it, leaving more available for user code.
26th November - The relocator selection menu is now being displayed according to what expansion memory is in the system; if none is found in either BLK0 or BLK5 (using a non-destructive read-increment-write-compare-revert test) then no menu is displayed and FAST-40 relocates to the top of memory by default. Otherwise, a simple menu displays whichever or both of the two expansion areas have RAM in them, and offers the choice of either (or the default top of memory choice).
30th November - Of all the possible hyper-complex things I could get stuck on, the thing that's giving me aggravation right now is, bizarrely, the incredibly-not-complex keypress handler in the relocation selector. Yes, that's right - this dozen or so really simple bytes of assembler related to reading the keyboard are misbehaving in a very peculiar way. The code calls SCNKEY, the ROM routine to scan the keyboard, then picks-up the resultant scancode and makes a decision on where to push the FAST-40 payload depending on which of three keys the user presses. This is actually working fine, except that certain other keys are also appearing as scancodes I'm looking for - so, for example, sometimes if I press 'H' or 'B' repeatedly, they show up as the scancode for '8', which is one I'm looking for. I have a feeling this is somehow related to the interplay between VICE and my little laptop that I'm working on this project on, in that there might be a keymapping config anomaly somewhere. I'm going to push the project over to my big development rig (which is a quad-core monster PC that I use for heavy projects in C# and suchlike) and see if it does the same thing there; I know the VICE keyscan config definitely works perfectly on that box, because it's something I spent a lot of time getting right when I was writing the keyscan logic for VIC++ (the equivalent of the Commodore ROM routine I'm calling here).
2nd December - Problem solved, thanks to a clue from a fellow member over at Denial; after running the code on the big rig and getting the same result, it turns out that the fault lay with STROUT, the ROM routine that pushes text strings to the screen. I call this to display the menu options, and unbelievably its last act before returning is to enable interrupts! WTF? Why it does this is a mystery, and my personal theory is that it's a typo bug in the original ROM source because right before the CLI instruction is a CLC; I bet that CLI was spotted, and someone said "Doh! That should be CLC - fix it quick!", and then the CLC was punched-in but the CLI wasn't removed. Now, I had disabled interrupts way before so that I could do the VIA IRQ twiddly things I needed to do, and was then calling SCNKEY to read the keyboard - but because STROUT had sneakily re-enabled interrupts, SCNKEY itself was being interrupted, and it's not thread-safe (on these 8-bit machines thread-safety is virtually non-existent). Hence all sorts of Weird Stuff was happening, including corrupted values returned for key scancodes. A swift re-sequence of some of my code has fixed the problem - I now don't disable interrupts until after the menu stage, and it also means I don't have to call SCNKEY myself as the stock IRQ handler does it for me.
8th December - Spent an hour or so tidying-up the code after the re-sequencing I did last week to fix the key-scanning issue, and shrank the initialisation logic by a few bytes. I also devoted some time to working-out precisely where the screen and bitmap areas will sit, and deriving the appropriate VIC settings so that it knows where stuff is. I actually had a bad half-hour where I thought I'd screwed-up rather spectacularly and wasn't going to be able to fit a 40x24 display into the stock VIC-20 RAM without doing some raster-split stuff like I do in VIC++ but then I realised I wasn't accounting for the fact that double-height characters occupy 16 bytes rather than 8. So it does all fit - just! What you have to do is put the 240-byte screen matrix at $1000 (20 'real' columns times 24 rows = 480 cells, divided by 2 for double-height chars = 240 bytes) but with characters 16-255 as the content and 0-15 not there. Then the bitmap also sits at $1000, but because the first 16 characters aren't present in the matrix, the VIC never looks at $1000-$10FF for pixel data. That means the bitmap is actually 4096 bytes, but the first 256 bytes are overlaid on to the matrix and not used because the first sixteen characters are not in the matrix. I'm pretty sure FAT-40 works this way too, because after I'd figured this out and calculated the VIC register settings for it, I fired FAT-40 up and PEEKed the VIC - and it's using the same values. Nice. :)
12th December - I'm in The Zone now, having got to the stage where there was enough working setup code that I could copy a string into the new screen text buffer and do a test-fire of the renderer. There was a bunch of logic in that code (copied from VIC++) that I didn't need, mostly associated with rendering attributes (inverse and underline) and there was also a bunch of Zero Page usage I had to remove, since FAST-40 only has 8 bytes of ZP available in a stock VIC-20 running BASIC. VIC++ runs in ROM and so uses ZP for speed when indirect memory accesses are required, but as FAST-40 is a normal RAM-based program I can make use of the utter absence of memory protection on the 6502 and have the routine modify itself on-the-fly - so indirect address accesses via ZP become absolute accesses because the address gets stashed into the code itself as it's calculated. The end result is a smaller, faster routine (it runs just short of 6000 cycles per line) and, to my delight, rendered my test string first time through.
14th December - Right-brain, meet Left-brain; I got stuck on a silly bug last night after tweaking register usage in the renderer - having ripped-out a bunch of unneeded code, I saw a way to tune things a little by optimising register allocation, but ended-up in a maze of twisty passages. Everything still worked, apart from the minor inconvenience that some characters were being rendered as spaces or garbage. I eventually got sick of looking at the code, and went to bed. This morning, as I walked along the embankment, the winter sun and clear air combined with a phrase of music I was humming, and my right-brain delivered the solution without my even really trying to think seriously about it. "You've forgotten to take out that second DEY at the bottom of the loop, Lefty", it said. And it was right.
23rd December - The little bit of code is in to reset the BASIC pointers prior to handing-over to the IRQ, so that the user can write code without splatting anything over the space FAST-40 needs to use. I've also plugged-in a prototype 'inflate' implementation, which (unsurprisingly) inflates a chunk of data that's been compressed by the 'deflate' algorithm; the reason for this little bit of extra baggage is because the characterset bitmap data and the renderer code payload actually deflate by about 65% so the total size of the final build is MUCH smaller than it would be without compression. Adding a couple of hundred bytes for a Huffman decoder is more than compensated-for by the reduction in size of the overall binary, and that decoder logic is part of the initialisation phase so it gets thrown away once it's done its' job. Right now I'm debugging a little quirk where the IRQ routine runs off into the weeds instead of doing what I expect, and I think it's a Stack corruption problem so the RTI is pulling junk by mistake. Once I fix that, we'll be into the fiddly bit of tweaking stuff so that the KERNAL screen I/O routines understand the new layout, which means re-writing bits of code it calls through indirection vectors for things like line length breaks, cursor positioning, and suchlike.
27th December - I found another delightful gem in the BASIC ROM which was causing (as I suspected) a nasty Stack corruption fault. Tucked away at the end of the 'NEW' routine (_SCRTCH at $C642) that I call to reset all the BASIC pointers after relocating FAST-40 is a little bit of code that resets the BASIC pseudo-stack, and it does this with a blatantly unsafe adjustment of the CPU Stack Pointer. This of course assumes that BASIC is in charge at this point, and takes no consideration of the possibility that NEW might be being called in some other way (i.e. from a bit of 6502 assembler that wants to reset BASIC). So I just need to figure-out how to code around that, either by fixing the Stack afterwards or by pre-rigging the Stack beforehand so the call to NEW doesn't break my return address - which is what's happening now, and causing the IRQ routine to return somewhere unexpected.
2nd January 2014 - Well, I didn't make the end-of-year deadline, but then the festive season has a habit of distracting even the most hardened coder from their pet projects. I did find time to eliminate that Stack problem, though - I noticed that a higher-level call to the BASIC warm-start entrypoint would have the desired effect and avoid the problem entirely, so that's what I do now. So the initialisation code is now essentially done, and it's time to look at the IRQ logic that'll manage invocation of the renderer as and when necessary. This is all basically logic I've already written for VIC++ but I have to also patch-in the flag-bit updates for screen writes and make that work with the existing VIC-20 Kernal - new territory for me, as I've never tried to rework bits of the Commodore screen editor before.
9th January - I'm doing some fiddly stuff at the moment, hooking-up FAST-40 to the screen editor subsystem. The program has to integrate with the normal VIC-20 screen and keyboard functionality so that it behaves just as it does in normal 22-column mode, and that means tying it into the Kernal so that everything works in the custom 40-column mode. The trick is to hook into the screen and keyboard I/O vectors which get called whenever something is written to the screen or a key is pressed, and override the standard logic with custom code that understands where things are and how they work when FAST-40 is in charge. There's not a lot of difference, but there are places where the Kernal expects line lengths to be 22 characters (for example) that should now act on 40-character lines. It's not particularly exciting code, but it's important for the functionality of the program.
13th January - Well, this isn't working as well as I envisaged. My idea was to clone the vectored I/O routines, tweak them where they needed to know about the 40-column stuff, and have them defer to the ROM where-ever possible. But the problem is that there's a lot of gnarly, intertwined, inter-dependent spaghetti code in there, and I'm up to almost 500 bytes of ROM code cloned already and the end is not yet in sight. This is, to use a technical term, a bit of a bugger. So I'm going to put that code to one side and try a different approach, which will be to basically re-write the I/O logic entirely, and only call back to the Kernal when it's appropriate. Let's see how it pans out...
18th January - Yep this is working; after a little bit of research (i.e. reading the ROM disassembly for a couple of hours) to determine what the guts of _CHROUT ($F27A) does, I have a nice bit of code written which is already pushing bytes to the text buffer and setting the dirty-row bits so that the renderer can draw stuff. This is vectored through _OUTVEC2 ($0326) so calls to the main I/O routine in the Kernal at $FFD2 route through my code and defer to the ROM for output to devices other than the screen. What this means is that everything that normally writes to the screen will continue to do so unchanged, but my code is now handling it - so for example, when I call the BASIC tepid-start routine at $E37E (between the cold-start and warm-start entry points so that vectors and RAM don't get re-initialised) the screen clears and the standard 'COMMODORE BASIC V2' messages appear on the 40-column bitmap. This is seriously cool, and now I'm writing the control-character handlers for things like Carriage Return, Cursor Down, etc.
26th January - One of the key features that distinguish the Commodore Kernal from a lot of other poorly-coded OS stuff of the time is the use of the jump-table at the top of ROM which has a bunch of core system routines punched through indirection vectors held low down in RAM. The idea is that if you want to do something standard, like write a character to the screen, you call that code through the jump-table entrypoint, which then does an indirect jump through the vector back into the ROM. This means that by altering the RAM vector, you can add, change or remove functionality in a way which requires no changes to any/every program that calls the routine - and this is precisely how FAST-40 works with the _CHROUT routine. I tweak the appropriate RAM vector to point to my code, and everything making calls to _CHROUT continues to do so in exactly the same way - anything using the Kernal to write to the screen now writes through my 40-column logic and gets drawn on a hi-res bitmap instead. This is very cool, and the Kernal itself does this the same way - any time something needs to write to the screen, there's a call to the jump-table address. Except in one place - the _CHRIN code, which reads input, and (in the case of keyboard input) echoes it to the screen. Using an absolute call. Which is a bugger, because that means I have to replicate _CHRIN within FAST-40 now so that it'll call my code instead of the Kernal. Thanks for that little bit of carelessness, Commodore. Grr.
8th February - Significant milestone today, as enough of my re-workings of _CHRIN and _CHROUT are operational that the program is starting to behave as an interactive system again instead of a passive collection of disconnected bits of code. We're still some distance from a finished product, but I think we're close to an Alpha release which will let people play with it if they want to. As a small celebration, I'm going to post a couple of screenies. First up, the initial selector screen that runs when you load FAST-40 and gives you the choice of where to locate the runtime, based on what memory configuration you've got in your VIC-20 - here I have 8K RAM in Block 1 and 8K RAM in Block 5, so the selector has found those and is offering them:
And here, having selected Block 5, the runtime has relocated up into the RAM at $A000 (the cartridge area) and switched to 40-column mode, leaving the whole of Block 1 free and having enough functionality to let me type into it:
So, goals for FAST-40 then; obviously render a 40x25 screen as quickly as possible, ideally quicker than any of the above; integrate nicely with the OS, and handle stuff like NMI gracefully; implement the full upper- and lower-case charactersets; and leave as much free memory for the user as possible. I'm also going to have the program make use of the 3K expansion RAM area in Block 0 if it's available, and if I'm feeling flash I might tuck a little BASIC hook in there to add a 'REFRESH' command so that the user can choose the balance between refresh speed and CPU availability.
Execute
10th November - I've got the BASIC stub working (the bit that loads and runs as a friendly BASIC program to invoke the assembler code with a SYS command), a simple makefile that builds the code and invokes XVIC to run it, and a few initialiser routines in. The code detects whether it's running on a PAL or NTSC machine, checks to see if there's RAM in BLK0 (the 3K expansion area), sets the IRQ vector to a placeholder routine that just does a quick screen colour change before routing to the standard IRQ logic at $EABF, and synchronises the IRQ timer on VIA#2 with raster line 0.
12th November - The two-phase IRQ logic is in. After a little VIA timer initialisation, two IRQ routines take turn - one firing at the middle of the screen and the other just before the bottom. They're merely drawing a little colour at the moment, but their primary role will be to twiddle the VIC settings in exactly the right places as the screen is drawn so that the appropriate sections of bitmap get switched-in at the right time. These two bits of code could actually be a single routine with a flag to indicate which phase is active, but in fact it's much quicker to have two separate routines each doing their own thing, and just tickle the IRQ vector at $0314/5 to select the right one. By page-aligning them, this 'tickle' becomes nothing more than an INC/DEC of the vector hi-byte - so the entire decision process involved in alternating them is reduced to six cycles.
17th November - Having got the two-phase IRQ working, I've decided not to go with a 25-line display after all; it would work, just as it does in VIC++, but would mean I'd have to do a bunch of faffing-around copying Pages 2 and 3 out and back in order to preserve regular VIC-20 functionality. Dropping to a 24-line display means I can squeeze the 40-column bitmap and matrix both into the main 4K area, and not have to spend (way too much) precious CPU time keeping memory arranged the way the stock Kernal likes it. That consequently means no raster-split requirement, and thus no two-phase IRQ. The downside is that I have to use double-height characters to fit into the available RAM, which means I lose some colour granularity - however this is the way those older 40-column programs work so I'm not particularly distressed. If you want full 40-column, 25-line, per-character-pair colouring then wait for VIC++ ;)
22nd November - The screen and bitmap memory areas are now initialised, and the VIC configuration settings are tweaked so that it points at the new display matrix. At the moment it's just displaying garbage, as I haven't yet plugged-in the text renderer logic - and also because the display memory overlaps BASIC memory on an 8K-expanded VIC-20. My next task is to write a bit of code that relocates FAST-40 to somewhere else, and pushes the start of BASIC memory up so that it's out of range of the display area. That code begins by looking to see whether there is RAM in the 3K expansion area (BLK0 starting at $0400) or in the so-called 8K Cartridge ROM area (BLK5 at $A000) which can also contain RAM. If either has RAM present, a choice of relocation options will be given and FAST-40 moved appropriately; the default option will be to simply move it to the top of the highest 8K RAM block present (BLK1, 2 or 3). The obvious advantage to moving to BLK0 or BLK5 is that less BASIC memory will have to be reserved for it, leaving more available for user code.
26th November - The relocator selection menu is now being displayed according to what expansion memory is in the system; if none is found in either BLK0 or BLK5 (using a non-destructive read-increment-write-compare-revert test) then no menu is displayed and FAST-40 relocates to the top of memory by default. Otherwise, a simple menu displays whichever or both of the two expansion areas have RAM in them, and offers the choice of either (or the default top of memory choice).
30th November - Of all the possible hyper-complex things I could get stuck on, the thing that's giving me aggravation right now is, bizarrely, the incredibly-not-complex keypress handler in the relocation selector. Yes, that's right - this dozen or so really simple bytes of assembler related to reading the keyboard are misbehaving in a very peculiar way. The code calls SCNKEY, the ROM routine to scan the keyboard, then picks-up the resultant scancode and makes a decision on where to push the FAST-40 payload depending on which of three keys the user presses. This is actually working fine, except that certain other keys are also appearing as scancodes I'm looking for - so, for example, sometimes if I press 'H' or 'B' repeatedly, they show up as the scancode for '8', which is one I'm looking for. I have a feeling this is somehow related to the interplay between VICE and my little laptop that I'm working on this project on, in that there might be a keymapping config anomaly somewhere. I'm going to push the project over to my big development rig (which is a quad-core monster PC that I use for heavy projects in C# and suchlike) and see if it does the same thing there; I know the VICE keyscan config definitely works perfectly on that box, because it's something I spent a lot of time getting right when I was writing the keyscan logic for VIC++ (the equivalent of the Commodore ROM routine I'm calling here).
2nd December - Problem solved, thanks to a clue from a fellow member over at Denial; after running the code on the big rig and getting the same result, it turns out that the fault lay with STROUT, the ROM routine that pushes text strings to the screen. I call this to display the menu options, and unbelievably its last act before returning is to enable interrupts! WTF? Why it does this is a mystery, and my personal theory is that it's a typo bug in the original ROM source because right before the CLI instruction is a CLC; I bet that CLI was spotted, and someone said "Doh! That should be CLC - fix it quick!", and then the CLC was punched-in but the CLI wasn't removed. Now, I had disabled interrupts way before so that I could do the VIA IRQ twiddly things I needed to do, and was then calling SCNKEY to read the keyboard - but because STROUT had sneakily re-enabled interrupts, SCNKEY itself was being interrupted, and it's not thread-safe (on these 8-bit machines thread-safety is virtually non-existent). Hence all sorts of Weird Stuff was happening, including corrupted values returned for key scancodes. A swift re-sequence of some of my code has fixed the problem - I now don't disable interrupts until after the menu stage, and it also means I don't have to call SCNKEY myself as the stock IRQ handler does it for me.
8th December - Spent an hour or so tidying-up the code after the re-sequencing I did last week to fix the key-scanning issue, and shrank the initialisation logic by a few bytes. I also devoted some time to working-out precisely where the screen and bitmap areas will sit, and deriving the appropriate VIC settings so that it knows where stuff is. I actually had a bad half-hour where I thought I'd screwed-up rather spectacularly and wasn't going to be able to fit a 40x24 display into the stock VIC-20 RAM without doing some raster-split stuff like I do in VIC++ but then I realised I wasn't accounting for the fact that double-height characters occupy 16 bytes rather than 8. So it does all fit - just! What you have to do is put the 240-byte screen matrix at $1000 (20 'real' columns times 24 rows = 480 cells, divided by 2 for double-height chars = 240 bytes) but with characters 16-255 as the content and 0-15 not there. Then the bitmap also sits at $1000, but because the first 16 characters aren't present in the matrix, the VIC never looks at $1000-$10FF for pixel data. That means the bitmap is actually 4096 bytes, but the first 256 bytes are overlaid on to the matrix and not used because the first sixteen characters are not in the matrix. I'm pretty sure FAT-40 works this way too, because after I'd figured this out and calculated the VIC register settings for it, I fired FAT-40 up and PEEKed the VIC - and it's using the same values. Nice. :)
12th December - I'm in The Zone now, having got to the stage where there was enough working setup code that I could copy a string into the new screen text buffer and do a test-fire of the renderer. There was a bunch of logic in that code (copied from VIC++) that I didn't need, mostly associated with rendering attributes (inverse and underline) and there was also a bunch of Zero Page usage I had to remove, since FAST-40 only has 8 bytes of ZP available in a stock VIC-20 running BASIC. VIC++ runs in ROM and so uses ZP for speed when indirect memory accesses are required, but as FAST-40 is a normal RAM-based program I can make use of the utter absence of memory protection on the 6502 and have the routine modify itself on-the-fly - so indirect address accesses via ZP become absolute accesses because the address gets stashed into the code itself as it's calculated. The end result is a smaller, faster routine (it runs just short of 6000 cycles per line) and, to my delight, rendered my test string first time through.
14th December - Right-brain, meet Left-brain; I got stuck on a silly bug last night after tweaking register usage in the renderer - having ripped-out a bunch of unneeded code, I saw a way to tune things a little by optimising register allocation, but ended-up in a maze of twisty passages. Everything still worked, apart from the minor inconvenience that some characters were being rendered as spaces or garbage. I eventually got sick of looking at the code, and went to bed. This morning, as I walked along the embankment, the winter sun and clear air combined with a phrase of music I was humming, and my right-brain delivered the solution without my even really trying to think seriously about it. "You've forgotten to take out that second DEY at the bottom of the loop, Lefty", it said. And it was right.
23rd December - The little bit of code is in to reset the BASIC pointers prior to handing-over to the IRQ, so that the user can write code without splatting anything over the space FAST-40 needs to use. I've also plugged-in a prototype 'inflate' implementation, which (unsurprisingly) inflates a chunk of data that's been compressed by the 'deflate' algorithm; the reason for this little bit of extra baggage is because the characterset bitmap data and the renderer code payload actually deflate by about 65% so the total size of the final build is MUCH smaller than it would be without compression. Adding a couple of hundred bytes for a Huffman decoder is more than compensated-for by the reduction in size of the overall binary, and that decoder logic is part of the initialisation phase so it gets thrown away once it's done its' job. Right now I'm debugging a little quirk where the IRQ routine runs off into the weeds instead of doing what I expect, and I think it's a Stack corruption problem so the RTI is pulling junk by mistake. Once I fix that, we'll be into the fiddly bit of tweaking stuff so that the KERNAL screen I/O routines understand the new layout, which means re-writing bits of code it calls through indirection vectors for things like line length breaks, cursor positioning, and suchlike.
27th December - I found another delightful gem in the BASIC ROM which was causing (as I suspected) a nasty Stack corruption fault. Tucked away at the end of the 'NEW' routine (_SCRTCH at $C642) that I call to reset all the BASIC pointers after relocating FAST-40 is a little bit of code that resets the BASIC pseudo-stack, and it does this with a blatantly unsafe adjustment of the CPU Stack Pointer. This of course assumes that BASIC is in charge at this point, and takes no consideration of the possibility that NEW might be being called in some other way (i.e. from a bit of 6502 assembler that wants to reset BASIC). So I just need to figure-out how to code around that, either by fixing the Stack afterwards or by pre-rigging the Stack beforehand so the call to NEW doesn't break my return address - which is what's happening now, and causing the IRQ routine to return somewhere unexpected.
2nd January 2014 - Well, I didn't make the end-of-year deadline, but then the festive season has a habit of distracting even the most hardened coder from their pet projects. I did find time to eliminate that Stack problem, though - I noticed that a higher-level call to the BASIC warm-start entrypoint would have the desired effect and avoid the problem entirely, so that's what I do now. So the initialisation code is now essentially done, and it's time to look at the IRQ logic that'll manage invocation of the renderer as and when necessary. This is all basically logic I've already written for VIC++ but I have to also patch-in the flag-bit updates for screen writes and make that work with the existing VIC-20 Kernal - new territory for me, as I've never tried to rework bits of the Commodore screen editor before.
9th January - I'm doing some fiddly stuff at the moment, hooking-up FAST-40 to the screen editor subsystem. The program has to integrate with the normal VIC-20 screen and keyboard functionality so that it behaves just as it does in normal 22-column mode, and that means tying it into the Kernal so that everything works in the custom 40-column mode. The trick is to hook into the screen and keyboard I/O vectors which get called whenever something is written to the screen or a key is pressed, and override the standard logic with custom code that understands where things are and how they work when FAST-40 is in charge. There's not a lot of difference, but there are places where the Kernal expects line lengths to be 22 characters (for example) that should now act on 40-character lines. It's not particularly exciting code, but it's important for the functionality of the program.
13th January - Well, this isn't working as well as I envisaged. My idea was to clone the vectored I/O routines, tweak them where they needed to know about the 40-column stuff, and have them defer to the ROM where-ever possible. But the problem is that there's a lot of gnarly, intertwined, inter-dependent spaghetti code in there, and I'm up to almost 500 bytes of ROM code cloned already and the end is not yet in sight. This is, to use a technical term, a bit of a bugger. So I'm going to put that code to one side and try a different approach, which will be to basically re-write the I/O logic entirely, and only call back to the Kernal when it's appropriate. Let's see how it pans out...
18th January - Yep this is working; after a little bit of research (i.e. reading the ROM disassembly for a couple of hours) to determine what the guts of _CHROUT ($F27A) does, I have a nice bit of code written which is already pushing bytes to the text buffer and setting the dirty-row bits so that the renderer can draw stuff. This is vectored through _OUTVEC2 ($0326) so calls to the main I/O routine in the Kernal at $FFD2 route through my code and defer to the ROM for output to devices other than the screen. What this means is that everything that normally writes to the screen will continue to do so unchanged, but my code is now handling it - so for example, when I call the BASIC tepid-start routine at $E37E (between the cold-start and warm-start entry points so that vectors and RAM don't get re-initialised) the screen clears and the standard 'COMMODORE BASIC V2' messages appear on the 40-column bitmap. This is seriously cool, and now I'm writing the control-character handlers for things like Carriage Return, Cursor Down, etc.
26th January - One of the key features that distinguish the Commodore Kernal from a lot of other poorly-coded OS stuff of the time is the use of the jump-table at the top of ROM which has a bunch of core system routines punched through indirection vectors held low down in RAM. The idea is that if you want to do something standard, like write a character to the screen, you call that code through the jump-table entrypoint, which then does an indirect jump through the vector back into the ROM. This means that by altering the RAM vector, you can add, change or remove functionality in a way which requires no changes to any/every program that calls the routine - and this is precisely how FAST-40 works with the _CHROUT routine. I tweak the appropriate RAM vector to point to my code, and everything making calls to _CHROUT continues to do so in exactly the same way - anything using the Kernal to write to the screen now writes through my 40-column logic and gets drawn on a hi-res bitmap instead. This is very cool, and the Kernal itself does this the same way - any time something needs to write to the screen, there's a call to the jump-table address. Except in one place - the _CHRIN code, which reads input, and (in the case of keyboard input) echoes it to the screen. Using an absolute call. Which is a bugger, because that means I have to replicate _CHRIN within FAST-40 now so that it'll call my code instead of the Kernal. Thanks for that little bit of carelessness, Commodore. Grr.
8th February - Significant milestone today, as enough of my re-workings of _CHRIN and _CHROUT are operational that the program is starting to behave as an interactive system again instead of a passive collection of disconnected bits of code. We're still some distance from a finished product, but I think we're close to an Alpha release which will let people play with it if they want to. As a small celebration, I'm going to post a couple of screenies. First up, the initial selector screen that runs when you load FAST-40 and gives you the choice of where to locate the runtime, based on what memory configuration you've got in your VIC-20 - here I have 8K RAM in Block 1 and 8K RAM in Block 5, so the selector has found those and is offering them:
And here, having selected Block 5, the runtime has relocated up into the RAM at $A000 (the cartridge area) and switched to 40-column mode, leaving the whole of Block 1 free and having enough functionality to let me type into it:
10th February - Doing the nitty-gritty chore of making sure _CHROUT responds properly to all input values, i.e. my version yields all the same results as the KERNAL version. It's a moderately boring job except for when my code either does something odd, or nothing at all, at which point it all gets quite exciting. Well, sort of. ;) Anyway, there are things like handling quote-mode, control characters, and a few other edge-cases that have to be implemented - the core of the routine works, which really just plops the designated character into the text buffer and sets the appropriate dirty-row bit to tell the renderer it's time to draw something. Things like control-codes for colour changes are quite simple, but then things get interesting when you do something like hit [Return] because that has to trigger the internal BASIC line rebuild logic, and other such fun stuff. Just plugging away towards Alpha now...
13th February - Still working through the _CHROUT routines to handle character outputs, and starting to think about the Line Link table. This is a 24-byte chunk of Zero Page that keeps track of when a logical line on the screen overruns on to subsequent lines, and is used by a variety of bits of the screen editor to make sure that multi-line lines are bundled together. I've got to do some work on it anyway, since it uses the 24th entry as an end-of-table byte, and now we're running a 24-line display that'll have to be changed. However, it occurred to me as I deciphered how it worked that a simple bitmap could be used instead - in fact, almost exactly the same technique as I use for the Dirty Row table that tells the renderer what rows to redraw. If it works as expected, the Line Link table will drop from 24 bytes to 3 - and I'll be able to consolidate the other ZP locations I use along with some other flags that wouldn't fit into ZP before, and thereby free-up those other locations and still have some of those 24 bytes free.
9th March - I've been quiet for the last month, I've just noticed, but work continues. I'm in the middle of a bug-hunt at the moment, fixing niggly little glitches like 'dead' cursor trails, so there's not much in the way of tangible progress to report. I have however come up with a beautiful way to scroll the screen - rather than doing a mechanical byte-copy to shift lines upward, I simply adjust the layout of the underlying character matrix and shuffle pointers around. This means I can scroll the whole screen up two lines in around a third of a frame, which is nice and quick and avoids 'screen tearing' where a scroll event clashes with a refresh. Why two lines and not one? Well, the screen matrix is configured as a 20x12 grid of double-height characters, so one byte in the matrix represents 4 characters - 2 columns, 2 rows - thus moving the matrix back a row means two lines scroll out of view. It's a trade-off because the native KERNAL scrolls one line at a time, so I'm not exactly mimicking standard behaviour, but then again scroll events only occur when the cursor moves off line 24 so having the display move two lines up at that point isn't particularly worrisome.
19th April - Yep, you're right, updates have been non-existent over the past month or so; real-life interjected itself in a fairly major way just after my last post, and it's really only been in the last couple of days that I've had a chance to get back to the project. So where are we? Well, the basic functionality is mostly there now, and I'm pretty happy with how it's hanging together. The focus of attention at the moment is a difficult little bug concerning dead-cursor trails being left on the screen after non-printing character outputs - I think it might be a timing issue or race condition between _CHROUT and the IRQ handler, with the cursor blink phase at the root. The logic as written is supposed to undraw the cursor when necessary if the blink phase is 'on', but for some reason that isn't always the case - the phase is occasionally 'off' when I expected it to be 'on', so the cursor doesn't undraw before moving, and so dead cursor blocks get left behind. Investigation is ongoing...
EDIT: nailed it - timing issue resulting in loss-of-sync between cursor visibility and the associated flag.
5th May - I'm at the point now where I'm patching HLTs. There were a good number of places where I dropped this (undocumented) opcode into the code as I found things that were going to need attention later, when the program was in better, more complete, shape. Stuff like the bit of logic to tell the renderer which characterset to look at (upper- or lower-case), or the altogether more complicated subroutine that accepts an entered line from the screen and invokes the KERNAL code to tokenise it and stash it as a BASIC line. Back in February I wrote that we were close to an Alpha release, and here we are three months later still awaiting that release - but I want it to be at least capable of running a BASIC program, even if there are a bunch of caveats like specific memory configurations, known bugs, etc. Why is this such a key objective? Simple - alongside the release I want to include a BASIC measurement program which will run on any VIC with any 40-column program and describe the character throughput rate, and thereby demonstrate why FAST-40 has its' name. ;)
6th May - It occurred to me that there might be some questions over my use of HLT (opcode $02) instead of the more likely BRK (opcode $00) to trigger breakpoints in the code, as I mentioned previously. The reason is quite simple and is tied to how the 6502, and the Commodore KERNAL, handle these two instructions. If we consider BRK first, it's a standard opcode ("break") and is used, unsurprisingly, to trigger breakpoints - the 6502 has a special piece of logic that watches for executions of BRK and triggers the IRQ vector handler with the 'B' flag set in the Status Register when one is encountered. The Commodore KERNAL code that handles IRQ looks to see if the 'B' flag is set when it's called, and if it is then control bounces out of IRQ and through the BRK vector to manage what action to take in response - it is therefore possible to alter what happens (normally it's a BASIC reset) and do something else instead. The key point being that everything is under software control, and therefore if you want to actually physically stop execution dead in VICE, you still have to apply a monitor breakpoint to the BRK handler. The HLT ("halt") instruction, on the other hand, is an undocumented opcode that physically jams the 6502 and renders it incapable of further activity until reset (it actually messes-up the internal T-State register so no instruction can execute). On real hardware HLT is game over, but VICE simply stops emulation and pops an alert - crucially the PC and other registers retain their emulated state at the point the HLT instruction was executed, and in fact you can enter the monitor and resume execution after the HLT if you wish. And that's why I use HLT as a debugging aid under VICE instead of BRK.

