Five hours. Five hours spent laboriously tracing a code path back and back and back, seeing a system crash that makes no sense have its' roots not in code I've recently added, or even not-so-recently added, but actually in a tiny piece of logic written 10 months ago (see this post) that featured a bug so subtle that time itself was a contributory factor to it. And I'd just like to quote myself from last time before kicking myself firmly in the behind for being so glib: "I've also taken half-an-hour to re-organise the sourcecode and rationalise the cosmetic style; formatting makes absolutely no difference to the assembled binary...". Yeah, right. In C# maybe - in Assembly, as if I didn't know after 30 years, time and position is everything.
It all started when I ran a full-screen test of the renderer to populate the entire display area and throw attributes into various places on certain lines. I wrote a little test harness to generate a few lines and attributes so I could make sure there weren't any edge-conditions lurking in the new code. Everything was looking great, and I gradually increased the line count until I got to 25. And then this happened:
Ouch. Right, so there's either a bug in the harness code that sets everything up, or in the renderer itself when it draws line 25; since this is exactly the kind of boundary-fault I was looking for, I can't really complain when I find one - better now while the routine is fresh in my mind than later when I'm concentrating on something else. Let's start isolating where this glitch is happening...
Fast-forward four-and-a-half hours, and I'm thoroughly flummoxed. The test harness checks out. The renderer checks out. The bug either manifests on line 25, and only then if there are attributes on it, or if some other lines have more than a few attributes set; but the tables holding the raw text and attribute data are populated properly, the final bitmap after rendering has the right pixel-data in it, and yet something somewhere is accidentally stomping over my VIC register settings and causing this horrible screen mess. My working theory has been that somewhere I have a memory-indirection write that's being corrupted and thus points at an unintended chunk of memory, but after exhaustive tracing I'm not spotting it. Weird.
The key thing to remember here is that this bug didn't occur in the old version of the renderer - I checked, just in case, using both the earlier version of the code and the version of XVIC I was using at the time. The old code works on VICE 2.2 upwards (that's the first version I was testing against last year) and the new code doesn't, so this is definitely something to do with what I've been working-on recently. But tracing the code shows that it's doing exactly what it should be, and going nowhere near the VIC registers at any point. So what else could be causing this? The only bit of code I have that deliberately messes with the VIC is the IRQ routine, and that hasn't changed since May 2012.
But the test harness I'm using to exercise the renderer is doing quite a lot of work, so that's a chunk of code running at startup which is comparatively slow - and I've just re-organised the code sequencing for the way the OS gets things ready for use. So here's how all these things tie together with the IRQ routine to produce a spectacular crash: as part of my 'no difference to the binary' (haha) shuffling, the CLI instruction to enable interrupts and set IRQ running has changed position and now occurs as the last action rather than being earlier in the startup path.
So why should that make a difference? Well, it's all centered on this little loop in the IRQ logic that waits for the raster counter to hit a certain value - at this point we've just tweaked the VIC settings to start drawing our special last line from a different memory area, and this loop just hangs around waiting until the VIC has had enough time to draw it. When we hit a certain raster line number, we can then switch back to 'normal' mode and resume drawing the rest of the screen (until the next frame gets to where line 25 starts and we switch all over again):
DEC _ROWCNT ; [5] turn double-height chars off
LDA _VICDATA+16 ; [4] get alternate VIC register value
STA _CHARMEM ; [4] switch screen to $0000 for line 25
LDA _NTSCPAL+2 ; [2] get raster sync line number
.raswait CMP _RASTER ; [4] wait for end of Line25
BNE .raswait ; [3/2] loop until line 25 finished drawing
INC _ROWCNT ; [5] turn double-height chars back on
LDA _VICDATA+5 ; [4] get original VIC register value
STA _CHARMEM ; [4] switch screen back to $1000
Simple, right? A two-line wait state in the middle there - not much room for error, you'd suppose. Unless (and bear with me here) the IRQ routine itself now kicks-in later than it used to during startup, and that startup is itself significantly different in timing - due to a slow test harness doing a bunch of stuff - which means that the first IRQ pass gets stuck in this little loop waiting for the raster counter to hit a specific point that we've already passed. Ergo, the screen gets switched into line 25 mode and never gets out of it, and then the next IRQ starts and gets stuck too, and the next, and the next...
Solution: change the loop exit condition so that instead of waiting for the raster counter to hit a specific line to denote the end of the 'special' line 25 mode, we'll wait instead until it has passed that line and we should therefore switch the VIC settings back for the rest of the screen. That little adjustment means we'll avoid racing the beam and getting stuck waiting for the raster counter to get to a specific line we might never actually hit precisely, especially when the IRQ routine is newly-enabled and we don't know how far down the screen the raster has got to. This might happen at any point when IRQ has been disabled and re-enabled, not just at startup but any time something else wants to suspend interrupts for any reason:
DEC _ROWCNT ; [5] turn double-height chars off
LDA _VICDATA+16 ; [4] get alternate VIC register value
STA _CHARMEM ; [4] switch screen to $0000 for line 25
LDA _NTSCPAL+2 ; [2] get raster sync line number
.raswait CMP _RASTER ; [4] wait for end of Line25
BCS .raswait ; [3/2] loop until line 25 finished drawing
INC _ROWCNT ; [5] turn double-height chars back on
LDA _VICDATA+5 ; [4] get original VIC register value
STA _CHARMEM ; [4] switch screen back to $1000
There is, however, one curious oddity outstanding which I have been unable to put to rest - and it concerns the actual value that the raster-compare loop uses to determine when the VIC will have finished drawing line 25. We now know that the loop condition itself was a bit dodgy, as we were testing for equality with a specific raster count value - so altering that from a BNE to a BCS 'greater than or equal to' test means we never get stuck if we miss the point at which the count matches. But here's the thing: right from version one of this routine, the count value we've been looking for (for PAL) is $88, or raster line 272 - but the code now never seems to hit it. Disregarding the first IRQ pass where a lot of work has been done and the raster has moved way past that line, on all subsequent passes the loop now stops waiting when the count hits $89. I can't see any reason why this little spinlock now inexplicably misses this line, but somehow it does.
Anyone got any ideas?

No comments:
Post a Comment