zxian - a ZX Spectrum emulator (for Windows, written in C)
The computer above has helped me learn programming and play many video games throughout my childhood. It is a Romanian ZX Spectrum clone called Cobra, and was built by my uncle. It features a keyboard superior to both Spectrum and Spectrum+'s. A small Romanian-made TV ("Sport" model) served as the monitor. Programs were loaded from cassette tape through a Russian-made player.
I decided to emulate it and thus zxian was born.
zxian is a ZX Spectrum emulator written in C, using SDL2 for graphics, input/output, and audio. As with my other projects, the source is available for download. I built it with Visual Studio Community 2019 (with C++ core features installed).
I hadn't coded in C in a while and it was very enjoyable. After several years of projects mostly in assembly language, C feels like a smarter assembly language, with really useful macros. I considered C++, but found no benefit from object orientation for this project, so I wrote in pure C.
v17 - added support for taking screenshots
v16 - added support for CRT scanlines effect. Fixed a slowdown issue when using accelerated renderer. Removed an overly eager optimization which impacted CPU-screen sync - this fixes games where graphics are updated multiple times per frame
v15 - added fullscreen support
v14 - sound improvements via variable sync. This fixes sustained tones (such as the BEEP command in BASIC)
v13 - optimizations: reduced host CPU usage by 85%
v12 - CPU microcode fix: R register behaviour; this unfreezes some games which rely on R for timing, like Defender of the Crown. CPU microcode fix: DD/FD prefix opcodes fall-through to unprefixed
v11 - added support for saving and loading state; added a UI which allows memory modification (pokes)
v10 - fixed an interrupt bug which allowed reentrancy; this fixes games such as Zynaps
v9 - fixed an overflow bug which deteriorated sound after 20 minutes
v8 - tape UI improvements: current block size and progress; sound improvements: configurability and parameter tweaks
v7 - added support for frame skipping. Improved sound quality and configurability
v6 - significantly improved audio quality. Added support for TAP tape images
v5 - video frame duration can now be specified in milliseconds. Rewrote the "read key status" code to fix a bug, which fixes games such as Manic Miner
v4 - support for "floating bus", whereby data read by ULA can "leak" into hardware ports that are not wired, such as 0xFF. Some games rely on this for timing, instead of an interrupt handler. This fixes games such as Cobra and Arkanoid
v3 - improved game compatibility by supplying a well-known value for the LSB during IM2 handler lookup; previous behaviour can be attained through a switch. This fixes games such as Dizzy 7
v2 - fixed an SNA loading bug caused by incorrect IFF2 initialization; it was causing some games to soft reset, and some to have corrupted graphics
zxian supports the popular Kempston joystick, which is mapped to the arrow keys, with the left control key acting as fire.
As of the current version, only SNA snapshots and TAP tape images are supported. zxian is completely command-line driven. If started with no arguments, it will simply boot into Sinclair BASIC (ZX Spectrum's 48k ROM).
Other utilities include support for saving and loading state, and support for modifying memory (pokes).
Development began with a 50Hz (the Spectrum was made in Britain) timer-invoked routine which read memory and drew pixels following Spectrum's questionable video memory layout. This was followed by writing the functionality for reading Z80 instruction opcodes, with support for all of Z80's opcode prefixes.
Then came seveal weeks of microcode development, where each Z80 instruction was implemented and tested. The Z80 CPU manual was a good resource for findings details on how each instruction behaved, what flags it affected, etc.
There is a large number of undocumented Z80 instructions, which I also implemented.
This was followed by support for interrupts and ZX Spectrum-specific areas such as hardware ports.
Seeing the image above was a great milestone.
I think that one difficulty with emulator development is that you have access to low-level tests (you can manually test each instruction individually) and to high-level tests (the Sinclair ROM, or a game) - but not much in between.
This means that you keep testing at a very low level, as you progress, but can only hope that when everything has been written, the ROM (or game) boots and works.
I enjoyed developing the sound capability of zxian, because I hadn't done anything like that before. While the SDL2 library abstracts the audio hardware of the host computer, it still requires a constant stream of data (audio samples) to function. The difficulty is that these samples have to be provided in real time.
The challenge I faced was that the Z80 CPU finishes a video frame's worth (20ms) of computation in much less time than the 20ms. Additionally, how much real time the Z80 actually needs varies from host computer to host computer, and is therefore unknown and unreliable.
However, the amount of CPU clock cycles (or tstates) that the Z80 is allowed to perform during each 20ms interval is constant, irrespective of the host machine. That specific amount of clock cycles might be performed in 9ms on one host computer, and in 5ms on a much faster host computer.
My solution was to sample the state of the speaker at fixed clock cycle intervals during the Z80's active time and write them to a buffer, such that 20ms's worth of Z80 CPU time yielded 20ms worth of real-time audio data.
Conversely, the SDL audio layer read from a second buffer, which was full of audio samples accumulated during the last video frame (20ms).
At the end of each video frame, the two buffers are swapped - the read buffer becomes the write buffer and vice-versa.
NOTE: As of version 6, the above has been replaced by a different approach, based on a circular buffer and automatic resynchronization between read and write "heads".
From version 6, here is a mini-log of changes I've made to the sound module, to ultimately make significant improvements to sound quality:
two buffers (read and write), swapped, rudimentary, poorly-working synchronization
same as above, but oversample and then average, no improvement
switched to stereo, with improvement, but still very annoying stutters
circular buffer, reset buffer read and write "heads" to sync, significant improvement
circular buffer, don't feed SDL audio buffer to sync, regression - it sounds worse
circular buffer, single-way desync detection, rewind read buffer "head" to sync, small improvement
circular buffer, two-way desync detection, rewind or fast-forward read buffer "head" to sync, good in most games, but still noticeable in continuous-music games like manic miner
circular buffer, two-way desync detection, rewind or fast-forward read buffer "head" to sync, with video frame tstate adjustment (that is, CPU gets fewer tstates during video frames that went over their tstate allocation) - much better
same as above but with more frequent (but smaller) resynchronizations seem much less noticeable than rarer, larger resynchronizations
use per-scanline tstate compensation so CPU speed is closer to 100% speed further reduces sound desynchronization rate
on computers where the video frames last slightly longer than the target, frameskip becomes enabled; in these cases, automatically switching between a static and a dynamic sampling interval yields a smaller amount of resyncs
further advances were made by allowing configuration of many different parameters, which led to changes which decreased the number of resyncs
(in v14) variable sync, to address crackling sustained tones (such as the BEEP command in BASIC); see more detail in the section below
I've concluded on a resynchronization strategy whereby the read head:
Is moved forward by 1.66ms when falling further than 50ms behind the write head
Is moved backward by 1.66ms when approaches to closer than 3.33ms behind the write head
Minimizes the total resync amount (occurrences*length) per second
Keeps the resyncs small in length (resyncs become noticeable if longer than 2.5ms forward or backward)
Here is a demonstration of how the sound improved from version 5 to version 6:
Sound improvements in version 14
In version 14, I solved an issue which existed from the beginning: crackling sustained tones. Due to the above-described resynchronization strategy, sustained tones (such as the lead tone when loading from tape, or the BEEP command in BASIC) crackle, because of the frequent (albeit tiny) resynchronizations.
If I changed the resynchronization strategy to allow more latency (lag) offset by larger resynchronizations, sustained tones sounded good, but resynchronizations were very noticeable when they did happen.
In version 14, the solution I implemented varies between an eager strategy (low lag, tiny resynchronizations) and a lazy strategy (high lag, large resynchronizations). The discriminant is the shape of the sound.
Upon collection, sound samples are analyzed to see if they represent a sustained tone. In this case, rising or falling edges are expected to exhibit a fixed period (equally-spaced), and thus, frequency. When this occurs, zxian chooses the lazy strategy.
During intervals of varying frequencies and silence, zxian chooses the eager strategy.
In practice during gameplay, I've observed a selection which mixes eager and lazy strategies. Fortunately, the extra lag allowed by the lazy strategy is offset by eager resyncs during periods of silence, which are not perceived by the user.
Thus, version 14 loses almost no sound quality during mixed-frequency scenarios (such as regular gameplay) and gains significant sound quality during sustained tones.
Optimizations - reducing host CPU usage by 85% in version 13
Graphics rendering code had remained unchanged since v1, with pixel-by-pixel rendering - which I knew it was much slower than it could be.
Likewise, the microcode (the module which executes Z80 instructions) was designed to be clean from a "object oriented" perspective, at the expense of speed.
Here are some of the changes which took place, to obtain a 85% reduction (that is, v13 uses 7 times less CPU than v12) in host CPU usage:
microcode relying on static memory - Each CPU instruction was previously stored on the heap, via malloc/free. While clean from a design standpoint, this was much slower than simply using a single-instruction "storage area" in static memory, and reusing it for each successive instruction
removal of ROM fetch/decode stage caching - Previously, I thought I'd speed up CPU code by caching the result of the fetch/decode phase based on the assumption that ROM does not change. After some experiments, games use so few ROM calls that the cache check overhead was not worth it, so I removed it. I did experiment with applying the same caching to RAM, but the speed increase was not enough to warrant the significant increase in complexity (since RAM can change, clever programs modify themselves, and instructions can be "jumped into" partway)
render only necessary video frames - I noticed CPU usage was the same when staring at the "copyright screen" versus when playing a game. Thus, I came up with a way to determine whether the video frame had changed at all (border, video memory, flash), skipping rendering altogether if nothing had changed. During gameplay, I've observed between 8% and 15% of video frames didn't change
when solid, block-drawn border - Rely on four SDL rectangles to draw the entire screen's border when the border colour remains unchanged throughout an entire video frame. Previously, the border was drawn pixel-by-pixel
when not solid, border via memset - When the border DOES change throughout a video frame, draw it via memset, which is very fast
scanline duplication for zoom - Instead of drawing a little square for each pixel to satisfy zoom (e.g. a 3x3 pixel square when zoom is 3), I changed it to draw a single horizontal line whose pixels were zoomed horizontally - followed by a copy down as many lines as needed (e.g. 2 copied lines for zoom 3)
faster pixel drawing - I sped this up by removing all multiplications used in offset and colour calculations by pre-computing them in lookup tables
pixel caching - After realizing that the basic unit of graphics is the horizontal 8-pixel wide line represented via a byte in ZX Spectrum's video memory. Thus, I cached all 256 possible renderings of 8-pixel wide lines, for all possible foreground/background colour combinations, for all inverse/flash combinations, for the current zoom. This converted a costly bit-by-bit loop into a single memcpy block operation. I greatly reduced memory usage here by only caching what was needed, since games typically don't use more than 20% of all possible 8-pixel lines.