zxian - a ZX Spectrum emulator (for Windows, written in C)

The computer above has helped me learn programming and play many video games throughout my childhood. It is a Romanian ZX Spectrum clone called Cobra, and was built by my uncle. It features a keyboard superior to both Spectrum and Spectrum+'s. A small Romanian-made TV ("Sport" model) served as the monitor. Programs were loaded from cassette tape through a Russian-made tape player.

I decided to emulate it and thus zxian was born.

zxian is a ZX Spectrum emulator written in C, using SDL2 for graphics, input/output, and audio. As with my other projects, the source is available for download. I built it with Visual Studio Community 2019 (with C++ core features installed).

I hadn't coded in C in a while and it was very enjoyable. After several years of projects mostly in assembly language, C feels like a smarter assembly language, with really useful macros. I considered C++, but found no benefit from object orientation for this project, so I wrote in pure C.

Optimization efforts over two separate versions reduced host CPU usage by 97%.

Version history (64bit)

v28 - added support for sound volume control
v27 - added support for emulation rewinding
v26 - added a button which plays a single tape block, for multi-load games. Removed clicking noises heard during tape loading
v25 - fixed a bug which sometimes deteriorated sound quality at high sample rates. Improved sound quality via oversampling. Screenshots now no longer include UIs elements (e.g. poke UI)
v24 - corrected R register behaviour (7bit counter instead of 8bit). This fixes games such as Robin of the Wood, Ping Pong, Sanxion
v23 - fixed a border rendering bug after fast tape loading finishes. Fixed a TAP loading bug concerning custom-length header blocks. This fixes games such as Lemmings
v22 - added support for saving SNA snapshots
v21 - added support for more options to the UI. Fixed a border rendering bug in fullscreen display modes. Fixed minor scanlines effect bugs
v20 - first 64bit version

Version history (32bit)

v19 - optimizations: reduced host CPU usage by 75%-90% of previous version
v18 - added zxianui, a friendlier, UI-based zxian starter
v17 - added support for taking screenshots
v16 - added support for CRT scanlines effect. Fixed a slowdown issue when using accelerated renderer. Removed an overly eager optimization which impacted CPU-screen sync - this fixes games where graphics are updated multiple times per frame
v15 - added fullscreen support
v14 - sound improvements via variable sync. This fixes sustained tones (such as the BEEP command in BASIC)
v13 - optimizations: reduced host CPU usage by 85%
v12 - CPU microcode fix: R register behaviour; this unfreezes some games which rely on R for timing, like Defender of the Crown. CPU microcode fix: DD/FD prefix opcodes fall-through to unprefixed
v11 - added support for saving and loading state; added a UI which allows memory modification (pokes)
v10 - fixed an interrupt bug which allowed reentrancy; this fixes games such as Zynaps
v9 - fixed an overflow bug which deteriorated sound after 20 minutes
v8 - tape UI improvements: current block size and progress; sound improvements: configurability and parameter tweaks
v7 - added support for frame skipping. Improved sound quality and configurability
v6 - significantly improved audio quality. Added support for TAP tape images
v5 - video frame duration can now be specified in milliseconds. Rewrote the "read key status" code to fix a bug, which fixes games such as Manic Miner
v4 - support for "floating bus", whereby data read by ULA can "leak" into hardware ports that are not wired, such as 0xFF. Some games rely on this for timing, instead of an interrupt handler. This fixes games such as Cobra and Arkanoid
v3 - improved game compatibility by supplying a well-known value for the LSB during IM2 handler lookup; previous behaviour can be attained through a switch. This fixes games such as Dizzy 7
v2 - fixed an SNA loading bug caused by incorrect IFF2 initialization; it was causing some games to soft reset, and some to have corrupted graphics
v1 - initial release

Downloads (current version)

zxian v28 emulator (for Windows) - unzip anywhere and run zxian.exe. Read the provided text files for more information.
zxian v28 source - load the solution file in Visual Studio to build zxian yourself.

Downloads (older versions)

Development

Development began with a 50Hz (the Spectrum was made in Britain) timer-invoked routine which read memory and drew pixels following Spectrum's questionable video memory layout. This was followed by writing the functionality for reading Z80 instruction opcodes, with support for all of Z80's opcode prefixes.

Then came seveal weeks of microcode development, where each Z80 instruction was implemented and tested. The Z80 CPU manual was a good resource for findings details on how each instruction behaved, what flags it affected, etc.

There is a large number of undocumented Z80 instructions, which I also implemented.

This was followed by support for interrupts and ZX Spectrum-specific areas such as hardware ports.

Seeing the image above was a great milestone.

I think that one difficulty with emulator development is that you have access to low-level tests (you can manually test each instruction individually) and to high-level tests (the Sinclair ROM, or a game) - but not much in between.

This means that you keep testing at a very low level, as you progress, but can only hope that when everything has been written, the ROM (or game) boots and works.

Sound development

I enjoyed developing the sound capability of zxian, because I hadn't done anything like that before. While the SDL2 library abstracts the audio hardware of the host computer, it still requires a constant stream of data (audio samples) to function. The difficulty is that these samples have to be provided in real time.

The challenge I faced was that the Z80 CPU finishes a video frame's worth (20ms) of computation in much less time than the 20ms. Additionally, how much real time the Z80 actually needs varies from host computer to host computer, and is therefore unknown and unreliable.

However, the amount of CPU clock cycles (or tstates) that the Z80 is allowed to perform during each 20ms interval is constant, irrespective of the host machine. That specific amount of clock cycles might be performed in 9ms on one host computer, and in 5ms on a much faster host computer.

My solution was to sample the state of the speaker at fixed clock cycle intervals during the Z80's active time and write them to a buffer, such that 20ms's worth of Z80 CPU time yielded 20ms worth of real-time audio data.

Conversely, the SDL audio layer read from a second buffer, which was full of audio samples accumulated during the last video frame (20ms).

At the end of each video frame, the two buffers are swapped - the read buffer becomes the write buffer and vice-versa.

NOTE: As of version 6, the above has been replaced by a different approach, based on a circular buffer and automatic resynchronization between read and write "heads".

From version 6, here is a mini-log of changes I've made to the sound module, to ultimately make significant improvements to sound quality:

two buffers (read and write), swapped, rudimentary, poorly-working synchronization
same as above, but oversample and then average, no improvement
switched to stereo, with improvement, but still very annoying stutters
circular buffer, reset buffer read and write "heads" to sync, significant improvement
circular buffer, don't feed SDL audio buffer to sync, regression - it sounds worse
circular buffer, single-way desync detection, rewind read buffer "head" to sync, small improvement
circular buffer, two-way desync detection, rewind or fast-forward read buffer "head" to sync, good in most games, but still noticeable in continuous-music games like manic miner
circular buffer, two-way desync detection, rewind or fast-forward read buffer "head" to sync, with video frame tstate adjustment (that is, CPU gets fewer tstates during video frames that went over their tstate allocation) - much better
same as above but with more frequent (but smaller) resynchronizations seem much less noticeable than rarer, larger resynchronizations
use per-scanline tstate compensation so CPU speed is closer to 100% speed further reduces sound desynchronization rate
on computers where the video frames last slightly longer than the target, frameskip becomes enabled; in these cases, automatically switching between a static and a dynamic sampling interval yields a smaller amount of resyncs
further advances were made by allowing configuration of many different parameters, which led to changes which decreased the number of resyncs
(in v14) variable sync, to address crackling sustained tones (such as the BEEP command in BASIC); see more detail in the section below

I've concluded on a resynchronization strategy whereby the read head:

Is moved forward by 1.66ms when falling further than 50ms behind the write head
Is moved backward by 1.66ms when approaches to closer than 3.33ms behind the write head

This strategy:

Minimizes the total resync amount (occurrences*length) per second
Keeps the resyncs small in length (resyncs become noticeable if longer than 2.5ms forward or backward)

Here is a demonstration of how the sound improved from version 5 to version 6:

Sound improvements in version 14

In version 14, I solved an issue which existed from the beginning: crackling sustained tones. Due to the above-described resynchronization strategy, sustained tones (such as the lead tone when loading from tape, or the BEEP command in BASIC) crackle, because of the frequent (albeit tiny) resynchronizations.

If I changed the resynchronization strategy to allow more latency (lag) offset by larger resynchronizations, sustained tones sounded good, but resynchronizations were very noticeable when they did happen.

In version 14, the solution I implemented varies between an eager strategy (low lag, tiny resynchronizations) and a lazy strategy (high lag, large resynchronizations). The discriminant is the shape of the sound.

Upon collection, sound samples are analyzed to see if they represent a sustained tone. In this case, rising or falling edges are expected to exhibit a fixed period (equally-spaced), and thus, frequency. When this occurs, zxian chooses the lazy strategy.

During intervals of varying frequencies and silence, zxian chooses the eager strategy.

In practice during gameplay, I've observed a selection which mixes eager and lazy strategies. Fortunately, the extra lag allowed by the lazy strategy is offset by eager resyncs during periods of silence, which are not perceived by the user.

Thus, version 14 loses almost no sound quality during mixed-frequency scenarios (such as regular gameplay) and gains significant sound quality during sustained tones.

Optimizations - reducing host CPU usage by 85% in version 13

Graphics rendering code had remained unchanged since v1, with pixel-by-pixel rendering - which I knew it was much slower than it could be.

Likewise, the microcode (the module which executes Z80 instructions) was designed to be clean from a "object oriented" perspective, at the expense of speed.

Here are some of the changes which took place, to obtain a 85% reduction (that is, v13 uses 7 times less CPU than v12) in host CPU usage:

microcode relying on static memory - Each CPU instruction was previously stored on the heap, via malloc/free. While clean from a design standpoint, this was much slower than simply using a single-instruction "storage area" in static memory, and reusing it for each successive instruction
removal of ROM fetch/decode stage caching - Previously, I thought I'd speed up CPU code by caching the result of the fetch/decode phase based on the assumption that ROM does not change. After some experiments, games use so few ROM calls that the cache check overhead was not worth it, so I removed it. I did experiment with applying the same caching to RAM, but the speed increase was not enough to warrant the significant increase in complexity (since RAM can change, clever programs modify themselves, and instructions can be "jumped into" partway)
render only necessary video frames - I noticed CPU usage was the same when staring at the "copyright screen" versus when playing a game. Thus, I came up with a way to determine whether the video frame had changed at all (border, video memory, flash), skipping rendering altogether if nothing had changed. During gameplay, I've observed between 8% and 15% of video frames didn't change
when solid, block-drawn border - Rely on four SDL rectangles to draw the entire screen's border when the border colour remains unchanged throughout an entire video frame. Previously, the border was drawn pixel-by-pixel
when not solid, border via memset - When the border DOES change throughout a video frame, draw it via memset, which is very fast
scanline duplication for zoom - Instead of drawing a little square for each pixel to satisfy zoom (e.g. a 3x3 pixel square when zoom is 3), I changed it to draw a single horizontal line whose pixels were zoomed horizontally - followed by a copy down as many lines as needed (e.g. 2 copied lines for zoom 3)
faster pixel drawing - I sped this up by removing all multiplications used in offset and colour calculations by pre-computing them in lookup tables
pixel caching - After realizing that the basic unit of graphics is the horizontal 8-pixel wide line represented via a byte in ZX Spectrum's video memory. Thus, I cached all 256 possible renderings of 8-pixel wide lines, for all possible foreground/background colour combinations, for all inverse/flash combinations, for the current zoom. This converted a costly bit-by-bit loop into a single memcpy block operation. I greatly reduced memory usage here by only caching what was needed, since games typically don't use more than 20% of all possible 8-pixel lines.

UI-based zxianui in version 18

To simplify starting zxian and loading a program, in version 18 I've developed zxianui.

This tiny executable resides in the same directory as zxian and lets the user manipulate zxian's simplest configuration parameters (e.g. tape/snapshot to load, zoom, display mode, etc.) via a UI.

I wrote zxianui in C, relying on no UI libraries. All calls are WIN32 API calls. I chose this approach for 2 reasons: first, I wanted no further dependencies (e.g. .NET Framework, GTK, etc.). Second, I wanted to learn the basics of pure WIN32 programming (windows, events, controls, etc.)

Optimizations - reducing host CPU usage by a further 75% to 90% in version 19

Version 19 contains performance optimizations around video rendering and microcode execution. The purpose was to further reduce the load on the host CPU.

Here are summaries of the changes:

selective rendering - Individual 8-pixel wide segments are now tracked individually and only rendered if they change. Similarly, parts of the border are tracked for changes. This represents a change from the previous way of fully rendering each scanline every frame.
drawing via inline ASM - Low-level rendering routines changed to rely on inline, hand-written assembly language portions. Some functions were re-written to "convince" the compiler to inline them. Certain variable and function argument usage was changed to allow the compiler to rely more on registers and less on memory.
microcode hot path reduction - I've combed through the fetch-execute code path and changed/removed many things such as inefficient loops, redundant operations, unnecessary data copying, lazier-than-could-be short circuits.

Above are histograms of zxian's performance over time. The same test was performed multiple times on each version of zxian, along with several unreleased (incremental) builds of v19.

One image shows performance at zoom 3 and the other at zoom 4. This means that the window height and width were 3 and 4 times, respectively, larger than ZX Spectrum's screen.

Here are the descriptions:

Dizzy 1 - Idling on the first screen of the game, immediately after the game begins. There are few moving sprites, few changing attributes.
Copyright Screen - Idling on the (c)1982 Sinclair Research Ltd; BASIC bootup message. No screen activity.
1943 - Leave the airplane flying through 2 games without any input in this shoot'em up. There is a high amount of screen activity: scrolling, enemies, bullets, etc.
Skool Daze - Idling through the demo for just over a minute. There are many sprites, occasional scrolling.
Nipper 2 - Idling on Jack the Nipper 2's title screen. There are few, large sprites, music, attribute cycling on text.

Version 20 - leap to 64bit

In version 20, zxian leaves the 32bit world behind. Version 19 will be the last 32bit version.

With the occasion, I was curious to see how efficient zxian was, compared to other ZX Spectrum emulators.

Above are histograms of the host CPU usage of zxian and other emulators. The same test was performed multiple times on each emulator under test.

One image shows performance at zoom 3 and the other at zoom 4. This means that the window height and width were 3 and 4 times, respectively, larger than ZX Spectrum's screen.

Here are the descriptions:

Dizzy 1 - Idling on the first screen of the game, immediately after the game begins. There are few moving sprites, few changing attributes.
Copyright Screen - Idling on the (c)1982 Sinclair Research Ltd; BASIC bootup message. No screen activity.
1943 - Leave the airplane flying through 2 games without any input in this shoot'em up. There is a high amount of screen activity: scrolling, enemies, bullets, etc.
Skool Daze - Idling through the demo for just over a minute. There are many sprites, occasional scrolling.
Nipper 2 - Idling on Jack the Nipper 2's title screen. There are few, large sprites, music, attribute cycling on text.
Attributes - A stress test whereby an infinite loop repeatedly writes pseudorandom bytes to the screen attributes area

Version 24 - R register fix

In version 24, I fixed a long-standing issue with how the R register of the Z80 CPU was incrementing. This register is used to refresh DRAM, which would otherwise lose data. The R register increments as the CPU reads instructions and its value appears on the CPU's address lines when they're not needed for actual read/writes.

Many versions ago, I recognized that some games rely on this register for timing (since its increments are deterministic). My previous implementation was partial. All games tested were playable, but some exhibited unexpected behaviour.

A game named Sanxion uses a text fade effect whereby the text disappears pixel-by-pixel, randomly. However, when run in zxian, the fade effect occurred in strips, rather than randomly. I investigated this by analyzing CPU instruction execution counts. This led me to a LD A, R instruction, which was executed just enough times (coinciding with the fade animation) to be interesting.

Following the LD A, R, in version 24 I've improved the implementation. Specifically, R is correctly incremented as a 7bit counter, and not an 8bit counter. That is, it now goes from x1111111 to x0000000, keeping bit 7 unchanged.

The error modes of games were noteworthy. They exhibited strange and interesting effects when R was not incremented correctly. Here's a summary, categorized by R incrementing strategy used.

R not incremented - Games like Defender of the Crown no longer animates fighting soldiers.
R incremented as 8bit counter, normal frequency - Ping Pong's paddle graphics are corrupted, and gameplay timing is off.
R incremented as 8bit counter, lower frequency - Robin of the Wood's soldiers can disappear or be corrupted; Robin's hits make no sounds; menu sounds are bad.
R incremented 7bit counter, normal frequency - No side effects observed.