Posted onJun 2

Creating Sega Genesis emulator in C++

This article covers the development of a Sega Genesis 16-bit console emulator in C++. A lot of exciting stuff awaits you ahead: emulating the Motorola 68000 CPU, reverse engineering games, OpenGL graphics, shaders, and much more—all using modern C++. The article is packed with images, so even just browsing through them should be fun.

The design of Sega Genesis

The architecture of Sega Genesis (source)

Here's a description of each component in the diagram, listed in random order:

ROM is cartridge data, its maximum memory size is 4MB.
VDP (Video Display Processor) is an ASIC developed by the Sega company, a video controller chip. It has 64KB of VRAM;
TheYM2612 is a six-channel FM synthesizer from Yamaha;
PSG Sound is an ASIC from Texas Instruments (SN76489), it has three meander channels and synthesizes sound. Compatibility with the 8-bit Sega Master System requires it.
TheMotorola 68000 processor is a CPU that handles most of the work. It has 64KB of RAM.
Zilog Z80 is an audio co-processor. Its job is to write commands to the YM2612 registers at the right time. It has 8KB of RAM.
Input/Output are controllers. First, there was a "three-button gamepad", then a "six-button" one was added, followed by a dozen rarer devices.

The core component is Motorola 68000 (m68k) that has 24-bit addressing at0x000000–0xFFFFFFFF. This processor handles any memory access via a bus (labeled 68000 BUS in the diagram) that transfers the address to different locations.You can see the address mapping here.

This article covers the emulation of all components except Z80 and the sound component.

Motorola 68000 emulation

Some facts about m68k

The m68k processor used to be popular: for decades, Macintosh, Amiga, and Atari computers leveraged it, as well as the Sega Genesis console and other devices.

The processorarchitecture already has elements of 32-bitness, but with limitations.

There are 16 32-bit registers in total (and one 16-bit register). Although the "address" registers (A0–A7) are 32-bit, 24 low-order bits are used for the address. In other words, 16 megabytes of memory space is addressed.

The processor supports basic virtualization features for multitasking systems. The access to the A7 register is actually an access to either the user stack pointer (USP) or the supervisor stack pointer (SSP), depending on the status register flag.

Unlike (almost all) modern architectures, m68k adheres to the big-endian byte order. The address and size of an instruction are always divisible by two. With a few exceptions, we can read memory only from an address that is divisible by two as well. Floating-point arithmetic is not supported.

The m68k instruction table (source)

Registers of m68k

Let's create basic types:

usingByte=uint8_t;usingWord=uint16_t;usingLong=uint32_t;usingLongLong=uint64_t;usingAddressType=Long;

A class for working with big-endian:

The BigEndian class

Since the m68k architecture adheres to big-endian order, changing the byte order is often necessary (assuming our computer uses x86_64 or ARM, which use little-endian by default). To do this, let's create a type:

template<typenameT>classBigEndian{public:Tget()const{returnstd::byteswap(value_);}private:Tvalue_;};

Then, imagine we need to retrieve theWord value from an array, we can do this:

constauto*array_ptr=reinterpret_cast<constBigEndian<Word>*>(data_ptr);// ...x-=array_ptr[index].get();

Since the processor is constantly writing something to or reading from memory, appropriate entities are required (i.e., what to write and where to write it).

The DataView and MutableDataView classes

The best way to do this is to usestd::span, which is a pointer to data and its size. For the immutable version, it's still a good idea to create a helper that calls.as<Word>() and so on:

usingMutableDataView=std::span<Byte>;classDataView:publicstd::span<constByte>{public:usingBase=std::span<constByte>;usingBase::Base;template<std::integralT>Tas()const{returnstd::byteswap(*reinterpret_cast<constT*>(data()));}};

Let's create a type for the m68k registers. An object of this type fully describes the state of the CPU, independent of memory:

The structure of Registers

structRegisters{/**   * Data registers D0 - D7   */std::array<Long,8>d;/**   * Address registers A0 - A6   */std::array<Long,7>a;/**   * User stack pointer   */Longusp;/**   * Supervisor stack pointer   */Longssp;/**   * Program counter   */Longpc;/**   * Status register   */struct{// lower byteboolcarry:1;booloverflow:1;boolzero:1;boolnegative:1;boolextend:1;bool:3;// upper byteuint8_tinterrupt_mask:3;bool:1;boolmaster_switch:1;boolsupervisor:1;uint8_ttrace:2;decltype(auto)operator=(constWord&word){*reinterpret_cast<Word*>(this)=word;return*this;}operatorWord()const{return*reinterpret_cast<constWord*>(this);}}sr;static_assert(sizeof(sr)==sizeof(Word));/**   * The stack pointer register depend on the supervisor flag   */Long&stack_ptr(){returnsr.supervisor?ssp:usp;}};static_assert(sizeof(Registers)==76);

This 76-byte structure fully describes the CPU state.

Error handling

Various errors may happen: an unaligned (non-divisible by 2) program counter address, read address, or write address; an unknown instruction; or an attempt to write to a protected address space.

I decided to handle errors without using exceptions (try/throw/catch). Usually, I don't mind standard exceptions, but this approach makes debugging a bit more convenient.

So, let's create a class for errors:

The Error class

classError{public:enumKind{// no errorOk,UnalignedMemoryRead,UnalignedMemoryWrite,UnalignedProgramCounter,UnknownAddressingMode,UnknownOpcode,// permission errorProtectedRead,ProtectedWrite,// bus errorUnmappedRead,UnmappedWrite,// invalid actionInvalidRead,InvalidWrite,};Error()=default;Error(Kindkind,std::stringwhat):kind_{kind},what_{std::move(what)}{}Kindkind()const{returnkind_;}conststd::string&what()const{returnwhat_;}private:Kindkind_{Ok};std::stringwhat_;};

A member function that may fail must now have a return type ofstd::optional<Error>.

If the member function can either fail or return an object of theT type, its return type must bestd::expected<T, Error>. This patternappeared in C++23 and is useful for this approach.

Memory read/write interface

As mentioned in the section on Sega Genesis architecture, the semantics of reading from or writing to addresses can differ depending on the address. To abstract the behavior in terms of m68k, we'll create theDevice class:

classDevice{public:// reads `data.size()` bytes from address `addr`[[nodiscard]]virtualstd::optional<Error>read(AddressTypeaddr,MutableDataViewdata)=0;// writes `data.size()` bytes to address `addr`[[nodiscard]]virtualstd::optional<Error>write(AddressTypeaddr,DataViewdata)=0;// ....};

The expected behavior is clear from the comments. We'll add theByte,Word, andLong read/write helpers to this class.

template<std::integralT>std::expected<T,Error>read(AddressTypeaddr){Tdata;autoerr=read(addr,MutableDataView{reinterpret_cast<Byte*>(&data),sizeof(T)});if(err){returnstd::unexpected{std::move(*err)};}// swap bytes after reading to make it little-endianreturnstd::byteswap(data);}template<std::integralT>[[nodiscard]]std::optional<Error>write(AddressTypeaddr,Tvalue){// swap bytes before writing to make it big-endianconstautoswapped=std::byteswap(value);returnwrite(addr,DataView{reinterpret_cast<constByte*>(&swapped),sizeof(T)});}

The m68k execution context

The execution context of m68k is registers plus memory:

structContext{Registers&registers;Device&device;};

The m68k operand representation

Each instruction has 0 to 2 operands, aka targets. There are a lot of ways they can point to an address in memory or a register. The operand class has variables like these:

Kindkind_;// one of 12 addressing types (the addressing mode)uint8_tindex_;// the "index" value for index addressing typesWordext_word0_;// the first extension wordWordext_word1_;// the second extension wordLongaddress_;// the "address" value for addressable// addressing types

There are also 2 or 3 variables. I stayed within 24 bytes.

This class has read/write member functions:

[[nodiscard]]std::optional<Error>read(Contextctx,MutableDataViewdata);[[nodiscard]]std::optional<Error>write(Contextctx,DataViewdata);

You can see the implementation here:lib/m68k/target/target.h.

The most complex addressing types wereAddress with Index andProgram Counter with Index. This is how their address is evaluated:

Target::indexed_address

LongTarget::indexed_address(Contextctx,LongbaseAddress)const{constuint8_txregNum=bits_range(ext_word0_,12,3);constLongxreg=bit_at(ext_word0_,15)?a_reg(ctx.registers,xregNum):ctx.registers.d[xregNum];constLongsize=bit_at(ext_word0_,11)?/*Long*/4:/*Word*/2;constLongscale=scale_value(bits_range(ext_word0_,9,2));constSignedBytedisp=static_cast<SignedByte>(bits_range(ext_word0_,0,8));SignedLongclarifiedXreg=static_cast<SignedLong>(xreg);if(size==2){clarifiedXreg=static_cast<SignedWord>(clarifiedXreg);}returnbaseAddress+disp+clarifiedXreg*scale;}

The m68k instruction representation

The instruction class includes the following variables:

Kindkind_;// one of 82 opcodesSizesize_;// Byte, Word, or LongConditioncond_;// one of 16 conditions for brunch instructionsTargetsrc_;// a source operandTargetdst_;// a destination operand

There are also 2 or 3 variables. I've got it down to a total of 64 bytes.

Parsing the m68k instructions

The instruction class has a static member function that parses the current instruction.

staticstd::expected<Instruction,Error>decode(Contextctx);

You can see its implementation here:lib/m68k/instruction/decode.cpp

To avoid copy-pasting a bunch of "error" checks, I used the following macros:

#define READ_WORD_SAFE                    \  const auto word = read_word();          \  if (!word) {                            \    return std::unexpected{word.error()}; \  }

I also pattern-checked the opcode in an easy-to-use format:

The HAS_PATTERN macro

Functions for calculating a mask:

constevalWordcalculate_mask(std::string_viewpattern){Wordmask{};for(constcharc:pattern){if(c!=' '){mask=(mask<<1)|((c=='0'||c=='1')?1:0);}}returnmask;}constevalWordcalculate_value(std::string_viewpattern){Wordmask{};for(constcharc:pattern){if(c!=' '){mask=(mask<<1)|((c=='1')?1:0);}}returnmask;}

TheHAS_PATTERN macro:

#define HAS_PATTERN(pattern) \  ((*word & calculate_mask(pattern)) == calculate_value(pattern))

And then we have this, for example:

if(HAS_PATTERN("0000 ...1 ..00 1...")){// this is MOVEP// ...}

The code above checks whether the bits in the opcode satisfy the pattern. In other words, it checks whether the corresponding bits (ones without a dot) are 0 or 1. In our case, this is the pattern for the MOVEP opcode.

This works as quickly as typing the code manually:consteval ensures that the call is executed at compile time.

Executing the m68k instructions

The instruction class has a member function that executes. Registers change at runtime, and there is optional memory access:

[[nodiscard]]std::optional<Error>execute(Contextctx);

You can see its implementation here:lib/m68k/instruction/execute.cpp. This is the most complex code in the emulator.

You can find a description of the instructions inthis markdown documentation. If that isn't enough, you can read theextensive description in this book.

Writing instruction emulation is an iterative process. Creating every instruction is difficult at first, but as more patterns and common code accumulate, it becomes easier.

There are some obnoxious instructions, such asMOVEP, and also BCD arithmetic instructions, such asABCD. In BCD arithmetic, hexadecimal numbers are treated as decimal numbers. For example, the BCD addition looks like this: 0x678 + 0x535 = 0x1213. I spent over four hours working on these BCD instructions because theirlogic is extremely complex and not explained properly anywhere.

Testing the m68k emulator

Testing is the most important part. Even a small error in a status flag can lead to disasters during emulation. Large applications are prone to unexpected breakdowns, so developers need to test all instructions.

Thetests in this repository have been very helpful. There are over 8,000 tests for each instruction, covering every possible case. The total number of tests is just over a million.

They can detect even a slightest error. Often, approximately 20 out of 8,000 tests fail.

For example, theMOVE (A6)+ (A6)+ instruction (the A6 register is accessed with a post-increment) shouldn't work the way I implemented it. So, I created a workaround to make it work properly.

The emulator operates correctly most of the time now. No more than ten tests fail in isolated cases when there's a bug in the tests or another issue.

Emulating C++ programs

You can emulate your own programs. Let's write a simple program that reads two numbers and writes all the values within that range in a loop:

voidwork(){intbegin=*(int*)0xFF0000;intend=*(int*)0xFF0004;for(inti=begin;i<=end;++i){// if we don't write "volatile",// the compiler optimizes it in one entry!*(volatileint*)0xFF0008=i;}}

Both the GCC and Clang can cross-compile your code to the m68k architecture. Let's do it with Clang (thea.cpp file will become thea.o one):

clang++a.cpp-c--target=m68k-O3

You can view the object file assembly code using the following command. Note that you will most likely need to install thebinutils-m68k-linux-gnu package first:

m68k-linux-gnu-objdump-da.o

This assembly code will be displayed.

This object file is packaged inELF format, so we need to unpack it. Let's extract the assembly code (the.text section) to thea.bin file:

m68k-linux-gnu-objcopy-Obinary--only-section=.texta.oa.bin

Thehd a.bin command ensures that the correct files are extracted.

We can now emulate this assembly code. The emulator code ishere, and the emulation logs arehere. In this example, the numbers from 1307 to 1320 are written at the0xFF0008 address.

More emulation: The Sieve of Eratosthenes

In the next program, I had to tinker with compilers. Using thesieve of Eratosthenes, I calculated prime numbers up to 1,000.

This required an array filled with zeros. The compilers tried to use thememset member function from the standard library in the regularbool notPrime[N+1] = {0} declaration. This should be avoided since no libraries are linked. As a result, the code looked like this:

voidwork(){constexprintN=1000;// avoiding calling "memset" -_-volatileboolnotPrime[N+1];for(inti=0;i<=N;++i){notPrime[i]=0;}for(inti=2;i<=N;++i){if(notPrime[i]){continue;}*(volatileint*)0xFF0008=i;for(intj=2*i;j<=N;j+=i){notPrime[j]=true;}}}

And it is built using GCC (with theg++-m68k-linux-gnu package):

m68k-linux-gnu-g++a.cpp-c-O3

This is what theassembly code looks like, and this is what theemulator output looks like.

Non-trivial programs are difficult to emulate because the environment is too synthetic. For example, there are two issues with writing a string in a program like this:

voidwork(){strcpy((char*)0xFF0008,"Der beste Seemann war doch ich");

\}

The first issue is calling a member function before it's attached to the object file. The second issue is the string, the location of which in memory is still unknown.

With enough effort, you can emulate Linux for m68k.QEMU can do it!

The ROM-file format

I useImHex to analyze unknown formats and protocols so that I can better understand their content.

Imagine that you have downloaded the ROM file of your favorite childhood game. A Google search of the ROM format reveals that the first 256 bytes are occupied by them68k vector table. It contains addresses for various cases, such as division by zero. The next 256 bytes contain theROM header with information about the game.

Let's draft a hex pattern using the internal ImHex language for parsing binary files and look at the contents:

The sega.hexpat pattern

Thebe part before the type means big-endian:

structAddressRange{beu32begin;beu32end;};structVectorTable{beu32initial_sp;beu32initial_pc;beu32bus_error;beu32address_error;beu32illegal_instruction;beu32zero_divide;beu32chk;beu32trapv;beu32privilege_violation;beu32trace;beu32line_1010_emulator;beu32line_1111_emulator;beu32hardware_breakpoint;beu32coprocessor_violation;beu32format_error;beu32uninitialized_interrupt;beu32reserved_16_23[8];beu32spurious_interrupt;beu32autovector_level_1;beu32autovector_level_2;beu32autovector_level_3;beu32hblank;beu32autovector_level_5;beu32vblank;beu32autovector_level_7;beu32trap[16];beu32reserved_48_63[16];};structRomHeader{charsystem_type[16];charcopyright[16];chartitle_domestic[48];chartitle_overseas[48];charserial_number[14];beu16checksum;chardevice_support[16];AddressRangerom_address_range;AddressRangeram_address_range;charextra_memory[12];charmodem_support[12];charreserved1[40];charregion[3];charreserved2[13];};structRom{VectorTablevector_table;RomHeaderrom_header;};Romrom@0x00;

Picture N4 – ImHex "parsed" the beginning of the file

We can also disassemble any number of instructions starting withinitial_pc (the entry point) to see what happens in the first instructions:

Picture N5 – The disassembler in ImHex

Once everything is clear, we can convert the structures from the hex pattern to C++. The example is here (I've removed unnecessary data members):lib/sega/rom_loader/rom_loader.h.

Unlike many other formats where headers aren't an integral part of the content, the 512-byte header in ROM files is essential. This means that the ROM file needs to be loaded into memory as a whole. According to the address mapping, the0x000000 - 0x3FFFFFFF area is assigned to it.

A bus device

To improve address mapping, we can implementBusDevice as a child class ofDevice and have it redirect write and read commands to a more accurate device:

classBusDevice:publicDevice{public:structRange{AddressTypebegin;AddressTypeend;};voidadd_device(Rangerange,Device*device);/* ... more `read` and `write` override methods */private:structMappedDevice{constRangerange;Device*device;};std::vector<MappedDevice>mapped_devices_;};

An object of this class is fed to the m68K emulator. The full implementation is here:lib/sega/memory/bus_device.h.

GUI

Initially, the emulation output was displayed only in the terminal, and control was also performed through the terminal. However, this is inconvenient for the emulator, so moving everything to the GUI is necessary.

I used the mega coolImGui library for the GUI. It's feature-rich, allowing developers to create any interface they want.

Picture N6 – The window example: the m68k emulator status

This enables to display the whole state of the emulator in separate windows, which makes debugging much easier.

Working in Docker

To avoid issues with outdated operating systems (when all packages are obsolete, and even modern C++ can't compile) and to prevent cluttering a PC with third-party packages, it's better to develop under Docker.

First, we create aDockerfile, and then we recreate the image when changing it.

sudodockerbuild-tsegacxx.

Then, we go to the container with the directory mount (-v) and other necessary parameters:

sudodockerrun--privileged \-v/home/eshulgin:/usr/src \-v/home/eshulgin/.config/nvim:/root/.config/nvim \-v/home/eshulgin/.local/share/nvim:/root/.local/share/nvim \-v/tmp/.X11-unix:/tmp/.X11-unix \-eDISPLAY=unix${DISPLAY} \-it \segacxx

Pitfalls:

There may be an issue with the GUI not having default access. However, after some research, I modified the command to include the-v for X11 and-e DISPLAY parameters.
Also, the GUI won't work unless thexhost + command is run from the PC to disableaccess control.
To access controllers (see the section below for details), I added--privileged to the command.

Picture N7 – NeoVim running under the docker container

Reverse engineering games in Ghidra

Let's say we configured the m68K emulation via ROM. We read some documentation and connected some basic devices to the bus, such as ROM, RAM, thetrademark register, etc. Then, we emulated one instruction at a time while looking at the disassembler.

It's a painful endeavor, and we want to get a higher-level picture. We can reverse engineer a game to do this. I useGhidra for that:

Picture N8 – Reverse engineering a game for Sega Genesis

Aplugin created by@DrMefistO helps get started. It marks well-known addresses and creates segments.

As you can see, since games were originally written in assembly language, they have a specific look.

Code and data are mixed: there's a code snippet, then there are byte fragments, e.g. for color, then more code, and so on. It's all the von Neumann architecture.

To make a frame, we need to useLINK andUNLK in the m68k assembler. In reality, this is almost never the case: in most functions, arguments are passed via semi-random registers. Some functions place the result in the status register flag (e.g., in ZF). Fortunately, in Ghidra, one can manually specify what the function does in such cases, enabling the decompiler to display more accurate output. There's alsoswitch of functions when they have the same content, but the first few instructions are different. An example is in the screenshot:

Picture N9 – A "switch" of functions

To get a general idea of what's going on and create a more accurate Sega emulator, we don't need to reverse engineer the entire game—5-10% is enough. It's better to reverse engineer a game that you remember well from your childhood so that it's not a "black box."

This skill will come in handy in the future when it comes to quickly debugging emulation failures in other games.

Emulating interrupts

Let's say we have some basic functional emulation configured. We run the emulator, and, as expected, it goes into an endless loop. After reverse engineering a code fragment, we discovered that a flag in RAM is zeroed, then the loop waits for the flag to remain zero:

Picture N10 – The reverse-engineered WaitVBLANK function

We check other fragments where this code is accessed and see that the code is located at the VBLANK interrupt. Let's reverse engineer VBLANK:

Picture N11 – The reverse-engineered VBLANK function

Have you heard of the legendary VBLANK and its popular grandson, HBLANK?

Depending on whether it isNTSC or PAL/SECAM, a video controller renders a frame pixel by pixel on the old TV 60 or 50 times per second.

Frame rendering (source)

The HBLANK interrupt triggers when the current line is drawn and the ray moves to the next line (the green lines in the picture above). On a real console, only 18 bytes can physically be sent to the video memory during this time, though I don't set such a limit in the simulator, and not all games use this interrupt.

The VBLANK interrupt triggers when the entire frame is rendered, and the ray reaches the beginning of the screen (the blue line). A maximum of 7 kilobytes of data can be sent to the video memory during this time.

Let's say we hardcoded the use of NTSC (60 FPS). To trigger the interrupt, we need to embed a check into the instruction execution loop that checks whether the conditions are met:

VBLANK interrupt is enabled by the video processor;
The Interrupt Mask value in the status register is less than six, which indicates the importance level of the current interrupt.
1s/60 time has passed since the previous interrupt.

If so, we jump to the function. It looks like this:

std::optional<Error>InterruptHandler::call_vblank(){// push PC (4 bytes)auto&sp=registers_.stack_ptr();sp-=4;if(autoerr=bus_device_.write(sp,registers_.pc)){returnerr;}// push SR (2 bytes)sp-=2;if(autoerr=bus_device_.write(sp,Word{registers_.sr})){returnerr;}// make supervisor, set priority mask, jump to VBLANKregisters_.sr.supervisor=1;registers_.sr.interrupt_mask=VBLANK_INTERRUPT_LEVEL;registers_.pc=vblank_pc_;returnstd::nullopt;}

The full code is here:lib/sega/executor/interrupt_handler.cpp.

The way games run revolves around this interrupt; it's a sort of game engine.

We also need to configure the GUI to re-render the screen when the VBLANK interrupt is received.

Video Display Processor

Video Display Processor (aka VDP) is the second most complex emulator component after m68k. To better understand how it works, I recommend checking out these websites:

Plutiedev is not just about VDP but about programming for Sega Genesis in general. There are many insights into how pseudo-float and other math are implemented in games.
Raster Scroll is an awesome description of VDP with lots of pictures. I suggest reading them just for the fun of it.

This processor has 24 registers responsible for various tasks and 64 kilobytes of VRAM for storing graphics information.

The m68k processor stores data in VRAM and can also change registers. This process mostly occurs during VBLANK. The VDP then renders an image on the TV based on the sent data. That's it—it doesn't do anything else.

VDP has a pretty complicated color system. Four palettes are active at any given time, each containing 16 colors. Each color occupies nine bits (three bits per R/G/B, for a total of 512 unique colors).

The first color in the palette is always transparent, so there are actually 15 colors available in the palette plus transparency.

In VDP, the basic unit is a tile, which is an 8x8 pixel square. The trick is that each pixel doesn't specify a color, but its number in the palette. So, it takes four bits per pixel (a value ranging from 0 to 15), for a total of 32 bytes per tile. You may ask, "Where's the palette number specified?" Well, it isn't specified in a tile, but in a higher-levelplane (or asprite) entity.

The screen can be 28 or 30 tiles high and 32 or 40 tiles wide.

VDP has two entities calledPlane A andPlane B (there's also aWindow Plane), which are the front and back backgrounds, sized no larger than 64x32 tiles.

They can adjust the shift relative to the camera at different rates (e.g., +2 pixels for the foreground and +1 for the background) to create a 3D effect in the game.

For a plane, it's possible to set the shift separately for a line of eight pixels or line by line to achieve different effects.

The plane defines a list of tiles and specifies a palette for each one. Overall, data for the plane can consume a significant amount of VRAM.

The VDP has thesprite entity, which is a composite chunk of tiles ranging in size from 1x1 to 4x4. For example, there can be 2x4 or 3x2 sprites. It has a position on the screen and a palette that determine how the tiles are rendered. We can mirror the sprite vertically and/or horizontally to avoid duplicating tiles. Many objects are rendered in multiple sprites if one sprite isn't enough.

A VDP can contain a maximum of 80 sprites. Each sprite has thelink data member, which is the value of the next sprite to be rendered, so it's like a linked list. The VDP first renders the zero sprite, then the sprite to which the zero sprite'slink points, and so on until the next link is null. This ensures the correct sprite depth.

Depending on the circumstances, there's enough memory in the VRAM for 1,400–1,700 tiles. This seems like a decent number, but it's not that much. For example, filling the background with unique tiles would require about 1,100 tiles, leaving no space for anything else. So, the level designers had to tightly duplicate tiles for rendering.

The VDP has many rules, including two levels of layer prioritization:

Picture N13 – The VDP graphics layer prioritization

It's better to iteratively render the VDP. First, we can render the palettes and assume that they do change correctly over time, meaning that the colors are roughly the same as the contents of the splash screen or main menu:

Picture N14 – A window in GUI, color palettes

Then, we can render all the tiles:

Pictures N15 – All tiles in the zero palette and a fully rendered frame

The same tiles in other palettes

Picture N16 – First palette

Picture N17 – Second palette

Picture N18 – Third palette

We can then render planes in individual windows:

Picture N19 – Two separate planes (below) and a fully rendered frame (above)

There's also the window plane, which is rendered a little differently:

_Picture N20 – The window plane (right) and a fully rendered frame (left)
_
Then it's the sprites' turn:

Picture N21 – The beginning of the sprite list (right) and a fully rendered frame (left)

The full implementation of the renderer is here:lib/sega/video/video.cpp.

A frame must be computed pixel by pixel. To make the pixels visible in ImGui, we need to create a 2D OpenGL texture and put every frame in there:

ImTextureIDVideo::draw(){glBindTexture(GL_TEXTURE_2D,texture_);glTexImage2D(GL_TEXTURE_2D,0,GL_RGBA,width_*kTileDimension,height_*kTileDimension,0,GL_RGBA,GL_UNSIGNED_BYTE,canvas_.data());returntexture_;}

Testing the VDP renderer

Although we can run the game to see what's rendered, doing so can be inconvenient. It's better to start with interesting cases, collect many dumps, and create a test that uses a single command to generate pictures from the dumps. Thegit status command shows which images have changed. This is convenient because we can fix VDP bugs without having to run the emulator.

For this purpose, I added aSave Dump button to the GUI that saves the state of the video memory (VDP, VRAM, CRAM, and VSRAM registers). I saved these dumps in thebin/sega_video_test/dumps directory and wrote aREADME explaining how to regenerate them using a single command.

Of course, this works only if the data has been correctly transferred to the video memory (this isn't the case for a couple of the dumps at the link).

Thestd_image library is useful for saving images as PNG files.

Retro controller support

Since we aren't taking the easy route, we can support retro controllers that are identical to the Sega ones.

I googled what I could buy nearby and bought a controller for $25:

_Picture N22 – The controller
_
The vendor claimed support for Windows but didn't mention Linux. ImGui, on the other hand, claimed support for Xbox, PlayStation, and Nintendo Switch controllers, so I was ready to reverse engineer the controller as well.

Fortunately, everything worked out. I managed to support the three-button Sega controller by pressing the buttons and seeing what code each one corresponded to:

Keyboard and retro controller mapping

voidGui::update_controller(){staticconstexprstd::arraykMap={// keyboard keysstd::make_pair(ImGuiKey_Enter,ControllerDevice::Button::Start),std::make_pair(ImGuiKey_LeftArrow,ControllerDevice::Button::Left),std::make_pair(ImGuiKey_RightArrow,ControllerDevice::Button::Right),std::make_pair(ImGuiKey_UpArrow,ControllerDevice::Button::Up),std::make_pair(ImGuiKey_DownArrow,ControllerDevice::Button::Down),std::make_pair(ImGuiKey_A,ControllerDevice::Button::A),std::make_pair(ImGuiKey_S,ControllerDevice::Button::B),std::make_pair(ImGuiKey_D,ControllerDevice::Button::C),// Retroflag joystick buttonsstd::make_pair(ImGuiKey_GamepadStart,ControllerDevice::Button::Start),std::make_pair(ImGuiKey_GamepadDpadLeft,ControllerDevice::Button::Left),std::make_pair(ImGuiKey_GamepadDpadRight,ControllerDevice::Button::Right),std::make_pair(ImGuiKey_GamepadDpadUp,ControllerDevice::Button::Up),std::make_pair(ImGuiKey_GamepadDpadDown,ControllerDevice::Button::Down),std::make_pair(ImGuiKey_GamepadFaceDown,ControllerDevice::Button::A),std::make_pair(ImGuiKey_GamepadFaceRight,ControllerDevice::Button::B),std::make_pair(ImGuiKey_GamepadR2,ControllerDevice::Button::C),};auto&controller=executor_.controller_device();for(constauto&[key,button]:kMap){if(ImGui::IsKeyPressed(key,/*repeat=*/false)){controller.set_button(button,true);}elseif(ImGui::IsKeyReleased(key)){controller.set_button(button,false);}}}

A little side story about a case of bad luck

I have a HyperX Alloy Origins Core keyboard (this also isn't an ad). It allows for the customization of the RGB lighting with complex patterns, such as animations or click responses, and the addition of macros. However, the customization software is available only on Windows, and I'd like to change the lighting on Linux based on certain events as well.

Then, I took USB dumps in Wireshark and reverse engineered the behavior.

For example, we can assign a static red color to one button, get what is written, and see which bytes relate to that button, and so on.

Unless we reverse engineer the .exe file, there's nowhere to look—it seems like the protocol was invented in the AliExpressTech basement, so there's no documentation. There's an incomplete reverse for this keyboard inOpenRGB (it turns out there's a project for reverse engineering all sorts of colorful stuff).

Pixel shaders

We could create all kinds of pixel shaders to make it look cool.

This was a real pain: shaders are poorly supported in ImGui, and changing that requires a terrible workaround. Additionally, I had to install the GLAD library to call the function that compiles the pixel shader. Also, the shader code must be of the GLSL 130 version, and the only external variable isuniform sampler2D Texture;—the rest are constants.

The goal was to create a CRT shader that would simulate an old TV and to add some other shaders if possible.

Since I am a total noob at shaders, I used ChatGPT to create them, considering the limitations described above. The sources are here:lib/sega/shader/shader.cpp. I didn't even dig into the shader code, just read the comments.

The CRT shader features generated by the AI:

Barrel Distortion is a bulge effect;
Scanline Darkness makes every second line darker;
Chromatic Aberration is an RBG layer distortion;
Vignette darkens the color around the edges.

The shader result:

Picture N23 – Click to see the full image

Fred Flintstone before and after adding the shader (enhanced):

Picture N24 – Fred Flintstone

I asked ChatGPT to create other shaders, but they're not as interesting:

Shaders

Picture N25 – No shaders

Picture N26 – The Desaturate shader

Picture N27 – The Glitch shader

Picture N28 – The Night Vision shader

I mostly played the emulator without shaders, but sometimes I used CRT.

Optimizations for the release build

It may not seem obvious, but rendering a frame is quite a resource-intensive task if done suboptimally. Let's say the screen size is 320x240 pixels. We're iterating pixel by pixel. There are always up to 80 sprites, plus three plane sprites, on the screen. They have priority, which means each of them must be traversed twice. First, we need to find the corresponding pixel in each sprite or plane and check whether it is within thebounding box. Then, we need to take the tile out of the tileset and check whether the pixel is opaque. All of this must be calculated 60 times per second—fast enough to still have time for ImGui and the m68k emulator.

So, the computations must contain no redundant code, memory allocations, and so on.

In reality, having the release build with the optimization settings enabled is enough.

set(CMAKE_BUILD_TYPERelease)

First, let's disable unused features and unnecessary warnings:

add_compile_options(-Wno-format)add_compile_options(-Wno-nan-infinity-disabled)add_compile_options(-fno-exceptions)add_compile_options(-fno-rtti)

We'll switch to theOfast build mode and build the code for a native architecture, sacrificing binary portability, with link-time optimization, loop unwinding, and "fast" math.

set(CMAKE_CXX_FLAGS_RELEASE"${CMAKE_CXX_FLAGS_RELEASE} \     -Ofast \     -march=native \     -flto \     -funroll-loops \     -ffast-math")

This is enough to achieve stable 60 FPS, and even 120 FPS if you play at double speed (when the interval for VBLANK interrupts is halved).

The only process that can be parallelized is the evaluation of pixelson one line. Evaluating on different lines at the same time is impossible because HBLANK works between lines, where colors can be swapped. This is why I wouldn't recommend it. We'll need to use a lock-free algorithm to parallelize it and ensure good resource utilization, but we don't want to do that unless it's absolutely necessary.

Testing the emulator with games

Almost every game introduced something new to the emulator: one game used a rare VDP feature that I implemented incorrectly, another one was doing something strange, and so on. In this section, I've described some of the quirks I've encountered while running a few dozen games.

Those that worked right away

I've basically built the emulator around theCool Spot (1993) game: I reverse engineered it, debugged VDP gimmicks, and so on. The Cool Spot character is the 7 Up lemonade mascot (he's known only in the US, the mascot is different in other regions). It's a beautiful platformer that I played through many times as a kid.

Picture N29 – Cool Spot (1993)

Earthworm Jim (1994). The worm is scavenging through the dumpsters—wow, looks cool!

Picture N30 – Earthworm Jim (1994)

Alladin (1993). I didn't really get into it—the graphics and gameplay weren't the best.

Picture N31 – Alladin (1993)

Reading the VDP status register

Some games read the VDP status register: if we add an incorrect bit, the game either hangs or malfunctions.

This was the case inBattle Toads (1992). The game was doing this:

do{wVar2=VDP_CTRL;}while((wVar2&2)!=0);

Picture N32 – Battle Toads (1992)

The Window Plane looks different when its width is set to 32 tiles

One of the most poorly documented things is the window plane behavior. It appears that, if the window width is 32 tiles, and the width of all planes is 64 tiles, a tile for the window plane should be searched for, considering its width is still 32 tiles. I couldn't find this documented anywhere, so I left the workaround there.

It appears, for example, inGoofy's Hysterical History Tour (1993). The gameplay of this game is pretty mediocre.

Picture N33 – Goofy's Hysterical History Tour (1993), the yellow line at the bottom came from Window Plane

Auto increment errors in DMA

The most annoying thing about VDP is DMA (Direct Memory Access), which is designed to move memory blocks from the m68k RAM to the VRAM. It has a few modes and settings, so it's easy to make a mistake. The most common error type isauto increment. There are non-obvious conditions regarding when a memory pointer should be incremented by this number.

InTom and Jerry: Frantic Antics (1993), when a character moves on the map, new layers are added to the plane via a rare auto increment (128 instead of the usual 1). I had the code to make it look like there's always a 1 in there, because the plane didn't change much except for the top line. I debugged it by examining the plane window closely and determining that the layer was added as if it were vertical.

Picture N34 – Tom and Jerry - Frantic Antics (1993)

Out of all the games I've run, this one is probably the worst. Seems like its developers didn't try at all, making it for an older generation of consoles.

Oversized write to the VSRAM memory

This isn't shown on the top-level scheme of the Sega Genesis architecture, but the VRAM (main video memory, 64 KB), CRAM (128 bytes, 4 color palettes), and VSRAM (80 bytes, vertical shift) are separate for some reason. These independent blocks of memory look even funnier when we consider that the horizontal shift lies entirely in VRAM, but that's not the point.

Tiny Toon Adventures (1993) uses the same algorithm to zero CRAM and VSRAM. So, 128 bytes are written to the 80-byte VSRAM... If we don't handle it somehow, a segfault error will occur. The console offers a great deal of freedom, and that's just the tip of the iceberg.

Picture N35 – Tiny Toon Adventures (1993)

The game has nice graphics, the gameplay is average, and it has a hardcore Sonic-esque feel to it.

Calling DMA when it is disabled

The Flinstones (1993) had some strange behavior: the plane moved up just as much as it moved to the right. In other words, there were strange entries in the VSRAM. The solution was simple: for DMA to work—or not to work—a certain bit had to be set in a VDP register. I considered it and fixed the issue. The game tried to create DMA write operations while the DMA was disabled. The authors somehow wrote the logic incorrectly.

Picture N36 – The Flinstones (1993)

Single-byte register reads

Most guides say that registers are usually read in two bytes, but inJurassic Park (1993), the VDP register is read in one byte. I had to support that.

Picture N37 – Jurassic Park (1993)

Attempting to write to read-only memory

If you decompile one fragment ofSpot Goes to Hollywood (1995), this happens:

if(psVar4!=(short*)0x0){do{sVar1=psVar4[1];*(short*)(sVar1+0x36)=*(short*)(sVar1+0x36)+-2;*psVar4=sVar1;psVar4=psVar4+1;}while(sVar1!=0);DAT_fffff8a0._2_2_=DAT_fffff8a0._2_2_+-2;}

So, there's an off-by-one error here, and the entry is made at the0x000036 address. Sega just doesn't do anything about it—there's no segfault analog. Wait, we could do that all along? As it turns out, we can. Such quirks happen quite often: instead of returningError it has to write into a log and do nothing.

Picture N38 – Spot goes to Hollywood (1995)

Changing endianness at DMA in the VRAM fill mode

In theContra: Hard Corps (1994) game, I saw the broken plane shifts. I added logs and saw that it uses a rare VRAM fill mode to fill the horizontal shift table. After taking several closer looks, I confirmed that the written bytes somehow change the endianness... I had to create a cringey workaround:

// change endianness in this case (example game: "Contra Hard Corps")if(auto_increment_>1){if(ram_address_%2==0){++ram_address_;}else{--ram_address_;}}

Picture N39 – Contra: Hard Corps (1994)

The Z80 RAM dependency and other dependencies

The emulator doesn't support Z80 yet, but some games require it. For example,Mickey Mania (1994) freezes after starting. Opening the decompiler reveals that it reads the0xA01000 address indefinitely until a non-zero byte appears. This is a z80 RAM zone, so the game creates an implicit link between m68k and z80.

Let's implement a new cringey workaround and return a random byte if it's a Z80 RAM read.

Unfortunately, there's another issue: the game now reads VDP H/V Counter at the0xC00008 address.

Well, we'll create another workaround. Now, the game shows the splash screen and crashes again when it reads another unmapped address. Let's put the game away for a while before we reach a critical number of workarounds.

Picture N40 – The Mickey Mania (1994) splash screen

Another example is theSonic the Hedgehog (1991) game, where I get into some sort of a debug mode because there are weird numbers in the upper left corner.

Picture N41 – Sonic the Hedgehog (1991) with two planes

Fortunately, the first Sonic game has long been reverse-engineered (GitHub). So, if you want to have fun, there's a way to fully support it.

Supporting Z80

What does Z80 do?

As previously mentioned, Zilog Z80 is a coprocessor designed for music playback. It has its own 8Kb RAM and is connected to the YM2612 sound synthesizer.

Z80 is an ordinary processor that was used in previous generations of consoles.

How was the music for Genesis games created? Sega distributed a tool calledGEMS under MS-DOS among developers. With GEMS, devs could create all kinds of sounds and use the development board to check what they would sound like on Genesis (what you hear is what you get).

However, many developers didn't bother to compose their own music but used default samples. This resulted in many unrelated games having the same sounds.

The sound was translated into a program called Sound Driver in Z80 assembly language and packed into the ROM cartridge with other data. While the game was running, m68k would read the sound driver from the cartridge ROM and load it into the Z80 RAM. Then, the Z80 processor would start producing sound via the program, which ran independently of m68k. So much for concurrency... You can watchthis video to learn more about the music in Genesis.

How to support Z80

First, one must learn the 332-pagemanual and create a Z80 emulator similar to the m68k one, flood it with tests, and run some programs on Z80. Then, they must learn the sound theory,YM2612 registers, and write a sound generator for Linux.

In terms of scope, it encompasses everything that I've previously described (m68k + VDP), or at least half of it—that's a lot to do.

What else can we do?

The article describes a setup that can run many games. However, you can do all sorts of little things, except for the sound.

Support the two-player mode

Currently, it's a one player game, but support for the dual gamepad mode is possible.

Supporting HBLANK

Currently, VBLANK is called, but HBLANK must be called after each line. Actually, only few games use it. The most common use case is the palette change in the middle of the image.

The Ristar (1994) example

TheRistar (1994) game leverages this feature. Note the waves on the water surface and the wobbly columns below.

Picture N42 – Ristar (1994) running on my emulator, no HBLANK)

And here's what it should look like, as shown in a YouTube walkthrough:

Picture N43 – Ristar (1994) running on a proper emulator

This is particularly evident when Ristar is submerged in water, and the palette is always aquatic there:

Picture N44 – On the left is almost under the water level, on the right is completely under the water level

Supporting other controllers

Currently, only the three-button gamepad is supported. However, asix-button controller can be supported, as well as the rarer ones likeSega Mouse,Sega Multitap,Saturn Keyboard,Ten Key Pad, and even aprinter.

The cooler debugger

The built-in debugger could be improved to allow users to view memory, set read/write breaks, and unwind the stack trace. This would ultimately allow for much faster debugging.