Systems Engineering
You Are The BIOS Now: Building A Hypervisor In Python With KVM
"A beginner-friendly rewrite of the original Rust post β same cursed hypervisor, but in Python, explained from scratch."
This code is open source
All the code from this blog post is available on GitHub. Clone it, run it, break it, make it yours.
View on GitHubIt seems like the whole AI infrastructure community has been getting into sandboxes lately. It's for good reason! AI Agents are fundamentally different workloads which have different requirements around security.
Docker containers were roughly built for trusted workloads which required process isolation. Agents on the other hand should not be trusted, because Agents, for better or for worse,
can write code. It's that Agent generated code which should be treated as highly untrusted arbitrary code. Ladies and gentlemen, we're going to need a bigger boat.
The KVM (Kernel-based Virtual Machine) is a great fit for this. We can take advantage of hardware level isolation on CPUs such that we can build much better safety properties for running Agent code. Let alone multi-tenant Agent code! Oh My!
For us to truly appreciate what a project like Firecracker does, let's go a little low level, put our systems engineer hats on, and build a cursed hello world program.
Follow the Rabbit
When your CPU boots, it wakes up thinking it's 1978. No operating system. No drivers. No Python. Just a confused chip in 16-bit "real mode" waiting for someone to tell it what to do.
That someone is usually the BIOS, a tiny program baked into your motherboard's firmware. But today, you, my friend, are the BIOS. We're going to write a program that creates a tiny virtual machine, boots a CPU inside it, and makes it print "Hello, World!", all from a Python script running on Linux.
This is called a Type-II hypervisor: a program that runs inside an existing operating system (like VirtualBox or QEMU) as opposed to a Type-I hypervisor that runs directly on bare metal (like VMware ESXi). Ours will be about 250 lines of Python.
If you've never touched virtualization, systems programming, or x86 architecture before...perfect. That's exactly who this is for.
What We're Actually Building
Here's the plan in plain English:
- Ask the Linux kernel for permission to create a virtual CPU.
- Give that virtual CPU some memory to work with.
- Convince the CPU it's running on a modern 64-bit machine (it wakes up thinking it's a 1978 Intel 8086).
- Write a tiny guest program, using a Python mini-assembler that prints "Hello, World!" one character at a time.
- Catch each character from Python and print it to your terminal.
No operating system runs inside our VM. No bootloader. Just machine code executed by a virtual CPU, with Python glue.
Prerequisites
You need three things:
- Linux
- Python 3.8+
- KVM access; your CPU must support hardware virtualization
To check if KVM is available:
ls -la /dev/kvm
If that file exists, you're good. If not, you may need to enable virtualization in your BIOS/UEFI settings (usually called "VT-x" for Intel or "AMD-V" for AMD) and load the kernel module:
sudo modprobe kvm-intel # Intel CPUs
sudo modprobe kvm-amd # AMD CPUs
You may also need to add yourself to the kvm group:
sudo usermod -aG kvm $USER
Then log out and back in. No pip packages needed- we're using only the Python standard library.
Part 1: What Is KVM?
KVM stands for Kernel-based Virtual Machine. It's a piece of the Linux kernel that exposes your CPU's hardware virtualization features (Intel VT-x or AMD-V) through a simple file: /dev/kvm.
You talk to KVM by opening this file and sending it ioctl commands. This is a Unix mechanism for sending control messages to device drivers. Think of it like an API, but instead of HTTP requests, you're sending binary-encoded structs through file descriptors.
The workflow looks like this:
Open /dev/kvm
βββ ioctl: "Create a VM" β gives you a VM file descriptor
βββ ioctl: "Create a VCPU" β gives you a VCPU file descriptor
βββ ioctl: "Run" β CPU executes guest code
βββ returns when something interesting happens (an "exit")
Each arrow is a separate ioctl call. KVM manages all the nasty hardware details, we just have to set up the initial state correctly.
Easy enough right? RIGHT?
Part 2: The ioctl Plumbing
Before we build the VM, we need to define the ioctl numbers. We can just compute them manually in Python.
Every Linux ioctl has a numeric code that encodes four things: the direction of data transfer (read, write, both, or none),
a "type" byte (KVM uses 0xAE), a command number, and the size of the data struct.
# ioctl encoding helpers- these mirror the C macros _IO, _IOR, _IOW
# Don't worry if you don't understand these, just ask claude to explain it to you
def _IO(type_, nr): return (type_ << 8) | nr
def _IOR(type_, nr, size): return (2 << 30) | (type_ << 8) | nr | (size << 16)
def _IOW(type_, nr, size): return (1 << 30) | (type_ << 8) | nr | (size << 16)
KVMIO = 0xAE # KVM's ioctl "type" byte
KVM_CREATE_VM = _IO(KVMIO, 0x01)
KVM_CREATE_VCPU = _IO(KVMIO, 0x41)
KVM_GET_VCPU_MMAP_SIZE = _IO(KVMIO, 0x04)
KVM_SET_USER_MEMORY_REGION = _IOW(KVMIO, 0x46, 32)
KVM_RUN = _IO(KVMIO, 0x80)
KVM_GET_SREGS = _IOR(KVMIO, 0x83, 312)
KVM_SET_SREGS = _IOW(KVMIO, 0x84, 312)
KVM_SET_REGS = _IOW(KVMIO, 0x82, 144)
The numbers like 312 and 144 are the sizeof the kernel's C structs for special registers and general-purpose registers, respectively. We'll define matching Python structs shortly.
Why not just use a library? Libraries obviously exist for this, but they hide all the interesting parts. We're here to learn how the sausage is made. We're system engineers!! Put that hat back on.
Part 3: Describing Hardware With ctypes
KVM expects us to pass C structs back and forth through ioctl. Python's ctypes module lets us define binary-compatible structs that
the kernel can read directly from memory.
An x86 CPU has segment registers, relics from the 1980s when memory was divided into overlapping 64KB "segments." Even in modern 64-bit mode, the CPU still checks them. Each segment register is described by this struct:
import ctypes
class KvmSegment(ctypes.Structure):
_fields_ = [
("base", ctypes.c_uint64), # Base address of the segment
("limit", ctypes.c_uint32), # Size limit
("selector", ctypes.c_uint16), # Index into the GDT (more on this later)
("type_", ctypes.c_uint8), # Segment type (code, data, etc.)
("present", ctypes.c_uint8), # Is this segment valid?
("dpl", ctypes.c_uint8), # Privilege level (0 = kernel, 3 = user)
("db", ctypes.c_uint8), # Default operation size
("s", ctypes.c_uint8), # Descriptor type
("l", ctypes.c_uint8), # Long mode (64-bit)
("g", ctypes.c_uint8), # Granularity
("avl", ctypes.c_uint8), # Available for OS use
("unusable", ctypes.c_uint8),
("padding", ctypes.c_uint8),
]
The descriptor table register tells the CPU where to find the GDT (Global Descriptor Table) and IDT (Interrupt Descriptor Table) in memory:
class KvmDtable(ctypes.Structure):
_fields_ = [
("base", ctypes.c_uint64),
("limit", ctypes.c_uint16),
("padding", ctypes.c_uint16 * 3),
]
The special registers struct bundles all of these together, plus the control registers (cr0 through cr8) and the efer register. These are the knobs we'll turn to activate 64-bit mode:
class KvmSregs(ctypes.Structure):
_fields_ = [
("cs", KvmSegment), # Code segment
("ds", KvmSegment), # Data segment
("es", KvmSegment), # Extra segment
("fs", KvmSegment), # General-purpose segment
("gs", KvmSegment), # General-purpose segment
("ss", KvmSegment), # Stack segment
("tr", KvmSegment), # Task register
("ldt", KvmSegment), # Local descriptor table
("gdt", KvmDtable), # Global descriptor table
("idt", KvmDtable), # Interrupt descriptor table
("cr0", ctypes.c_uint64), # Control register 0
("cr2", ctypes.c_uint64),
("cr3", ctypes.c_uint64), # Page table base
("cr4", ctypes.c_uint64),
("cr8", ctypes.c_uint64),
("efer", ctypes.c_uint64), # Extended Feature Enable Register
("apic_base", ctypes.c_uint64),
("interrupt_bitmap", ctypes.c_uint64 * 4),
]
And finally, the general-purpose registers, the ones you'd recognize from any x86 assembly tutorial:
class KvmRegs(ctypes.Structure):
_fields_ = [
("rax", ctypes.c_uint64), ("rbx", ctypes.c_uint64),
("rcx", ctypes.c_uint64), ("rdx", ctypes.c_uint64),
("rsi", ctypes.c_uint64), ("rdi", ctypes.c_uint64),
("rsp", ctypes.c_uint64), ("rbp", ctypes.c_uint64),
("r8", ctypes.c_uint64), ("r9", ctypes.c_uint64),
("r10", ctypes.c_uint64), ("r11", ctypes.c_uint64),
("r12", ctypes.c_uint64), ("r13", ctypes.c_uint64),
("r14", ctypes.c_uint64), ("r15", ctypes.c_uint64),
("rip", ctypes.c_uint64), # Instruction pointer
("rflags", ctypes.c_uint64), # CPU flags
]
Key insight: These aren't abstractions. Each struct maps byte-for-byte to a kernel data structure. When we call
fcntl.ioctl(fd, KVM_SET_REGS, regs), Python hands the raw bytes of ourKvmRegsobject directly to the kernel.
Part 4: Allocating Guest Memory
Our virtual CPU needs memory. But we can't just use a Python bytearray, KVM needs the real virtual address of a memory-mapped region
(the pointer that the kernel's MMU knows about) so it can wire it into the guest's address space.
Python's built-in mmap module is great for file-backed mappings, but it doesn't give us the raw pointer address we need.
So we call libc's mmap directly through ctypes:
# Don't be afraid of this code. Look at it, ingest it, ask Claude about it, embrace it.
import ctypes.util
_libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
_libc.mmap.restype = ctypes.c_void_p
_libc.mmap.argtypes = [
ctypes.c_void_p, ctypes.c_size_t, ctypes.c_int,
ctypes.c_int, ctypes.c_int, ctypes.c_long,
]
def mmap_anon(size):
"""Allocate a page-aligned block of anonymous memory. Returns the address as an int."""
addr = _libc.mmap(None, size, 0x3, 0x22, -1, 0) # RW, PRIVATE|ANON
if addr == ctypes.c_void_p(-1).value:
raise OSError("mmap failed")
return addr
What's happening here:
0x22combinesMAP_PRIVATE(0x02) andMAP_ANONYMOUS(0x20), "give me zeroed memory that's my own copy, don't back it with a file."0x3combinesPROT_READ(0x1) andPROT_WRITE(0x2).- The returned
addris an integer, a raw pointer, which is exactly what KVM'skvm_userspace_memory_regionstruct expects.
We then tell KVM: "Hey, this block of host memory? Pretend it's the guest's physical RAM starting at address 0x1000."
class KvmUserspaceMemoryRegion(ctypes.Structure):
_fields_ = [
("slot", ctypes.c_uint32),
("flags", ctypes.c_uint32),
("guest_phys_addr", ctypes.c_uint64),
("memory_size", ctypes.c_uint64),
("userspace_addr", ctypes.c_uint64),
]
From the guest's perspective, it sees plain physical RAM. From our perspective, it's a chunk of mmap'd memory
we can read and write with ctypes.memmove.
Part 5: The CursedVm Class
Ok so we're writing python and not the Rust that the original blog post was written in. So maybe instead of CursedVM, we'll be the "LessCursedVM". Still cursed, but less.
Now we can put it all together. Here's the skeleton of our hypervisor:
import fcntl
import mmap
import os
import struct
GUEST_PHYS_ADDR = 0x1000 # Where guest "physical" memory starts
GUEST_SIZE = 256 << 20 # 256 MiB of guest RAM
GDT_OFFSET = 0x0 # GDT sits at the start of guest memory
PML4_OFFSET = 0x1000 # Page tables start here
PAGE_TABLE_SIZE = 0x1000 # 4 KiB per table
PAGE_SIZE = 1 << 21 # 2 MiB large pages
CODE_OFFSET = PML4_OFFSET + 3 * PAGE_TABLE_SIZE # Code goes after page tables
class CursedVm:
def __init__(self):
# Open /dev/kvm
self._kvm_fd = os.open("/dev/kvm", os.O_RDWR)
# Create a VM (returns a new file descriptor)
self._vm_fd = fcntl.ioctl(self._kvm_fd, KVM_CREATE_VM, 0)
# Create a virtual CPU
self._vcpu_fd = fcntl.ioctl(self._vm_fd, KVM_CREATE_VCPU, 0)
# mmap the VCPU's "run" area β this is where KVM writes exit info
run_size = fcntl.ioctl(self._kvm_fd, KVM_GET_VCPU_MMAP_SIZE, 0)
self._run = mmap.mmap(
self._vcpu_fd, run_size,
mmap.MAP_SHARED, mmap.PROT_READ | mmap.PROT_WRITE,
)
# Allocate guest memory
self._mem = mmap_anon(GUEST_SIZE)
# Register it with KVM
region = KvmUserspaceMemoryRegion(
slot=0, flags=0,
guest_phys_addr=GUEST_PHYS_ADDR,
memory_size=GUEST_SIZE,
userspace_addr=self._mem,
)
fcntl.ioctl(self._vm_fd, KVM_SET_USER_MEMORY_REGION, region)
# Bootstrap the CPU into 64-bit mode
self._setup_long_mode()
self._setup_page_tables()
self._setup_registers()
Each ioctl call is like sending a command to KVM through a walkie-talkie. First we get a VM, then a VCPU inside that VM, then we wire up memory, and finally configure the CPU state.
The _run mmap is particularly interesting, it's a shared memory region between our process and the kernel.
When the VCPU exits (because the guest did something that needs our attention),
KVM writes the reason and details into this buffer. We read it back with struct.unpack_from.
This is an important data structure that allows our Python program to control the VM exits into the host
Part 6: Escaping 1978 and Setting Up Long Mode
Here's where things get a little gnarly. When an x86 CPU powers on, it starts in real mode, a 16-bit environment from the Intel 8086 era. It can only address 1 MiB of memory, uses segmented addressing, and has no concept of virtual memory or protection rings as we know them today.
A real machine goes through several stages to reach 64-bit "long mode":
Real Mode (16-bit) β Protected Mode (32-bit) β Long Mode (64-bit)
Each transition requires flipping specific bits in control registers and setting up data structures that the CPU checks. Normally a BIOS and bootloader handle this, but our VM has neither. We have to do it ourselves. We need to be the change we want to see.
The Global Descriptor Table (GDT)
The GDT is a table of "segment descriptors" that the CPU consults to understand memory layout. Even in 64-bit mode where segmentation is mostly ignored, the CPU still requires a valid GDT to be present. It's one of those "legacy tax" things in x86.
We need three entries:
GDT_NULL = 0x0000000000000000 # Entry 0: always null (CPU requires it)
GDT_CODE64 = 0x00209A0000000000 # Entry 1: 64-bit code segment
GDT_DATA64 = 0x0000920000000000 # Entry 2: 64-bit data segment
These 64-bit values are packed bitfields. The code segment says "this is executable, readable, present, runs in 64-bit mode, and operates at privilege level 0 (kernel)." The data segment says "this is writable, present, and used for data." The exact bit layout is documented in the Intel SDM (Software Developer's Manual).
Flipping The Switches
Setting up long mode requires configuring three groups of things:
1. The segment registers tell the CPU which GDT entries to use:
def _setup_long_mode(self):
sregs = KvmSregs()
fcntl.ioctl(self._vcpu_fd, KVM_GET_SREGS, sregs)
# Point the GDT register at our table in guest memory
sregs.gdt.base = GUEST_PHYS_ADDR + GDT_OFFSET
sregs.gdt.limit = 23 # 3 entries Γ 8 bytes β 1
# Write the GDT entries into guest memory
self._poke(GDT_OFFSET + 0, "<Q", 0x0000000000000000) # null
self._poke(GDT_OFFSET + 8, "<Q", 0x00209A0000000000) # 64-bit code
self._poke(GDT_OFFSET + 16, "<Q", 0x0000920000000000) # 64-bit data
2. The code segment register (CS) configured for 64-bit execution:
# Code segment: 64-bit long mode, ring 0
for attr, val in [("base",0), ("limit",0xFFFFFFFF), ("selector",8),
("present",1), ("type_",11), ("dpl",0),
("db",0), ("s",1), ("l",1), ("g",1)]:
setattr(sregs.cs, attr, val)
The ("l", 1) line is the magic bit. It tells the CPU "this code segment runs in 64-bit long mode." Without it, the CPU stays in compatibility mode. The ("selector", 8) points at GDT entry 1 (each entry is 8 bytes, so entry 1 is at byte offset 8).
3. The control registers are the actual mode switches:
sregs.efer |= 0x500 # Set LME (Long Mode Enable) and LMA (Long Mode Active)
sregs.cr0 |= 0x80000001 # Enable paging (PG) and protection (PE)
sregs.cr4 |= 0x20 # Enable Physical Address Extension (PAE)
fcntl.ioctl(self._vcpu_fd, KVM_SET_SREGS, sregs)
Here's what each bit does:
| Register | Bit(s) | Name | What It Does |
|---|---|---|---|
EFER | 8 (0x100) | LME | "I want to use long mode" |
EFER | 10 (0x400) | LMA | "Long mode is now active" |
CR0 | 0 (0x1) | PE | Enable protected mode (exit real mode) |
CR0 | 31 (0x80000000) | PG | Enable paging (virtual memory) |
CR4 | 5 (0x20) | PAE | Enable 36-bit+ physical addressing |
Setting 0x500 in EFER sets both bits 8 and 10 (0x100 | 0x400). These three registers form a chain:
PAE must be on for paging to work, paging must be on for long mode to activate, and long mode must be enabled for 64-bit
code to run.
Why So Much Ceremony? I hate it
You might wonder why we can't just set a single "64-bit mode" flag. Wouldn't that be nice? The answer is backward compatibility. x86 has been accumulating features for over 45 years, and each new mode was bolted on top of the previous one. Long mode was added by AMD in 2003, but it still requires the infrastructure from protected mode (1985) and PAE (1995) to be in place. Every modern x86 CPU still wakes up as an 8086 and goes through this same ritual, we're just doing it explicitly instead of letting firmware handle it.
Part 7: Page Tables β Mapping Virtual Memory
With paging enabled (that CR0.PG bit we just set), the CPU no longer accesses physical memory directly. Every memory access goes through a page table. A page table is a tree structure that translates
virtual addresses to physical addresses. We now need to have a page table so that our guest CPU does not crash. We're going to be lazy and create a simple one.
In 64-bit mode with 2 MiB large pages, the translation works like this:
Virtual Address (64 bits):
ββββββββββββ¬βββββββββββ¬βββββββββββ¬ββββββββββββββββββββββββββββββ
β 9 bits β 9 bits β 9 bits β 21 bits β
β PML4 β PDPT β PD β Offset within 2 MiB page β
ββββββββββββ΄βββββββββββ΄βββββββββββ΄ββββββββββββββββββββββββββββββ
β β β
β β βββ Page Directory entry β 2 MiB physical page
β βββββββββββββββ PDPT entry β Page Directory
βββββββββββββββββββββββββββ PML4 entry β PDPT
The CPU walks this tree on every memory access (with heavy caching via the TLB). We need to build this tree in guest memory.
We'll use an identity mapping, virtual address X maps to physical address X. This is the simplest possible setup, and it means our guest code doesn't have to worry about address translation at all.
def _setup_page_tables(self):
# Zero out the page table area (3 tables Γ 4096 bytes each)
self._zero(PML4_OFFSET, 3 * PAGE_TABLE_SIZE)
pml4 = GUEST_PHYS_ADDR + PML4_OFFSET
pdpt = pml4 + PAGE_TABLE_SIZE
pd = pdpt + PAGE_TABLE_SIZE
# PML4[0] β PDPT (present + writable)
self._poke(PML4_OFFSET, "<Q", pdpt | 0x3)
# PDPT[0] β Page Directory (present + writable)
self._poke(PML4_OFFSET + PAGE_TABLE_SIZE, "<Q", pd | 0x3)
# PD[0..512] β 512 identity-mapped 2 MiB pages (present + writable + large)
for i in range(512):
self._poke(PML4_OFFSET + 2 * PAGE_TABLE_SIZE + i * 8, "<Q", (i << 21) | 0x83)
# Tell the CPU where the PML4 lives
sregs = KvmSregs()
fcntl.ioctl(self._vcpu_fd, KVM_GET_SREGS, sregs)
sregs.cr3 = pml4
fcntl.ioctl(self._vcpu_fd, KVM_SET_SREGS, sregs)
The 0x83 in each page directory entry encodes three flags: present (bit 0), writable (bit 1), and large page (bit 7). The large page bit tells the CPU "don't look for another level of tables, this entry directly maps 2 MiB of memory." 0x83 = 0b10000011.
The (i << 21) part sets the physical address: entry 0 maps to physical address 0, entry 1 maps to 0x200000 (2 MiB), entry 2 to 0x400000 (4 MiB), and so on up to 1 GiB.
The helper methods for writing to guest memory look like this:
def _poke(self, offset, fmt, value):
"""Write a packed value into guest memory at the given offset."""
ctypes.memmove(self._mem + offset, struct.pack(fmt, value), struct.calcsize(fmt))
def _zero(self, offset, n):
"""Zero out a region of guest memory."""
ctypes.memset(self._mem + offset, 0, n)
If you're still here at this point...we're almost there and it's going to be worth it
Part 8: Setting Up The Registers
The last setup step is initializing the general-purpose registers. The three critical ones are:
- RIP (Instruction Pointer): where the CPU starts executing code
- RSP (Stack Pointer): where the stack starts (grows downward)
- RFLAGS: CPU flags (bit 1 must always be set per the x86 spec)
def _setup_registers(self):
regs = KvmRegs()
regs.rip = GUEST_PHYS_ADDR + CODE_OFFSET # Start executing here
regs.rsp = (GUEST_PHYS_ADDR + GUEST_SIZE) & ~(PAGE_SIZE - 1) # Stack at top
regs.rbp = regs.rsp # Frame pointer = stack pointer
regs.rflags = 0x2 # Reserved bit (must be 1)
fcntl.ioctl(self._vcpu_fd, KVM_SET_REGS, regs)
The & ~(PAGE_SIZE - 1) is a bitmask trick that rounds down to the nearest 2 MiB page boundary.
This keeps the stack properly aligned.
Here's the resulting memory layout inside the guest:
0x1000 βββββββββββββββββββββββββββ
β GDT (24 bytes) β
0x2000 βββββββββββββββββββββββββββ€
β PML4 (4 KiB) β
0x3000 βββββββββββββββββββββββββββ€
β PDPT (4 KiB) β
0x4000 βββββββββββββββββββββββββββ€
β Page Directory (4 KiB) β
0x5000 βββββββββββββββββββββββββββ€ β RIP starts here
β Guest Code β
β β
β (free space) β
β β
βββββββββββββββββββββββββββ€ β RSP (stack grows down)
β Stack β
βββββββββββββββββββββββββββ
Part 9: The Run Loop and Catching I/O Exits
Our guest has no operating system, no syscalls, no stdout. The only way it can communicate with the outside world is through I/O port instructions (out in x86 assembly).
I/O ports are an ancient x86 mechanism originally designed for talking to hardware like keyboards and disk controllers.
The CPU has a separate 64K address space just for I/O, and the out instruction writes a byte to a numbered port.
When our guest executes out, the CPU can't actually talk to hardware (it's virtualized),
so KVM traps the instruction and returns control to our Python code.
This is called a VM exit. We can then read which port was written to and what data was sent.
def run(self, on_io):
"""Run the guest. Calls on_io(port, data) for each I/O-out exit."""
while True:
fcntl.ioctl(self._vcpu_fd, KVM_RUN, 0)
reason = struct.unpack_from("<I", self._run, 8)[0]
if reason == 2: # KVM_EXIT_IO
if self._run[32] == 1: # direction == IO_OUT
size = self._run[33]
port = struct.unpack_from("<H", self._run, 34)[0]
count = struct.unpack_from("<I", self._run, 36)[0]
off = struct.unpack_from("<Q", self._run, 40)[0]
on_io(port, bytes(self._run[off : off + size * count]))
elif reason == 5: # KVM_EXIT_HLT
return
else:
raise RuntimeError(f"VM exit: {reason}")
The magic numbers (8, 32, 33, 34, 36, 40) are byte offsets into the kvm_run struct. This is the least Pythonic part of the whole project,
we're essentially doing manual struct parsing because the kernel's kvm_run struct is very large and complex,
and we only need a few fields from it.
Here's what each offset contains:
| Offset | Size | Field | Meaning |
|---|---|---|---|
| 8 | 4 bytes | exit_reason | Why the VCPU stopped |
| 32 | 1 byte | io.direction | 0 = in, 1 = out |
| 33 | 1 byte | io.size | Bytes per I/O operation |
| 34 | 2 bytes | io.port | Which I/O port |
| 36 | 4 bytes | io.count | Number of operations |
| 40 | 8 bytes | io.data_offset | Where in kvm_run the data lives |
The flow for each character looks like this:
Guest: out 'H' to port 0x10
β
βΌ
KVM: *VM exit* (I/O out)
β
βΌ
Python: reads kvm_run β port=0x10, data=b'H'
β
βΌ
on_io(0x10, b'H') β print('H')
β
βΌ
Python: ioctl(KVM_RUN) β resumes guest
β
βΌ
Guest: out 'e' to port 0x10
... and so on
Part 10: The Mini Assembler
In the original Rust blog post, the guest program was defined as a wall of raw hex bytes:
# π±
bytes([0x48, 0x31, 0xC0, 0x50, 0x48, 0xB8, 0x6F, 0x72, ...])
This is truly cursed....
Instead, we'll build a tiny Asm class that lets you write guest programs for those of us that don't know x86 machine code.
Each method emits the correct bytes behind the scenes:
class Asm:
"""A tiny x86-64 assembler. Turns method calls into machine code."""
def __init__(self):
self._buf = bytearray()
self._labels = {}
def _emit(self, *bs):
self._buf.extend(bs)
def code(self) -> bytes:
return bytes(self._buf)
Low-Level Instructions
Each CPU instruction becomes a method that emits the right bytes:
def mov_rax(self, imm64):
"""rax = <64-bit value>"""
self._emit(0x48, 0xB8, *imm64.to_bytes(8, "little"))
def mov_rsi_rsp(self):
"""rsi = rsp"""
self._emit(0x48, 0x89, 0xE6)
def load_byte_from_rsi(self):
"""al = byte pointed to by rsi"""
self._emit(0x8A, 0x06)
def inc_rsi(self):
"""rsi += 1"""
self._emit(0x48, 0xFF, 0xC6)
def push_rax(self):
self._emit(0x50)
def out_al(self, port):
"""Send the byte in AL to an I/O port. Triggers a VM exit."""
self._emit(0xE6, port & 0xFF)
def hlt(self):
"""Halt the CPU."""
self._emit(0xF4)
Labels and jumps work by recording byte positions and patching relative offsets:
def label(self, name):
"""Mark current position with a name for jumps."""
self._labels[name] = len(self._buf)
def test_al_zero(self):
"""Set CPU flags: is AL == 0?"""
self._emit(0x84, 0xC0)
def jump_if_zero(self, target):
"""Jump to label if AL was zero."""
pos = len(self._buf)
self._emit(0x74, 0x00)
self._buf[pos + 1] = (self._labels.get(target, pos + 2) - (pos + 2)) & 0xFF
def jump(self, target):
"""Unconditional jump to label."""
pos = len(self._buf)
self._emit(0xEB, 0x00)
self._buf[pos + 1] = (self._labels[target] - (pos + 2)) & 0xFF
High-Level Helpers
The real magic is in two convenience methods that compose the low-level instructions for you:
def push_string(self, s):
"""
Push a string onto the stack. After this, RSP points at the
first character. Null-terminated and padded automatically.
"""
raw = s.encode("ascii") + b"\x00"
while len(raw) % 8 != 0:
raw += b"\x00"
# Push in reverse (stack grows down)
for i in range(len(raw) - 8, -1, -8):
chunk = int.from_bytes(raw[i:i+8], "little")
self.mov_rax(chunk)
self.push_rax()
def print_loop(self, port=0x10):
"""
Emit a loop that reads bytes from RSI one at a time,
sends each to an I/O port, and repeats forever.
"""
self.label("loop")
self.load_byte_from_rsi()
self.test_al_zero()
# ... jump logic, out, inc, reset ...
push_string handles all the tricky parts automatically: ASCII encoding, null termination, padding to 8-byte alignment, splitting into 64-bit chunks, and pushing them in reverse order so the string reads forward in memory.
print_loop emits the entire readβtestβoutβincrementβjump cycle. It also has a sibling, print_once,
that prints the string once and then halts the CPU instead of looping.
Part 11: Writing The Guest Program
With the Asm class, writing the guest program is now three lines:
asm = Asm()
asm.push_string("Hello, World!\n") # put the string on the stack
asm.mov_rsi_rsp() # point RSI at the first character
asm.print_loop(port=0x10) # loop: send each char to port 0x10
That's it. We now have our guest program.
and if you're curious you can peek at the raw bytes:
asm = Asm()
asm.push_string("Hi!\n")
asm.mov_rsi_rsp()
asm.print_loop(port=0x10)
print(asm.code().hex(" "))
# 48 b8 48 69 21 0a 00 00 00 00 50 48 89 e6 8a 06 84 c0 74 07 e6 10 48 ff c6 eb f3 48 89 e6 eb ee
Part 12: Putting It All Together
WE MADE IT.
Here's the complete main script:
import signal
import sys
import threading
from cursed_vm import Asm, CursedVm
def main():
# Build the guest program
asm = Asm()
asm.push_string("Hello, World!\n")
asm.mov_rsi_rsp()
asm.print_loop(port=0x10)
# Handle Ctrl+C
stop = threading.Event()
signal.signal(signal.SIGINT, lambda *_: stop.set())
# Run the VM on a background thread
def vm_thread():
with CursedVm() as vm:
vm.load(asm.code())
vm.run(lambda port, data: (
sys.stdout.write(chr(data[0])) or sys.stdout.flush()
if port == 0x10 and len(data) == 1
else None
))
t = threading.Thread(target=vm_thread, daemon=True)
t.start()
stop.wait()
print("\nBye.")
if __name__ == "__main__":
main()
We run the VM on a background thread because the run loop blocks (it only returns when the guest halts or crashes). The main thread waits for Ctrl+C, which sets the stop event and lets us shut down cleanly.
The with CursedVm() as vm: context manager ensures we clean up file descriptors and munmap the guest memory when we're done.
Run it:
python main.py
And you should see an infinite loop of Hello, World's just like Turing intended:
Hello, World!
Hello, World!
Hello, World!
Hello, World!
^C
Bye.
Each "Hello, World!" is being printed one character at a time by a virtual CPU executing machine code inside a KVM virtual machine, with each character triggering a VM exit that Python catches and writes to your terminal. How freaking cool is that?!
What Just Happened?
Let's zoom out. Here's the full chain of events:
- Python opens
/dev/kvmand asks the kernel for a virtual machine. - Python allocates 256 MiB of memory with
mmapand tells KVM "pretend this is the guest's physical RAM." - Python writes a GDT, page tables, and segment register config into that memory and pokes the VCPU's registers to activate 64-bit mode.
- The
Asmclass generates machine code from readable method calls and Python copies it into the guest's code region. - Python calls
ioctl(KVM_RUN), and the physical CPU enters a special hardware mode (VMX root/non-root on Intel) where it executes the guest code at native speed. - The guest pushes "Hello, World!\n" onto its stack and enters a loop that sends each character to I/O port
0x10via theoutinstruction. - Each
outcauses a VM exit β the CPU switches back to the host, KVM writes the exit details into the sharedkvm_runbuffer, and control returns to Python. - Python reads the port and data, prints the character, and calls
KVM_RUNagain. - This repeats β roughly 15 VM exits per "Hello, World!" (one per character including the newline) β until you hit Ctrl+C.
The guest code runs at essentially native CPU speed. The only overhead is the VM exits for I/O, which take a few microseconds each. For our 15-character string printed in a tight loop, we're looking at tens of thousands of "Hello, World!" per second.
Is our VM good enough for super intelligence?
Obviously this isn't a production grade system, but there are great projects out there that are. Nightshift uses Firecracker under the hood to manage our virtualization stack.
Nightshift then wraps this virtualization with an easy to use and deploy SDK, purpose built for Agents. We give AI agents their own Firecracker microVMs so they can write and execute code in a fully isolated sandbox, with the same hardware-backed security guarantees you just saw, but without any of the ceremony.
Join the Nightshift community
Read the source, open an issue, send a PR. Nightshift is built for the community that run agents in production.
Star on GitHub