Systems Engineering

Containers, But Without The Magic Part 1: Networking

Gianni Crivello9 min read

If you spend enough time around container tooling, you’ll eventually hear phrases like:

“CNI plugin chains”

“overlay networking”

“service mesh sidecars”

Which all sound very impressive.

But underneath all of that, container networking is built on a handful of Linux primitives.

We're on a journey to get Nightshift delegating sandbox runtime to containerd. This will give the project a ton of flexibility and make it easy for users to swap in the type of sandbox they want to run (kata containers, firecracker, runc, etc.).

Check out the Nightshift project!

We're trying to standardize how AI Agents run on-prem or in the cloud.

View on GitHub

Containerd's implementation will also be better than our bespoke runtime. We want to stand on the shoulders of giants and focus on what makes Nightshift a unique project.

So before we get started wiring containerd into Nightshift, we should understand the networking stack we have at our disposal. This is when I came across CNI (Container Network Interface) which streamlines the networking of containers and sits in between the Linux networking stack and the container runtime.

"What does that even mean?" I asked myself on a sunny South Florida afternoon.

I don't like magic so I did what any curious engineer would do:

I built container networking from scratch using nothing but ip.

This is my journey motivating myself for the reason CNI exists, and what it offers the ecosystem.

Start with Nothing

First we create two network namespaces. A network namespace is basically a separate networking stack with its own interfaces, its own routing table, and its own ARP cache.

In other words, it behaves like a small machine.

If you're a noob to networking like me, I highly suggest firing up a Linux machine and following along!

Let's create two namespaces.

sudo ip netns add ns1
sudo ip netns add ns2

Let’s look inside one of them.

sudo ip netns exec ns1 ip link

You’ll see something like:

1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

1:: This is the interface index. Every network interface on a system gets a unique numberic ID inside that namespace.

lo:: This is the loopback interface. Loopback is a virtual interface that sends packets back to the same machine. This is your localhost!

<LOOPBACK>: This is an interface flag. There are several of these including UP, BROADCAST, MULTICAST, LOOPBACK, and LOWER_UP. Notice, this doesn't say UP yet.

mtu 65536: MTU is the maximum transmission unit. this is the maximum packet size the interface will send.

qdisc: This is the packet scheduler attached to the interface. Linux allows traffic shaping and queuing using qdiscs. noop means do nothing.

state: This tells us whether the interface is active. It's currently DOWN.

mode: This relates to special interface modes used by certain drivers. Currently in DEFAULT

group: Interfaces can be grouped for administrative purposes. Currently default

qlen 1000: This is the maximum number of packets queued for transmission. If packets are generated faster than they can be transmitted, they wait in this queue.

link/loopback: This tells us the Layer 2 (Ethernet) link type. Examples we'll see are loopback, ether, veth.

00:00:00:00:00:00: This is the MAC address. Loopback interfaces don’t use real MAC addresses, so Linux assigns all zeros.

brd 00:00:00:00:00:00: The broadcast address is the address used to send packets to all devices on the network. Loopback doesn’t use broadcast, so it’s all zeros.

I wasn't lying when I said we're spelling things out.

That’s the entire network stack for this namespace. Not exactly production ready.

Bring the loopback interface up:

sudo ip netns exec ns1 ip link set lo up sudo ip netns exec ns2 ip link set lo up

So if we look again with sudo ip netns exec ns1 ip link we'll see:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Now we have two isolated “containers”. They just can’t talk to anything yet. Let's fix that.

Give the Container a Network Cable

To connect a namespace to the host we use a veth pair.

Think of a veth pair as a virtual ethernet cable:

vethA <----> vethB

Packets entering one side immediately appear on the other.

Let’s create one.

sudo ip link add veth1 type veth peer name veth1-host

Right now both ends exist on the host.

We need to move one end into the namespace:

sudo ip link set veth1 netns ns1

Check the host side:

ip link show veth1-host

Check inside the namespace:

sudo ip netns exec ns1 ip link

You should now see an ethernet interface inside ns1. I'll leave repeating the same process for ns2 as an exercise for the reader.

At this point each container has a NIC, but they’re still not connected to anything useful.

Add a Virtual Switch

To connect containers together we introduce a Linux bridge. A bridge behaves like an Ethernet switch: it forwards packets based on MAC addresses.

Create a bridge:

sudo ip link add br0 type bridge
sudo ip link set br0 up

Now attach the host ends of the veth pairs to the bridge.

sudo ip link set veth1-host master br0
sudo ip link set veth2-host master br0

bring the host ends up:

sudo ip link set veth1-host up
sudo ip link set veth2-host up

The topology now looks like this:

        br0
       /   \
 veth1-host veth2-host
     |          |
    ns1        ns2

In other words: a tiny virtual network. Now we need ip addresses so containers know how to talk to eachother or how to talk to the internet.

Give the Containers IP Addresses

Right now the containers have interfaces but no addresses.

Let’s fix that.

sudo ip netns exec ns1 ip addr add 10.0.0.2/24 dev veth1
sudo ip netns exec ns1 ip link set veth1 up
sudo ip netns exec ns2 ip addr add 10.0.0.3/24 dev veth2
sudo ip netns exec ns2 ip link set veth2 up

Now try a ping from one container to another.

sudo ip netns exec ns1 ping 10.0.0.3

holy crap it works.

PING 10.0.0.3 (10.0.0.3) 56(84) bytes of data.
64 bytes from 10.0.0.3: icmp_seq=1 ttl=64 time=0.045 ms
64 bytes from 10.0.0.3: icmp_seq=2 ttl=64 time=0.034 ms

The bridge automatically learns where each MAC address lives.

You can actually inspect this table:

bridge fdb show

Which is exactly how a physical switch behaves. Now that containers can talk to eachother, let's get them to talk to the internet.

Add a Gateway

Eventually containers need to talk to the outside world.

For that we assign an IP to the bridge and treat it as a gateway.

sudo ip addr add 10.0.0.1/24 dev br0

Test from the container:

sudo ip netns exec ns1 ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.039 ms
64 bytes from 10.0.0.1: icmp_seq=2 ttl=64 time=0.030 ms

ok so we're talking to the gateway. Now we add a default route. So if the container is not sure where to send packets, it uses the gateway.

sudo ip netns exec ns1 ip route add default via 10.0.0.1
sudo ip netns exec ns2 ip route add default via 10.0.0.1

Check the routing table:

sudo ip netns exec ns1 ip route

You should see:

10.0.0.0/24 dev veth1
default via 10.0.0.1

Local traffic stays on the bridge. Everything else goes through the gateway. But, we're not done yet.

NAT

no not like the bug, we're talking networks here guys.

Containers use private IP addresses, which means they can’t reach the internet directly.

The host fixes that with NAT (Network Address Translation).

First enable packet forwarding:

sudo sysctl -w net.ipv4.ip_forward=1

Then add a NAT rule:

sudo iptables -t nat -A POSTROUTING -s 10.0.0.0/24 ! -o br0 -j MASQUERADE

Now the host rewrites outgoing packets so they appear to originate from the host’s IP.

And suddenly the containers can reach the internet. Let's ping Google!

sudo ip netns exec ns1 ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=1.18 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=116 time=1.19 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=116 time=1.16 ms

What We Just Built

Phew! Let’s step back for a moment.

Using only Linux tools we:

  • created network namespaces
  • created veth pairs
  • built a bridge
  • assigned IP addresses
  • configured routing
  • added NAT

That’s the entire foundation of container networking...mostly.

Enter CNI

CNI (Container Network Interface) exists to automate the exact process we just walked through.

Instead of executing twenty shell commands, a container runtime runs a plugin with a JSON configuration.

The bridge plugin, for example, will:

  • create a bridge
  • create the veth pair
  • move one side into the container namespace
  • assign an IP address
  • configure routes

Which should look very familiar to us now!

Why This Matters for Nightshift

Nightshift runs isolated compute environments for executing user workloads.

As we started integrating with containerd, one detail became important very quickly:

containerd doesn’t actually implement networking.

Instead, it delegates networking to CNI.

Which means every container ultimately gets wired together using the same primitives we just explored:

network namespace --> veth pair --> bridge --> routing --> iptables

Once you understand that stack, container networking starts feeling a little less mysterious.

CNI isn’t magic.

It’s just a well-structured specification to automate a bunch of ip commands.

Check out the Nightshift project!

We're trying to standardize how AI Agents run on-prem or in the cloud.

View on GitHub