The compute stack: Fedora CoreOS, Podman quadlets, and systemd-native container orchestration

The previous article covered what happens when an instance is replaced — the volume takeover, the concurrent Terraform and systemd execution, the 2-5 minutes of downtime. This article covers the instance itself: what's on it, how it gets configured, and how multiple containers run on one EC2 without Kubernetes, docker-compose, or a configuration management tool.

The EC2 instances run Fedora CoreOS: a container-optimized Linux with a read-only root filesystem, no package manager, and no SSH by default. Every operational task you'd normally handle with apt-get, a config management run, or a quick SSH session has to work differently. The configuration runs once. The root is sealed. Any tool that isn't in the base image runs in a container or doesn't run at all.

Configuration drift and the immutable alternative

Config management tools solve the "keep it running" problem through convergence. cloud-init bootstraps the instance, then Ansible, Chef, or Puppet converge it toward a desired state. The host is mutable. Drift is prevented by the tool re-running on a schedule, not by the OS. Stop running Ansible and the host starts diverging the moment someone SSH's in and makes a "quick fix." The enforcement is the tool, not the platform.

Container orchestrators shift the problem to the application layer. Kubernetes manifests declare what should be running; the control plane continuously reconciles toward it. But the node itself is still mutable (unless you also run Bottlerocket or Talos). The orchestrator ensures the right containers are running. It says nothing about whether the host underneath has drifted.

Immutable OS provisioning removes the problem at the root. The configuration runs once at first boot, and the root filesystem is sealed read-only afterward. No drift because the filesystem can't be written to, and no reconciliation loop because there's nothing to reconverge toward. systemd is the only process supervisor needed on one machine. FCOS supports atomic OS updates via rpm-ostree, but this architecture replaces the instance instead.

Butane, Ignition, and the pointer pattern

Fedora CoreOS uses Ignition — a provisioning system that runs in the initramfs, before the root filesystem is mounted read-write for the only time. It creates files, writes systemd units, configures users, and sets up storage. Butane is the human-readable YAML source that transpiles to Ignition JSON. The Terraform configuration uses the Poseidon ct provider for the conversion:

data "ct_config" "workspace" {
  content      = local.butane_config
  strict       = true
  pretty_print = false
}

strict = true rejects any field the transpiler doesn't recognize — a typo in the Butane config is a Terraform error, not a silently ignored directive.

The entire OS configuration — every systemd unit, every container definition, every config file, every shell script — is declared in a single Butane file. The resulting Ignition JSON is uploaded to S3. The EC2 instance's user_data doesn't contain the full configuration. It's a ~200-byte pointer that tells Ignition where to fetch the real config:

locals {
  ignition_content_hash = sha512(data.ct_config.workspace.rendered)
  ignition_pointer = jsonencode({
    ignition = {
      version = "3.4.0"
      config = {
        replace = {
          source       = "s3://${var.ignition_bucket_name}/${local.ignition_s3_key}"
          verification = {
            hash = "sha512-${local.ignition_content_hash}"
          }
        }
      }
    }
  })
}

EC2 user_data has a 16KB limit. A full Ignition config with seven container quadlets, a dozen systemd services, shell scripts, an nginx configuration, and a Dockerfile exceeds that comfortably. The instance fetches the full config from S3 at boot using its instance profile credentials.

The SHA-512 hash serves double duty. Integrity: if the S3 object is corrupted or tampered with, the instance enters emergency mode rather than running with a bad config. Change detection: the hash is embedded in user_data, so when the Ignition config changes, user_data changes, and Terraform replaces the instance:

resource "aws_instance" "workspace" {
  ami       = local.ami_id
  user_data = local.ignition_pointer

  root_block_device {
    volume_size           = 10
    volume_type           = "gp3"
    encrypted             = true
    delete_on_termination = true
  }

  lifecycle {
    replace_triggered_by = [aws_s3_object.ignition_config]
  }
}

The root volume is 10GB, encrypted, and destroyed with the instance. Data lives on a separate gp3 EBS volume mounted at /var/data — databases, credentials, certificates, backups, and customer graph files.

Partitioning a volume that doesn't exist at boot

Ignition has built-in storage.disks and storage.filesystems declarations for partitioning and formatting volumes. On this architecture, they create a boot deadlock. The data volume isn't attached when the instance first boots. The volume takeover service claims it from the previous instance after the OS is already running. Ignition's storage declarations would block in the initramfs, waiting for a device that won't appear until well after boot.

The Butane config avoids this entirely — no storage.disks, no storage.filesystems. Instead, a systemd-repart configuration file defines the partition layout, and a chain of systemd services handles the rest after the volume arrives:

volume-takeover.service — claims the EBS volume (covered in Article 4)
data-partitioning.service — waits for the NVMe device node, runs systemd-repart (no-op if already partitioned, formats as XFS on first deployment)
var-data.mount — mounts /dev/disk/by-partlabel/data at /var/data
data-directories.service — creates the directory structure via systemd-tmpfiles
platform-init.service — first-boot initialization
Container services start

Each step declares After= and Requires= on the previous one. If any step fails, the dependent services don't start.

Secrets without state files

platform-init.service runs before any container starts. It checks for a marker file on the persistent data volume.

On a genuinely new data volume, the marker is absent. The script generates everything the system needs: database passwords, SSH keys, connection URIs. It writes the marker when done.

On subsequent boots, including after instance replacement when the volume takeover reattaches the data from the previous instance, the marker exists. The script sets SELinux contexts on data directories and exits. No credential generation, no key creation.

The decision: no secrets pass through Terraform. Database passwords and connection URIs are generated on the instance and written to the persistent data volume. They never appear in Terraform state files, CI logs, or the Ignition configuration.

generate_password() {
    openssl rand -base64 24 | tr -d '/+=' | head -c 32
}

if [ ! -f "${CREDENTIALS_FILE}" ]; then
    # Generate random passwords for each database
    # Write passwords and pre-built connection URIs to a shell-sourceable file
    # (bridge network URIs, host network URIs, async driver variants)
    ...

    chmod 600 "${CREDENTIALS_FILE}"
    chown <app-uid>:<app-uid> "${CREDENTIALS_FILE}"
fi

The credentials file is a shell-sourceable format that systemd's EnvironmentFile= directive reads directly. Each container's quadlet references it, so the same passwords propagate to every service that needs them.

The init script handles more than credentials. It generates an Ed25519 SSH keypair for the management service (covered below) and sets directory ownership so each database's data directory is owned by the UID that database runs as inside its container. It also creates a self-signed TLS certificate so nginx can start immediately while Let's Encrypt issuance runs in the background via the DNS-01 flow described in Article 4.

Because the credentials live on the persistent data volume, they survive instance replacement. A new instance boots, the volume takeover reattaches the data, the init script finds the marker and the existing credentials, and the containers start with the same database passwords. No re-initialization.

Multiple containers, no orchestrator

The opening section covered why Kubernetes doesn't solve OS-level drift. It also doesn't fit the resource constraints: 1-2GB of control plane overhead on a 4GB instance. docker-compose is lighter, but it still runs a daemon between systemd and the container runtime. docker-compose up starts a process supervisor that owns the container lifecycle; systemd doesn't know the individual containers exist, can't express dependencies between them and non-container services, and can't restart a single container on failure. Logging goes through the Docker daemon rather than journald. On an immutable host where systemd already manages every other process, that's a second process supervisor with no added value.

Podman quadlets are .container, .build, and .network files in /etc/containers/systemd/. At boot, systemd's generator converts each one into a regular service unit. No daemon, no orchestration layer. systemctl manages containers the same way it manages every other service. Container stdout and stderr map directly to the systemd journal, so journalctl -u backend.service shows the backend's application logs alongside its systemd lifecycle events, with no separate log aggregator. Dependency ordering, health checks, restart policies, and resource limits use the same [Unit] and [Service] directives as any systemd unit. The container runtime is Podman, but the lifecycle manager is systemd.

Most containers communicate over a Podman bridge network by container name. One — the management service — runs on the host network for loopback SSH access to systemd (covered in "Controlling systemd from a container" below).

systemd unit dependency graph showing boot sequence from volume-takeover through container services

systemd orders the full startup through After= and Requires= edges between units. The backend's quadlet declares:

[Unit]
After=platform-network.service platform-init.service graph-db.service
      relational-db.service policy-engine.service ecr-credential-helper.service
      ai-engine.service
Requires=platform-network.service platform-init.service
         ecr-credential-helper.service graph-db.service
         relational-db.service policy-engine.service
Wants=ai-engine.service

Requires= means the backend won't start without the listed services. Wants= is softer — the AI engine is preferred but not mandatory. If the AI engine is still building its custom image when the backend starts, I features degrade gracefully. nginx depends on the backend; the management service runs independently on its own branch of the graph.

Container images come from a private ECR registry. Fedora CoreOS doesn't ship a credential helper for ECR. A oneshot service downloads the docker-credential-ecr-login binary on first boot (ConditionPathExists=!/usr/local/bin/docker-credential-ecr-login skips subsequent runs), and every container that pulls from ECR declares After=ecr-credential-helper.service. Image pulls authenticate through the instance profile — no registry passwords stored anywhere.

Health checks use whatever tool the base image ships: the graph database's native client (validates the protocol, not just the process), pg_isready for the relational database, wget or curl depending on what the base image includes. Services built from distroless or scratch images have no health checks — there's no binary that could execute one. systemd restarts them on crash (Restart=always).

HealthStartPeriod gives slow starters time before the first check, with longer grace periods for services that run database migrations at startup.

Each quadlet sets a hard memory ceiling (--memory) and a soft reservation (--memory-reservation). The ceilings across all containers deliberately overcommit available RAM. The reservations fit within it, with headroom for the OS and page cache. In practice, not every container hits its ceiling simultaneously — the graph database and AI engine spike during queries; the rest idle.

Building a custom image at boot

A customer installs a graph module that ships its own Python dependencies, and the AI engine needs those libraries at runtime. But the base image from ECR is generic. It doesn't know what modules any particular customer has installed, and the host is immutable.

A .build quadlet solves this — a systemd unit that builds a container image at boot:

[Build]
ImageTag=localhost/ai-engine-custom:latest
File=/etc/platform/Dockerfile.ai-engine
SetWorkingDirectory=/var/data/platform/ai-engine
Pull=always
PodmanArgs=--security-opt label=disable

The Dockerfile scans the customer's graph directory for requirements.txt files, deduplicates them, and pip installs the combined set:

FROM ${base_image}

COPY graphs /tmp/graphs
RUN find /tmp/graphs -name "requirements.txt" -exec sh -c 'cat "$1"; echo' _ {} \; \
    | sort -u | grep -v '^$' > /tmp/all-requirements.txt || true; \
    if [ -s /tmp/all-requirements.txt ]; then \
        pip install --no-cache-dir -r /tmp/all-requirements.txt; \
    fi; \
    rm -rf /tmp/graphs /tmp/all-requirements.txt

The image exists only locally. The AI engine's .container quadlet references it by tag. The build runs before the AI engine starts (declared via Requires=), so the custom image is ready when the container that uses it comes up. Pull=always ensures the base image from ECR is fresh on every instance replacement.

One subtlety: Podman builds can change SELinux labels on the build context directory. The build quadlet's ExecStartPost re-applies the appropriate SELinux labels after the build completes, so other containers can still read the graph files.

Hardening a rootful stack

A container escape on this system lands on a read-only host filesystem (Fedora CoreOS), SELinux enforcing mode, an instance profile scoped to one customer's AWS resources, and a /28 subnet with no routes to other customers. Each layer assumes the layer above it has been breached.

All containers run under rootful Podman (system-wide quadlets in /etc/containers/systemd/). Rootless Podman would avoid the root daemon entirely, but it introduces complications that aren't justified on a single-tenant instance already isolated at the network, IAM, and identity layers. nginx needs port 443 — rootless requires net.ipv4.ip_unprivileged_port_start=0 or a port remap. All containers share a consistent SELinux MCS level so their volume mounts are cross-readable; rootless Podman uses a different labeling model that makes this harder to coordinate.

Rootful doesn't mean the containers run as root. Capabilities are dropped wholesale, then selectively re-added. nginx is the most privileged container, and even it gets only what it needs:

ReadOnly=true
DropCapability=ALL
AddCapability=NET_BIND_SERVICE
AddCapability=CHOWN
AddCapability=SETUID
AddCapability=SETGID
NoNewPrivileges=true

NET_BIND_SERVICE for port 443, CHOWN/SETUID/SETGID for worker process management. Application containers get DropCapability=ALL with no additions.

Application services run as a dedicated non-root user created by the Butane config. Each database runs as its upstream-default UID. These UIDs own their respective data directories on the persistent volume, so file permissions enforce separation even after a container escape. NoNewPrivileges is set on every container — no setuid escalation, no capability transitions, kernel-enforced for the process and all its children.

Where possible, container root filesystems are read-only. nginx uses ReadOnly=true with Tmpfs for /var/cache/nginx and /var/run. Configuration files are mounted :ro. SELinux runs in enforcing mode with a consistent MCS level on every container. Volume mounts use :z for automatic relabeling. The init script explicitly sets the correct SELinux type and MCS level on all data directories, because race conditions during boot can cause the automatic relabeling to miss files created between services.

Controlling systemd from a container

The management service needs to trigger systemd operations: restart a service after a module update, rebuild the AI engine's custom image, check service status. It runs as a non-root user inside a FROM scratch container on the host network. Giving it a Podman socket would defeat the privilege separation — socket access can run arbitrary containers with arbitrary privileges.

Instead: a restricted SSH connection over loopback. sshd listens on a non-standard port bound to 127.0.0.1 only, unreachable from outside the instance. The management service connects as a dedicated restricted user with an Ed25519 key generated at first boot (never in Terraform state). The authorized_keys file restricts the key:

no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty ssh-ed25519 AAAA...

The sudoers file limits the user to an explicit allowlist of systemctl commands — specific service restarts and status checks, nothing else:

<user> ALL=(root) NOPASSWD: /usr/bin/systemctl restart <service>.service
<user> ALL=(root) NOPASSWD: /usr/bin/systemctl status <service>.service
# ... one line per permitted operation, no wildcards

The management service can restart application services, check their status, and run isolated module lifecycle scripts (via podman run --rm --network=none --read-only --cap-drop=ALL). It cannot restart nginx, stop databases, modify files, or execute arbitrary commands.

Everything is a container

Fedora CoreOS ships systemd, Podman, and not much else. The read-only root means no dnf install. Every tool that isn't in the base image runs in a container, whether it's an application service or operational tooling.

The loopback SSH service needs SELinux to allow sshd on a non-standard port. FCOS doesn't ship semanage (the tool for managing SELinux port labels). A oneshot systemd service handles it: run an ephemeral Fedora container, install the SELinux utilities, apply the label, exit. The container is destroyed; the SELinux policy persists on the host:

ExecStart=/usr/bin/podman run --rm --privileged \
  -v /sys/fs/selinux:/sys/fs/selinux \
  -v /etc/selinux:/etc/selinux \
  -v /var/lib/selinux:/var/lib/selinux \
  fedora:latest \
  sh -c "dnf install -y policycoreutils-python-utils && \
         semanage port -a -t ssh_port_t -p tcp <port> || \
         semanage port -m -t ssh_port_t -p tcp <port>"

The same pattern applies to the volume takeover (AWS CLI in a container, covered in Article 4), certificate renewal (certbot in a container), and backups (covered below). There is no special category of "host-level tools" that get installed differently from application services.

Backups and timers

Two systemd timers handle recurring maintenance. Both use Persistent=true, so if the instance is off when a timer fires (during a replacement, for example), the missed run executes immediately after boot.

The backup timer fires daily with a random delay (spreading load if multiple customers share the same cell). The service dumps both databases through podman exec inside the running containers — the graph database's native export and pg_dump for the relational database — compresses the output, uploads to S3, and cleans up older local backups:

[Timer]
OnCalendar=daily
Persistent=true
RandomizedDelaySec=1800

# Dump each database through podman exec inside the running container
podman exec <graph-db> <native-export-command> > "${BACKUP_DIR}/graph.dump"
podman exec <relational-db> pg_dump ...  > "${BACKUP_DIR}/relational.sql"

# Compress and upload (FCOS has no aws CLI — use the container image)
tar -czf "${DATE}.tar.gz" "${DATE}"
podman run --rm --net=host \
    -v /var/data/backups:/backups \
    -e AWS_REGION="${AWS_REGION}" \
    public.ecr.aws/aws-cli/aws-cli:latest \
    s3 cp "/backups/${DATE}.tar.gz" "s3://${BACKUPS_BUCKET}/backups/${DATE}.tar.gz"

# Retain 7 days locally
find /var/data/backups -name "*.tar.gz" -mtime +7 -delete

The certificate renewal timer fires periodically with a random delay. The script checks the current certificate's expiry, skips if more than 30 days remain, and runs certbot DNS-01 in a container otherwise. After renewal, it copies the certificates to nginx's mount path and reloads the service. Same timer pattern — different OnCalendar, longer random delay.

Both timers are enabled by the Butane config at boot. The backup service declares After= dependencies on the database services — it won't run if the databases aren't up. The certificate service runs independently since it only needs network access and Route53 permissions.

Trade-offs

Immutable infrastructure eliminates configuration drift at the cost of update speed. There are two update paths, and they have different costs.

Container image changes — a new backend version, an updated AI model — go through the management service's SSH interface. It triggers a systemctl restart, the quadlet pulls the new image, and the container comes back up. Seconds of downtime.

OS-level changes — systemd unit modifications, nginx config, a new SELinux policy — require full instance replacement. New Ignition configuration, new S3 upload, new instance. 2-5 minutes. There is no "just edit one file."

Generating credentials on the instance means no secrets in Terraform state, but also no centralized secret management. The credentials exist only on the EBS volume. If the volume is lost, the databases need to be restored from backup with new passwords. For a single-tenant deployment where the customer's data is on the same volume, losing it means restoring from backup regardless.

Quadlets are single-node only. No multi-node scheduling, no rolling updates across instances, no service mesh. For the Consultant tier — one instance per customer, replacement-based updates, brief downtime acceptable — that's the right scope. The Team and Enterprise tiers use K3s clusters where the multi-node features justify their complexity.

Series

This architecture is implemented in dether.net, a graph-native threat modeling platform. If you're interested in seeing these patterns applied to security architecture analysis, that's where they run in production.

The compute stack: Fedora CoreOS, Podman quadlets, and systemd-native container orchestration

The compute stack: Fedora CoreOS, Podman quadlets, and systemd-native container orchestration

Configuration drift and the immutable alternative

Butane, Ignition, and the pointer pattern

Partitioning a volume that doesn't exist at boot

Secrets without state files

Multiple containers, no orchestrator

Building a custom image at boot

Hardening a rootful stack

Controlling systemd from a container

Everything is a container

Backups and timers

Trade-offs

Series

Related Insights

Designing for compromise: a multi-tenant SaaS architecture on AWS

Customer isolation from the infrastructure up