CloudFront VPC Origins: what breaks and how to fix it

Before VPC Origins, connecting CloudFront to a backend in a private subnet required either a public IP on the instance (defeating the purpose of private subnets) or an intermediary. The standard pattern is CloudFront → Application Load Balancer → EC2 in a private subnet. The ALB sits in a public subnet, terminates the connection from CloudFront, and forwards to the instance. It works. It also costs ~$16/month per ALB in hourly charges alone, before any traffic-based costs. For an architecture where each customer gets their own CloudFront distribution pointing at their own EC2 instance, that intermediary cost adds up.

CloudFront VPC Origins remove the need for both. CloudFront creates an Elastic Network Interface directly in the customer's private subnet and routes traffic to the EC2 instance over the AWS backbone. No public IP on the instance, no load balancer, no internet-facing endpoint. The instance is genuinely private — not private-with-an-ALB-in-front, but private with no public-subnet component at all. The security model is straightforward: the instance's security group allows ingress only from the CloudFront VPC Origin's security group, and a custom header injected by CloudFront proves that requests arrived through the distribution.

The earlier articles in this series described the architecture (overview), the per-customer isolation model (blast radius), and the deployment pipeline (self-service). VPC Origins appeared in all three as the mechanism connecting users to their workspaces. This article covers how they work, and the three engineering problems they introduced that traditional CloudFront-to-ALB architectures don't have.

The model

A CloudFront distribution in this architecture has two origins. Static frontend assets (HTML, CSS, JavaScript) are served from an S3 bucket through Origin Access Control. API and GraphQL requests go to a VPC Origin that routes over the AWS backbone to an ENI in the customer's /28 subnet.

diagram dual origin architecture

The VPC Origin is an AWS-managed resource that references the EC2 instance by ARN. CloudFront creates the ENI, manages its lifecycle, and routes traffic over the internal AWS network. The instance never sees a request from the public internet directly. The security group on the customer's instance allows HTTPS ingress only from the CloudFront VPC Origin's security group — not from CIDR ranges, not from prefix lists, not from the internet.

CloudFront injects a customer-specific secret header into every forwarded request. nginx validates it before processing anything:

set $origin_valid 0;
if ($http_x_custom_secret = "${secret_value}") {
    set $origin_valid 1;
}

location /graphql {
    if ($origin_valid = 0) {
        return 403 "Origin verification failed\n";
    }
    # ...
}

The header value is a random secret generated per customer at deployment time. A request that arrives without the correct header — whether from a misconfigured distribution, a direct connection attempt, or another customer's CloudFront — is rejected. The security group prevents network-level access from anything other than CloudFront; the header prevents application-level access from any distribution other than this customer's.

Cache behaviors route requests to the right origin. Static assets go to S3 with aggressive caching (one-day default TTL, gzip compression enabled). The /graphql path goes to the VPC backend with no caching and compression disabled, both necessary for streaming (covered below). The /config endpoint (OIDC settings, feature flags) gets a five-minute cache because it rarely changes but should eventually reflect updates. Module bundles get a similar short cache with authorization headers forwarded.

Custom error responses handle SPA routing: both 403 and 404 from S3 return index.html with a 200 status, so client-side routing works for deep links.

Edge locations are restricted to PriceClass_100 (North America and Europe). The platform's target market doesn't require global edge presence, and the lower price class reduces costs significantly for a per-customer distribution model.

TLS without port 80

Traffic between CloudFront and the EC2 instance travels over the AWS backbone, not the public internet. It would be tempting to leave it as plain HTTP — the network is private, AWS manages it, and the instance has no public endpoint. But the design principle behind this architecture is that no component trusts the network it runs on. A private link is still a link. If an attacker gains access to the VPC (a misconfigured peering route, a compromised Lambda in the same subnet, an AWS-side vulnerability), unencrypted traffic between CloudFront and the backend is readable. TLS on the origin side eliminates that class of exposure regardless of what happens at the network layer.

AWS Certificate Manager provides free certificates, but ACM certificates can only be used with AWS services (CloudFront, ALB, API Gateway). They can't be exported to an EC2 instance. The public-facing TLS termination uses ACM on CloudFront. The origin-side certificate needs to come from somewhere else.

Let's Encrypt is the obvious choice: free, automated, widely supported. The standard ACME flow uses HTTP-01 challenge validation: the ACME server makes an HTTP request to /.well-known/acme-challenge/ on port 80 of the domain being validated. You could route that path through CloudFront to the instance, but that means opening port 80, adding a cache behavior for the challenge path, and keeping an HTTP listener running. That's a wider attack surface for something that runs once every 60 days (30 days before the 90-day expiry). The whole point of VPC Origins is that the instance has no public-facing endpoints.

DNS-01 challenge validation avoids all of that. Instead of proving domain control through an HTTP endpoint, the instance proves it by writing a TXT record to Route53. Let's Encrypt checks the DNS record instead of making an HTTP request. No HTTP endpoint needed.

The certificate renewal script runs certbot in a container (Fedora CoreOS supports rpm-ostree overlays for host packages, but layering packages onto an immutable base defeats the point; containers keep the host clean). It uses the dns-route53 plugin, which writes the challenge TXT record, waits for Let's Encrypt to validate it, and cleans up:

podman run --rm --net=host \
    -v "${LETSENCRYPT_DIR}:/etc/letsencrypt:Z" \
    -e AWS_REGION="${aws_region}" \
    docker.io/certbot/dns-route53:latest \
    certonly --dns-route53 \
    --non-interactive \
    --agree-tos \
    --email "admin@example.com" \
    -d "${DOMAIN}"

The certificate is for the internal origin domain, not the public-facing customer domain. CloudFront terminates the public TLS connection using an ACM certificate. The Let's Encrypt certificate handles the CloudFront-to-EC2 leg. Traffic is encrypted end-to-end, through two separate certificates: ACM on the edge, Let's Encrypt on the origin.

The IAM policy on the instance profile restricts Route53 writes to TXT records only in the customer's public zone:

{
  Effect   = "Allow"
  Action   = ["route53:ChangeResourceRecordSets"]
  Resource = "arn:aws:route53:::hostedzone/${var.public_zone_id}"
  Condition = {
    "ForAllValues:StringEquals" = {
      "route53:ChangeResourceRecordSetsRecordTypes" = ["TXT"]
    }
  }
}

A compromised instance can renew its own TLS certificate. It cannot modify A records, CNAME records, or any routing record. It can only write TXT records in the customer's public zone. That's sufficient for the DNS-01 challenge but useless for traffic redirection: CloudFront routes to the VPC Origin by ARN, not by resolving the origin domain via DNS. The origin domain configured in CloudFront is used for SNI during the TLS handshake, not looked up in Route53.

A systemd timer triggers renewal. The script checks the current certificate's expiry, skips renewal if more than 30 days remain, and reloads nginx after replacing the certificate files. Each customer gets their own certificate for their own internal origin domain, no wildcards, no shared secrets across tenants. The entire flow runs unattended, with no port 80 and no public IP.

No WebSocket support

CloudFront VPC Origins do not support WebSocket connections. CloudFront with public-facing origins does — the Upgrade header can be forwarded, and CloudFront will proxy WebSocket frames. But VPC Origins specifically don't. Traffic between CloudFront and a VPC Origin flows over the AWS backbone via the ENI, and that path doesn't handle the HTTP-to-WebSocket upgrade.

This matters because GraphQL subscriptions (the mechanism for real-time updates like AI chat streaming) traditionally run over WebSockets. The client opens a WebSocket connection, subscribes to a topic, and the server pushes events as they occur. With VPC Origins, the WebSocket never establishes.

The solution is Server-Sent Events. SSE is a standard HTTP mechanism: the client makes a regular HTTP request, and the server holds the connection open, streaming events as text/event-stream responses. It's unidirectional (server to client), but GraphQL subscriptions are inherently server-push. The client initiates the subscription with a standard HTTP POST, and the server streams results back. No bidirectional channel is needed.

The nginx configuration for the /graphql endpoint is tuned for SSE:

location /graphql {
    # ...origin verification...

    proxy_pass http://backend_upstream;
    proxy_http_version 1.1;

    # SSE-specific settings
    proxy_set_header Connection '';
    proxy_buffering off;
    proxy_cache off;
    chunked_transfer_encoding on;
    proxy_read_timeout <long_timeout>;
}

proxy_buffering off ensures nginx doesn't buffer the response, so each event is forwarded to the client immediately as the backend emits it. proxy_cache off prevents nginx from trying to store the streaming response in its cache zone. proxy_set_header Connection '' prevents nginx from injecting a Connection: close header that would terminate the stream. chunked_transfer_encoding on allows the response body to stream incrementally.

proxy_read_timeout sets the maximum connection lifetime. In practice this should be tuned down to match the longest expected streaming session, since long-lived idle connections consume file descriptors and backend connection pool slots. CloudFront's own origin read timeout provides a shorter ceiling for connections that stop sending data, but the nginx value should still reflect the actual use case rather than a permissive default.

On the CloudFront side, the /graphql* cache behavior disables caching (default_ttl = 0, max_ttl = 0) and compression (compress = false). Caching a streaming response would mean the second client receives the first client's events. Gzip compression would buffer the stream, adding latency and potentially breaking the event boundaries that SSE relies on.

The trade-off compared to WebSockets is minor for this use case. WebSockets provide bidirectional communication; SSE provides server-to-client streaming. GraphQL subscriptions only push from server to client. The client sends new subscription requests as separate HTTP calls. SSE also reconnects automatically on disconnection (the transport libraries handle this as part of the protocol), which is an advantage over WebSockets where reconnection logic is application-level. The application does need to be configured for SSE transport instead of WebSocket, but this isn't custom work. Apollo and the other major GraphQL libraries support SSE subscriptions as a standard transport option.

What started as a constraint turns out to be a security improvement. SSE operates over standard HTTP, which means the entire existing security stack applies without adaptation. Authentication uses standard Authorization: Bearer headers and HTTP-only secure cookies. The same middleware that validates regular API requests validates subscription requests. WAFs, proxies, and API gateways can inspect SSE traffic, apply rate limiting, and block malicious payloads using their standard rulesets, because it's just HTTP. With WebSockets, most of that infrastructure either can't inspect the upgraded connection or requires specialized configuration to do so.

The directionality constraint is itself a security property. Because the client cannot send data back over the SSE channel, an entire class of client-side injection attacks over the open connection is eliminated. A WebSocket is bidirectional: once upgraded, the server must validate every frame the client sends for the lifetime of the connection. SSE has no such surface. The client subscribes, the server pushes. Data flows one way. Cross-origin protections are standard CORS, enforced by the browser automatically.

Hostile volume takeover

The architecture uses a single EC2 instance per customer with a persistent EBS data volume. Databases, vector stores, and state live on the data volume. The stack is immutable: there are no in-place updates. Every change, whether an OS patch, a configuration change, or an application upgrade, replaces the entire instance. The root volume is ephemeral. The data volume carries over.

With an ALB-based architecture, instance replacement is straightforward. The CloudFront origin points at the ALB, not the EC2 instance. Terraform can terminate the old instance, launch a new one, reattach the EBS volume, and register it with the target group. CloudFront never notices — the origin ARN is the load balancer, which doesn't change.

VPC Origins point directly at the EC2 instance ARN. This creates a chicken-and-egg problem: the old instance can't be terminated while the VPC Origin still references it, but the VPC Origin can't be updated to reference the new instance until that instance exists. Terraform can't destroy-then-create because the distribution would have no valid origin during the gap. It must use create_before_destroy — the new instance has to exist before the old one goes away:

resource "aws_cloudfront_vpc_origin" "backend" {
  vpc_origin_endpoint_config {
    name                   = "${var.prefix}-origin-${var.customer_id}-${random_id.vpc_origin_suffix.hex}"  # random suffix forces new resource on replacement
    arn                    = module.compute.instance_arn
    https_port             = 443
    origin_protocol_policy = "https-only"

    origin_ssl_protocols {
      items    = ["TLSv1.2"]  # minimum floor — CloudFront auto-negotiates TLS 1.3 when the origin supports it
      quantity = 1
    }
  }

  lifecycle {
    create_before_destroy = true
  }
}

create_before_destroy means the new EC2 instance must exist before the old VPC Origin is destroyed. Terraform creates the new instance, creates a new VPC Origin pointing to it, updates the CloudFront distribution, then destroys the old VPC Origin and the old instance. This keeps the distribution continuously associated with a valid origin.

But EBS volumes can only attach to one instance at a time. The new instance has booted. The old instance still holds the data volume. The new instance needs the databases on that volume to serve traffic, but it can't attach the volume until the old instance releases it. Terraform can't manage this: an aws_volume_attachment resource would create a dependency cycle between the old and new instances.

diagram hostile volume takeover

The solution moves volume attachment logic out of Terraform and into the instance itself. A systemd service on the new instance handles the takeover.

When the new EC2 boots, it has no data volume. A oneshot systemd service (volume-takeover.service) runs before any application container starts. It pulls the AWS CLI container image (Fedora CoreOS has no AWS CLI on the host filesystem), then runs a script that:

Fetches its own instance ID from the EC2 metadata service (IMDSv2, token-based)
Queries the EBS volume's attachment status
If another instance holds the volume, issues stop-instances — an ACPI shutdown that triggers a clean OS shutdown on the old instance. Databases flush to disk. Systemd stops services in dependency order
Waits for the old instance to fully stop
Attaches the volume to itself

# Query who currently holds the volume
# If it's already attached to this instance, exit early

# If another instance holds it:
#   1. Issue stop-instances (ACPI shutdown — databases flush cleanly)
#   2. Wait for the old instance to fully stop
#   3. Attach the volume to this instance
#   4. Wait for attachment to complete

The systemd service runs with RemainAfterExit=yes and a 10-minute timeout, which is more than enough for the stop-wait-attach sequence:

[Unit]
Description=Volume Takeover
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
RemainAfterExit=yes
TimeoutStartSec=600

# Pulls AWS CLI container image, then runs the takeover script
# with region and volume ID passed as environment variables
ExecStartPre=<pull AWS CLI container>
ExecStart=<run takeover script in container with host networking>

After the takeover completes, var-data.mount waits for the block device to appear and mounts it. data-directories.service verifies the directory structure exists on the volume. Only then do the application containers start — systemd enforces the ordering: takeover → mount → directories → databases → backend → nginx.

The IAM permissions required are narrow. ec2:AttachVolume, ec2:DetachVolume, and ec2:StopInstances are scoped by aws:ResourceTag/Customer, so the instance can only act on resources tagged with its own customer ID. ec2:DescribeVolumes and ec2:DescribeInstances can't be resource-scoped in IAM (they're list operations), so they're granted on *, but describe calls only return metadata and can't modify anything. A compromised instance can stop another instance with the same customer tag (which is only ever its own predecessor) and attach its own data volume. It cannot stop or attach resources belonging to other customers. Permission boundaries on the instance role enforce this ceiling even if the inline policy is misconfigured.

The downtime is 2-5 minutes: ACPI shutdown of the old instance (~30-60 seconds), volume detach and reattach (~30 seconds), mount and directory check (~5 seconds), container startup (~60-90 seconds). For the Consultant tier (threat modeling sessions, not real-time transactions) this is acceptable. The Team and Enterprise tiers use K3s clusters with rolling updates and avoid this pattern entirely.

Trade-offs

VPC Origins remove the ALB and its per-hour cost. For per-customer distributions, this is significant: ~$16/month per ALB avoided, times the number of customers. The trade-off is the three constraints covered above, each requiring its own engineering solution.

DNS-01 challenges for TLS are slower than HTTP-01. DNS propagation takes 30-60 seconds; HTTP-01 validation is near-instant. The difference only matters during certificate renewal, which happens every 60 days and runs in the background. There's no user-visible impact.

SSE instead of WebSockets removes bidirectional communication. For GraphQL subscriptions (server pushes events to client), this doesn't matter — the data flows one way.

Instance replacement has the downtime cost described above. The alternative — managing the volume attachment in Terraform — would either require destroy_before_create (leaving the distribution with no valid origin during the gap, and terminating the old instance without a clean shutdown) or a circular dependency that Terraform can't resolve. Moving the logic in-band to the instance is the only option that preserves both create_before_destroy semantics and clean data handoff.

The combined effect: no ALB cost, no public IP, traffic over the AWS backbone, and three solved problems that each have clear boundaries and known trade-offs.

Series

This architecture is implemented in dether.net, a graph-native threat modeling platform. If you're interested in seeing these patterns applied to security architecture analysis, that's where they run in production.

CloudFront VPC Origins: what breaks and how to fix it

CloudFront VPC Origins: what breaks and how to fix it

The model

TLS without port 80

No WebSocket support

Hostile volume takeover

Trade-offs

Series

Related Insights

Designing for compromise: a multi-tenant SaaS architecture on AWS

Customer isolation from the infrastructure up