Automating isolation: the self-service deployment pipeline

Per-customer isolation creates a problem. Each customer gets their own subnet, Cognito pool, IAM roles, S3 buckets, and Terraform state — the first and second articles established why and traced the isolation through every layer. But those resources have to be created. The provisioning system (whatever builds and manages customer environments) is a component with access to multiple customers. If it's a single deployer with broad IAM permissions, or a shared pipeline that holds credentials for every customer's infrastructure, the provisioning system becomes the cross-tenant path the isolation model was designed to eliminate.

This article covers the management plane: the infrastructure that takes a customer from sign-up to a running workspace through self-service. It runs in a separate VPC from customer workloads. Its deployers are ephemeral Fargate tasks that assume a customer-scoped role, run Terraform, and exit — no persistent access, no stored credentials. Every stage of the pipeline creates resources through roles scoped to one customer, with state scoped to one customer.

The pipeline

A customer moves through four stages. Each transition requires a different verification, and each creates resources proportional to the customer's current commitment. A single DynamoDB record per customer implements the state machine. Every transition is an atomic conditional update: the record can only move from one expected status to the next, so concurrent or replayed requests that attempt the same transition are rejected.

diagram customer lifecycle

All four stages run through a single frontend and API. The customer interacts with a Vue.js single-page application served from S3 through CloudFront. Static files: HTML, CSS, JavaScript. The SPA calls an HTTP API Gateway, which routes every request to a Lambda function. The frontend handles registration, workspace-based login, and the management dashboard (billing, deployment, settings). It polls the API for status updates during provisioning and deployment. The frontend has no server-side state and no direct access to AWS resources — it can only call the same API endpoints available to any HTTPS client.

Every Lambda behind that API runs a single Go binary in a FROM scratch container: no shell, no OS libraries, no runtime beyond the binary itself. Go compiles to a statically linked binary, so the container image is the binary and nothing else. A compromised Lambda has no tooling to pivot with. Each Lambda has the minimum IAM permissions for its function. Public-facing Lambdas can read and write DynamoDB and send email through SES, but they cannot create customer resources. Resource creation is decoupled through SQS or Step Functions: a breached public endpoint can produce queue messages or start an execution, but has no direct path to Cognito, IAM, or Terraform.

Registration and identity verification

The customer fills a sign-up form with their email and workspace name. They submit and see a "check your email" prompt.

diagram registration and verification flow

The magic link in the verification email doesn't verify directly. It opens an HTML confirmation page. The customer clicks "Confirm Account" to proceed. This two-step design protects against corporate email scanners that pre-fetch links to check for malware. The initial request returns a read-only page; confirmation, triggered by the button, performs the actual verification: constant-time token hash comparison, expiry check, and an atomic DynamoDB status transition that prevents double-submission.

Behind the form, sign-up creates a DynamoDB record and sends the verification email. Nothing else. The registration Lambda can write to DynamoDB and send email through SES. It cannot create Cognito pools, allocate subnets, or touch any other AWS service. Tokens in the magic link are stored as SHA-256 hashes, not in plaintext.

After verification, SQS triggers a separate Lambda that creates the customer's Cognito pool. The registration endpoint's reach ends at the queue — it has no Cognito permissions and no path to invoke the provisioner directly. If the provisioner crashes mid-way and SQS retries, it finds the existing pool, skips creation, and completes the DynamoDB write. No orphan resources.

An unverified registration is a database record. A bot submitting 10,000 fake sign-ups creates 10,000 DynamoDB records (negligible cost) that auto-delete within a short TTL window. AWS resources are created only after a human verifies their email.

Once the pool is ready, Cognito sends a temporary password email. The customer logs in through the workspace-based login discovery flow, sets their password, and optionally enables MFA. Then they see the management dashboard.

Per-customer Cognito pools create a discovery problem: the console needs to know which pool to authenticate against before the user has authenticated. The user enters their workspace name, the frontend calls a public endpoint that returns that workspace's Cognito configuration, and redirects to the pool's hosted UI. The console never validates credentials or stores sessions — it resolves a workspace name to a Cognito endpoint and delegates authentication entirely to the per-customer pool.

diagram workspace login discovery

Payment

After logging in, the customer clicks "Subscribe" on the dashboard, selects a plan, and completes payment through Stripe's hosted checkout. Stripe redirects them back to the dashboard. Within a few seconds, the dashboard updates to "Payment confirmed" and they receive a provisioning email with a link to set up their workspace.

diagram payment flow

Payment has two separate trust models. The checkout is JWT-authenticated: The checkout endpoint verifies the customer's JWT, checks that the workspace is eligible (status = confirmed), creates a Stripe Checkout Session, and stores the session_id in DynamoDB. The customer is redirected to Stripe's hosted checkout. Payment information never touches the platform.

After payment, Stripe fires a webhook. The webhook Lambda verifies the Stripe signature and pushes to SQS. It has zero DynamoDB access — its reach ends at the queue, same as the registration endpoint. The Stripe API key and webhook signing secret live in Secrets Manager, not in environment variables or Terraform state, so they're rotatable without redeployment.

The payment processor reads from SQS and cross-references the Stripe session_id against what was stored in DynamoDB at checkout time. This prevents a forged webhook from activating an arbitrary workspace. The processor updates the customer status, generates a provisioning token (stored as a hash with a time-limited expiry), and sends the provisioning email. It cannot allocate subnets, create IAM roles, or start Terraform.

Until payment is confirmed, the customer has a DynamoDB record and a Cognito pool. No infrastructure beyond identity.

Account provisioning

The customer clicks the link in the provisioning email. The management console shows a progress screen while the platform creates the foundational resources (subnet, DNS, IAM roles, S3 bucket). When it finishes, the dashboard updates to show a "Deploy Infrastructure" button. If something fails, the dashboard shows an error with a retry option.

For the customer, this is one click and a wait. Behind that click, the platform requires three independent verifications before creating any infrastructure.

The first gate checks identity: a valid JWT from the customer's Cognito pool. The second checks payment: an active status in DynamoDB, confirming Stripe payment through the server-to-server path. The third re-verifies email access: a valid provisioning token from the magic link email, proving the customer still controls the address at provisioning time. The token comparison is constant-time — an attacker measuring response time differences shouldn't be able to infer whether a partial hash matched, which would narrow the search space for forging a valid token. The token is time-bound and single-use.

All three must pass. A valid JWT with an inactive payment status is denied. Confirmed payment with an expired token is denied. Each check is independent: JWT signature against Cognito's JWKS endpoint, payment status from DynamoDB, token hash from a separate DynamoDB attribute.

Registration and provisioning can be days or weeks apart. The third gate re-verifies email access at the moment infrastructure is about to be created, not at the moment the user first signed up. The provisioning email doubles as a payment confirmation: the customer receives an email confirming their payment with a link to proceed. If the token expires or the email is lost, the customer can request a resend from the dashboard. Resending generates a new token and invalidates the old one.

diagram account provisioning

When all three gates pass, the account provisioning Lambda atomically transitions the status and starts a Step Functions execution. Unlike registration and payment, there is no SQS intermediary — Step Functions already provides retry and error handling. The Lambda cannot run Terraform itself. Terraform runs in a Fargate task. It allocates a /28 subnet (through an atomic DynamoDB counter, never reused even after deprovisioning), creates split-horizon DNS zones, per-customer IAM roles, and an internal S3 bucket.

Each Fargate deployer has a per-customer task role with exactly one permission: sts:AssumeRole on this customer's provisioner role. The trust policy requires the customer's workspace ID as an external ID — a task targeting the wrong role ARN fails the external ID check before gaining any permissions. The assumed role is capped by a permission boundary that limits the maximum allowed actions to this customer's resources, regardless of inline policies.

The entrypoint assumes the provisioner role at the shell level before Terraform runs. Every AWS API call executes under customer-scoped credentials. The only role reference in the task role is this customer's provisioner — no path to other customers' roles exists.

A new customer is a new set of module invocations with a different customer ID. Same modules, different parameters. The account module takes the customer's workspace ID as a variable, and that variable propagates into every resource name and every IAM policy:

# The workspace variable flows into resource names...
resource "aws_s3_bucket" "internal" {
  bucket = "${var.prefix}-${var.customer_id}-${var.bucket_suffix}-internal"
}

# ...and into every IAM policy as an explicit ARN
statement {
  effect    = "Allow"
  actions   = ["s3:GetObject", "s3:PutObject"]
  resources = ["arn:aws:s3:::${var.prefix}-${var.customer_id}-${var.bucket_suffix}-internal/*"]
}

No wildcards. The variable substitution is what turns a generic module into a customer-scoped set of policies. The bucket suffix is a random string generated at provisioning time, because predictable bucket names let an attacker pre-create a bucket and intercept state. Running the module with different customer IDs and suffixes produces policies that are structurally identical but point at completely different resources. They cannot address each other.

Each customer's Terraform state lives in their own internal S3 bucket. The account module's outputs (subnet ID, zone IDs, role ARNs, bucket name) are written to S3 in the customer's own bucket rather than passed through Terraform remote state. The infrastructure module reads them from S3. This means the infrastructure deployer doesn't need read access to the account module's state file, another boundary that limits what a compromised deployer can reach.

Failures set the status to provisioning_failed and store the error. The customer retries from the dashboard. The triple gate accepts both active and provisioning_failed status, so retries don't require re-verification. Terraform resumes from the existing state: resources already created are kept, the remaining ones are added.

Infrastructure deployment

The customer clicks "Deploy Infrastructure" on the management console. The dashboard shows deployment progress: Terraform running, frontend syncing, DNS propagating. When it finishes, the customer gets their workspace URL and a confirmation email. If the deployment fails, the dashboard shows the error and a retry button.

The deploy endpoint requires a valid JWT and a status of provisioned (or deployed/deployment_failed for re-deploy and retry). No triple gate — the customer proved identity, payment, and email access during account provisioning. Before starting, the endpoint validates that all account provisioning outputs exist: subnet ID, DNS zone IDs, Cognito pool and client IDs, internal bucket name. If account provisioning left incomplete references, deployment is rejected rather than failing mid-Terraform.

diagram deployment pipeline

Step Functions orchestrates the deployment. Terraform runs in a Fargate task through the same role chain: per-customer deployer task role → customer provisioner role → customer resources. Unlike the Lambda functions, the deployer container is not FROM scratch — it includes AWS CLI and jq. Before Terraform runs, the entrypoint assumes the customer's provisioner role at the shell level by parsing STS JSON credentials, and reads the image versions file from S3 (described under Version management).

Terraform initializes against the customer's state file in their internal S3 bucket and applies the infrastructure module. The module builds on the account module's foundation: EC2 (Fedora CoreOS, configured through Ignition), EBS, CloudFront (VPC Origin to the private EC2 instance), frontend bucket, certificates, DNS records, management API. It takes the same workspace ID variable and reads the account module's outputs from S3.

The infrastructure module also updates the customer's Cognito app client (created during registration) to add workspace-specific callback URLs, so the customer can authenticate directly at their workspace URL in addition to the management console.

Where resource ARNs aren't known at creation time (EC2 instance IDs, for example), the modules use tag-based IAM conditions (aws:RequestTag and aws:ResourceTag) to provide equivalent scoping. Terraform tags every resource with the customer's workspace ID, and the IAM policy requires the tag to match before allowing any action.

After Terraform finishes, a separate Lambda syncs frontend static files to the customer's bucket. This doesn't run in the Fargate container. The separation is deliberate: the infra_provisioner role can create and configure the frontend bucket but has no PutObject permission. Frontend content can only reach the bucket through the SyncFrontend Lambda, which assumes a per-customer sync role scoped to one S3 bucket and one CloudFront distribution. The Lambda's own execution role has no direct write permissions — all writes go through the assumed role.

Before sending the "your workspace is ready" email, a health check pings the workspace's health endpoint at regular intervals with retries. If the endpoint is still unreachable after the retry window (EC2 still booting, CloudFront still propagating), the deployment completes anyway. The check delays the success email until the workspace is verified accessible, not gates the deployment itself.

The final step reads Terraform outputs from S3 and writes infrastructure references to DynamoDB (CloudFront distribution ID, customer URL, instance ID), sets the status to deployed, and sends a notification email. Each step updates a deployment phase in DynamoDB (terraform_running → syncing_frontend → health_check → finalizing → complete), and the dashboard polls this field to show real-time progress. Failures set deployment_failed, fire an SNS alert, and email the customer. Retries resume from saved Terraform state.

Operational details

Version management

Container image versions are managed through S3 files rather than DynamoDB, specifically to avoid granting infrastructure roles database access. Two files control what gets deployed:

Platform default: s3://{prefix}-provisioning-outputs/defaults/image-versions.json
Customer override: s3://{prefix}-{workspace}-{suffix}-internal/config/image-versions.json

The deployer checks for a customer-specific override first, then falls back to the platform default. This enables staged rollouts: update one customer's override file to test a new version, verify it works in production, then update the platform default to roll it out to everyone.

Re-deployment uses the same endpoint and the same pipeline. Terraform detects changes and applies them. If nothing has changed, it produces a no-op. There are two update speeds: container image pulls for application changes, full instance replacement for OS-level changes. Both go through this pipeline with the same role assumption, state isolation, and per-customer scope.

Lifecycle management

Abandoned registrations clean themselves up through DynamoDB TTL: short windows for unverified records and stuck provisioning, longer for verified but unpaid accounts. When DynamoDB deletes a record, a stream event triggers a cleanup Lambda that reads the old image and deletes associated resources (a Cognito pool for a verified-but-unpaid customer, for example).

Cleanup follows the same permission scoping as the rest of the pipeline. The TTL cleanup Lambda can delete Cognito pools and nothing else. For provisioned or deployed customers, teardown runs terraform destroy through a Fargate task with the same per-customer role chain: deployer task role → customer provisioner role → customer resources. The destroy runs against one customer's state file with one customer's credentials. It cannot affect another customer's infrastructure.

A scheduled CloudWatch rule detects stuck deployments (workspaces in the deploying state longer than a timeout threshold) and transitions them to deployment_failed.

Subscription cancellation arrives through Stripe webhooks. If the customer is mid-deployment when the cancellation arrives, the deployment completes before the status transitions to suspended. Infrastructure is destroyed after a configurable grace period.

Trade-offs

The management plane runs on serverless and ephemeral infrastructure. Lambda functions, SQS queues, Step Functions, DynamoDB, SES, and Fargate tasks add minimal monthly platform overhead, with negligible per-run costs for provisioning and deployment. The entire self-service pipeline — multi-gate verification, per-customer role isolation, automated cleanup — costs less than a single small reserved instance running a traditional CI/CD server.

A simpler pipeline is possible. A single CI/CD system triggered by webhooks, with a deployer role that can assume any customer's provisioner role, would be less infrastructure to build. The trade-off is that a compromised pipeline, or a leaked credential, has access to every customer's infrastructure at once. The multi-gate, per-customer-role design accepts more pipeline complexity for the guarantee that no single component in the provisioning chain can touch more than one customer.

The queue-based registration design trades latency for resilience. Creating the Cognito pool synchronously would be faster for the user. The asynchronous design means a few seconds of polling after email verification. But the registration endpoint has negligible cost under abuse, and the system self-heals through TTL-based cleanup without operator intervention.

The customer goes through three separate interactions before reaching a running workspace: email verification, payment, and provisioning with a second email confirmation. Most platforms collapse this into sign-up-and-deploy. The extra steps exist because each one gates a different level of resource creation, and the target audience — security teams building threat models — is more likely to trust a platform that verifies before provisioning than one that hands out infrastructure on a credit card swipe.

Series

This architecture is implemented in dether.net, a graph-native threat modeling platform. If you're interested in seeing these patterns applied to security architecture analysis, that's where they run in production.

Automating isolation: the self-service deployment pipeline

Automating isolation: the self-service deployment pipeline

The pipeline

Registration and identity verification

Payment

Account provisioning

Infrastructure deployment

Operational details

Version management

Lifecycle management

Trade-offs

Series

Related Insights

Designing for compromise: a multi-tenant SaaS architecture on AWS

Customer isolation from the infrastructure up