Support image-overrides in spec for testing

Spec can override container images: image-overrides: dumpster-kubo: ghcr.io/.../dumpster-kubo:test-tag Merged with CLI overrides (CLI wins). Enables testing with GHCR-pushed test tags without modifying compose files. Also reverts the image-pull-policy spec key (not needed — the fix is to use proper GHCR tags, not IfNotPresent). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Support image-pull-policy in spec (default: Always)
2026-03-22 01:02:23 +00:00 · 2026-03-21 20:17:06 +00:00 · 2026-03-21 19:27:14 +00:00 · 2026-03-21 15:25:47 +00:00 · 2026-03-21 07:47:23 +00:00 · 2026-03-21 00:30:46 +00:00
17 changed files with 3551 additions and 369 deletions
--- a/.gitignore
+++ b/.gitignore
@ -8,3 +8,4 @@ __pycache__
 package
 stack_orchestrator/data/build_tag.txt
 /build
+.worktrees
--- a/.pebbles/.gitignore
+++ b/.pebbles/.gitignore
@ -0,0 +1 @@
+pebbles.db
--- a/.pebbles/config.json
+++ b/.pebbles/config.json
@ -0,0 +1 @@
+{"project": "stack-orchestrator", "prefix": "so"}
--- a/.pebbles/events.jsonl
+++ b/.pebbles/events.jsonl
@ -0,0 +1,10 @@
+{"type": "create", "timestamp": "2026-03-18T14:45:07.038870Z", "issue_id": "so-a1a", "payload": {"title": "deploy create should support external credential injection", "type": "feature", "priority": "2", "description": "deploy create generates config.env but provides no mechanism to inject external credentials (API keys, tokens, etc.) at creation time. Operators must append to config.env after the fact, which mutates a build artifact. deploy create should accept --credentials-file or similar to include secrets in the generated config.env."}}
+{"type": "create", "timestamp": "2026-03-18T14:45:07.038942Z", "issue_id": "so-b2b", "payload": {"title": "REGISTRY_TOKEN / imagePullSecret flow undocumented", "type": "bug", "priority": "2", "description": "create_registry_secret() exists in deployment_create.py and is called during up(), but REGISTRY_TOKEN is not documented in spec.yml or any user-facing docs. The restart command warns \"Registry token env var REGISTRY_TOKEN not set, skipping registry secret\" but doesn't explain how to set it. For GHCR private images, this is required and the flow from spec.yml -> config.env -> imagePullSecret needs documentation."}}
+{"type": "create", "timestamp": "2026-03-18T19:10:00.000000Z", "issue_id": "so-k1k", "payload": {"title": "Stack path resolution differs between deploy create and deployment restart", "type": "bug", "priority": "2", "description": "deploy create resolves --stack as a relative path from cwd. deployment restart resolves --stack-path as absolute, then computes repo_root as 4 parents up (assuming stack_orchestrator/data/stacks/name structure). External stacks with different nesting depths (e.g. stack-orchestrator/stacks/name = 3 levels) get wrong repo_root, causing --spec-file resolution to fail. The two commands should use the same path resolution logic."}}
+{"type": "create", "timestamp": "2026-03-18T19:25:00.000000Z", "issue_id": "so-l2l", "payload": {"title": "deployment restart should update in place, not delete/recreate", "type": "bug", "priority": "1", "description": "deployment restart deletes the entire namespace then recreates everything from scratch. This causes:\n\n1. **Downtime** — nothing serves traffic between delete and successful recreate\n2. **No rollback** — deleting the namespace destroys ReplicaSet revision history\n3. **Race conditions** — namespace may still be terminating when up() tries to create\n4. **Cascading failures** — if ANY container fails to start, the entire site is down with no fallback\n\nFix: three changes needed.\n\n**A. up() should create-or-update, not just create.** Use patch/apply semantics for Deployments, Services, Ingresses. When the pod spec changes (new env vars, new image), k8s creates a new ReplicaSet, scales it up, waits for readiness probes, then scales the old one down. Old pods serve traffic until new pods are healthy.\n\n**B. down() should never delete the namespace on restart.** Only on explicit teardown. The namespace owns the revision history. Current code: _delete_namespace() on every down(). Should: delete individual resources by label for teardown, do nothing for restart (let update-in-place handle it).\n\n**C. All containers need readiness probes.** Without them k8s considers pods ready immediately, defeating rolling update safety. laconic-so should generate readiness probes from the http-proxy routes in spec.yml (if a container has an http route, probe that port).\n\nWith these changes, k8s native rolling updates provide zero-downtime deploys and automatic rollback (if new pods fail readiness, rollout stalls, old pods keep serving).\n\nSource files:\n- deploy_k8s.py: up(), down(), _create_deployment(), _delete_namespace()\n- cluster_info.py: pod spec generation (needs readiness probes)\n- deployment.py: restart() orchestration"}}
+{"type": "create", "timestamp": "2026-03-18T20:15:03.000000Z", "issue_id": "so-m3m", "payload": {"title": "Add credentials-files spec key for on-disk credential injection", "type": "feature", "priority": "1", "description": "deployment restart regenerates config.env from spec.yml, wiping credentials that were appended from on-disk files (e.g. ~/.credentials/*.env). Operators must append credentials after deploy create, which is fragile and breaks on restart.\n\nFix: New top-level spec key credentials-files. _write_config_file() reads each file and appends its contents to config.env after writing config vars. Files are read at deploy time from the deployment host.\n\nSpec syntax:\n  credentials-files:\n    - ~/.credentials/dumpster-secrets.env\n    - ~/.credentials/dumpster-r2.env\n\nFiles:\n- deploy/spec.py: add get_credentials_files() returning list of paths\n- deploy/deployment_create.py: in _write_config_file(), after writing config vars, read and append each credentials file (expand ~ to home dir)\n\nAlso update dumpster-stack spec.yml to use the new key and remove the ansible credential append workaround from woodburn_deployer (group_vars/all.yml credentials_env_files, stack_deploy role append tasks, restart_dumpster.yml credential steps). Those cleanups are in the woodburn_deployer repo."}}
+{"type":"status_update","timestamp":"2026-03-18T21:54:12.59148256Z","issue_id":"so-m3m","payload":{"status":"in_progress"}}
+{"type":"close","timestamp":"2026-03-18T21:55:31.6035544Z","issue_id":"so-m3m","payload":{}}
+{"type": "create", "timestamp": "2026-03-20T23:05:00.000000Z", "issue_id": "so-n1n", "payload": {"title": "Merge kind-mount-propagation branch — HostToContainer propagation for extraMounts", "type": "feature", "priority": "2", "description": "The kind-mount-root feature was cherry-picked to main (commit 8d03083d) but the mount propagation fix (commit 929bdab8 on branch enya-ac868cc4-kind-mount-propagation-fix) adds HostToContainer propagation so host submounts propagate into the Kind node. This is needed for ZFS child datasets and tmpfs mounts under the root. Cherry-pick 929bdab8 to main."}}
+{"type": "create", "timestamp": "2026-03-20T23:05:00.000000Z", "issue_id": "so-o2o", "payload": {"title": "etcd cert backup not persisting across cluster deletion", "type": "bug", "priority": "1", "description": "The extraMount for etcd at data/cluster-backups/<id>/etcd is configured but after cluster deletion the directory is empty. Caddy TLS certificates stored in etcd are lost. Either etcd isn't writing to the host mount, or the cleanup code is deleting the backup. Investigate _clean_etcd_keeping_certs in helpers.py."}}
+{"type": "create", "timestamp": "2026-03-21T00:20:00.000000Z", "issue_id": "so-p3p", "payload": {"title": "laconic-so should manage Caddy ingress image lifecycle", "type": "feature", "priority": "2", "description": "The Caddy ingress controller image is hardcoded in ingress-caddy-kind-deploy.yaml. There's no mechanism to update it without manual kubectl commands or cluster recreation. laconic-so should: 1) Allow spec.yml to specify a custom Caddy image, 2) Support updating the Caddy image as part of deployment restart, 3) Set strategy: Recreate on the Caddy Deployment (hostPort pods can't do RollingUpdate). This would let cryovial or similar tooling trigger Caddy updates through the normal deployment pipeline."}}
--- a/stack_orchestrator/constants.py
+++ b/stack_orchestrator/constants.py
@ -46,3 +46,6 @@ runtime_class_key = "runtime-class"
 high_memlock_runtime = "high-memlock"
 high_memlock_spec_filename = "high-memlock-spec.json"
 acme_email_key = "acme-email"
+kind_mount_root_key = "kind-mount-root"
+external_services_key = "external-services"
+ca_certificates_key = "ca-certificates"
--- a/stack_orchestrator/data/k8s/components/ingress/ingress-caddy-kind-deploy.yaml
+++ b/stack_orchestrator/data/k8s/components/ingress/ingress-caddy-kind-deploy.yaml
@ -186,8 +186,8 @@ spec:
          operator: Equal
      containers:
        - name: caddy-ingress-controller
-          image: caddy/ingress:latest
-          imagePullPolicy: IfNotPresent
+          image: ghcr.io/laconicnetwork/caddy-ingress:latest
+          imagePullPolicy: Always
          ports:
            - name: http
              containerPort: 80
--- a/stack_orchestrator/deploy/compose/deploy_docker.py
+++ b/stack_orchestrator/deploy/compose/deploy_docker.py
@ -48,7 +48,7 @@ class DockerDeployer(Deployer):
        self.compose_project_name = compose_project_name
        self.compose_env_file = compose_env_file

-    def up(self, detach, skip_cluster_management, services):
+    def up(self, detach, skip_cluster_management, services, image_overrides=None):
        if not opts.o.dry_run:
            try:
                return self.docker.compose.up(detach=detach, services=services)
--- a/stack_orchestrator/deploy/deploy.py
+++ b/stack_orchestrator/deploy/deploy.py
@ -137,7 +137,11 @@ def create_deploy_context(


 def up_operation(
-    ctx, services_list, stay_attached=False, skip_cluster_management=False
+    ctx,
+    services_list,
+    stay_attached=False,
+    skip_cluster_management=False,
+    image_overrides=None,
 ):
    global_context = ctx.parent.parent.obj
    deploy_context = ctx.obj
@ -156,6 +160,7 @@ def up_operation(
        detach=not stay_attached,
        skip_cluster_management=skip_cluster_management,
        services=services_list,
+        image_overrides=image_overrides,
    )
    for post_start_command in cluster_context.post_start_commands:
        _run_command(global_context, cluster_context.cluster, post_start_command)
--- a/stack_orchestrator/deploy/deployer.py
+++ b/stack_orchestrator/deploy/deployer.py
@ -20,7 +20,7 @@ from typing import Optional

 class Deployer(ABC):
    @abstractmethod
-    def up(self, detach, skip_cluster_management, services):
+    def up(self, detach, skip_cluster_management, services, image_overrides=None):
        pass

    @abstractmethod
--- a/stack_orchestrator/deploy/deployment.py
+++ b/stack_orchestrator/deploy/deployment.py
@ -17,7 +17,7 @@ import click
 from pathlib import Path
 import subprocess
 import sys
-import time
+
 from stack_orchestrator import constants
 from stack_orchestrator.deploy.images import push_images_operation
 from stack_orchestrator.deploy.deploy import (
@ -248,8 +248,13 @@ def run_job(ctx, job_name, helm_release):
    "--expected-ip",
    help="Expected IP for DNS verification (if different from egress)",
 )
+@click.option(
+    "--image",
+    multiple=True,
+    help="Override container image: container=image",
+)
@click.pass_context
-def restart(ctx, stack_path, spec_file, config_file, force, expected_ip):
+def restart(ctx, stack_path, spec_file, config_file, force, expected_ip, image):
    """Pull latest code and restart deployment using git-tracked spec.

    GitOps workflow:
@ -276,6 +281,17 @@ def restart(ctx, stack_path, spec_file, config_file, force, expected_ip):

    deployment_context: DeploymentContext = ctx.obj

+    # Parse --image flags into a dict of container_name -> image
+    image_overrides = {}
+    for entry in image:
+        if "=" not in entry:
+            raise click.BadParameter(
+                f"Invalid --image format '{entry}', expected container=image",
+                param_hint="'--image'",
+            )
+        container_name, image_ref = entry.split("=", 1)
+        image_overrides[container_name] = image_ref
+
    # Get current spec info (before git pull)
    current_spec = deployment_context.spec
    current_http_proxy = current_spec.get_http_proxy()
@ -322,9 +338,22 @@ def restart(ctx, stack_path, spec_file, config_file, force, expected_ip):

    # Determine spec file location
    # Priority: --spec-file argument > repo's deployment/spec.yml > deployment dir
-    # Stack path is like: repo/stack_orchestrator/data/stacks/stack-name
-    # So repo root is 4 parents up
-    repo_root = stack_source.parent.parent.parent.parent
+    # Find repo root via git rather than assuming a fixed directory depth.
+    git_root_result = subprocess.run(
+        ["git", "rev-parse", "--show-toplevel"],
+        cwd=stack_source,
+        capture_output=True,
+        text=True,
+    )
+    if git_root_result.returncode == 0:
+        repo_root = Path(git_root_result.stdout.strip())
+    else:
+        # Fallback: walk up from stack_source looking for .git
+        repo_root = stack_source
+        while repo_root != repo_root.parent:
+            if (repo_root / ".git").exists():
+                break
+            repo_root = repo_root.parent
    if spec_file:
        # Spec file relative to repo root
        spec_file_path = repo_root / spec_file
@ -368,7 +397,14 @@ def restart(ctx, stack_path, spec_file, config_file, force, expected_ip):
        print("\n[2/4] Hostname unchanged, skipping DNS verification")

    # Step 3: Sync deployment directory with spec
+    # The spec's "stack:" value is often a relative path (e.g.
+    # "stack-orchestrator/stacks/dumpster") that must resolve from the
+    # repo root.  Change cwd so stack_is_external() sees it correctly.
    print("\n[3/4] Syncing deployment directory...")
+    import os
+
+    prev_cwd = os.getcwd()
+    os.chdir(repo_root)
    deploy_ctx = make_deploy_context(ctx)
    create_operation(
        deployment_command_context=deploy_ctx,
@ -378,28 +414,216 @@ def restart(ctx, stack_path, spec_file, config_file, force, expected_ip):
        network_dir=None,
        initial_peers=None,
    )
-
    # Reload deployment context with updated spec
    deployment_context.init(deployment_context.deployment_dir)
    ctx.obj = deployment_context

-    # Stop deployment
-    print("\n[4/4] Restarting deployment...")
+    # Apply updated deployment.
+    # If maintenance-service is configured, swap Ingress to maintenance
+    # backend during the Recreate window so users see a branded page
+    # instead of bare 502s.
+    print("\n[4/4] Applying deployment update...")
    ctx.obj = make_deploy_context(ctx)
-    down_operation(
-        ctx, delete_volumes=False, extra_args_list=[], skip_cluster_management=True
-    )

-    # Brief pause to ensure clean shutdown
-    time.sleep(5)
+    # Check for maintenance service in the (reloaded) spec
+    maintenance_svc = deployment_context.spec.get_maintenance_service()
+    if maintenance_svc:
+        print(f"Maintenance service configured: {maintenance_svc}")
+        _restart_with_maintenance(
+            ctx, deployment_context, maintenance_svc, image_overrides
+        )
+    else:
+        up_operation(
+            ctx,
+            services_list=None,
+            stay_attached=False,
+            skip_cluster_management=True,
+            image_overrides=image_overrides or None,
+        )

-    # Start deployment
-    up_operation(
-        ctx, services_list=None, stay_attached=False, skip_cluster_management=True
-    )
+    # Restore cwd after both create_operation and up_operation have run.
+    # Both need the relative stack path to resolve from repo_root.
+    os.chdir(prev_cwd)

    print("\n=== Restart Complete ===")
-    print("Deployment restarted with git-tracked configuration.")
+    print("Deployment updated via rolling update.")
    if new_hostname and new_hostname != current_hostname:
        print(f"\nNew hostname: {new_hostname}")
        print("Caddy will automatically provision TLS certificate.")
+
+
+def _restart_with_maintenance(
+    ctx, deployment_context, maintenance_svc, image_overrides
+):
+    """Restart with Ingress swap to maintenance service during Recreate.
+
+    Flow:
+    1. Deploy all pods (including maintenance pod) with up_operation
+    2. Patch Ingress: swap all route backends to maintenance service
+    3. Scale main (non-maintenance) Deployments to 0
+    4. Scale main Deployments back up (triggers Recreate with new spec)
+    5. Wait for readiness
+    6. Patch Ingress: restore original backends
+
+    This ensures the maintenance pod is already running before we touch
+    the Ingress, and the main pods get a clean Recreate.
+    """
+    import time
+
+    from kubernetes.client.exceptions import ApiException
+
+    from stack_orchestrator.deploy.deploy import up_operation
+
+    # Step 1: Apply the full deployment (creates/updates all pods + services)
+    # This ensures maintenance pod exists before we swap Ingress to it.
+    up_operation(
+        ctx,
+        services_list=None,
+        stay_attached=False,
+        skip_cluster_management=True,
+        image_overrides=image_overrides or None,
+    )
+
+    # Parse maintenance service spec: "container-name:port"
+    maint_container = maintenance_svc.split(":")[0]
+    maint_port = int(maintenance_svc.split(":")[1])
+
+    # Connect to k8s API
+    deploy_ctx = ctx.obj
+    deployer = deploy_ctx.deployer
+    deployer.connect_api()
+    namespace = deployer.k8s_namespace
+    app_name = deployer.cluster_info.app_name
+    networking_api = deployer.networking_api
+    apps_api = deployer.apps_api
+
+    ingress_name = f"{app_name}-ingress"
+
+    # Step 2: Read current Ingress and save original backends
+    try:
+        ingress = networking_api.read_namespaced_ingress(
+            name=ingress_name, namespace=namespace
+        )
+    except ApiException:
+        print("Warning: No Ingress found, skipping maintenance swap")
+        return
+
+    # Resolve which service the maintenance container belongs to
+    maint_service_name = deployer.cluster_info._resolve_service_name_for_container(
+        maint_container
+    )
+
+    # Save original backends for restoration
+    original_backends = []
+    for rule in ingress.spec.rules:
+        rule_backends = []
+        for path in rule.http.paths:
+            rule_backends.append(
+                {
+                    "name": path.backend.service.name,
+                    "port": path.backend.service.port.number,
+                }
+            )
+        original_backends.append(rule_backends)
+
+    # Patch all Ingress backends to point to maintenance service
+    print("Swapping Ingress to maintenance service...")
+    for rule in ingress.spec.rules:
+        for path in rule.http.paths:
+            path.backend.service.name = maint_service_name
+            path.backend.service.port.number = maint_port
+
+    networking_api.replace_namespaced_ingress(
+        name=ingress_name, namespace=namespace, body=ingress
+    )
+    print("Ingress now points to maintenance service")
+
+    # Step 3: Find main (non-maintenance) Deployments and scale to 0
+    # then back up to trigger a clean Recreate
+    deployments_resp = apps_api.list_namespaced_deployment(
+        namespace=namespace, label_selector=f"app={app_name}"
+    )
+    main_deployments = []
+    for dep in deployments_resp.items:
+        dep_name = dep.metadata.name
+        # Skip maintenance deployments
+        component = (dep.metadata.labels or {}).get("app.kubernetes.io/component", "")
+        is_maintenance = maint_container in component
+        if not is_maintenance:
+            main_deployments.append(dep_name)
+
+    if main_deployments:
+        # Scale down main deployments
+        for dep_name in main_deployments:
+            print(f"Scaling down {dep_name}...")
+            apps_api.patch_namespaced_deployment_scale(
+                name=dep_name,
+                namespace=namespace,
+                body={"spec": {"replicas": 0}},
+            )
+
+        # Wait for pods to terminate
+        print("Waiting for main pods to terminate...")
+        deadline = time.monotonic() + 120
+        while time.monotonic() < deadline:
+            pods = deployer.core_api.list_namespaced_pod(
+                namespace=namespace,
+                label_selector=f"app={app_name}",
+            )
+            # Count non-maintenance pods
+            active = sum(
+                1
+                for p in pods.items
+                if p.metadata
+                and p.metadata.deletion_timestamp is None
+                and not any(
+                    maint_container in (c.name or "") for c in (p.spec.containers or [])
+                )
+            )
+            if active == 0:
+                break
+            time.sleep(2)
+
+        # Scale back up
+        replicas = deployment_context.spec.get_replicas()
+        for dep_name in main_deployments:
+            print(f"Scaling up {dep_name} to {replicas} replicas...")
+            apps_api.patch_namespaced_deployment_scale(
+                name=dep_name,
+                namespace=namespace,
+                body={"spec": {"replicas": replicas}},
+            )
+
+        # Step 5: Wait for readiness
+        print("Waiting for main pods to become ready...")
+        deadline = time.monotonic() + 300
+        while time.monotonic() < deadline:
+            all_ready = True
+            for dep_name in main_deployments:
+                dep = apps_api.read_namespaced_deployment(
+                    name=dep_name, namespace=namespace
+                )
+                ready = dep.status.ready_replicas or 0
+                desired = dep.spec.replicas or 1
+                if ready < desired:
+                    all_ready = False
+                    break
+            if all_ready:
+                break
+            time.sleep(5)
+
+    # Step 6: Restore original Ingress backends
+    print("Restoring original Ingress backends...")
+    ingress = networking_api.read_namespaced_ingress(
+        name=ingress_name, namespace=namespace
+    )
+    for i, rule in enumerate(ingress.spec.rules):
+        for j, path in enumerate(rule.http.paths):
+            if i < len(original_backends) and j < len(original_backends[i]):
+                path.backend.service.name = original_backends[i][j]["name"]
+                path.backend.service.port.number = original_backends[i][j]["port"]
+
+    networking_api.replace_namespaced_ingress(
+        name=ingress_name, namespace=namespace, body=ingress
+    )
+    print("Ingress restored to original backends")
--- a/stack_orchestrator/deploy/deployment_create.py
+++ b/stack_orchestrator/deploy/deployment_create.py
@ -577,7 +577,9 @@ def _generate_and_store_secrets(config_vars: dict, deployment_name: str):
    return secrets


-def create_registry_secret(spec: Spec, deployment_name: str) -> Optional[str]:
+def create_registry_secret(
+    spec: Spec, deployment_name: str, namespace: str = "default"
+) -> Optional[str]:
    """Create K8s docker-registry secret from spec + environment.

    Reads registry configuration from spec.yml and creates a Kubernetes
@ -586,6 +588,7 @@ def create_registry_secret(spec: Spec, deployment_name: str) -> Optional[str]:
    Args:
        spec: The deployment spec containing image-registry config
        deployment_name: Name of the deployment (used for secret naming)
+        namespace: K8s namespace to create the secret in

    Returns:
        The secret name if created, None if no registry config
@ -599,16 +602,29 @@ def create_registry_secret(spec: Spec, deployment_name: str) -> Optional[str]:
    server = registry_config.get("server")
    username = registry_config.get("username")
    token_env = registry_config.get("token-env")
+    token_file = registry_config.get("token-file")

-    if not all([server, username, token_env]):
+    if not server or not username:
+        return None
+    if not token_env and not token_file:
        return None

-    # Type narrowing for pyright - we've validated these aren't None above
-    assert token_env is not None
-    token = os.environ.get(token_env)
+    # Resolve token: file takes precedence over env var
+    token = None
+    if token_file:
+        token_path = os.path.expanduser(token_file)
+        if os.path.exists(token_path):
+            with open(token_path) as f:
+                token = f.read().strip()
+        else:
+            print(f"Warning: Registry token file '{token_path}' not found")
+    if not token and token_env:
+        token = os.environ.get(token_env)
+
    if not token:
+        source = token_file or token_env
        print(
-            f"Warning: Registry token env var '{token_env}' not set, "
+            f"Warning: Registry token not available from '{source}', "
            "skipping registry secret"
        )
        return None
@ -620,7 +636,7 @@ def create_registry_secret(spec: Spec, deployment_name: str) -> Optional[str]:
    }

    # Secret name derived from deployment name
-    secret_name = f"{deployment_name}-registry"
+    secret_name = f"{deployment_name}-image-pull-secret"

    # Load kube config
    try:
@ -633,7 +649,6 @@ def create_registry_secret(spec: Spec, deployment_name: str) -> Optional[str]:
            return None

    v1 = client.CoreV1Api()
-    namespace = "default"

    k8s_secret = client.V1Secret(
        metadata=client.V1ObjectMeta(name=secret_name),
@ -675,6 +690,15 @@ def _write_config_file(

    # Write non-secret config to config.env (exclude $generate:...$ tokens)
    with open(config_env_file, "w") as output_file:
+        output_file.write(
+            "# AUTO-GENERATED by laconic-so from spec.yml config section.\n"
+            "# Source: stack_orchestrator/deploy/deployment_create.py"
+            " _write_config_file()\n"
+            "# Do not edit — changes will be overwritten on deploy create"
+            " or restart.\n"
+            "# To change config, edit the config section in your spec.yml"
+            " and redeploy.\n"
+        )
        if config_vars:
            for variable_name, variable_value in config_vars.items():
                # Skip variables with generate tokens - they go to K8s Secret
@ -684,6 +708,19 @@ def _write_config_file(
                    continue
                output_file.write(f"{variable_name}={variable_value}\n")

+        # Append contents of credentials files listed in spec
+        credentials_files = spec_content.get("credentials-files", []) or []
+        for cred_path_str in credentials_files:
+            cred_path = Path(cred_path_str).expanduser()
+            if not cred_path.exists():
+                print(f"Error: credentials file does not exist: {cred_path}")
+                sys.exit(1)
+            output_file.write(f"# From credentials file: {cred_path_str}\n")
+            contents = cred_path.read_text()
+            output_file.write(contents)
+            if not contents.endswith("\n"):
+                output_file.write("\n")
+

 def _write_kube_config_file(external_path: Path, internal_path: Path):
    if not external_path.exists():
@ -835,9 +872,7 @@ def create_operation(
            # Copy from temp to deployment dir, excluding data volumes
            # and backing up changed files.
            # Exclude data/* to avoid touching user data volumes.
-            # Exclude config file to preserve deployment settings
-            # (XXX breaks passing config vars from spec)
-            exclude_patterns = ["data", "data/*", constants.config_file_name]
+            exclude_patterns = ["data", "data/*"]
            _safe_copy_tree(
                temp_dir, deployment_dir_path, exclude_patterns=exclude_patterns
            )
@ -1032,12 +1067,8 @@ def _write_deployment_files(
        for configmap in parsed_spec.get_configmaps():
            source_config_dir = resolve_config_dir(stack_name, configmap)
            if os.path.exists(source_config_dir):
-                destination_config_dir = target_dir.joinpath(
-                    "configmaps", configmap
-                )
-                copytree(
-                    source_config_dir, destination_config_dir, dirs_exist_ok=True
-                )
+                destination_config_dir = target_dir.joinpath("configmaps", configmap)
+                copytree(source_config_dir, destination_config_dir, dirs_exist_ok=True)

    # Copy the job files into the target dir
    jobs = get_job_list(parsed_stack)
--- a/stack_orchestrator/deploy/k8s/cluster_info.py
+++ b/stack_orchestrator/deploy/k8s/cluster_info.py
@ -82,7 +82,14 @@ class ClusterInfo:
    def __init__(self) -> None:
        self.parsed_job_yaml_map = {}

-    def int(self, pod_files: List[str], compose_env_file, deployment_name, spec: Spec, stack_name=""):
+    def int(
+        self,
+        pod_files: List[str],
+        compose_env_file,
+        deployment_name,
+        spec: Spec,
+        stack_name="",
+    ):
        self.parsed_pod_yaml_map = parsed_pod_files_map_from_file_names(pod_files)
        # Find the set of images in the pods
        self.image_set = images_for_deployment(pod_files)
@ -160,67 +167,99 @@ class ClusterInfo:
                        nodeports.append(service)
        return nodeports

+    def _resolve_service_name_for_container(self, container_name: str) -> str:
+        """Resolve the k8s Service name that routes to a given container.
+
+        For multi-pod stacks, each pod has its own Service. We find which
+        pod file contains this container and return the corresponding
+        service name. For single-pod stacks, returns the legacy service name.
+        """
+        pod_files = list(self.parsed_pod_yaml_map.keys())
+        multi_pod = len(pod_files) > 1
+
+        if not multi_pod:
+            return f"{self.app_name}-service"
+
+        for pod_file in pod_files:
+            pod = self.parsed_pod_yaml_map[pod_file]
+            if container_name in pod.get("services", {}):
+                pod_name = self._pod_name_from_file(pod_file)
+                return f"{self.app_name}-{pod_name}-service"
+
+        # Fallback: container not found in any pod file
+        return f"{self.app_name}-service"
+
    def get_ingress(
-        self, use_tls=False, certificate=None, cluster_issuer="letsencrypt-prod"
+        self, use_tls=False, certificates=None, cluster_issuer="letsencrypt-prod"
    ):
        # No ingress for a deployment that has no http-proxy defined, for now
        http_proxy_info_list = self.spec.get_http_proxy()
        ingress = None
        if http_proxy_info_list:
-            # TODO: handle multiple definitions
-            http_proxy_info = http_proxy_info_list[0]
-            if opts.o.debug:
-                print(f"http-proxy: {http_proxy_info}")
-            # TODO: good enough parsing for webapp deployment for now
-            host_name = http_proxy_info["host-name"]
            rules = []
-            tls = (
-                [
-                    client.V1IngressTLS(
-                        hosts=certificate["spec"]["dnsNames"]
-                        if certificate
-                        else [host_name],
-                        secret_name=certificate["spec"]["secretName"]
-                        if certificate
-                        else f"{self.app_name}-tls",
-                    )
-                ]
-                if use_tls
-                else None
-            )
-            paths = []
-            for route in http_proxy_info["routes"]:
-                path = route["path"]
-                proxy_to = route["proxy-to"]
+            tls = [] if use_tls else None
+
+            for http_proxy_info in http_proxy_info_list:
                if opts.o.debug:
-                    print(f"proxy config: {path} -> {proxy_to}")
-                # proxy_to has the form <service>:<port>
-                proxy_to_port = int(proxy_to.split(":")[1])
-                paths.append(
-                    client.V1HTTPIngressPath(
-                        path_type="Prefix",
-                        path=path,
-                        backend=client.V1IngressBackend(
-                            service=client.V1IngressServiceBackend(
-                                # TODO: this looks wrong
-                                name=f"{self.app_name}-service",
-                                # TODO: pull port number from the service
-                                port=client.V1ServiceBackendPort(number=proxy_to_port),
-                            )
-                        ),
+                    print(f"http-proxy: {http_proxy_info}")
+                host_name = http_proxy_info["host-name"]
+                certificate = (certificates or {}).get(host_name)
+
+                if use_tls:
+                    tls.append(
+                        client.V1IngressTLS(
+                            hosts=(
+                                certificate["spec"]["dnsNames"]
+                                if certificate
+                                else [host_name]
+                            ),
+                            secret_name=(
+                                certificate["spec"]["secretName"]
+                                if certificate
+                                else f"{self.app_name}-{host_name}-tls"
+                            ),
+                        )
+                    )
+
+                paths = []
+                for route in http_proxy_info["routes"]:
+                    path = route["path"]
+                    proxy_to = route["proxy-to"]
+                    if opts.o.debug:
+                        print(f"proxy config: {path} -> {proxy_to}")
+                    # proxy_to has the form <service>:<port>
+                    container_name = proxy_to.split(":")[0]
+                    proxy_to_port = int(proxy_to.split(":")[1])
+                    service_name = self._resolve_service_name_for_container(
+                        container_name
+                    )
+                    paths.append(
+                        client.V1HTTPIngressPath(
+                            path_type="Prefix",
+                            path=path,
+                            backend=client.V1IngressBackend(
+                                service=client.V1IngressServiceBackend(
+                                    name=service_name,
+                                    port=client.V1ServiceBackendPort(
+                                        number=proxy_to_port
+                                    ),
+                                )
+                            ),
+                        )
+                    )
+                rules.append(
+                    client.V1IngressRule(
+                        host=host_name,
+                        http=client.V1HTTPIngressRuleValue(paths=paths),
                    )
                )
-            rules.append(
-                client.V1IngressRule(
-                    host=host_name, http=client.V1HTTPIngressRuleValue(paths=paths)
-                )
-            )
+
            spec = client.V1IngressSpec(tls=tls, rules=rules)

            ingress_annotations = {
                "kubernetes.io/ingress.class": "caddy",
            }
-            if not certificate:
+            if not certificates:
                ingress_annotations["cert-manager.io/cluster-issuer"] = cluster_issuer

            ingress = client.V1Ingress(
@ -233,6 +272,28 @@ class ClusterInfo:
            )
        return ingress

+    def _get_readiness_probe_ports(self) -> dict:
+        """Map container names to TCP readiness probe ports.
+
+        Derives probe ports from http-proxy routes in the spec. If a container
+        has an http-proxy route (proxy-to: container:port), we probe that port.
+        This tells k8s when the container is ready to serve traffic, which is
+        required for safe rolling updates.
+        """
+        probe_ports: dict = {}
+        http_proxy_list = self.spec.get_http_proxy()
+        if http_proxy_list:
+            for http_proxy in http_proxy_list:
+                for route in http_proxy.get("routes", []):
+                    proxy_to = route.get("proxy-to", "")
+                    if ":" in proxy_to:
+                        container, port_str = proxy_to.rsplit(":", 1)
+                        port = int(port_str)
+                        # Use the first route's port for each container
+                        if container not in probe_ports:
+                            probe_ports[container] = port
+        return probe_ports
+
    # TODO: suppoprt multiple services
    def get_service(self):
        # Collect all ports from http-proxy routes
@ -288,8 +349,7 @@ class ClusterInfo:

            # Per-volume resources override global, which overrides default.
            vol_resources = (
-                self.spec.get_volume_resources_for(volume_name)
-                or global_resources
+                self.spec.get_volume_resources_for(volume_name) or global_resources
            )

            labels = {
@ -329,6 +389,7 @@ class ClusterInfo:
                    print(f"{cfg_map_name} not in pod files")
                continue

+            cfg_map_path = os.path.expanduser(cfg_map_path)
            if not cfg_map_path.startswith("/") and self.spec.file_path is not None:
                cfg_map_path = os.path.join(
                    os.path.dirname(str(self.spec.file_path)), cfg_map_path
@ -391,12 +452,15 @@ class ClusterInfo:
                    continue

            vol_resources = (
-                self.spec.get_volume_resources_for(volume_name)
-                or global_resources
+                self.spec.get_volume_resources_for(volume_name) or global_resources
            )
            if self.spec.is_kind_deployment():
                host_path = client.V1HostPathVolumeSource(
-                    path=get_kind_pv_bind_mount_path(volume_name)
+                    path=get_kind_pv_bind_mount_path(
+                        volume_name,
+                        kind_mount_root=self.spec.get_kind_mount_root(),
+                        host_path=volume_path,
+                    )
                )
            else:
                host_path = client.V1HostPathVolumeSource(path=volume_path)
@ -467,6 +531,7 @@ class ClusterInfo:
        containers = []
        init_containers = []
        services = {}
+        readiness_probe_ports = self._get_readiness_probe_ports()
        global_resources = self.spec.get_container_resources()
        if not global_resources:
            global_resources = DEFAULT_CONTAINER_RESOURCES
@ -527,9 +592,7 @@ class ClusterInfo:
                    if self.spec.get_image_registry() is not None
                    else image
                )
-                volume_mounts = volume_mounts_for_service(
-                    parsed_yaml_map, service_name
-                )
+                volume_mounts = volume_mounts_for_service(parsed_yaml_map, service_name)
                # Handle command/entrypoint from compose file
                # In docker-compose: entrypoint -> k8s command, command -> k8s args
                container_command = None
@ -565,6 +628,16 @@ class ClusterInfo:
                container_resources = self._resolve_container_resources(
                    container_name, service_info, global_resources
                )
+                # Readiness probe from http-proxy routes
+                readiness_probe = None
+                probe_port = readiness_probe_ports.get(container_name)
+                if probe_port:
+                    readiness_probe = client.V1Probe(
+                        tcp_socket=client.V1TCPSocketAction(port=probe_port),
+                        initial_delay_seconds=5,
+                        period_seconds=10,
+                        failure_threshold=3,
+                    )
                container = client.V1Container(
                    name=container_name,
                    image=image_to_use,
@ -575,14 +648,19 @@ class ClusterInfo:
                    env_from=env_from,
                    ports=container_ports if container_ports else None,
                    volume_mounts=volume_mounts,
+                    readiness_probe=readiness_probe,
                    security_context=client.V1SecurityContext(
                        privileged=self.spec.get_privileged(),
-                        run_as_user=int(service_info["user"]) if "user" in service_info else None,
-                        capabilities=client.V1Capabilities(
-                            add=self.spec.get_capabilities()
-                        )
-                        if self.spec.get_capabilities()
-                        else None,
+                        run_as_user=(
+                            int(service_info["user"])
+                            if "user" in service_info
+                            else None
+                        ),
+                        capabilities=(
+                            client.V1Capabilities(add=self.spec.get_capabilities())
+                            if self.spec.get_capabilities()
+                            else None
+                        ),
                    ),
                    resources=to_k8s_resource_requirements(container_resources),
                )
@ -591,33 +669,53 @@ class ClusterInfo:
                svc_labels = service_info.get("labels", {})
                if isinstance(svc_labels, list):
                    # docker-compose labels can be a list of "key=value"
-                    svc_labels = dict(
-                        item.split("=", 1) for item in svc_labels
-                    )
-                is_init = str(
-                    svc_labels.get("laconic.init-container", "")
-                ).lower() in ("true", "1", "yes")
+                    svc_labels = dict(item.split("=", 1) for item in svc_labels)
+                is_init = str(svc_labels.get("laconic.init-container", "")).lower() in (
+                    "true",
+                    "1",
+                    "yes",
+                )
                if is_init:
                    init_containers.append(container)
                else:
                    containers.append(container)
-        volumes = volumes_for_pod_files(
-            parsed_yaml_map, self.spec, self.app_name
-        )
+        volumes = volumes_for_pod_files(parsed_yaml_map, self.spec, self.app_name)
        return containers, init_containers, services, volumes

-    # TODO: put things like image pull policy into an object-scope struct
-    def get_deployment(self, image_pull_policy: Optional[str] = None):
-        containers, init_containers, services, volumes = self._build_containers(
-            self.parsed_pod_yaml_map, image_pull_policy
-        )
-        registry_config = self.spec.get_image_registry_config()
-        if registry_config:
-            secret_name = f"{self.app_name}-registry"
-            image_pull_secrets = [client.V1LocalObjectReference(name=secret_name)]
-        else:
-            image_pull_secrets = []
+    def _pod_name_from_file(self, pod_file: str) -> str:
+        """Extract pod name from compose file path.

+        docker-compose-dumpster.yml -> dumpster
+        docker-compose-dumpster-maintenance.yml -> dumpster-maintenance
+        """
+        import os
+
+        base = os.path.basename(pod_file)
+        name = base
+        if name.startswith("docker-compose-"):
+            name = name[len("docker-compose-") :]
+        if name.endswith(".yml"):
+            name = name[: -len(".yml")]
+        elif name.endswith(".yaml"):
+            name = name[: -len(".yaml")]
+        return name
+
+    def _pod_has_pvcs(self, parsed_pod_file: Any) -> bool:
+        """Check if a parsed compose file declares volumes that become PVCs.
+
+        Excludes volumes that are ConfigMaps (declared in spec.configmaps),
+        since those don't require Recreate strategy.
+        """
+        volumes = parsed_pod_file.get("volumes", {})
+        configmaps = set(self.spec.get_configmaps().keys())
+        pvc_volumes = [v for v in volumes if v not in configmaps]
+        return len(pvc_volumes) > 0
+
+    def _build_common_pod_metadata(self, services: dict) -> tuple:
+        """Build shared annotations, labels, affinity, tolerations for pods.
+
+        Returns (annotations, labels, affinity, tolerations).
+        """
        annotations = None
        labels = {"app": self.app_name}
        if self.stack_name:
@ -639,7 +737,6 @@ class ClusterInfo:
        if self.spec.get_node_affinities():
            affinities = []
            for rule in self.spec.get_node_affinities():
-                # TODO add some input validation here
                label_name = rule["label"]
                label_value = rule["value"]
                affinities.append(
@ -662,7 +759,6 @@ class ClusterInfo:
        if self.spec.get_node_tolerations():
            tolerations = []
            for toleration in self.spec.get_node_tolerations():
-                # TODO add some input validation here
                toleration_key = toleration["key"]
                toleration_value = toleration["value"]
                tolerations.append(
@ -674,37 +770,224 @@ class ClusterInfo:
                    )
                )

-        use_host_network = self._any_service_has_host_network()
-        template = client.V1PodTemplateSpec(
-            metadata=client.V1ObjectMeta(annotations=annotations, labels=labels),
-            spec=client.V1PodSpec(
-                containers=containers,
-                init_containers=init_containers or None,
-                image_pull_secrets=image_pull_secrets,
-                volumes=volumes,
-                affinity=affinity,
-                tolerations=tolerations,
-                runtime_class_name=self.spec.get_runtime_class(),
-                host_network=use_host_network or None,
-                dns_policy=("ClusterFirstWithHostNet" if use_host_network else None),
-            ),
-        )
-        spec = client.V1DeploymentSpec(
-            replicas=self.spec.get_replicas(),
-            template=template,
-            selector={"matchLabels": {"app": self.app_name}},
-        )
+        return annotations, labels, affinity, tolerations

-        deployment = client.V1Deployment(
-            api_version="apps/v1",
-            kind="Deployment",
-            metadata=client.V1ObjectMeta(
-                name=f"{self.app_name}-deployment",
-                labels={"app": self.app_name, **({"app.kubernetes.io/stack": self.stack_name} if self.stack_name else {})},
-            ),
-            spec=spec,
-        )
-        return deployment
+    # TODO: put things like image pull policy into an object-scope struct
+    def get_deployment(self, image_pull_policy: Optional[str] = None):
+        """Build a single k8s Deployment from all pod files (legacy behavior).
+
+        When only one pod is defined in the stack, this is equivalent to
+        get_deployments()[0]. Kept for backward compatibility.
+        """
+        deployments = self.get_deployments(image_pull_policy)
+        if not deployments:
+            return None
+        # Legacy: return the first (and usually only) deployment
+        return deployments[0]
+
+    def get_deployments(
+        self, image_pull_policy: Optional[str] = None
+    ) -> List[client.V1Deployment]:
+        """Build one k8s Deployment per pod file.
+
+        Each pod file (docker-compose-<name>.yml) becomes its own Deployment
+        with independent lifecycle and update strategy:
+        - Pods with PVCs get strategy=Recreate (can't do rolling updates
+          with ReadWriteOnce volumes)
+        - Pods without PVCs get strategy=RollingUpdate
+
+        This enables maintenance services to survive main pod restarts.
+        """
+        if not self.parsed_pod_yaml_map:
+            return []
+
+        registry_config = self.spec.get_image_registry_config()
+        if registry_config:
+            secret_name = f"{self.app_name}-image-pull-secret"
+            image_pull_secrets = [client.V1LocalObjectReference(name=secret_name)]
+        else:
+            image_pull_secrets = []
+
+        use_host_network = self._any_service_has_host_network()
+        pod_files = list(self.parsed_pod_yaml_map.keys())
+
+        # Single pod file: preserve legacy naming ({app_name}-deployment)
+        # Multiple pod files: use {app_name}-{pod_name}-deployment
+        multi_pod = len(pod_files) > 1
+
+        deployments = []
+        for pod_file in pod_files:
+            pod_name = self._pod_name_from_file(pod_file)
+            single_pod_map = {pod_file: self.parsed_pod_yaml_map[pod_file]}
+            containers, init_containers, services, volumes = self._build_containers(
+                single_pod_map, image_pull_policy
+            )
+            annotations, labels, affinity, tolerations = (
+                self._build_common_pod_metadata(services)
+            )
+
+            # Add pod-name label so Services can target specific pods
+            if multi_pod:
+                labels["app.kubernetes.io/component"] = pod_name
+
+            has_pvcs = self._pod_has_pvcs(self.parsed_pod_yaml_map[pod_file])
+            if has_pvcs:
+                strategy = client.V1DeploymentStrategy(type="Recreate")
+            else:
+                strategy = client.V1DeploymentStrategy(
+                    type="RollingUpdate",
+                    rolling_update=client.V1RollingUpdateDeployment(
+                        max_unavailable=0, max_surge=1
+                    ),
+                )
+
+            # Pod selector: for multi-pod, select by both app and component
+            selector_labels = {"app": self.app_name}
+            if multi_pod:
+                selector_labels["app.kubernetes.io/component"] = pod_name
+
+            # Add CA certificate volume and env vars if configured
+            _ca_secret, ca_volume, ca_mounts, ca_envs = (
+                self.get_ca_certificate_resources()
+            )
+            if ca_volume:
+                volumes.append(ca_volume)
+                for container in containers:
+                    if container.volume_mounts is None:
+                        container.volume_mounts = []
+                    container.volume_mounts.extend(ca_mounts)
+                    if container.env is None:
+                        container.env = []
+                    container.env.extend(ca_envs)
+
+            template = client.V1PodTemplateSpec(
+                metadata=client.V1ObjectMeta(annotations=annotations, labels=labels),
+                spec=client.V1PodSpec(
+                    containers=containers,
+                    init_containers=init_containers or None,
+                    image_pull_secrets=image_pull_secrets,
+                    volumes=volumes,
+                    affinity=affinity,
+                    tolerations=tolerations,
+                    runtime_class_name=self.spec.get_runtime_class(),
+                    host_network=use_host_network or None,
+                    dns_policy=(
+                        "ClusterFirstWithHostNet" if use_host_network else None
+                    ),
+                ),
+            )
+
+            if multi_pod:
+                deployment_name = f"{self.app_name}-{pod_name}-deployment"
+            else:
+                deployment_name = f"{self.app_name}-deployment"
+
+            spec = client.V1DeploymentSpec(
+                replicas=self.spec.get_replicas(),
+                template=template,
+                selector={"matchLabels": selector_labels},
+                strategy=strategy,
+            )
+
+            deployment = client.V1Deployment(
+                api_version="apps/v1",
+                kind="Deployment",
+                metadata=client.V1ObjectMeta(
+                    name=deployment_name,
+                    labels={
+                        "app": self.app_name,
+                        **(
+                            {
+                                "app.kubernetes.io/stack": self.stack_name,
+                            }
+                            if self.stack_name
+                            else {}
+                        ),
+                        **(
+                            {"app.kubernetes.io/component": pod_name}
+                            if multi_pod
+                            else {}
+                        ),
+                    },
+                ),
+                spec=spec,
+            )
+            deployments.append(deployment)
+
+        return deployments
+
+    def get_services(self) -> List[client.V1Service]:
+        """Build per-pod ClusterIP Services for multi-pod stacks.
+
+        Each pod's containers get their own Service so Ingress can route
+        to specific pods. For single-pod stacks, returns a list with one
+        service matching the legacy get_service() behavior.
+        """
+        pod_files = list(self.parsed_pod_yaml_map.keys())
+        multi_pod = len(pod_files) > 1
+
+        if not multi_pod:
+            # Legacy: single service for all pods
+            svc = self.get_service()
+            return [svc] if svc else []
+
+        # Multi-pod: one service per pod, only for pods that have
+        # ports referenced by http-proxy routes
+        http_proxy_list = self.spec.get_http_proxy()
+        if not http_proxy_list:
+            return []
+
+        # Build map: container_name -> port from http-proxy routes
+        container_ports: dict = {}
+        for http_proxy in http_proxy_list:
+            for route in http_proxy.get("routes", []):
+                proxy_to = route.get("proxy-to", "")
+                if ":" in proxy_to:
+                    container, port_str = proxy_to.rsplit(":", 1)
+                    port = int(port_str)
+                    if container not in container_ports:
+                        container_ports[container] = set()
+                    container_ports[container].add(port)
+
+        # Build map: pod_file -> set of service names in that pod
+        pod_services_map: dict = {}
+        for pod_file in pod_files:
+            pod = self.parsed_pod_yaml_map[pod_file]
+            pod_services_map[pod_file] = set(pod.get("services", {}).keys())
+
+        services = []
+        for pod_file in pod_files:
+            pod_name = self._pod_name_from_file(pod_file)
+            svc_names = pod_services_map[pod_file]
+            # Collect ports from http-proxy that belong to this pod's containers
+            ports_set: Set[int] = set()
+            for svc_name in svc_names:
+                if svc_name in container_ports:
+                    ports_set.update(container_ports[svc_name])
+
+            if not ports_set:
+                continue
+
+            service_ports = [
+                client.V1ServicePort(port=p, target_port=p, name=f"port-{p}")
+                for p in sorted(ports_set)
+            ]
+            service = client.V1Service(
+                metadata=client.V1ObjectMeta(
+                    name=f"{self.app_name}-{pod_name}-service",
+                    labels={"app": self.app_name},
+                ),
+                spec=client.V1ServiceSpec(
+                    type="ClusterIP",
+                    ports=service_ports,
+                    selector={
+                        "app": self.app_name,
+                        "app.kubernetes.io/component": pod_name,
+                    },
+                ),
+            )
+            services.append(service)
+        return services

    def get_jobs(self, image_pull_policy: Optional[str] = None) -> List[client.V1Job]:
        """Build k8s Job objects from parsed job compose files.
@ -720,7 +1003,7 @@ class ClusterInfo:
        jobs = []
        registry_config = self.spec.get_image_registry_config()
        if registry_config:
-            secret_name = f"{self.app_name}-registry"
+            secret_name = f"{self.app_name}-image-pull-secret"
            image_pull_secrets = [client.V1LocalObjectReference(name=secret_name)]
        else:
            image_pull_secrets = []
@ -728,8 +1011,8 @@ class ClusterInfo:
        for job_file in self.parsed_job_yaml_map:
            # Build containers for this single job file
            single_job_map = {job_file: self.parsed_job_yaml_map[job_file]}
-            containers, init_containers, _services, volumes = (
-                self._build_containers(single_job_map, image_pull_policy)
+            containers, init_containers, _services, volumes = self._build_containers(
+                single_job_map, image_pull_policy
            )

            # Derive job name from file path: docker-compose-<name>.yml -> <name>
@ -737,7 +1020,7 @@ class ClusterInfo:
            # Strip docker-compose- prefix and .yml suffix
            job_name = base
            if job_name.startswith("docker-compose-"):
-                job_name = job_name[len("docker-compose-"):]
+                job_name = job_name[len("docker-compose-") :]
            if job_name.endswith(".yml"):
                job_name = job_name[: -len(".yml")]
            elif job_name.endswith(".yaml"):
@ -747,12 +1030,14 @@ class ClusterInfo:
            # picked up by pods_in_deployment() which queries app={app_name}.
            pod_labels = {
                "app": f"{self.app_name}-job",
-                **({"app.kubernetes.io/stack": self.stack_name} if self.stack_name else {}),
+                **(
+                    {"app.kubernetes.io/stack": self.stack_name}
+                    if self.stack_name
+                    else {}
+                ),
            }
            template = client.V1PodTemplateSpec(
-                metadata=client.V1ObjectMeta(
-                    labels=pod_labels
-                ),
+                metadata=client.V1ObjectMeta(labels=pod_labels),
                spec=client.V1PodSpec(
                    containers=containers,
                    init_containers=init_containers or None,
@ -765,7 +1050,14 @@ class ClusterInfo:
                template=template,
                backoff_limit=0,
            )
-            job_labels = {"app": self.app_name, **({"app.kubernetes.io/stack": self.stack_name} if self.stack_name else {})}
+            job_labels = {
+                "app": self.app_name,
+                **(
+                    {"app.kubernetes.io/stack": self.stack_name}
+                    if self.stack_name
+                    else {}
+                ),
+            }
            job = client.V1Job(
                api_version="batch/v1",
                kind="Job",
@ -778,3 +1070,130 @@ class ClusterInfo:
            jobs.append(job)

        return jobs
+
+    def get_external_service_resources(self) -> List:
+        """Build k8s Services (and Endpoints) for external-services in spec.
+
+        Two modes:
+        - host mode: ExternalName Service (DNS CNAME to external host)
+        - selector mode: headless Service + Endpoints (cross-namespace
+          routing to a mock pod, IP discovered at deploy time)
+
+        Returns a flat list of k8s resource objects (Services + Endpoints).
+        """
+        ext_services = self.spec.get_external_services()
+        if not ext_services:
+            return []
+
+        resources = []
+        for name, config in ext_services.items():
+            port = config.get("port", 443)
+
+            if "host" in config:
+                # ExternalName: DNS CNAME to external host
+                svc = client.V1Service(
+                    metadata=client.V1ObjectMeta(
+                        name=name,
+                        labels={"app": self.app_name},
+                    ),
+                    spec=client.V1ServiceSpec(
+                        type="ExternalName",
+                        external_name=config["host"],
+                        ports=[
+                            client.V1ServicePort(port=port, name=f"port-{port}")
+                        ],
+                    ),
+                )
+                resources.append(svc)
+
+            elif "selector" in config and "namespace" in config:
+                # Cross-namespace headless Service + Endpoints.
+                # The Endpoints IP is populated in deploy_k8s.py at deploy
+                # time by querying the target namespace for matching pods.
+                svc = client.V1Service(
+                    metadata=client.V1ObjectMeta(
+                        name=name,
+                        labels={"app": self.app_name},
+                    ),
+                    spec=client.V1ServiceSpec(
+                        cluster_ip="None",
+                        ports=[
+                            client.V1ServicePort(port=port, name=f"port-{port}")
+                        ],
+                    ),
+                )
+                resources.append(svc)
+                # Endpoints object is created in deploy_k8s.py after pod
+                # IP discovery — we just return the Service here.
+
+        return resources
+
+    def get_ca_certificate_resources(self) -> tuple:
+        """Build k8s Secret and volume mount config for CA certificates.
+
+        Returns (secret, volume, volume_mount, env_vars) or (None, ...) if
+        no CA certificates are configured. The caller must add the volume
+        and mount to all containers, and the env vars to all containers.
+        """
+        ca_files = self.spec.get_ca_certificates()
+        if not ca_files:
+            return None, None, None, []
+
+        # Concatenate all CA files into one Secret
+        secret_data = {}
+        for i, ca_path in enumerate(ca_files):
+            expanded = os.path.expanduser(ca_path)
+            if not os.path.exists(expanded):
+                print(f"Warning: CA certificate file not found: {expanded}")
+                continue
+            with open(expanded, "rb") as f:
+                ca_bytes = f.read()
+            key = f"laconic-extra-ca-{i}.pem"
+            secret_data[key] = base64.b64encode(ca_bytes).decode()
+
+        if not secret_data:
+            return None, None, None, []
+
+        secret_name = f"{self.app_name}-ca-certificates"
+        secret = client.V1Secret(
+            metadata=client.V1ObjectMeta(
+                name=secret_name,
+                labels={"app": self.app_name},
+            ),
+            data=secret_data,
+        )
+
+        volume = client.V1Volume(
+            name="laconic-ca-certs",
+            secret=client.V1SecretVolumeSource(
+                secret_name=secret_name,
+            ),
+        )
+
+        # Mount each CA file into /etc/ssl/certs/ (Go reads this dir)
+        # Mount each CA file directly into /etc/ssl/certs/ using subPath
+        # so Go's x509 package picks them up (it reads *.pem from that dir).
+        # Also return env vars for Node/Bun containers.
+        volume_mounts = []
+        first_mount_path = None
+        for key in secret_data.keys():
+            mount_path = f"/etc/ssl/certs/{key}"
+            if first_mount_path is None:
+                first_mount_path = mount_path
+            volume_mounts.append(
+                client.V1VolumeMount(
+                    name="laconic-ca-certs",
+                    mount_path=mount_path,
+                    sub_path=key,
+                    read_only=True,
+                )
+            )
+
+        env_vars = [
+            client.V1EnvVar(
+                name="NODE_EXTRA_CA_CERTS",
+                value=first_mount_path,
+            ),
+        ]
+
+        return secret, volume, volume_mounts, env_vars
--- a/stack_orchestrator/deploy/k8s/deploy_k8s.py
+++ b/stack_orchestrator/deploy/k8s/deploy_k8s.py
@ -115,6 +115,7 @@ class K8sDeployer(Deployer):
    ) -> None:
        self.type = type
        self.skip_cluster_management = False
+        self.image_overrides = None
        self.k8s_namespace = "default"  # Will be overridden below if context exists
        # TODO: workaround pending refactoring above to cope with being
        # created with a null deployment_context
@ -122,9 +123,13 @@ class K8sDeployer(Deployer):
            return
        self.deployment_dir = deployment_context.deployment_dir
        self.deployment_context = deployment_context
-        self.kind_cluster_name = deployment_context.spec.get_kind_cluster_name() or compose_project_name
+        self.kind_cluster_name = (
+            deployment_context.spec.get_kind_cluster_name() or compose_project_name
+        )
        # Use spec namespace if provided, otherwise derive from cluster-id
-        self.k8s_namespace = deployment_context.spec.get_namespace() or f"laconic-{compose_project_name}"
+        self.k8s_namespace = (
+            deployment_context.spec.get_namespace() or f"laconic-{compose_project_name}"
+        )
        self.cluster_info = ClusterInfo()
        # stack.name may be an absolute path (from spec "stack:" key after
        # path resolution). Extract just the directory basename for labels.
@ -204,6 +209,43 @@ class K8sDeployer(Deployer):
            else:
                raise

+    def _wait_for_namespace_gone(self, timeout_seconds: int = 120):
+        """Wait for namespace to finish terminating."""
+        if opts.o.dry_run:
+            return
+        import time
+
+        deadline = time.monotonic() + timeout_seconds
+        while time.monotonic() < deadline:
+            try:
+                ns = self.core_api.read_namespace(name=self.k8s_namespace)
+                if ns.status and ns.status.phase == "Terminating":
+                    if opts.o.debug:
+                        print(
+                            f"Waiting for namespace {self.k8s_namespace}"
+                            " to finish terminating..."
+                        )
+                    time.sleep(2)
+                    continue
+                # Namespace exists and is Active — shouldn't happen after delete
+                break
+            except ApiException as e:
+                if e.status == 404:
+                    # Gone — success
+                    return
+                raise
+        # If we get here, namespace still exists after timeout
+        try:
+            self.core_api.read_namespace(name=self.k8s_namespace)
+            print(
+                f"Warning: namespace {self.k8s_namespace} still exists"
+                f" after {timeout_seconds}s"
+            )
+        except ApiException as e:
+            if e.status == 404:
+                return
+            raise
+
    def _delete_resources_by_label(self, label_selector: str, delete_volumes: bool):
        """Delete only this stack's resources from a shared namespace."""
        ns = self.k8s_namespace
@ -232,7 +274,8 @@ class K8sDeployer(Deployer):
            for job in jobs.items:
                print(f"Deleting Job {job.metadata.name}")
                self.batch_api.delete_namespaced_job(
-                    name=job.metadata.name, namespace=ns,
+                    name=job.metadata.name,
+                    namespace=ns,
                    body=client.V1DeleteOptions(propagation_policy="Background"),
                )
        except ApiException as e:
@ -303,7 +346,22 @@ class K8sDeployer(Deployer):
                        name=pv.metadata.name
                    )
                    if pv_resp:
-                        if opts.o.debug:
+                        # If PV is in Released state (stale claimRef from a
+                        # previous deployment), clear the claimRef so a new
+                        # PVC can bind to it. This happens after stop+start
+                        # because stop deletes the namespace (and PVCs) but
+                        # preserves PVs by default.
+                        if pv_resp.status and pv_resp.status.phase == "Released":
+                            print(
+                                f"PV {pv.metadata.name} is Released, "
+                                "clearing claimRef for rebinding"
+                            )
+                            pv_resp.spec.claim_ref = None
+                            self.core_api.patch_persistent_volume(
+                                name=pv.metadata.name,
+                                body={"spec": {"claimRef": None}},
+                            )
+                        elif opts.o.debug:
                            print("PVs already present:")
                            print(f"{pv_resp}")
                        continue
@ -347,12 +405,148 @@ class K8sDeployer(Deployer):
            if opts.o.debug:
                print(f"Sending this ConfigMap: {cfg_map}")
            if not opts.o.dry_run:
-                cfg_rsp = self.core_api.create_namespaced_config_map(
-                    body=cfg_map, namespace=self.k8s_namespace
+                cm_name = cfg_map.metadata.name
+                try:
+                    self.core_api.create_namespaced_config_map(
+                        body=cfg_map, namespace=self.k8s_namespace
+                    )
+                except ApiException as e:
+                    if e.status == 409:
+                        self.core_api.patch_namespaced_config_map(
+                            name=cm_name,
+                            namespace=self.k8s_namespace,
+                            body=cfg_map,
+                        )
+                    else:
+                        raise
+
+    def _create_external_services(self):
+        """Create k8s Services for external-services declared in the spec.
+
+        For host mode: ExternalName Service (DNS CNAME).
+        For selector mode: headless Service + Endpoints with pod IPs
+        discovered from the target namespace.
+        """
+        resources = self.cluster_info.get_external_service_resources()
+        ext_services = self.cluster_info.spec.get_external_services()
+
+        for resource in resources:
+            if opts.o.dry_run:
+                print(f"Dry run: would create external service: {resource.metadata.name}")
+                continue
+
+            svc_name = resource.metadata.name
+            try:
+                self.core_api.create_namespaced_service(
+                    body=resource, namespace=self.k8s_namespace
                )
-                if opts.o.debug:
-                    print("ConfigMap created:")
-                    print(f"{cfg_rsp}")
+                print(f"Created external service '{svc_name}'")
+            except ApiException as e:
+                if e.status == 409:
+                    self.core_api.replace_namespaced_service(
+                        name=svc_name,
+                        namespace=self.k8s_namespace,
+                        body=resource,
+                    )
+                    print(f"Updated external service '{svc_name}'")
+                else:
+                    raise
+
+        # Create Endpoints for selector-mode services
+        for name, config in ext_services.items():
+            if "selector" not in config or "namespace" not in config:
+                continue
+            if opts.o.dry_run:
+                continue
+
+            target_ns = config["namespace"]
+            selector = config["selector"]
+            port = config.get("port", 443)
+
+            # Build label selector string from dict
+            label_selector = ",".join(f"{k}={v}" for k, v in selector.items())
+
+            # Discover pod IPs in target namespace
+            pods = self.core_api.list_namespaced_pod(
+                namespace=target_ns, label_selector=label_selector
+            )
+            pod_ips = [
+                p.status.pod_ip
+                for p in pods.items
+                if p.status and p.status.pod_ip
+            ]
+
+            if not pod_ips:
+                print(
+                    f"Warning: no pods found in {target_ns} matching "
+                    f"{label_selector} for external service '{name}'"
+                )
+                continue
+
+            endpoints = client.V1Endpoints(
+                metadata=client.V1ObjectMeta(
+                    name=name,
+                    labels={"app": self.cluster_info.app_name},
+                ),
+                subsets=[
+                    client.V1EndpointSubset(
+                        addresses=[
+                            client.V1EndpointAddress(ip=ip) for ip in pod_ips
+                        ],
+                        ports=[
+                            client.CoreV1EndpointPort(
+                                port=port, name=f"port-{port}"
+                            )
+                        ],
+                    )
+                ],
+            )
+
+            try:
+                self.core_api.create_namespaced_endpoints(
+                    body=endpoints, namespace=self.k8s_namespace
+                )
+                print(f"Created endpoints for '{name}' → {pod_ips}")
+            except ApiException as e:
+                if e.status == 409:
+                    self.core_api.replace_namespaced_endpoints(
+                        name=name,
+                        namespace=self.k8s_namespace,
+                        body=endpoints,
+                    )
+                    print(f"Updated endpoints for '{name}' → {pod_ips}")
+                else:
+                    raise
+
+    def _create_ca_certificates(self):
+        """Create k8s Secret for CA certificates declared in the spec.
+
+        The Secret is mounted into containers by get_deployments() in
+        cluster_info.py. This method just ensures the Secret exists.
+        """
+        ca_secret, _, _, _ = self.cluster_info.get_ca_certificate_resources()
+        if not ca_secret:
+            return
+        if opts.o.dry_run:
+            print(f"Dry run: would create CA certificate secret")
+            return
+
+        secret_name = ca_secret.metadata.name
+        try:
+            self.core_api.create_namespaced_secret(
+                body=ca_secret, namespace=self.k8s_namespace
+            )
+            print(f"Created CA certificate secret '{secret_name}'")
+        except ApiException as e:
+            if e.status == 409:
+                self.core_api.replace_namespaced_secret(
+                    name=secret_name,
+                    namespace=self.k8s_namespace,
+                    body=ca_secret,
+                )
+                print(f"Updated CA certificate secret '{secret_name}'")
+            else:
+                raise

    def _create_deployment(self):
        # Skip if there are no pods to deploy (e.g. jobs-only stacks)
@ -360,48 +554,109 @@ class K8sDeployer(Deployer):
            if opts.o.debug:
                print("No pods defined, skipping Deployment creation")
            return
-        # Process compose files into a Deployment
-        deployment = self.cluster_info.get_deployment(
-            image_pull_policy=None if self.is_kind() else "Always"
-        )
-        # Create the k8s objects
-        if opts.o.debug:
-            print(f"Sending this deployment: {deployment}")
-        if not opts.o.dry_run:
-            deployment_resp = cast(
-                client.V1Deployment,
-                self.apps_api.create_namespaced_deployment(
-                    body=deployment, namespace=self.k8s_namespace
-                ),
-            )
+        # Process compose files into Deployments (one per pod file)
+        # image-pull-policy from spec, default Always (production).
+        # Testing specs use IfNotPresent so kind-loaded local images are used.
+        pull_policy = self.cluster_info.spec.get("image-pull-policy", "Always")
+        deployments = self.cluster_info.get_deployments(image_pull_policy=pull_policy)
+        for deployment in deployments:
+            # Apply image overrides if provided
+            if self.image_overrides:
+                for container in deployment.spec.template.spec.containers:
+                    if container.name in self.image_overrides:
+                        container.image = self.image_overrides[container.name]
+                        if opts.o.debug:
+                            print(
+                                f"Overriding image for {container.name}:"
+                                f" {container.image}"
+                            )
+            # Create or update the k8s Deployment
            if opts.o.debug:
-                print("Deployment created:")
-                meta = deployment_resp.metadata
-                spec = deployment_resp.spec
-                if meta and spec and spec.template.spec:
-                    ns = meta.namespace
-                    name = meta.name
-                    gen = meta.generation
-                    containers = spec.template.spec.containers
-                    img = containers[0].image if containers else None
-                    print(f"{ns} {name} {gen} {img}")
+                print(f"Sending this deployment: {deployment}")
+            if not opts.o.dry_run:
+                name = deployment.metadata.name
+                try:
+                    deployment_resp = cast(
+                        client.V1Deployment,
+                        self.apps_api.create_namespaced_deployment(
+                            body=deployment, namespace=self.k8s_namespace
+                        ),
+                    )
+                    strategy = (
+                        deployment.spec.strategy.type
+                        if deployment.spec.strategy
+                        else "default"
+                    )
+                    print(f"Created Deployment {name} (strategy: {strategy})")
+                except ApiException as e:
+                    if e.status == 409:
+                        # Already exists — replace to ensure removed fields
+                        # (volumes, mounts, env vars) are actually deleted.
+                        existing = self.apps_api.read_namespaced_deployment(
+                            name=name, namespace=self.k8s_namespace
+                        )
+                        deployment.metadata.resource_version = (
+                            existing.metadata.resource_version
+                        )
+                        deployment_resp = cast(
+                            client.V1Deployment,
+                            self.apps_api.replace_namespaced_deployment(
+                                name=name,
+                                namespace=self.k8s_namespace,
+                                body=deployment,
+                            ),
+                        )
+                        print(f"Updated Deployment {name} (rolling update)")
+                    else:
+                        raise
+                if opts.o.debug:
+                    meta = deployment_resp.metadata
+                    spec = deployment_resp.spec
+                    if meta and spec and spec.template.spec:
+                        containers = spec.template.spec.containers
+                        img = containers[0].image if containers else None
+                        print(
+                            f"  {meta.namespace} {meta.name}"
+                            f" gen={meta.generation} {img}"
+                        )

-        service = self.cluster_info.get_service()
-        if opts.o.debug:
-            print(f"Sending this service: {service}")
-        if service and not opts.o.dry_run:
-            service_resp = self.core_api.create_namespaced_service(
-                namespace=self.k8s_namespace, body=service
-            )
+        # Create Services (one per pod for multi-pod, or one for single-pod)
+        services = self.cluster_info.get_services()
+        for service in services:
            if opts.o.debug:
-                print("Service created:")
-                print(f"{service_resp}")
+                print(f"Sending this service: {service}")
+            if service and not opts.o.dry_run:
+                svc_name = service.metadata.name
+                try:
+                    service_resp = self.core_api.create_namespaced_service(
+                        namespace=self.k8s_namespace, body=service
+                    )
+                    print(f"Created Service {svc_name}")
+                except ApiException as e:
+                    if e.status == 409:
+                        # Replace to ensure removed ports are deleted.
+                        # Must preserve clusterIP (immutable) and resourceVersion.
+                        existing = self.core_api.read_namespaced_service(
+                            name=svc_name, namespace=self.k8s_namespace
+                        )
+                        service.metadata.resource_version = (
+                            existing.metadata.resource_version
+                        )
+                        service.spec.cluster_ip = existing.spec.cluster_ip
+                        service_resp = self.core_api.replace_namespaced_service(
+                            name=svc_name,
+                            namespace=self.k8s_namespace,
+                            body=service,
+                        )
+                        print(f"Updated Service {svc_name}")
+                    else:
+                        raise
+                if opts.o.debug:
+                    print(f"  {service_resp}")

    def _create_jobs(self):
        # Process job compose files into k8s Jobs
-        jobs = self.cluster_info.get_jobs(
-            image_pull_policy=None if self.is_kind() else "Always"
-        )
+        jobs = self.cluster_info.get_jobs(image_pull_policy="Always")
        for job in jobs:
            if opts.o.debug:
                print(f"Sending this job: {job}")
@ -453,107 +708,149 @@ class K8sDeployer(Deployer):
                                return cert
        return None

-    def up(self, detach, skip_cluster_management, services):
+    def _setup_cluster(self):
+        """Create/reuse kind cluster, load images, ensure namespace."""
+        if self.is_kind() and not self.skip_cluster_management:
+            kind_config = str(
+                self.deployment_dir.joinpath(constants.kind_config_filename)
+            )
+            actual_cluster = create_cluster(self.kind_cluster_name, kind_config)
+            if actual_cluster != self.kind_cluster_name:
+                self.kind_cluster_name = actual_cluster
+            # Only load locally-built images into kind
+            local_containers = self.deployment_context.stack.obj.get("containers", [])
+            if local_containers:
+                local_images = {
+                    img
+                    for img in self.cluster_info.image_set
+                    if any(c in img for c in local_containers)
+                }
+                if local_images:
+                    load_images_into_kind(self.kind_cluster_name, local_images)
+        self.connect_api()
+        self._ensure_namespace()
+        if self.is_kind() and not self.skip_cluster_management:
+            if not is_ingress_running():
+                install_ingress_for_kind(self.cluster_info.spec.get_acme_email())
+                wait_for_ingress_in_kind()
+            if self.cluster_info.spec.get_unlimited_memlock():
+                _create_runtime_class(
+                    constants.high_memlock_runtime,
+                    constants.high_memlock_runtime,
+                )
+
+    def _create_ingress(self):
+        """Create or update Ingress with TLS certificate lookup."""
+        http_proxy_info = self.cluster_info.spec.get_http_proxy()
+        use_tls = http_proxy_info and not self.is_kind()
+        certificates = None
+        if use_tls:
+            certificates = {}
+            for proxy in http_proxy_info:
+                host_name = proxy["host-name"]
+                cert = self._find_certificate_for_host_name(host_name)
+                if cert:
+                    certificates[host_name] = cert
+                    if opts.o.debug:
+                        print(f"Using existing certificate for {host_name}: {cert}")
+
+        ingress = self.cluster_info.get_ingress(
+            use_tls=use_tls, certificates=certificates
+        )
+        if ingress:
+            if opts.o.debug:
+                print(f"Sending this ingress: {ingress}")
+            if not opts.o.dry_run:
+                ing_name = ingress.metadata.name
+                try:
+                    self.networking_api.create_namespaced_ingress(
+                        namespace=self.k8s_namespace, body=ingress
+                    )
+                    print(f"Created Ingress {ing_name}")
+                except ApiException as e:
+                    if e.status == 409:
+                        existing = self.networking_api.read_namespaced_ingress(
+                            name=ing_name, namespace=self.k8s_namespace
+                        )
+                        ingress.metadata.resource_version = (
+                            existing.metadata.resource_version
+                        )
+                        self.networking_api.replace_namespaced_ingress(
+                            name=ing_name,
+                            namespace=self.k8s_namespace,
+                            body=ingress,
+                        )
+                        print(f"Updated Ingress {ing_name}")
+                    else:
+                        raise
+        else:
+            if opts.o.debug:
+                print("No ingress configured")
+
+    def _create_nodeports(self):
+        """Create or update NodePort services."""
+        nodeports: List[client.V1Service] = self.cluster_info.get_nodeports()
+        for nodeport in nodeports:
+            if opts.o.debug:
+                print(f"Sending this nodeport: {nodeport}")
+            if not opts.o.dry_run:
+                np_name = nodeport.metadata.name
+                try:
+                    self.core_api.create_namespaced_service(
+                        namespace=self.k8s_namespace, body=nodeport
+                    )
+                except ApiException as e:
+                    if e.status == 409:
+                        existing = self.core_api.read_namespaced_service(
+                            name=np_name, namespace=self.k8s_namespace
+                        )
+                        nodeport.metadata.resource_version = (
+                            existing.metadata.resource_version
+                        )
+                        nodeport.spec.cluster_ip = existing.spec.cluster_ip
+                        self.core_api.replace_namespaced_service(
+                            name=np_name,
+                            namespace=self.k8s_namespace,
+                            body=nodeport,
+                        )
+                    else:
+                        raise
+
+    def up(self, detach, skip_cluster_management, services, image_overrides=None):
+        # Merge spec-level image overrides with CLI overrides
+        spec_overrides = self.cluster_info.spec.get("image-overrides", {})
+        if spec_overrides:
+            if image_overrides:
+                spec_overrides.update(image_overrides)  # CLI wins
+            image_overrides = spec_overrides
+        self.image_overrides = image_overrides
        self.skip_cluster_management = skip_cluster_management
        if not opts.o.dry_run:
-            if self.is_kind() and not self.skip_cluster_management:
-                # Create the kind cluster (or reuse existing one)
-                kind_config = str(
-                    self.deployment_dir.joinpath(constants.kind_config_filename)
-                )
-                actual_cluster = create_cluster(self.kind_cluster_name, kind_config)
-                if actual_cluster != self.kind_cluster_name:
-                    # An existing cluster was found, use it instead
-                    self.kind_cluster_name = actual_cluster
-                # Only load locally-built images into kind
-                # Registry images (docker.io, ghcr.io, etc.) will be pulled by k8s
-                local_containers = self.deployment_context.stack.obj.get(
-                    "containers", []
-                )
-                if local_containers:
-                    # Filter image_set to only images matching local containers
-                    local_images = {
-                        img
-                        for img in self.cluster_info.image_set
-                        if any(c in img for c in local_containers)
-                    }
-                    if local_images:
-                        load_images_into_kind(self.kind_cluster_name, local_images)
-                # Note: if no local containers defined, all images come from registries
-            self.connect_api()
-            # Create deployment-specific namespace for resource isolation
-            self._ensure_namespace()
-            if self.is_kind() and not self.skip_cluster_management:
-                # Configure ingress controller (not installed by default in kind)
-                # Skip if already running (idempotent for shared cluster)
-                if not is_ingress_running():
-                    install_ingress_for_kind(self.cluster_info.spec.get_acme_email())
-                    # Wait for ingress to start
-                    # (deployment provisioning will fail unless this is done)
-                    wait_for_ingress_in_kind()
-                # Create RuntimeClass if unlimited_memlock is enabled
-                if self.cluster_info.spec.get_unlimited_memlock():
-                    _create_runtime_class(
-                        constants.high_memlock_runtime,
-                        constants.high_memlock_runtime,
-                    )
-
+            self._setup_cluster()
        else:
            print("Dry run mode enabled, skipping k8s API connect")

        # Create registry secret if configured
        from stack_orchestrator.deploy.deployment_create import create_registry_secret

-        create_registry_secret(self.cluster_info.spec, self.cluster_info.app_name)
+        create_registry_secret(
+            self.cluster_info.spec, self.cluster_info.app_name, self.k8s_namespace
+        )

        self._create_volume_data()
+        self._create_external_services()
+        self._create_ca_certificates()
        self._create_deployment()
        self._create_jobs()
-
-        http_proxy_info = self.cluster_info.spec.get_http_proxy()
-        # Note: we don't support tls for kind (enabling tls causes errors)
-        use_tls = http_proxy_info and not self.is_kind()
-        certificate = (
-            self._find_certificate_for_host_name(http_proxy_info[0]["host-name"])
-            if use_tls
-            else None
-        )
-        if opts.o.debug:
-            if certificate:
-                print(f"Using existing certificate: {certificate}")
-
-        ingress = self.cluster_info.get_ingress(
-            use_tls=use_tls, certificate=certificate
-        )
-        if ingress:
-            if opts.o.debug:
-                print(f"Sending this ingress: {ingress}")
-            if not opts.o.dry_run:
-                ingress_resp = self.networking_api.create_namespaced_ingress(
-                    namespace=self.k8s_namespace, body=ingress
-                )
-                if opts.o.debug:
-                    print("Ingress created:")
-                    print(f"{ingress_resp}")
-        else:
-            if opts.o.debug:
-                print("No ingress configured")
-
-        nodeports: List[client.V1Service] = self.cluster_info.get_nodeports()
-        for nodeport in nodeports:
-            if opts.o.debug:
-                print(f"Sending this nodeport: {nodeport}")
-            if not opts.o.dry_run:
-                nodeport_resp = self.core_api.create_namespaced_service(
-                    namespace=self.k8s_namespace, body=nodeport
-                )
-                if opts.o.debug:
-                    print("NodePort created:")
-                    print(f"{nodeport_resp}")
+        self._create_ingress()
+        self._create_nodeports()

        # Call start() hooks — stacks can create additional k8s resources
        if self.deployment_context:
-            from stack_orchestrator.deploy.deployment_create import call_stack_deploy_start
+            from stack_orchestrator.deploy.deployment_create import (
+                call_stack_deploy_start,
+            )
+
            call_stack_deploy_start(self.deployment_context)

    def down(self, timeout, volumes, skip_cluster_management):
@ -565,9 +862,7 @@ class K8sDeployer(Deployer):
        # PersistentVolumes are cluster-scoped (not namespaced), so delete by label
        if volumes:
            try:
-                pvs = self.core_api.list_persistent_volume(
-                    label_selector=app_label
-                )
+                pvs = self.core_api.list_persistent_volume(label_selector=app_label)
                for pv in pvs.items:
                    if opts.o.debug:
                        print(f"Deleting PV: {pv.metadata.name}")
@ -579,14 +874,14 @@ class K8sDeployer(Deployer):
                if opts.o.debug:
                    print(f"Error listing PVs: {e}")

-        # When namespace is explicitly set in the spec, it may be shared with
-        # other stacks — delete only this stack's resources by label.
-        # Otherwise the namespace is owned by this deployment, delete it entirely.
-        shared_namespace = self.deployment_context.spec.get_namespace() is not None
-        if shared_namespace:
-            self._delete_resources_by_label(app_label, volumes)
-        else:
-            self._delete_namespace()
+        # Delete the namespace to ensure clean slate.
+        # Resources created by older laconic-so versions lack labels, so
+        # label-based deletion can't find them. Namespace deletion is the
+        # only reliable cleanup.
+        self._delete_namespace()
+        # Wait for namespace to finish terminating before returning,
+        # so that up() can recreate it immediately.
+        self._wait_for_namespace_gone()

        if self.is_kind() and not self.skip_cluster_management:
            # Destroy the kind cluster
@ -711,14 +1006,18 @@ class K8sDeployer(Deployer):

    def logs(self, services, tail, follow, stream):
        self.connect_api()
-        pods = pods_in_deployment(self.core_api, self.cluster_info.app_name, namespace=self.k8s_namespace)
+        pods = pods_in_deployment(
+            self.core_api, self.cluster_info.app_name, namespace=self.k8s_namespace
+        )
        if len(pods) > 1:
            print("Warning: more than one pod in the deployment")
        if len(pods) == 0:
            log_data = "******* Pods not running ********\n"
        else:
            k8s_pod_name = pods[0]
-            containers = containers_in_pod(self.core_api, k8s_pod_name, namespace=self.k8s_namespace)
+            containers = containers_in_pod(
+                self.core_api, k8s_pod_name, namespace=self.k8s_namespace
+            )
            # If pod not started, logs request below will throw an exception
            try:
                log_data = ""
@ -741,48 +1040,49 @@ class K8sDeployer(Deployer):
                print("No pods defined, skipping update")
            return
        self.connect_api()
-        ref_deployment = self.cluster_info.get_deployment()
-        if not ref_deployment or not ref_deployment.metadata:
-            return
-        ref_name = ref_deployment.metadata.name
-        if not ref_name:
-            return
+        ref_deployments = self.cluster_info.get_deployments()
+        for ref_deployment in ref_deployments:
+            if not ref_deployment or not ref_deployment.metadata:
+                continue
+            ref_name = ref_deployment.metadata.name
+            if not ref_name:
+                continue

-        deployment = cast(
-            client.V1Deployment,
-            self.apps_api.read_namespaced_deployment(
-                name=ref_name, namespace=self.k8s_namespace
-            ),
-        )
-        if not deployment.spec or not deployment.spec.template:
-            return
-        template_spec = deployment.spec.template.spec
-        if not template_spec or not template_spec.containers:
-            return
+            deployment = cast(
+                client.V1Deployment,
+                self.apps_api.read_namespaced_deployment(
+                    name=ref_name, namespace=self.k8s_namespace
+                ),
+            )
+            if not deployment.spec or not deployment.spec.template:
+                continue
+            template_spec = deployment.spec.template.spec
+            if not template_spec or not template_spec.containers:
+                continue

-        ref_spec = ref_deployment.spec
-        if ref_spec and ref_spec.template and ref_spec.template.spec:
-            ref_containers = ref_spec.template.spec.containers
-            if ref_containers:
-                new_env = ref_containers[0].env
-                for container in template_spec.containers:
-                    old_env = container.env
-                    if old_env != new_env:
-                        container.env = new_env
+            ref_spec = ref_deployment.spec
+            if ref_spec and ref_spec.template and ref_spec.template.spec:
+                ref_containers = ref_spec.template.spec.containers
+                if ref_containers:
+                    new_env = ref_containers[0].env
+                    for container in template_spec.containers:
+                        old_env = container.env
+                        if old_env != new_env:
+                            container.env = new_env

-        template_meta = deployment.spec.template.metadata
-        if template_meta:
-            template_meta.annotations = {
-                "kubectl.kubernetes.io/restartedAt": datetime.utcnow()
-                .replace(tzinfo=timezone.utc)
-                .isoformat()
-            }
+            template_meta = deployment.spec.template.metadata
+            if template_meta:
+                template_meta.annotations = {
+                    "kubectl.kubernetes.io/restartedAt": datetime.utcnow()
+                    .replace(tzinfo=timezone.utc)
+                    .isoformat()
+                }

-        self.apps_api.patch_namespaced_deployment(
-            name=ref_name,
-            namespace=self.k8s_namespace,
-            body=deployment,
-        )
+            self.apps_api.patch_namespaced_deployment(
+                name=ref_name,
+                namespace=self.k8s_namespace,
+                body=deployment,
+            )

    def run(
        self,
@ -817,9 +1117,7 @@ class K8sDeployer(Deployer):
            else:
                # Non-Helm path: create job from ClusterInfo
                self.connect_api()
-                jobs = self.cluster_info.get_jobs(
-                    image_pull_policy=None if self.is_kind() else "Always"
-                )
+                jobs = self.cluster_info.get_jobs(image_pull_policy="Always")
                # Find the matching job by name
                target_name = f"{self.cluster_info.app_name}-job-{job_name}"
                matched_job = None
--- a/stack_orchestrator/deploy/k8s/helpers.py
+++ b/stack_orchestrator/deploy/k8s/helpers.py
@ -393,7 +393,9 @@ def load_images_into_kind(kind_cluster_name: str, image_set: Set[str]):
            raise DeployerException(f"kind load docker-image failed: {result}")


-def pods_in_deployment(core_api: client.CoreV1Api, deployment_name: str, namespace: str = "default"):
+def pods_in_deployment(
+    core_api: client.CoreV1Api, deployment_name: str, namespace: str = "default"
+):
    pods = []
    pod_response = core_api.list_namespaced_pod(
        namespace=namespace, label_selector=f"app={deployment_name}"
@ -406,7 +408,9 @@ def pods_in_deployment(core_api: client.CoreV1Api, deployment_name: str, namespa
    return pods


-def containers_in_pod(core_api: client.CoreV1Api, pod_name: str, namespace: str = "default") -> List[str]:
+def containers_in_pod(
+    core_api: client.CoreV1Api, pod_name: str, namespace: str = "default"
+) -> List[str]:
    containers: List[str] = []
    pod_response = cast(
        client.V1Pod, core_api.read_namespaced_pod(pod_name, namespace=namespace)
@ -440,7 +444,20 @@ def named_volumes_from_pod_files(parsed_pod_files):
    return named_volumes


-def get_kind_pv_bind_mount_path(volume_name: str):
+def get_kind_pv_bind_mount_path(
+    volume_name: str,
+    kind_mount_root: Optional[str] = None,
+    host_path: Optional[str] = None,
+):
+    """Get the path inside the Kind node for a PV.
+
+    When kind-mount-root is set and the volume's host path is under
+    that root, return /mnt/{relative_path} so it resolves through the
+    single root extraMount. Otherwise fall back to /mnt/{volume_name}.
+    """
+    if kind_mount_root and host_path and host_path.startswith(kind_mount_root):
+        rel = os.path.relpath(host_path, kind_mount_root)
+        return f"/mnt/{rel}"
    return f"/mnt/{volume_name}"


@ -563,6 +580,7 @@ def _generate_kind_mounts(parsed_pod_files, deployment_dir, deployment_context):
    volume_definitions = []
    volume_host_path_map = _get_host_paths_for_volumes(deployment_context)
    seen_host_path_mounts = set()  # Track to avoid duplicate mounts
+    kind_mount_root = deployment_context.spec.get_kind_mount_root()

    # Cluster state backup for offline data recovery (unique per deployment)
    # etcd contains all k8s state; PKI certs needed to decrypt etcd offline
@ -583,6 +601,16 @@ def _generate_kind_mounts(parsed_pod_files, deployment_dir, deployment_context):
        f"  - hostPath: {pki_host_path}\n" f"    containerPath: /etc/kubernetes/pki\n"
    )

+    # When kind-mount-root is set, emit a single extraMount for the root.
+    # Individual volumes whose host path starts with the root are covered
+    # by this single mount and don't need their own extraMount entries.
+    mount_root_emitted = False
+    if kind_mount_root:
+        volume_definitions.append(
+            f"  - hostPath: {kind_mount_root}\n" f"    containerPath: /mnt\n"
+        )
+        mount_root_emitted = True
+
    # Note these paths are relative to the location of the pod files (at present)
    # So we need to fix up to make them correct and absolute because kind assumes
    # relative to the cwd.
@ -642,6 +670,12 @@ def _generate_kind_mounts(parsed_pod_files, deployment_dir, deployment_context):
                                        volume_host_path_map[volume_name],
                                        deployment_dir,
                                    )
+                                    # Skip individual extraMount if covered
+                                    # by the kind-mount-root single mount
+                                    if mount_root_emitted and str(host_path).startswith(
+                                        kind_mount_root
+                                    ):
+                                        continue
                                    container_path = get_kind_pv_bind_mount_path(
                                        volume_name
                                    )
@ -978,7 +1012,7 @@ def translate_sidecar_service_names(


 def envs_from_environment_variables_map(
-    map: Mapping[str, str]
+    map: Mapping[str, str],
 ) -> List[client.V1EnvVar]:
    result = []
    for env_var, env_val in map.items():
--- a/stack_orchestrator/deploy/spec.py
+++ b/stack_orchestrator/deploy/spec.py
@ -98,16 +98,17 @@ class Spec:
    def get_image_registry(self):
        return self.obj.get(constants.image_registry_key)

+    def get_credentials_files(self) -> typing.List[str]:
+        """Returns list of credential file paths to append to config.env."""
+        return self.obj.get("credentials-files", [])
+
    def get_image_registry_config(self) -> typing.Optional[typing.Dict]:
        """Returns registry auth config: {server, username, token-env}.

        Used for private container registries like GHCR. The token-env field
        specifies an environment variable containing the API token/PAT.
-
-        Note: Uses 'registry-credentials' key to avoid collision with
-        'image-registry' key which is for pushing images.
        """
-        return self.obj.get("registry-credentials")
+        return self.obj.get("image-pull-secret")

    def get_volumes(self):
        return self.obj.get(constants.volumes_key, {})
@ -170,15 +171,13 @@ class Spec:
        Returns the per-volume Resources if found, otherwise None.
        The caller should fall back to get_volume_resources() then the default.
        """
-        vol_section = (
-            self.obj.get(constants.resources_key, {}).get(constants.volumes_key, {})
+        vol_section = self.obj.get(constants.resources_key, {}).get(
+            constants.volumes_key, {}
        )
        if volume_name not in vol_section:
            return None
        entry = vol_section[volume_name]
-        if isinstance(entry, dict) and (
-            "reservations" in entry or "limits" in entry
-        ):
+        if isinstance(entry, dict) and ("reservations" in entry or "limits" in entry):
            return Resources(entry)
        return None

@ -265,5 +264,46 @@ class Spec:
    def is_kind_deployment(self):
        return self.get_deployment_type() in [constants.k8s_kind_deploy_type]

+    def get_kind_mount_root(self) -> typing.Optional[str]:
+        """Return kind-mount-root path or None.
+
+        When set, laconic-so emits a single Kind extraMount mapping this
+        host path to /mnt inside the Kind node. Volumes with host paths
+        under this root resolve to /mnt/{relative_path} and don't need
+        individual extraMounts. This allows adding new volumes without
+        recreating the Kind cluster.
+        """
+        return self.obj.get(constants.kind_mount_root_key)
+
+    def get_maintenance_service(self) -> typing.Optional[str]:
+        """Return maintenance-service value (e.g. 'dumpster-maintenance:8000') or None.
+
+        When set, the restart command swaps Ingress backends to this service
+        during the main pod Recreate, so users see a branded maintenance page
+        instead of a bare 502.
+        """
+        return self.obj.get("maintenance-service")
+
+    def get_external_services(self) -> typing.Dict[str, typing.Dict]:
+        """Return external-services config from spec.
+
+        Each entry maps a service name to its routing config:
+        - host mode: {host: "example.com", port: 443}
+          → ExternalName k8s Service (DNS CNAME)
+        - selector mode: {selector: {app: "foo"}, namespace: "ns", port: 443}
+          → Headless Service + Endpoints (cross-namespace routing to mock pod)
+        """
+        return self.obj.get(constants.external_services_key, {})
+
+    def get_ca_certificates(self) -> typing.List[str]:
+        """Return list of CA certificate file paths to trust.
+
+        Used in testing specs to inject mkcert root CAs so containers
+        trust TLS certs on mock services. Files are mounted into all
+        containers at /etc/ssl/certs/ and NODE_EXTRA_CA_CERTS is set.
+        Production specs omit this key entirely.
+        """
+        return self.obj.get(constants.ca_certificates_key, [])
+
    def is_docker_deployment(self):
        return self.get_deployment_type() in [constants.compose_deploy_type]
--- a/tests/deploy/run-deploy-test.sh
+++ b/tests/deploy/run-deploy-test.sh
@ -141,28 +141,35 @@ echo "$test_config_file_changed_content" > "$test_config_file"
 test_unchanged_config="$test_deployment_dir/config/test/script.sh"

 # Modify spec file to simulate an update
-sed -i.bak 's/CERC_TEST_PARAM_3:/CERC_TEST_PARAM_3: FASTER/' $test_deployment_spec
+sed -i.bak 's/CERC_TEST_PARAM_3: FAST/CERC_TEST_PARAM_3: FASTER/' $test_deployment_spec

-# Create/modify config.env to test it isn't overwritten during sync
+# Save config.env before update (to verify it gets backed up)
 config_env_file="$test_deployment_dir/config.env"
-config_env_persistent_content="PERSISTENT_VALUE=should-not-be-overwritten-$(date +%s)"
-echo "$config_env_persistent_content" >> "$config_env_file"
 original_config_env_content=$(<$config_env_file)

 # Run sync to update deployment files without destroying data
 $TEST_TARGET_SO --stack test deploy create --spec-file $test_deployment_spec --deployment-dir $test_deployment_dir --update

-# Verify config.env was not overwritten
+# Verify config.env was regenerated from spec (reflects the FASTER change)
 synced_config_env_content=$(<$config_env_file)
-if [ "$synced_config_env_content" == "$original_config_env_content" ]; then
-    echo "deployment update test: config.env preserved - passed"
+if [[ "$synced_config_env_content" == *"CERC_TEST_PARAM_3=FASTER"* ]]; then
+    echo "deployment update test: config.env regenerated from spec - passed"
 else
-    echo "deployment update test: config.env was overwritten - FAILED"
-    echo "Expected: $original_config_env_content"
+    echo "deployment update test: config.env not regenerated - FAILED"
+    echo "Expected CERC_TEST_PARAM_3=FASTER in config.env"
    echo "Got: $synced_config_env_content"
    exit 1
 fi

+# Verify old config.env was backed up
+config_env_backup="${config_env_file}.bak"
+if [ -f "$config_env_backup" ]; then
+    echo "deployment update test: config.env backed up - passed"
+else
+    echo "deployment update test: config.env backup not created - FAILED"
+    exit 1
+fi
+
 # Verify the spec file was updated in deployment dir
 updated_deployed_spec=$(<$test_deployment_dir/spec.yml)
 if [[ "$updated_deployed_spec" == *"FASTER"* ]]; then
--- a/uv.lock
+++ b/uv.lock
Author	SHA1	Message	Date
Snake Game Developer	90e32ffd60	Support image-overrides in spec for testing Some checks failed Lint Checks / Run linter (push) Failing after 3h11m25s Details Spec can override container images: image-overrides: dumpster-kubo: ghcr.io/.../dumpster-kubo:test-tag Merged with CLI overrides (CLI wins). Enables testing with GHCR-pushed test tags without modifying compose files. Also reverts the image-pull-policy spec key (not needed — the fix is to use proper GHCR tags, not IfNotPresent). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 01:02:23 +00:00
Snake Game Developer	1052a1d4e7	Support image-pull-policy in spec (default: Always) Testing specs can set image-pull-policy: IfNotPresent so kind-loaded local images are used instead of pulling from the registry. Production specs omit the key and get the default Always behavior. Root cause: with Always, k8s pulled the GHCR kubo image (with baked R2 endpoint) instead of the locally-built image (with https://s3:443), causing kubo to connect to R2 directly and get Unauthorized. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 20:17:06 +00:00
Snake Game Developer	f93541f7db	Fix CA cert mounting: subPath for Go, expanduser for configmaps - CA certs mounted via subPath into /etc/ssl/certs/ so Go's x509 picks them up (directory mount replaces the entire dir) - get_configmaps() now expands ~ in paths via os.path.expanduser() - Both changes discovered during testing with mkcert + MinIO Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 19:27:14 +00:00
Snake Game Developer	713a81c245	Add external-services and ca-certificates spec keys New spec.yml features for routing external service dependencies: external-services: s3: host: example.com # ExternalName Service (production) port: 443 s3: selector: {app: mock} # headless Service + Endpoints (testing) namespace: mock-ns port: 443 ca-certificates: - ~/.local/share/mkcert/rootCA.pem # testing only laconic-so creates the appropriate k8s Service type per mode: - host mode: ExternalName (DNS CNAME to external provider) - selector mode: headless Service + Endpoints with pod IPs discovered from the target namespace at deploy time ca-certificates mounts CA files into all containers at /etc/ssl/certs/ and sets NODE_EXTRA_CA_CERTS for Node/Bun. Also includes the previously committed PV Released state fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 15:25:47 +00:00
Snake Game Developer	98ff221a21	Fix PV rebinding after deployment stop/start cycle deployment stop deletes the namespace (and PVCs) but preserves PVs by default. On the next deployment start, PVs are in Released state with a stale claimRef pointing at the deleted PVC. New PVCs cannot bind to Released PVs, so pods get stuck in Pending. Clear the claimRef on any Released PV during _create_volume_data() so the PV returns to Available and can accept new PVC bindings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 07:47:23 +00:00
A. F. Dudley	7141dc7637	file so-p3p: laconic-so should manage Caddy ingress image lifecycle Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 00:30:46 +00:00
A. F. Dudley	2555df06b5	fix: use patched Caddy ingress image with ACME storage fix Switch from caddy/ingress:latest to ghcr.io/laconicnetwork/caddy-ingress:latest which has the List()/Stat() fix for secret_store. This fixes multi-domain ACME provisioning deadlock where the second domain's cert request fails because List() returns mangled keys and Stat() returns wrong IsTerminal. Source: LaconicNetwork/ingress@109d69a (fix/acme-account-reuse branch) Fixes: so-o2o (partially — etcd backup investigation still needed) Closes: ds-v22v (Caddy sequential provisioning no longer needed) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 23:31:39 +00:00
A. F. Dudley	24cf22fea5	File pebbles: mount propagation merge + etcd cert backup broken Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 23:01:20 +00:00
A. F. Dudley	8d03083d0d	feat: add kind-mount-root for unified Kind extraMount When kind-mount-root is set in spec.yml, emit a single extraMount mapping the root to /mnt instead of per-volume mounts. This allows adding new volumes without recreating the Kind cluster. Volumes whose host path is under the root skip individual extraMounts and their PV paths resolve to /mnt/{relative_path}. Volumes outside the root keep individual extraMounts as before. Cherry-picked from branch enya-ac868cc4-kind-mount-propagation-fix (commits b6d6ad81, 929bdab8) and adapted for current main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 21:28:40 +00:00
A. F. Dudley	9109cfb7a1	feat: add token-file option for image-pull-secret registry auth Adds token-file key to image-pull-secret spec config. Reads the registry token from a file on disk instead of requiring an environment variable. File path supports ~ expansion. Falls back to token-env if token-file is not set or file doesn't exist. This lets operators store the GHCR token in ~/.credentials/ alongside other secrets, removing the need for ansible to pass REGISTRY_TOKEN as an env var. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 19:30:44 +00:00
A. F. Dudley	61afeb255c	fix: keep cwd at repo root through entire restart, revert try/except The stack path in spec.yml is relative — both create_operation and up_operation need cwd at the repo root for stack_is_external() to resolve it. Move os.chdir(prev_cwd) to after up_operation completes instead of between the two operations. Reverts the SystemExit catch in call_stack_deploy_start — the root cause was cwd, not the hook. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:54:46 +00:00
A. F. Dudley	32f6e57b70	fix: ConfigMap volumes don't force Recreate strategy + resilient hooks Two fixes for multi-deployment: 1. _pod_has_pvcs now excludes ConfigMap volumes from PVC detection. Pods with only ConfigMap volumes (like maintenance) correctly get RollingUpdate strategy instead of Recreate. 2. call_stack_deploy_start catches SystemExit when stack path doesn't resolve from cwd (common during restart). Most stacks don't have deploy hooks, so this is non-fatal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:51:58 +00:00
A. F. Dudley	6923e1c23b	refactor: extract methods from K8sDeployer.up to fix C901 complexity Split up() into _setup_cluster(), _create_ingress(), _create_nodeports(). Reduces cyclomatic complexity below the flake8 threshold. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:20:50 +00:00
A. F. Dudley	5b8303f8f9	fix: resolve stack path from repo root + update deploy test - chdir to git repo root before create_operation so relative stack paths in spec.yml resolve correctly via stack_is_external() - Update deploy test: config.env is now regenerated from spec on --update (matching `72aabe7d` behavior), verify backup exists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:14:47 +00:00
A. F. Dudley	0ac886bf95	fix: chdir to repo root before create_operation in restart The spec's "stack:" value is a relative path that must resolve from the repo root. stack_is_external() checks Path(stack).exists() from cwd, which fails when cwd isn't the repo root. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:06:38 +00:00
A. F. Dudley	2484abfcce	fix: use git rev-parse for repo root in restart command The repo_root calculation assumed stack paths are always 4 levels deep (stack_orchestrator/data/stacks/name). External stacks with different nesting (e.g. stack-orchestrator/stacks/name = 3 levels) got the wrong root, causing --spec-file resolution to fail. Use git rev-parse --show-toplevel instead. Fixes: so-k1k Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 15:03:24 +00:00
A. F. Dudley	967936e524	Multi-deployment: one k8s Deployment per pod in stack.yml Each pod entry in stack.yml now creates its own k8s Deployment with independent lifecycle and update strategy. Pods with PVCs get Recreate, pods without get RollingUpdate. This enables maintenance services that survive main pod restarts. - cluster_info: get_deployments() builds per-pod Deployments, Services - cluster_info: Ingress routes to correct per-pod Service - deploy_k8s: _create_deployment() iterates all Deployments/Services - deployment: restart swaps Ingress to maintenance service during Recreate - spec: add maintenance-service key Single-pod stacks are backward compatible (same resource names). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 01:40:45 +00:00
A. F. Dudley	6ace024cd3	fix: use replace instead of patch for k8s resource updates Strategic merge patch preserves fields not present in the patch body. This means removed volumes, ports, and env vars persist in the running Deployment after a restart. Replace sends the complete spec built from the current compose files — removed fields are actually deleted. Affects Deployment, Service, Ingress, and NodePort updates. Service replace preserves clusterIP (immutable field) by reading it from the existing resource before replacing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 03:44:57 +00:00
A. F. Dudley	ea610bb8d6	Merge branch 'cv-c3c-image-flag-for-restart' # Conflicts: # stack_orchestrator/deploy/k8s/deploy_k8s.py	2026-03-18 23:04:55 +00:00
A. F. Dudley	4b1fc27a1e	cv-c3c: add --image flag to deployment restart command Allows callers to override container images during restart, e.g.: laconic-so deployment restart --image backend=ghcr.io/org/app:sha123 The override is applied to the k8s Deployment spec before create-or-patch. Docker/compose deployers accept the parameter but ignore it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 22:42:56 +00:00
A. F. Dudley	25e5ff09d9	so-m3m: add credentials-files spec key for on-disk credential injection _write_config_file() now reads each file listed under the credentials-files top-level spec key and appends its contents to config.env after config vars. Paths support ~ expansion. Missing files fail hard with sys.exit(1). Also adds get_credentials_files() to Spec class following the same pattern as get_image_registry_config(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 21:55:28 +00:00
A. F. Dudley	0e4ecc3602	refactor: rename registry-credentials to image-pull-secret in spec The spec key `registry-credentials` was ambiguous — could mean container registry auth or Laconic registry config. Rename to `image-pull-secret` which matches the k8s secret name it creates. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 21:38:31 +00:00
A. F. Dudley	dc15c0f4a5	feat: auto-generate readiness probes from http-proxy routes Containers referenced in spec.yml http-proxy routes now get TCP readiness probes on the proxied port. This tells k8s when a container is actually ready to serve traffic. Without readiness probes, k8s considers pods ready immediately after start, which means: - Rolling updates cut over before the app is listening - Broken containers look "ready" and receive traffic (502s) - kubectl rollout undo has nothing to roll back to The probes use TCP socket checks (not HTTP) to work with any protocol. Initial delay 5s, check every 10s, fail after 3 consecutive failures. Closes so-l2l part C. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 19:43:09 +00:00
A. F. Dudley	2d11ca7bb0	feat: update-in-place deployments with rolling updates Replace the destroy-and-recreate deployment model with in-place updates. deploy_k8s.py: All resource creation (Deployment, Service, Ingress, NodePort, ConfigMap) now uses create-or-update semantics. If a resource already exists (409 Conflict), it patches instead of failing. For Deployments, this triggers a k8s rolling update — old pods serve traffic until new pods pass readiness checks. deployment.py: restart() no longer calls down(). It just calls up() which patches existing resources. No namespace deletion, no downtime gap, no race conditions. k8s handles the rollout. This gives: - Zero-downtime deploys (old pods serve during rollout) - Automatic rollback (if new pods fail readiness, rollout stalls) - Manual rollback via kubectl rollout undo Closes so-l2l (parts A and B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 19:40:20 +00:00
A. F. Dudley	ba39c991f1	fix: create imagePullSecret in deployment namespace, not default create_registry_secret() hardcoded namespace="default" but deployments now run in dedicated laconic-* namespaces. The secret was invisible to pods in the deployment namespace, causing 401 on GHCR pulls. Accept namespace as parameter, passed from deploy_k8s.py which knows the correct namespace. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 19:08:52 +00:00
A. F. Dudley	0b3e5559d0	fix: wait for namespace termination in down() before returning Reverts the label-based deletion approach — resources created by older laconic-so lack labels, so label queries return empty results. Namespace deletion is the only reliable cleanup. Adds _wait_for_namespace_gone() so down() blocks until the namespace is fully terminated. This prevents the race condition where up() tries to create resources in a still-terminating namespace (403 Forbidden). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 18:49:38 +00:00
A. F. Dudley	ae2cea3410	fix: never delete namespace on deployment down down() deleted the entire namespace when it wasn't explicitly set in the spec. This causes a race condition on restart: up() tries to create resources in a namespace that's still terminating, getting 403 Forbidden. Always use _delete_resources_by_label() instead. The namespace is cheap to keep and required for immediate up() after down(). This also matches the shared-namespace behavior, making down() consistent regardless of namespace configuration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 18:47:05 +00:00
A. F. Dudley	e298e7444f	fix: add auto-generated header to config.env config.env is regenerated from spec.yml on every deploy create and restart, silently overwriting manual edits. Add a header comment explaining this so operators know to edit spec.yml instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 18:24:27 +00:00
A. F. Dudley	e5a8ec5f06	fix: rename registry secret to image-pull-secret The secret name `{app}-registry` is ambiguous — it could be a container registry credential or a Laconic registry config. Rename to `{app}-image-pull-secret` which clearly describes its purpose as a Kubernetes imagePullSecret for private container registries. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 15:33:11 +00:00
A. F. Dudley	0bbb51067c	fix: set imagePullPolicy=Always for kind deployments Kind deployments used imagePullPolicy=None (defaults to IfNotPresent), which means the kind node caches images by tag and never re-pulls from the local registry. After a container rebuild + registry push, the pod keeps using the stale cached image. Set Always for all deployment types so k8s re-pulls on every pod restart. With a local registry this adds negligible overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 17:44:35 +00:00
A. F. Dudley	72aabe7d9a	fix: deploy create --update now syncs config.env from spec The --update path excluded config.env from the safe_copy_tree, which meant new config vars added to spec.yml were never written to config.env. The XXX comment already flagged this as broken. Remove config.env from exclude_patterns so --update regenerates it from spec.yml like the non-update path does. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 08:20:45 +00:00
afd	8a7491d3e0	Support multiple http-proxy entries in a single deployment Some checks failed Lint Checks / Run linter (push) Failing after 3h7m12s Details Previously get_ingress() only used the first http-proxy entry, silently ignoring additional hostnames. Now iterates over all entries, creating an Ingress rule and TLS config per hostname. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-16 06:16:28 +00:00
				`@ -0,0 +1 @@`
				`{"project": "stack-orchestrator", "prefix": "so"}`