Merge pull request 'feat(k8s): ACME email fix, etcd persistence, volume paths' (#986) from fix-caddy-acme-email-rbac into main

Reviewed-on: cerc-io/stack-orchestrator#986
2026-02-03 22:31:47 +00:00 · 2026-02-03 22:31:47 +00:00 · 21d47908cc
commit 21d47908cc
parent 88dccdfb7c f70e87b848
12 changed files with 1214 additions and 44 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -8,6 +8,7 @@ NEVER assume your hypotheses are true without evidence
 ALWAYS clearly state when something is a hypothesis
 ALWAYS use evidence from the systems your interacting with to support your claims and hypotheses
 ALWAYS run `pre-commit run --all-files` before committing changes
 ## Key Principles
@ -43,6 +44,76 @@ This project follows principles inspired by literate programming, where developm
 This approach treats the human-AI collaboration as a form of **conversational literate programming** where understanding emerges through dialogue before code implementation.
 ## External Stacks Preferred
 When creating new stacks for any reason, **use the external stack pattern** rather than adding stacks directly to this repository.
 External stacks follow this structure:
 ```
 my-stack/
 └── stack-orchestrator/
    ├── stacks/
    │   └── my-stack/
    │       ├── stack.yml
    │       └── README.md
    ├── compose/
    │   └── docker-compose-my-stack.yml
    └── config/
        └── my-stack/
            └── (config files)
 ```
 ### Usage
 ```bash
 # Fetch external stack
 laconic-so fetch-stack github.com/org/my-stack
 # Use external stack
 STACK_PATH=~/cerc/my-stack/stack-orchestrator/stacks/my-stack
 laconic-so --stack $STACK_PATH deploy init --output spec.yml
 laconic-so --stack $STACK_PATH deploy create --spec-file spec.yml --deployment-dir deployment
 laconic-so deployment --dir deployment start
 ```
 ### Examples
 - `zenith-karma-stack` - Karma watcher deployment
 - `urbit-stack` - Fake Urbit ship for testing
 - `zenith-desk-stack` - Desk deployment stack
 ## Architecture: k8s-kind Deployments
 ### One Cluster Per Host
 One Kind cluster per host by design. Never request or expect separate clusters.
 - `create_cluster()` in `helpers.py` reuses any existing cluster
 - `cluster-id` in deployment.yml is an identifier, not a cluster request
 - All deployments share: ingress controller, etcd, certificates
 ### Stack Resolution
 - External stacks detected via `Path(stack).exists()` in `util.py`
 - Config/compose resolution: external path first, then internal fallback
 - External path structure: `stack_orchestrator/data/stacks/<name>/stack.yml`
 ### Secret Generation Implementation
 - `GENERATE_TOKEN_PATTERN` in `deployment_create.py` matches `$generate:type:length$`
 - `_generate_and_store_secrets()` creates K8s Secret
 - `cluster_info.py` adds `envFrom` with `secretRef` to containers
 - Non-secret config written to `config.env`
 ### Repository Cloning
 `setup-repositories --git-ssh` clones repos defined in stack.yml's `repos:` field. Requires SSH agent.
 ### Key Files (for codebase navigation)
 - `repos/setup_repositories.py`: `setup-repositories` command (git clone)
 - `deployment_create.py`: `deploy create` command, secret generation
 - `deployment.py`: `deployment start/stop/restart` commands
 - `deploy_k8s.py`: K8s deployer, cluster management calls
 - `helpers.py`: `create_cluster()`, etcd cleanup, kind operations
 - `cluster_info.py`: K8s resource generation (Deployment, Service, Ingress)
 ## Insights and Observations
 ### Design Principles
--- a/README.md
+++ b/README.md
@ -71,6 +71,59 @@ The various [stacks](/stack_orchestrator/data/stacks) each contain instructions
 - [laconicd with console and CLI](stack_orchestrator/data/stacks/fixturenet-laconic-loaded)
 - [kubo (IPFS)](stack_orchestrator/data/stacks/kubo)
 ## Deployment Types
 - **compose**: Docker Compose on local machine
 - **k8s**: External Kubernetes cluster (requires kubeconfig)
 - **k8s-kind**: Local Kubernetes via Kind - one cluster per host, shared by all deployments
 ## External Stacks
 Stacks can live in external git repositories. Required structure:
 ```
 <repo>/
  stack_orchestrator/data/
    stacks/<stack-name>/stack.yml
    compose/docker-compose-<pod-name>.yml
  deployment/spec.yml
 ```
 ## Deployment Commands
 ```bash
 # Create deployment from spec
 laconic-so --stack <path> deploy create --spec-file <spec.yml> --deployment-dir <dir>
 # Start (creates cluster on first run)
 laconic-so deployment --dir <dir> start
 # GitOps restart (git pull + redeploy, preserves data)
 laconic-so deployment --dir <dir> restart
 # Stop
 laconic-so deployment --dir <dir> stop
 ```
 ## spec.yml Reference
 ```yaml
 stack: stack-name-or-path
 deploy-to: k8s-kind
 network:
  http-proxy:
    - host-name: app.example.com
      routes:
        - path: /
          proxy-to: service-name:port
  acme-email: admin@example.com
 config:
  ENV_VAR: value
  SECRET_VAR: $generate:hex:32$   # Auto-generated, stored in K8s Secret
 volumes:
  volume-name:
 ```
 ## Contributing
 See the [CONTRIBUTING.md](/docs/CONTRIBUTING.md) for developer mode install.
--- a/docs/deployment_patterns.md
+++ b/docs/deployment_patterns.md
@ -0,0 +1,202 @@
 # Deployment Patterns
 ## GitOps Pattern
 For production deployments, we recommend a GitOps approach where your deployment configuration is tracked in version control.
 ### Overview
 - **spec.yml is your source of truth**: Maintain it in your operator repository
 - **Don't regenerate on every restart**: Run `deploy init` once, then customize and commit
 - **Use restart for updates**: The restart command respects your git-tracked spec.yml
 ### Workflow
 1. **Initial setup**: Run `deploy init` once to generate a spec.yml template
 2. **Customize and commit**: Edit spec.yml with your configuration (hostnames, resources, etc.) and commit to your operator repo
 3. **Deploy from git**: Use the committed spec.yml for deployments
 4. **Update via git**: Make changes in git, then restart to apply
 ```bash
 # Initial setup (run once)
 laconic-so --stack my-stack deploy init --output spec.yml
 # Customize for your environment
 vim spec.yml  # Set hostname, resources, etc.
 # Commit to your operator repository
 git add spec.yml
 git commit -m "Add my-stack deployment configuration"
 git push
 # On deployment server: deploy from git-tracked spec
 laconic-so deploy create \
  --spec-file /path/to/operator-repo/spec.yml \
  --deployment-dir my-deployment
 laconic-so deployment --dir my-deployment start
 ```
 ### Updating Deployments
 When you need to update a deployment:
 ```bash
 # 1. Make changes in your operator repo
 vim /path/to/operator-repo/spec.yml
 git commit -am "Update configuration"
 git push
 # 2. On deployment server: pull and restart
 cd /path/to/operator-repo && git pull
 laconic-so deployment --dir my-deployment restart
 ```
 The `restart` command:
 - Pulls latest code from the stack repository
 - Uses your git-tracked spec.yml (does NOT regenerate from defaults)
 - Syncs the deployment directory
 - Restarts services
 ### Anti-patterns
 **Don't do this:**
 ```bash
 # BAD: Regenerating spec on every deployment
 laconic-so --stack my-stack deploy init --output spec.yml
 laconic-so deploy create --spec-file spec.yml ...
 ```
 This overwrites your customizations with defaults from the stack's `commands.py`.
 **Do this instead:**
 ```bash
 # GOOD: Use your git-tracked spec
 git pull  # Get latest spec.yml from your operator repo
 laconic-so deployment --dir my-deployment restart
 ```
 ## Private Registry Authentication
 For deployments using images from private container registries (e.g., GitHub Container Registry), configure authentication in your spec.yml:
 ### Configuration
 Add a `registry-credentials` section to your spec.yml:
 ```yaml
 registry-credentials:
  server: ghcr.io
  username: your-org-or-username
  token-env: REGISTRY_TOKEN
 ```
 **Fields:**
 - `server`: The registry hostname (e.g., `ghcr.io`, `docker.io`, `gcr.io`)
 - `username`: Registry username (for GHCR, use your GitHub username or org name)
 - `token-env`: Name of the environment variable containing your API token/PAT
 ### Token Environment Variable
 The `token-env` pattern keeps credentials out of version control. Set the environment variable when running `deployment start`:
 ```bash
 export REGISTRY_TOKEN="your-personal-access-token"
 laconic-so deployment --dir my-deployment start
 ```
 For GHCR, create a Personal Access Token (PAT) with `read:packages` scope.
 ### Ansible Integration
 When using Ansible for deployments, pass the token from a credentials file:
 ```yaml
 - name: Start deployment
  ansible.builtin.command:
    cmd: laconic-so deployment --dir {{ deployment_dir }} start
  environment:
    REGISTRY_TOKEN: "{{ lookup('file', '~/.credentials/ghcr_token') }}"
 ```
 ### How It Works
 1. laconic-so reads the `registry-credentials` config from spec.yml
 2. Creates a Kubernetes `docker-registry` secret named `{deployment}-registry`
 3. The deployment's pods reference this secret for image pulls
 ## Cluster and Volume Management
 ### Stopping Deployments
 The `deployment stop` command has two important flags:
 ```bash
 # Default: stops deployment, deletes cluster, PRESERVES volumes
 laconic-so deployment --dir my-deployment stop
 # Explicitly delete volumes (USE WITH CAUTION)
 laconic-so deployment --dir my-deployment stop --delete-volumes
 ```
 ### Volume Persistence
 Volumes persist across cluster deletion by design. This is important because:
 - **Data survives cluster recreation**: Ledger data, databases, and other state are preserved
 - **Faster recovery**: No need to re-sync or rebuild data after cluster issues
 - **Safe cluster upgrades**: Delete and recreate cluster without data loss
 **Only use `--delete-volumes` when:**
 - You explicitly want to start fresh with no data
 - The user specifically requests volume deletion
 - You're cleaning up a test/dev environment completely
 ### Shared Cluster Architecture
 In kind deployments, multiple stacks share a single cluster:
 - First `deployment start` creates the cluster
 - Subsequent deployments reuse the existing cluster
 - `deployment stop` on ANY deployment deletes the shared cluster
 - Other deployments will fail until cluster is recreated
 To stop a single deployment without affecting the cluster:
 ```bash
 laconic-so deployment --dir my-deployment stop --skip-cluster-management
 ```
 ## Volume Persistence in k8s-kind
 k8s-kind has 3 storage layers:
 - **Docker Host**: The physical server running Docker
 - **Kind Node**: A Docker container simulating a k8s node
 - **Pod Container**: Your workload
 For k8s-kind, volumes with paths are mounted from Docker Host → Kind Node → Pod via extraMounts.
 | spec.yml volume | Storage Location | Survives Pod Restart | Survives Cluster Restart |
 |-----------------|------------------|---------------------|-------------------------|
 | `vol:` (empty)  | Kind Node PVC    | ✅ | ❌ |
 | `vol: ./data/x` | Docker Host      | ✅ | ✅ |
 | `vol: /abs/path`| Docker Host      | ✅ | ✅ |
 **Recommendation**: Always use paths for data you want to keep. Relative paths
 (e.g., `./data/rpc-config`) resolve to `$DEPLOYMENT_DIR/data/rpc-config` on the
 Docker Host.
 ### Example
 ```yaml
 # In spec.yml
 volumes:
  rpc-config: ./data/rpc-config  # Persists to $DEPLOYMENT_DIR/data/rpc-config
  chain-data: ./data/chain       # Persists to $DEPLOYMENT_DIR/data/chain
  temp-cache:                    # Empty = Kind Node PVC (lost on cluster delete)
 ```
 ### The Antipattern
 Empty-path volumes appear persistent because they survive pod restarts (data lives
 in Kind Node container). However, this data is lost when the kind cluster is
 recreated. This "false persistence" has caused data loss when operators assumed
 their data was safe.
--- a/stack_orchestrator/constants.py
+++ b/stack_orchestrator/constants.py
@ -44,3 +44,4 @@ unlimited_memlock_key = "unlimited-memlock"
 runtime_class_key = "runtime-class"
 high_memlock_runtime = "high-memlock"
 high_memlock_spec_filename = "high-memlock-spec.json"
 acme_email_key = "acme-email"
--- a/stack_orchestrator/data/k8s/components/ingress/ingress-caddy-kind-deploy.yaml
+++ b/stack_orchestrator/data/k8s/components/ingress/ingress-caddy-kind-deploy.yaml
@ -93,6 +93,7 @@ rules:
      - get
      - create
      - update
      - delete
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
--- a/stack_orchestrator/deploy/deployment.py
+++ b/stack_orchestrator/deploy/deployment.py
@ -15,7 +15,9 @@
 import click
 from pathlib import Path
 import subprocess
 import sys
 import time
 from stack_orchestrator import constants
 from stack_orchestrator.deploy.images import push_images_operation
 from stack_orchestrator.deploy.deploy import (
@ -228,3 +230,176 @@ def run_job(ctx, job_name, helm_release):
    ctx.obj = make_deploy_context(ctx)
    run_job_operation(ctx, job_name, helm_release)
@command.command()
@click.option("--stack-path", help="Path to stack git repo (overrides stored path)")
@click.option(
    "--spec-file", help="Path to GitOps spec.yml in repo (e.g., deployment/spec.yml)"
 )
@click.option("--config-file", help="Config file to pass to deploy init")
@click.option(
    "--force",
    is_flag=True,
    default=False,
    help="Skip DNS verification",
 )
@click.option(
    "--expected-ip",
    help="Expected IP for DNS verification (if different from egress)",
 )
@click.pass_context
 def restart(ctx, stack_path, spec_file, config_file, force, expected_ip):
    """Pull latest code and restart deployment using git-tracked spec.
    GitOps workflow:
    1. Operator maintains spec.yml in their git repository
    2. This command pulls latest code (including updated spec.yml)
    3. If hostname changed, verifies DNS routes to this server
    4. Syncs deployment directory with the git-tracked spec
    5. Stops and restarts the deployment
    Data volumes are always preserved. The cluster is never destroyed.
    Stack source resolution (in order):
    1. --stack-path argument (if provided)
    2. stack-source field in deployment.yml (if stored)
    3. Error if neither available
    Note: spec.yml should be maintained in git, not regenerated from
    commands.py on each restart. Use 'deploy init' only for initial
    spec generation, then customize and commit to your operator repo.
    """
    from stack_orchestrator.util import get_yaml, get_parsed_deployment_spec
    from stack_orchestrator.deploy.deployment_create import create_operation
    from stack_orchestrator.deploy.dns_probe import verify_dns_via_probe
    deployment_context: DeploymentContext = ctx.obj
    # Get current spec info (before git pull)
    current_spec = deployment_context.spec
    current_http_proxy = current_spec.get_http_proxy()
    current_hostname = (
        current_http_proxy[0]["host-name"] if current_http_proxy else None
    )
    # Resolve stack source path
    if stack_path:
        stack_source = Path(stack_path).resolve()
    else:
        # Try to get from deployment.yml
        deployment_file = (
            deployment_context.deployment_dir / constants.deployment_file_name
        )
        deployment_data = get_yaml().load(open(deployment_file))
        stack_source_str = deployment_data.get("stack-source")
        if not stack_source_str:
            print(
                "Error: No stack-source in deployment.yml and --stack-path not provided"
            )
            print("Use --stack-path to specify the stack git repository location")
            sys.exit(1)
        stack_source = Path(stack_source_str)
    if not stack_source.exists():
        print(f"Error: Stack source path does not exist: {stack_source}")
        sys.exit(1)
    print("=== Deployment Restart ===")
    print(f"Deployment dir: {deployment_context.deployment_dir}")
    print(f"Stack source: {stack_source}")
    print(f"Current hostname: {current_hostname}")
    # Step 1: Git pull (brings in updated spec.yml from operator's repo)
    print("\n[1/4] Pulling latest code from stack repository...")
    git_result = subprocess.run(
        ["git", "pull"], cwd=stack_source, capture_output=True, text=True
    )
    if git_result.returncode != 0:
        print(f"Git pull failed: {git_result.stderr}")
        sys.exit(1)
    print(f"Git pull: {git_result.stdout.strip()}")
    # Determine spec file location
    # Priority: --spec-file argument > repo's deployment/spec.yml > deployment dir
    # Stack path is like: repo/stack_orchestrator/data/stacks/stack-name
    # So repo root is 4 parents up
    repo_root = stack_source.parent.parent.parent.parent
    if spec_file:
        # Spec file relative to repo root
        spec_file_path = repo_root / spec_file
    else:
        # Try standard GitOps location in repo
        gitops_spec = repo_root / "deployment" / "spec.yml"
        if gitops_spec.exists():
            spec_file_path = gitops_spec
        else:
            # Fall back to deployment directory
            spec_file_path = deployment_context.deployment_dir / "spec.yml"
    if not spec_file_path.exists():
        print(f"Error: spec.yml not found at {spec_file_path}")
        print("For GitOps, add spec.yml to your repo at deployment/spec.yml")
        print("Or specify --spec-file with path relative to repo root")
        sys.exit(1)
    print(f"Using spec: {spec_file_path}")
    # Parse spec to check for hostname changes
    new_spec_obj = get_parsed_deployment_spec(str(spec_file_path))
    new_http_proxy = new_spec_obj.get("network", {}).get("http-proxy", [])
    new_hostname = new_http_proxy[0]["host-name"] if new_http_proxy else None
    print(f"Spec hostname: {new_hostname}")
    # Step 2: DNS verification (only if hostname changed)
    if new_hostname and new_hostname != current_hostname:
        print(f"\n[2/4] Hostname changed: {current_hostname} -> {new_hostname}")
        if force:
            print("DNS verification skipped (--force)")
        else:
            print("Verifying DNS via probe...")
            if not verify_dns_via_probe(new_hostname):
                print(f"\nDNS verification failed for {new_hostname}")
                print("Ensure DNS is configured before restarting.")
                print("Use --force to skip this check.")
                sys.exit(1)
    else:
        print("\n[2/4] Hostname unchanged, skipping DNS verification")
    # Step 3: Sync deployment directory with spec
    print("\n[3/4] Syncing deployment directory...")
    deploy_ctx = make_deploy_context(ctx)
    create_operation(
        deployment_command_context=deploy_ctx,
        spec_file=str(spec_file_path),
        deployment_dir=str(deployment_context.deployment_dir),
        update=True,
        network_dir=None,
        initial_peers=None,
    )
    # Reload deployment context with updated spec
    deployment_context.init(deployment_context.deployment_dir)
    ctx.obj = deployment_context
    # Stop deployment
    print("\n[4/4] Restarting deployment...")
    ctx.obj = make_deploy_context(ctx)
    down_operation(
        ctx, delete_volumes=False, extra_args_list=[], skip_cluster_management=True
    )
    # Brief pause to ensure clean shutdown
    time.sleep(5)
    # Start deployment
    up_operation(
        ctx, services_list=None, stay_attached=False, skip_cluster_management=True
    )
    print("\n=== Restart Complete ===")
    print("Deployment restarted with git-tracked configuration.")
    if new_hostname and new_hostname != current_hostname:
        print(f"\nNew hostname: {new_hostname}")
        print("Caddy will automatically provision TLS certificate.")
--- a/stack_orchestrator/deploy/deployment_create.py
+++ b/stack_orchestrator/deploy/deployment_create.py
@ -15,9 +15,12 @@
 import click
 from importlib import util
 import json
 import os
 import re
 import base64
 from pathlib import Path
-from typing import List
+from typing import List, Optional
 import random
 from shutil import copy, copyfile, copytree, rmtree
 from secrets import token_hex
@ -484,14 +487,179 @@ def init_operation(
        get_yaml().dump(spec_file_content, output_file)
-def _write_config_file(spec_file: Path, config_env_file: Path):
+# Token pattern: $generate:hex:32$ or $generate:base64:16$
 GENERATE_TOKEN_PATTERN = re.compile(r"\$generate:(\w+):(\d+)\$")
 def _generate_and_store_secrets(config_vars: dict, deployment_name: str):
    """Generate secrets for $generate:...$ tokens and store in K8s Secret.
    Called by `deploy create` - generates fresh secrets and stores them.
    Returns the generated secrets dict for reference.
    """
    from kubernetes import client, config as k8s_config
    secrets = {}
    for name, value in config_vars.items():
        if not isinstance(value, str):
            continue
        match = GENERATE_TOKEN_PATTERN.search(value)
        if not match:
            continue
        secret_type, length = match.group(1), int(match.group(2))
        if secret_type == "hex":
            secrets[name] = token_hex(length)
        elif secret_type == "base64":
            secrets[name] = base64.b64encode(os.urandom(length)).decode()
        else:
            secrets[name] = token_hex(length)
    if not secrets:
        return secrets
    # Store in K8s Secret
    try:
        k8s_config.load_kube_config()
    except Exception:
        # Fall back to in-cluster config if available
        try:
            k8s_config.load_incluster_config()
        except Exception:
            print(
                "Warning: Could not load kube config, secrets will not be stored in K8s"
            )
            return secrets
    v1 = client.CoreV1Api()
    secret_name = f"{deployment_name}-generated-secrets"
    namespace = "default"
    secret_data = {k: base64.b64encode(v.encode()).decode() for k, v in secrets.items()}
    k8s_secret = client.V1Secret(
        metadata=client.V1ObjectMeta(name=secret_name), data=secret_data, type="Opaque"
    )
    try:
        v1.create_namespaced_secret(namespace, k8s_secret)
        num_secrets = len(secrets)
        print(f"Created K8s Secret '{secret_name}' with {num_secrets} secret(s)")
    except client.exceptions.ApiException as e:
        if e.status == 409:  # Already exists
            v1.replace_namespaced_secret(secret_name, namespace, k8s_secret)
            num_secrets = len(secrets)
            print(f"Updated K8s Secret '{secret_name}' with {num_secrets} secret(s)")
        else:
            raise
    return secrets
 def create_registry_secret(spec: Spec, deployment_name: str) -> Optional[str]:
    """Create K8s docker-registry secret from spec + environment.
    Reads registry configuration from spec.yml and creates a Kubernetes
    secret of type kubernetes.io/dockerconfigjson for image pulls.
    Args:
        spec: The deployment spec containing image-registry config
        deployment_name: Name of the deployment (used for secret naming)
    Returns:
        The secret name if created, None if no registry config
    """
    from kubernetes import client, config as k8s_config
    registry_config = spec.get_image_registry_config()
    if not registry_config:
        return None
    server = registry_config.get("server")
    username = registry_config.get("username")
    token_env = registry_config.get("token-env")
    if not all([server, username, token_env]):
        return None
    # Type narrowing for pyright - we've validated these aren't None above
    assert token_env is not None
    token = os.environ.get(token_env)
    if not token:
        print(
            f"Warning: Registry token env var '{token_env}' not set, "
            "skipping registry secret"
        )
        return None
    # Create dockerconfigjson format (Docker API uses "password" field for tokens)
    auth = base64.b64encode(f"{username}:{token}".encode()).decode()
    docker_config = {
        "auths": {server: {"username": username, "password": token, "auth": auth}}
    }
    # Secret name derived from deployment name
    secret_name = f"{deployment_name}-registry"
    # Load kube config
    try:
        k8s_config.load_kube_config()
    except Exception:
        try:
            k8s_config.load_incluster_config()
        except Exception:
            print("Warning: Could not load kube config, registry secret not created")
            return None
    v1 = client.CoreV1Api()
    namespace = "default"
    k8s_secret = client.V1Secret(
        metadata=client.V1ObjectMeta(name=secret_name),
        data={
            ".dockerconfigjson": base64.b64encode(
                json.dumps(docker_config).encode()
            ).decode()
        },
        type="kubernetes.io/dockerconfigjson",
    )
    try:
        v1.create_namespaced_secret(namespace, k8s_secret)
        print(f"Created registry secret '{secret_name}' for {server}")
    except client.exceptions.ApiException as e:
        if e.status == 409:  # Already exists
            v1.replace_namespaced_secret(secret_name, namespace, k8s_secret)
            print(f"Updated registry secret '{secret_name}' for {server}")
        else:
            raise
    return secret_name
 def _write_config_file(
    spec_file: Path, config_env_file: Path, deployment_name: Optional[str] = None
 ):
    spec_content = get_parsed_deployment_spec(spec_file)
-    # Note: we want to write an empty file even if we have no config variables
+    config_vars = spec_content.get("config", {}) or {}
    # Generate and store secrets in K8s if deployment_name provided and tokens exist
    if deployment_name and config_vars:
        has_generate_tokens = any(
            isinstance(v, str) and GENERATE_TOKEN_PATTERN.search(v)
            for v in config_vars.values()
        )
        if has_generate_tokens:
            _generate_and_store_secrets(config_vars, deployment_name)
    # Write non-secret config to config.env (exclude $generate:...$ tokens)
    with open(config_env_file, "w") as output_file:
        if "config" in spec_content and spec_content["config"]:
            config_vars = spec_content["config"]
        if config_vars:
            for variable_name, variable_value in config_vars.items():
                # Skip variables with generate tokens - they go to K8s Secret
                if isinstance(variable_value, str) and GENERATE_TOKEN_PATTERN.search(
                    variable_value
                ):
                    continue
                output_file.write(f"{variable_name}={variable_value}\n")
@ -507,11 +675,14 @@ def _copy_files_to_directory(file_paths: List[Path], directory: Path):
        copy(path, os.path.join(directory, os.path.basename(path)))
-def _create_deployment_file(deployment_dir: Path):
+def _create_deployment_file(deployment_dir: Path, stack_source: Optional[Path] = None):
    deployment_file_path = deployment_dir.joinpath(constants.deployment_file_name)
    cluster = f"{constants.cluster_name_prefix}{token_hex(8)}"
    deployment_content = {constants.cluster_id_key: cluster}
    if stack_source:
        deployment_content["stack-source"] = str(stack_source)
    with open(deployment_file_path, "w") as output_file:
-        output_file.write(f"{constants.cluster_id_key}: {cluster}\n")
+        get_yaml().dump(deployment_content, output_file)
 def _check_volume_definitions(spec):
@ -519,9 +690,13 @@ def _check_volume_definitions(spec):
        for volume_name, volume_path in spec.get_volumes().items():
            if volume_path:
                if not os.path.isabs(volume_path):
                    # For k8s-kind: allow relative paths, they'll be resolved
                    # by _make_absolute_host_path() during kind config generation
                    if not spec.is_kind_deployment():
                        deploy_type = spec.get_deployment_type()
                        raise Exception(
-                        f"Relative path {volume_path} for volume {volume_name} not "
+                            f"Relative path {volume_path} for volume "
-                        f"supported for deployment type {spec.get_deployment_type()}"
+                            f"{volume_name} not supported for {deploy_type}"
                        )
@ -616,11 +791,15 @@ def create_operation(
        generate_helm_chart(stack_name, spec_file, deployment_dir_path)
        return  # Exit early for helm chart generation
    # Resolve stack source path for restart capability
    stack_source = get_stack_path(stack_name)
    if update:
        # Sync mode: write to temp dir, then copy to deployment dir with backups
        temp_dir = Path(tempfile.mkdtemp(prefix="deployment-sync-"))
        try:
-            # Write deployment files to temp dir (skip deployment.yml to preserve cluster ID)
+            # Write deployment files to temp dir
            # (skip deployment.yml to preserve cluster ID)
            _write_deployment_files(
                temp_dir,
                Path(spec_file),
@ -628,12 +807,14 @@ def create_operation(
                stack_name,
                deployment_type,
                include_deployment_file=False,
                stack_source=stack_source,
            )
-            # Copy from temp to deployment dir, excluding data volumes and backing up changed files
+            # Copy from temp to deployment dir, excluding data volumes
-            # Exclude data/* to avoid touching user data volumes
+            # and backing up changed files.
-            # Exclude config file to preserve deployment settings (XXX breaks passing config vars
+            # Exclude data/* to avoid touching user data volumes.
-            # from spec. could warn about this or not exclude...)
+            # Exclude config file to preserve deployment settings
            # (XXX breaks passing config vars from spec)
            exclude_patterns = ["data", "data/*", constants.config_file_name]
            _safe_copy_tree(
                temp_dir, deployment_dir_path, exclude_patterns=exclude_patterns
@ -650,6 +831,7 @@ def create_operation(
            stack_name,
            deployment_type,
            include_deployment_file=True,
            stack_source=stack_source,
        )
    # Delegate to the stack's Python code
@ -670,7 +852,7 @@ def create_operation(
    )
-def _safe_copy_tree(src: Path, dst: Path, exclude_patterns: List[str] = None):
+def _safe_copy_tree(src: Path, dst: Path, exclude_patterns: Optional[List[str]] = None):
    """
    Recursively copy a directory tree, backing up changed files with .bak suffix.
@ -721,6 +903,7 @@ def _write_deployment_files(
    stack_name: str,
    deployment_type: str,
    include_deployment_file: bool = True,
    stack_source: Optional[Path] = None,
 ):
    """
    Write deployment files to target directory.
@ -730,7 +913,8 @@ def _write_deployment_files(
    :param parsed_spec: Parsed spec object
    :param stack_name: Name of stack
    :param deployment_type: Type of deployment
-    :param include_deployment_file: Whether to create deployment.yml file (skip for update)
+    :param include_deployment_file: Whether to create deployment.yml (skip for update)
    :param stack_source: Path to stack source (git repo) for restart capability
    """
    stack_file = get_stack_path(stack_name).joinpath(constants.stack_file_name)
    parsed_stack = get_parsed_stack_config(stack_name)
@ -741,10 +925,15 @@ def _write_deployment_files(
    # Create deployment file if requested
    if include_deployment_file:
-        _create_deployment_file(target_dir)
+        _create_deployment_file(target_dir, stack_source=stack_source)
    # Copy any config variables from the spec file into an env file suitable for compose
-    _write_config_file(spec_file, target_dir.joinpath(constants.config_file_name))
+    # Use stack_name as deployment_name for K8s secret naming
    # Extract just the name part if stack_name is a path ("path/to/stack" -> "stack")
    deployment_name = Path(stack_name).name.replace("_", "-")
    _write_config_file(
        spec_file, target_dir.joinpath(constants.config_file_name), deployment_name
    )
    # Copy any k8s config file into the target dir
    if deployment_type == "k8s":
@ -805,8 +994,9 @@ def _write_deployment_files(
                    )
        else:
            # TODO:
-            # this is odd - looks up config dir that matches a volume name, then copies as a mount dir?
+            # This is odd - looks up config dir that matches a volume name,
-            # AFAICT this is not used by or relevant to any existing stack - roy
+            # then copies as a mount dir?
            # AFAICT not used by or relevant to any existing stack - roy
            # TODO: We should probably only do this if the volume is marked :ro.
            for volume_name, volume_path in parsed_spec.get_volumes().items():
--- a/stack_orchestrator/deploy/dns_probe.py
+++ b/stack_orchestrator/deploy/dns_probe.py
@ -0,0 +1,159 @@
 # Copyright © 2024 Vulcanize
 # SPDX-License-Identifier: AGPL-3.0
 """DNS verification via temporary ingress probe."""
 import secrets
 import socket
 import time
 from typing import Optional
 import requests
 from kubernetes import client
 def get_server_egress_ip() -> str:
    """Get this server's public egress IP via ipify."""
    response = requests.get("https://api.ipify.org", timeout=10)
    response.raise_for_status()
    return response.text.strip()
 def resolve_hostname(hostname: str) -> list[str]:
    """Resolve hostname to list of IP addresses."""
    try:
        _, _, ips = socket.gethostbyname_ex(hostname)
        return ips
    except socket.gaierror:
        return []
 def verify_dns_simple(hostname: str, expected_ip: Optional[str] = None) -> bool:
    """Simple DNS verification - check hostname resolves to expected IP.
    If expected_ip not provided, uses server's egress IP.
    Returns True if hostname resolves to expected IP.
    """
    resolved_ips = resolve_hostname(hostname)
    if not resolved_ips:
        print(f"DNS FAIL: {hostname} does not resolve")
        return False
    if expected_ip is None:
        expected_ip = get_server_egress_ip()
    if expected_ip in resolved_ips:
        print(f"DNS OK: {hostname} -> {resolved_ips} (includes {expected_ip})")
        return True
    else:
        print(f"DNS WARN: {hostname} -> {resolved_ips} (expected {expected_ip})")
        return False
 def create_probe_ingress(hostname: str, namespace: str = "default") -> str:
    """Create a temporary ingress for DNS probing.
    Returns the probe token that the ingress will respond with.
    """
    token = secrets.token_hex(16)
    networking_api = client.NetworkingV1Api()
    # Create a simple ingress that Caddy will pick up
    ingress = client.V1Ingress(
        metadata=client.V1ObjectMeta(
            name="laconic-dns-probe",
            annotations={
                "kubernetes.io/ingress.class": "caddy",
                "laconic.com/probe-token": token,
            },
        ),
        spec=client.V1IngressSpec(
            rules=[
                client.V1IngressRule(
                    host=hostname,
                    http=client.V1HTTPIngressRuleValue(
                        paths=[
                            client.V1HTTPIngressPath(
                                path="/.well-known/laconic-probe",
                                path_type="Exact",
                                backend=client.V1IngressBackend(
                                    service=client.V1IngressServiceBackend(
                                        name="caddy-ingress-controller",
                                        port=client.V1ServiceBackendPort(number=80),
                                    )
                                ),
                            )
                        ]
                    ),
                )
            ]
        ),
    )
    networking_api.create_namespaced_ingress(namespace=namespace, body=ingress)
    return token
 def delete_probe_ingress(namespace: str = "default"):
    """Delete the temporary probe ingress."""
    networking_api = client.NetworkingV1Api()
    try:
        networking_api.delete_namespaced_ingress(
            name="laconic-dns-probe", namespace=namespace
        )
    except client.exceptions.ApiException:
        pass  # Ignore if already deleted
 def verify_dns_via_probe(
    hostname: str, namespace: str = "default", timeout: int = 30, poll_interval: int = 2
 ) -> bool:
    """Verify DNS by creating temp ingress and probing it.
    This definitively proves that traffic to the hostname reaches this cluster.
    Args:
        hostname: The hostname to verify
        namespace: Kubernetes namespace for probe ingress
        timeout: Total seconds to wait for probe to succeed
        poll_interval: Seconds between probe attempts
    Returns:
        True if probe succeeds, False otherwise
    """
    # First check DNS resolves at all
    if not resolve_hostname(hostname):
        print(f"DNS FAIL: {hostname} does not resolve")
        return False
    print(f"Creating probe ingress for {hostname}...")
    create_probe_ingress(hostname, namespace)
    try:
        # Wait for Caddy to pick up the ingress
        time.sleep(3)
        # Poll until success or timeout
        probe_url = f"http://{hostname}/.well-known/laconic-probe"
        start_time = time.time()
        last_error = None
        while time.time() - start_time < timeout:
            try:
                response = requests.get(probe_url, timeout=5)
                # For now, just verify we get a response from this cluster
                # A more robust check would verify a unique token
                if response.status_code < 500:
                    print(f"DNS PROBE OK: {hostname} routes to this cluster")
                    return True
            except requests.RequestException as e:
                last_error = e
            time.sleep(poll_interval)
        print(f"DNS PROBE FAIL: {hostname} - {last_error}")
        return False
    finally:
        print("Cleaning up probe ingress...")
        delete_probe_ingress(namespace)
--- a/stack_orchestrator/deploy/k8s/cluster_info.py
+++ b/stack_orchestrator/deploy/k8s/cluster_info.py
@ -352,6 +352,10 @@ class ClusterInfo:
                continue
            if not os.path.isabs(volume_path):
                # For k8s-kind, allow relative paths:
                # - PV uses /mnt/{volume_name} (path inside kind node)
                # - extraMounts resolve the relative path to Docker Host
                if not self.spec.is_kind_deployment():
                    print(
                        f"WARNING: {volume_name}:{volume_path} is not absolute, "
                        "cannot bind volume."
@ -453,6 +457,16 @@ class ClusterInfo:
                if "command" in service_info:
                    cmd = service_info["command"]
                    container_args = cmd if isinstance(cmd, list) else cmd.split()
                # Add env_from to pull secrets from K8s Secret
                secret_name = f"{self.app_name}-generated-secrets"
                env_from = [
                    client.V1EnvFromSource(
                        secret_ref=client.V1SecretEnvSource(
                            name=secret_name,
                            optional=True,  # Don't fail if no secrets
                        )
                    )
                ]
                container = client.V1Container(
                    name=container_name,
                    image=image_to_use,
@ -460,6 +474,7 @@ class ClusterInfo:
                    command=container_command,
                    args=container_args,
                    env=envs,
                    env_from=env_from,
                    ports=container_ports if container_ports else None,
                    volume_mounts=volume_mounts,
                    security_context=client.V1SecurityContext(
@ -476,7 +491,12 @@ class ClusterInfo:
        volumes = volumes_for_pod_files(
            self.parsed_pod_yaml_map, self.spec, self.app_name
        )
-        image_pull_secrets = [client.V1LocalObjectReference(name="laconic-registry")]
+        registry_config = self.spec.get_image_registry_config()
        if registry_config:
            secret_name = f"{self.app_name}-registry"
            image_pull_secrets = [client.V1LocalObjectReference(name=secret_name)]
        else:
            image_pull_secrets = []
        annotations = None
        labels = {"app": self.app_name}
--- a/stack_orchestrator/deploy/k8s/deploy_k8s.py
+++ b/stack_orchestrator/deploy/k8s/deploy_k8s.py
@ -29,6 +29,7 @@ from stack_orchestrator.deploy.k8s.helpers import (
 from stack_orchestrator.deploy.k8s.helpers import (
    install_ingress_for_kind,
    wait_for_ingress_in_kind,
    is_ingress_running,
 )
 from stack_orchestrator.deploy.k8s.helpers import (
    pods_in_deployment,
@ -289,19 +290,35 @@ class K8sDeployer(Deployer):
        self.skip_cluster_management = skip_cluster_management
        if not opts.o.dry_run:
            if self.is_kind() and not self.skip_cluster_management:
-                # Create the kind cluster
+                # Create the kind cluster (or reuse existing one)
-                create_cluster(
+                kind_config = str(
-                    self.kind_cluster_name,
+                    self.deployment_dir.joinpath(constants.kind_config_filename)
                    str(self.deployment_dir.joinpath(constants.kind_config_filename)),
                )
-                # Ensure the referenced containers are copied into kind
+                actual_cluster = create_cluster(self.kind_cluster_name, kind_config)
-                load_images_into_kind(
+                if actual_cluster != self.kind_cluster_name:
-                    self.kind_cluster_name, self.cluster_info.image_set
+                    # An existing cluster was found, use it instead
                    self.kind_cluster_name = actual_cluster
                # Only load locally-built images into kind
                # Registry images (docker.io, ghcr.io, etc.) will be pulled by k8s
                local_containers = self.deployment_context.stack.obj.get(
                    "containers", []
                )
                if local_containers:
                    # Filter image_set to only images matching local containers
                    local_images = {
                        img
                        for img in self.cluster_info.image_set
                        if any(c in img for c in local_containers)
                    }
                    if local_images:
                        load_images_into_kind(self.kind_cluster_name, local_images)
                # Note: if no local containers defined, all images come from registries
            self.connect_api()
            if self.is_kind() and not self.skip_cluster_management:
                # Configure ingress controller (not installed by default in kind)
-                install_ingress_for_kind()
+                # Skip if already running (idempotent for shared cluster)
                if not is_ingress_running():
                    install_ingress_for_kind(self.cluster_info.spec.get_acme_email())
                    # Wait for ingress to start
                    # (deployment provisioning will fail unless this is done)
                    wait_for_ingress_in_kind()
@ -315,6 +332,11 @@ class K8sDeployer(Deployer):
        else:
            print("Dry run mode enabled, skipping k8s API connect")
        # Create registry secret if configured
        from stack_orchestrator.deploy.deployment_create import create_registry_secret
        create_registry_secret(self.cluster_info.spec, self.cluster_info.app_name)
        self._create_volume_data()
        self._create_deployment()
--- a/stack_orchestrator/deploy/k8s/helpers.py
+++ b/stack_orchestrator/deploy/k8s/helpers.py
@ -14,11 +14,13 @@
 # along with this program.  If not, see <http:#www.gnu.org/licenses/>.
 from kubernetes import client, utils, watch
 from kubernetes.client.exceptions import ApiException
 import os
 from pathlib import Path
 import subprocess
 import re
 from typing import Set, Mapping, List, Optional, cast
 import yaml
 from stack_orchestrator.util import get_k8s_dir, error_exit
 from stack_orchestrator.opts import opts
@ -96,16 +98,227 @@ def _run_command(command: str):
    return result
 def _get_etcd_host_path_from_kind_config(config_file: str) -> Optional[str]:
    """Extract etcd host path from kind config extraMounts."""
    import yaml
    try:
        with open(config_file, "r") as f:
            config = yaml.safe_load(f)
    except Exception:
        return None
    nodes = config.get("nodes", [])
    for node in nodes:
        extra_mounts = node.get("extraMounts", [])
        for mount in extra_mounts:
            if mount.get("containerPath") == "/var/lib/etcd":
                return mount.get("hostPath")
    return None
 def _clean_etcd_keeping_certs(etcd_path: str) -> bool:
    """Clean persisted etcd, keeping only TLS certificates.
    When etcd is persisted and a cluster is recreated, kind tries to install
    resources fresh but they already exist. Instead of trying to delete
    specific stale resources (blacklist), we keep only the valuable data
    (caddy TLS certs) and delete everything else (whitelist approach).
    The etcd image is distroless (no shell), so we extract the statically-linked
    etcdctl binary and run it from alpine which has shell support.
    Returns True if cleanup succeeded, False if no action needed or failed.
    """
    db_path = Path(etcd_path) / "member" / "snap" / "db"
    # Check existence using docker since etcd dir is root-owned
    check_cmd = (
        f"docker run --rm -v {etcd_path}:/etcd:ro alpine:3.19 "
        "test -f /etcd/member/snap/db"
    )
    check_result = subprocess.run(check_cmd, shell=True, capture_output=True)
    if check_result.returncode != 0:
        if opts.o.debug:
            print(f"No etcd snapshot at {db_path}, skipping cleanup")
        return False
    if opts.o.debug:
        print(f"Cleaning persisted etcd at {etcd_path}, keeping only TLS certs")
    etcd_image = "gcr.io/etcd-development/etcd:v3.5.9"
    temp_dir = "/tmp/laconic-etcd-cleanup"
    # Whitelist: prefixes to KEEP - everything else gets deleted
    keep_prefixes = "/registry/secrets/caddy-system"
    # The etcd image is distroless (no shell). We extract the statically-linked
    # etcdctl binary and run it from alpine which has shell + jq support.
    cleanup_script = f"""
        set -e
        ALPINE_IMAGE="alpine:3.19"
        # Cleanup previous runs
        docker rm -f laconic-etcd-cleanup 2>/dev/null || true
        docker rm -f etcd-extract 2>/dev/null || true
        docker run --rm -v /tmp:/tmp $ALPINE_IMAGE rm -rf {temp_dir}
        # Create temp dir
        docker run --rm -v /tmp:/tmp $ALPINE_IMAGE mkdir -p {temp_dir}
        # Extract etcdctl binary (it's statically linked)
        docker create --name etcd-extract {etcd_image}
        docker cp etcd-extract:/usr/local/bin/etcdctl /tmp/etcdctl-bin
        docker rm etcd-extract
        docker run --rm -v /tmp/etcdctl-bin:/src:ro -v {temp_dir}:/dst $ALPINE_IMAGE \
            sh -c "cp /src /dst/etcdctl && chmod +x /dst/etcdctl"
        # Copy db to temp location
        docker run --rm \
            -v {etcd_path}:/etcd:ro \
            -v {temp_dir}:/tmp-work \
            $ALPINE_IMAGE cp /etcd/member/snap/db /tmp-work/etcd-snapshot.db
        # Restore snapshot
        docker run --rm -v {temp_dir}:/work {etcd_image} \
            etcdutl snapshot restore /work/etcd-snapshot.db \
                --data-dir=/work/etcd-data --skip-hash-check 2>/dev/null
        # Start temp etcd (runs the etcd binary, no shell needed)
        docker run -d --name laconic-etcd-cleanup \
            -v {temp_dir}/etcd-data:/etcd-data \
            -v {temp_dir}:/backup \
            {etcd_image} etcd \
                --data-dir=/etcd-data \
                --listen-client-urls=http://0.0.0.0:2379 \
                --advertise-client-urls=http://localhost:2379
        sleep 3
        # Use alpine with extracted etcdctl to run commands (alpine has shell + jq)
        # Export caddy secrets
        docker run --rm \
            -v {temp_dir}:/backup \
            --network container:laconic-etcd-cleanup \
            $ALPINE_IMAGE sh -c \
            '/backup/etcdctl get --prefix "{keep_prefixes}" -w json \
                > /backup/kept.json 2>/dev/null || echo "{{}}" > /backup/kept.json'
        # Delete ALL registry keys
        docker run --rm \
            -v {temp_dir}:/backup \
            --network container:laconic-etcd-cleanup \
            $ALPINE_IMAGE /backup/etcdctl del --prefix /registry
        # Restore kept keys using jq
        docker run --rm \
            -v {temp_dir}:/backup \
            --network container:laconic-etcd-cleanup \
            $ALPINE_IMAGE sh -c '
                apk add --no-cache jq >/dev/null 2>&1
                jq -r ".kvs[] | @base64" /backup/kept.json 2>/dev/null | \
                while read encoded; do
                    key=$(echo $encoded | base64 -d | jq -r ".key" | base64 -d)
                    val=$(echo $encoded | base64 -d | jq -r ".value" | base64 -d)
                    echo "$val" | /backup/etcdctl put "$key"
                done
            ' || true
        # Save cleaned snapshot
        docker exec laconic-etcd-cleanup \
            etcdctl snapshot save /etcd-data/cleaned-snapshot.db
        docker stop laconic-etcd-cleanup
        docker rm laconic-etcd-cleanup
        # Restore to temp location first to verify it works
        docker run --rm \
            -v {temp_dir}/etcd-data/cleaned-snapshot.db:/data/db:ro \
            -v {temp_dir}:/restore \
            {etcd_image} \
            etcdutl snapshot restore /data/db --data-dir=/restore/new-etcd \
            --skip-hash-check 2>/dev/null
        # Create timestamped backup of original (kept forever)
        TIMESTAMP=$(date +%Y%m%d-%H%M%S)
        docker run --rm -v {etcd_path}:/etcd $ALPINE_IMAGE \
            cp -a /etcd/member /etcd/member.backup-$TIMESTAMP
        # Replace original with cleaned version
        docker run --rm -v {etcd_path}:/etcd -v {temp_dir}:/tmp-work $ALPINE_IMAGE \
            sh -c "rm -rf /etcd/member && mv /tmp-work/new-etcd/member /etcd/member"
        # Cleanup temp files (but NOT the timestamped backup in etcd_path)
        docker run --rm -v /tmp:/tmp $ALPINE_IMAGE rm -rf {temp_dir}
        rm -f /tmp/etcdctl-bin
    """
    result = subprocess.run(cleanup_script, shell=True, capture_output=True, text=True)
    if result.returncode != 0:
        if opts.o.debug:
            print(f"Warning: etcd cleanup failed: {result.stderr}")
        return False
    if opts.o.debug:
        print("Cleaned etcd, kept only TLS certificates")
    return True
 def create_cluster(name: str, config_file: str):
    """Create or reuse the single kind cluster for this host.
    There is only one kind cluster per host by design. Multiple deployments
    share this cluster. If a cluster already exists, it is reused.
    Args:
        name: Cluster name (used only when creating the first cluster)
        config_file: Path to kind config file (used only when creating)
    Returns:
        The name of the cluster being used
    """
    existing = get_kind_cluster()
    if existing:
        print(f"Using existing cluster: {existing}")
        return existing
    # Clean persisted etcd, keeping only TLS certificates
    etcd_path = _get_etcd_host_path_from_kind_config(config_file)
    if etcd_path:
        _clean_etcd_keeping_certs(etcd_path)
    print(f"Creating new cluster: {name}")
    result = _run_command(f"kind create cluster --name {name} --config {config_file}")
    if result.returncode != 0:
        raise DeployerException(f"kind create cluster failed: {result}")
    return name
 def destroy_cluster(name: str):
    _run_command(f"kind delete cluster --name {name}")
 def is_ingress_running() -> bool:
    """Check if the Caddy ingress controller is already running in the cluster."""
    try:
        core_v1 = client.CoreV1Api()
        pods = core_v1.list_namespaced_pod(
            namespace="caddy-system",
            label_selector=(
                "app.kubernetes.io/name=caddy-ingress-controller,"
                "app.kubernetes.io/component=controller"
            ),
        )
        for pod in pods.items:
            if pod.status and pod.status.container_statuses:
                if pod.status.container_statuses[0].ready is True:
                    if opts.o.debug:
                        print("Caddy ingress controller already running")
                    return True
        return False
    except ApiException:
        return False
 def wait_for_ingress_in_kind():
    core_v1 = client.CoreV1Api()
    for i in range(20):
@ -132,7 +345,7 @@ def wait_for_ingress_in_kind():
    error_exit("ERROR: Timed out waiting for Caddy ingress to become ready")
-def install_ingress_for_kind():
+def install_ingress_for_kind(acme_email: str = ""):
    api_client = client.ApiClient()
    ingress_install = os.path.abspath(
        get_k8s_dir().joinpath(
@ -141,7 +354,34 @@ def install_ingress_for_kind():
    )
    if opts.o.debug:
        print("Installing Caddy ingress controller in kind cluster")
-    utils.create_from_yaml(api_client, yaml_file=ingress_install)
+
    # Template the YAML with email before applying
    with open(ingress_install) as f:
        yaml_content = f.read()
    if acme_email:
        yaml_content = yaml_content.replace('email: ""', f'email: "{acme_email}"')
        if opts.o.debug:
            print(f"Configured Caddy with ACME email: {acme_email}")
    # Apply templated YAML
    yaml_objects = list(yaml.safe_load_all(yaml_content))
    utils.create_from_yaml(api_client, yaml_objects=yaml_objects)
    # Patch ConfigMap with ACME email if provided
    if acme_email:
        if opts.o.debug:
            print(f"Configuring ACME email: {acme_email}")
        core_api = client.CoreV1Api()
        configmap = core_api.read_namespaced_config_map(
            name="caddy-ingress-controller-configmap", namespace="caddy-system"
        )
        configmap.data["email"] = acme_email
        core_api.patch_namespaced_config_map(
            name="caddy-ingress-controller-configmap",
            namespace="caddy-system",
            body=configmap,
        )
 def load_images_into_kind(kind_cluster_name: str, image_set: Set[str]):
@ -324,6 +564,25 @@ def _generate_kind_mounts(parsed_pod_files, deployment_dir, deployment_context):
    volume_host_path_map = _get_host_paths_for_volumes(deployment_context)
    seen_host_path_mounts = set()  # Track to avoid duplicate mounts
    # Cluster state backup for offline data recovery (unique per deployment)
    # etcd contains all k8s state; PKI certs needed to decrypt etcd offline
    deployment_id = deployment_context.id
    backup_subdir = f"cluster-backups/{deployment_id}"
    etcd_host_path = _make_absolute_host_path(
        Path(f"./data/{backup_subdir}/etcd"), deployment_dir
    )
    volume_definitions.append(
        f"  - hostPath: {etcd_host_path}\n" f"    containerPath: /var/lib/etcd\n"
    )
    pki_host_path = _make_absolute_host_path(
        Path(f"./data/{backup_subdir}/pki"), deployment_dir
    )
    volume_definitions.append(
        f"  - hostPath: {pki_host_path}\n" f"    containerPath: /etc/kubernetes/pki\n"
    )
    # Note these paths are relative to the location of the pod files (at present)
    # So we need to fix up to make them correct and absolute because kind assumes
    # relative to the cwd.
--- a/stack_orchestrator/deploy/spec.py
+++ b/stack_orchestrator/deploy/spec.py
@ -98,6 +98,17 @@ class Spec:
    def get_image_registry(self):
        return self.obj.get(constants.image_registry_key)
    def get_image_registry_config(self) -> typing.Optional[typing.Dict]:
        """Returns registry auth config: {server, username, token-env}.
        Used for private container registries like GHCR. The token-env field
        specifies an environment variable containing the API token/PAT.
        Note: Uses 'registry-credentials' key to avoid collision with
        'image-registry' key which is for pushing images.
        """
        return self.obj.get("registry-credentials")
    def get_volumes(self):
        return self.obj.get(constants.volumes_key, {})
@ -117,6 +128,9 @@ class Spec:
    def get_http_proxy(self):
        return self.obj.get(constants.network_key, {}).get(constants.http_proxy_key, [])
    def get_acme_email(self):
        return self.obj.get(constants.network_key, {}).get("acme-email", "")
    def get_annotations(self):
        return self.obj.get(constants.annotations_key, {})
@ -179,6 +193,9 @@ class Spec:
    def get_deployment_type(self):
        return self.obj.get(constants.deploy_to_key)
    def get_acme_email(self):
        return self.obj.get(constants.network_key, {}).get(constants.acme_email_key, "")
    def is_kubernetes_deployment(self):
        return self.get_deployment_type() in [
            constants.k8s_kind_deploy_type,