Merge pull request 'feat(k8s): ACME email fix, etcd persistence, volume paths' (#986) from fix-caddy-acme-email-rbac into main

Reviewed-on: cerc-io/stack-orchestrator#986
2026-02-03 22:31:47 +00:00 · 2026-02-03 22:31:47 +00:00 · 21d47908cc
commit 21d47908cc
parent 88dccdfb7c f70e87b848
12 changed files with 1214 additions and 44 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -8,6 +8,7 @@ NEVER assume your hypotheses are true without evidence

 ALWAYS clearly state when something is a hypothesis
 ALWAYS use evidence from the systems your interacting with to support your claims and hypotheses
+ALWAYS run `pre-commit run --all-files` before committing changes

 ## Key Principles

@ -43,6 +44,76 @@ This project follows principles inspired by literate programming, where developm

 This approach treats the human-AI collaboration as a form of **conversational literate programming** where understanding emerges through dialogue before code implementation.

+## External Stacks Preferred
+
+When creating new stacks for any reason, **use the external stack pattern** rather than adding stacks directly to this repository.
+
+External stacks follow this structure:
+
+```
+my-stack/
+└── stack-orchestrator/
+    ├── stacks/
+    │   └── my-stack/
+    │       ├── stack.yml
+    │       └── README.md
+    ├── compose/
+    │   └── docker-compose-my-stack.yml
+    └── config/
+        └── my-stack/
+            └── (config files)
+```
+
+### Usage
+
+```bash
+# Fetch external stack
+laconic-so fetch-stack github.com/org/my-stack
+
+# Use external stack
+STACK_PATH=~/cerc/my-stack/stack-orchestrator/stacks/my-stack
+laconic-so --stack $STACK_PATH deploy init --output spec.yml
+laconic-so --stack $STACK_PATH deploy create --spec-file spec.yml --deployment-dir deployment
+laconic-so deployment --dir deployment start
+```
+
+### Examples
+
+- `zenith-karma-stack` - Karma watcher deployment
+- `urbit-stack` - Fake Urbit ship for testing
+- `zenith-desk-stack` - Desk deployment stack
+
+## Architecture: k8s-kind Deployments
+
+### One Cluster Per Host
+One Kind cluster per host by design. Never request or expect separate clusters.
+
+- `create_cluster()` in `helpers.py` reuses any existing cluster
+- `cluster-id` in deployment.yml is an identifier, not a cluster request
+- All deployments share: ingress controller, etcd, certificates
+
+### Stack Resolution
+- External stacks detected via `Path(stack).exists()` in `util.py`
+- Config/compose resolution: external path first, then internal fallback
+- External path structure: `stack_orchestrator/data/stacks/<name>/stack.yml`
+
+### Secret Generation Implementation
+- `GENERATE_TOKEN_PATTERN` in `deployment_create.py` matches `$generate:type:length$`
+- `_generate_and_store_secrets()` creates K8s Secret
+- `cluster_info.py` adds `envFrom` with `secretRef` to containers
+- Non-secret config written to `config.env`
+
+### Repository Cloning
+`setup-repositories --git-ssh` clones repos defined in stack.yml's `repos:` field. Requires SSH agent.
+
+### Key Files (for codebase navigation)
+- `repos/setup_repositories.py`: `setup-repositories` command (git clone)
+- `deployment_create.py`: `deploy create` command, secret generation
+- `deployment.py`: `deployment start/stop/restart` commands
+- `deploy_k8s.py`: K8s deployer, cluster management calls
+- `helpers.py`: `create_cluster()`, etcd cleanup, kind operations
+- `cluster_info.py`: K8s resource generation (Deployment, Service, Ingress)
+
 ## Insights and Observations

 ### Design Principles
--- a/README.md
+++ b/README.md
@ -71,6 +71,59 @@ The various [stacks](/stack_orchestrator/data/stacks) each contain instructions
 - [laconicd with console and CLI](stack_orchestrator/data/stacks/fixturenet-laconic-loaded)
 - [kubo (IPFS)](stack_orchestrator/data/stacks/kubo)

+## Deployment Types
+
+- **compose**: Docker Compose on local machine
+- **k8s**: External Kubernetes cluster (requires kubeconfig)
+- **k8s-kind**: Local Kubernetes via Kind - one cluster per host, shared by all deployments
+
+## External Stacks
+
+Stacks can live in external git repositories. Required structure:
+
+```
+<repo>/
+  stack_orchestrator/data/
+    stacks/<stack-name>/stack.yml
+    compose/docker-compose-<pod-name>.yml
+  deployment/spec.yml
+```
+
+## Deployment Commands
+
+```bash
+# Create deployment from spec
+laconic-so --stack <path> deploy create --spec-file <spec.yml> --deployment-dir <dir>
+
+# Start (creates cluster on first run)
+laconic-so deployment --dir <dir> start
+
+# GitOps restart (git pull + redeploy, preserves data)
+laconic-so deployment --dir <dir> restart
+
+# Stop
+laconic-so deployment --dir <dir> stop
+```
+
+## spec.yml Reference
+
+```yaml
+stack: stack-name-or-path
+deploy-to: k8s-kind
+network:
+  http-proxy:
+    - host-name: app.example.com
+      routes:
+        - path: /
+          proxy-to: service-name:port
+  acme-email: admin@example.com
+config:
+  ENV_VAR: value
+  SECRET_VAR: $generate:hex:32$   # Auto-generated, stored in K8s Secret
+volumes:
+  volume-name:
+```
+
 ## Contributing

 See the [CONTRIBUTING.md](/docs/CONTRIBUTING.md) for developer mode install.
--- a/docs/deployment_patterns.md
+++ b/docs/deployment_patterns.md
@ -0,0 +1,202 @@
+# Deployment Patterns
+
+## GitOps Pattern
+
+For production deployments, we recommend a GitOps approach where your deployment configuration is tracked in version control.
+
+### Overview
+
+- **spec.yml is your source of truth**: Maintain it in your operator repository
+- **Don't regenerate on every restart**: Run `deploy init` once, then customize and commit
+- **Use restart for updates**: The restart command respects your git-tracked spec.yml
+
+### Workflow
+
+1. **Initial setup**: Run `deploy init` once to generate a spec.yml template
+2. **Customize and commit**: Edit spec.yml with your configuration (hostnames, resources, etc.) and commit to your operator repo
+3. **Deploy from git**: Use the committed spec.yml for deployments
+4. **Update via git**: Make changes in git, then restart to apply
+
+```bash
+# Initial setup (run once)
+laconic-so --stack my-stack deploy init --output spec.yml
+
+# Customize for your environment
+vim spec.yml  # Set hostname, resources, etc.
+
+# Commit to your operator repository
+git add spec.yml
+git commit -m "Add my-stack deployment configuration"
+git push
+
+# On deployment server: deploy from git-tracked spec
+laconic-so deploy create \
+  --spec-file /path/to/operator-repo/spec.yml \
+  --deployment-dir my-deployment
+
+laconic-so deployment --dir my-deployment start
+```
+
+### Updating Deployments
+
+When you need to update a deployment:
+
+```bash
+# 1. Make changes in your operator repo
+vim /path/to/operator-repo/spec.yml
+git commit -am "Update configuration"
+git push
+
+# 2. On deployment server: pull and restart
+cd /path/to/operator-repo && git pull
+laconic-so deployment --dir my-deployment restart
+```
+
+The `restart` command:
+- Pulls latest code from the stack repository
+- Uses your git-tracked spec.yml (does NOT regenerate from defaults)
+- Syncs the deployment directory
+- Restarts services
+
+### Anti-patterns
+
+**Don't do this:**
+```bash
+# BAD: Regenerating spec on every deployment
+laconic-so --stack my-stack deploy init --output spec.yml
+laconic-so deploy create --spec-file spec.yml ...
+```
+
+This overwrites your customizations with defaults from the stack's `commands.py`.
+
+**Do this instead:**
+```bash
+# GOOD: Use your git-tracked spec
+git pull  # Get latest spec.yml from your operator repo
+laconic-so deployment --dir my-deployment restart
+```
+
+## Private Registry Authentication
+
+For deployments using images from private container registries (e.g., GitHub Container Registry), configure authentication in your spec.yml:
+
+### Configuration
+
+Add a `registry-credentials` section to your spec.yml:
+
+```yaml
+registry-credentials:
+  server: ghcr.io
+  username: your-org-or-username
+  token-env: REGISTRY_TOKEN
+```
+
+**Fields:**
+- `server`: The registry hostname (e.g., `ghcr.io`, `docker.io`, `gcr.io`)
+- `username`: Registry username (for GHCR, use your GitHub username or org name)
+- `token-env`: Name of the environment variable containing your API token/PAT
+
+### Token Environment Variable
+
+The `token-env` pattern keeps credentials out of version control. Set the environment variable when running `deployment start`:
+
+```bash
+export REGISTRY_TOKEN="your-personal-access-token"
+laconic-so deployment --dir my-deployment start
+```
+
+For GHCR, create a Personal Access Token (PAT) with `read:packages` scope.
+
+### Ansible Integration
+
+When using Ansible for deployments, pass the token from a credentials file:
+
+```yaml
+- name: Start deployment
+  ansible.builtin.command:
+    cmd: laconic-so deployment --dir {{ deployment_dir }} start
+  environment:
+    REGISTRY_TOKEN: "{{ lookup('file', '~/.credentials/ghcr_token') }}"
+```
+
+### How It Works
+
+1. laconic-so reads the `registry-credentials` config from spec.yml
+2. Creates a Kubernetes `docker-registry` secret named `{deployment}-registry`
+3. The deployment's pods reference this secret for image pulls
+
+## Cluster and Volume Management
+
+### Stopping Deployments
+
+The `deployment stop` command has two important flags:
+
+```bash
+# Default: stops deployment, deletes cluster, PRESERVES volumes
+laconic-so deployment --dir my-deployment stop
+
+# Explicitly delete volumes (USE WITH CAUTION)
+laconic-so deployment --dir my-deployment stop --delete-volumes
+```
+
+### Volume Persistence
+
+Volumes persist across cluster deletion by design. This is important because:
+- **Data survives cluster recreation**: Ledger data, databases, and other state are preserved
+- **Faster recovery**: No need to re-sync or rebuild data after cluster issues
+- **Safe cluster upgrades**: Delete and recreate cluster without data loss
+
+**Only use `--delete-volumes` when:**
+- You explicitly want to start fresh with no data
+- The user specifically requests volume deletion
+- You're cleaning up a test/dev environment completely
+
+### Shared Cluster Architecture
+
+In kind deployments, multiple stacks share a single cluster:
+- First `deployment start` creates the cluster
+- Subsequent deployments reuse the existing cluster
+- `deployment stop` on ANY deployment deletes the shared cluster
+- Other deployments will fail until cluster is recreated
+
+To stop a single deployment without affecting the cluster:
+```bash
+laconic-so deployment --dir my-deployment stop --skip-cluster-management
+```
+
+## Volume Persistence in k8s-kind
+
+k8s-kind has 3 storage layers:
+
+- **Docker Host**: The physical server running Docker
+- **Kind Node**: A Docker container simulating a k8s node
+- **Pod Container**: Your workload
+
+For k8s-kind, volumes with paths are mounted from Docker Host → Kind Node → Pod via extraMounts.
+
+| spec.yml volume | Storage Location | Survives Pod Restart | Survives Cluster Restart |
+|-----------------|------------------|---------------------|-------------------------|
+| `vol:` (empty)  | Kind Node PVC    | ✅ | ❌ |
+| `vol: ./data/x` | Docker Host      | ✅ | ✅ |
+| `vol: /abs/path`| Docker Host      | ✅ | ✅ |
+
+**Recommendation**: Always use paths for data you want to keep. Relative paths
+(e.g., `./data/rpc-config`) resolve to `$DEPLOYMENT_DIR/data/rpc-config` on the
+Docker Host.
+
+### Example
+
+```yaml
+# In spec.yml
+volumes:
+  rpc-config: ./data/rpc-config  # Persists to $DEPLOYMENT_DIR/data/rpc-config
+  chain-data: ./data/chain       # Persists to $DEPLOYMENT_DIR/data/chain
+  temp-cache:                    # Empty = Kind Node PVC (lost on cluster delete)
+```
+
+### The Antipattern
+
+Empty-path volumes appear persistent because they survive pod restarts (data lives
+in Kind Node container). However, this data is lost when the kind cluster is
+recreated. This "false persistence" has caused data loss when operators assumed
+their data was safe.
--- a/stack_orchestrator/constants.py
+++ b/stack_orchestrator/constants.py
@ -44,3 +44,4 @@ unlimited_memlock_key = "unlimited-memlock"
 runtime_class_key = "runtime-class"
 high_memlock_runtime = "high-memlock"
 high_memlock_spec_filename = "high-memlock-spec.json"
+acme_email_key = "acme-email"
--- a/stack_orchestrator/data/k8s/components/ingress/ingress-caddy-kind-deploy.yaml
+++ b/stack_orchestrator/data/k8s/components/ingress/ingress-caddy-kind-deploy.yaml
@ -93,6 +93,7 @@ rules:
      - get
      - create
      - update
+      - delete
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
--- a/stack_orchestrator/deploy/deployment.py
+++ b/stack_orchestrator/deploy/deployment.py
@ -15,7 +15,9 @@

 import click
 from pathlib import Path
+import subprocess
 import sys
+import time
 from stack_orchestrator import constants
 from stack_orchestrator.deploy.images import push_images_operation
 from stack_orchestrator.deploy.deploy import (
@ -228,3 +230,176 @@ def run_job(ctx, job_name, helm_release):

    ctx.obj = make_deploy_context(ctx)
    run_job_operation(ctx, job_name, helm_release)
+
+
+@command.command()
+@click.option("--stack-path", help="Path to stack git repo (overrides stored path)")
+@click.option(
+    "--spec-file", help="Path to GitOps spec.yml in repo (e.g., deployment/spec.yml)"
+)
+@click.option("--config-file", help="Config file to pass to deploy init")
+@click.option(
+    "--force",
+    is_flag=True,
+    default=False,
+    help="Skip DNS verification",
+)
+@click.option(
+    "--expected-ip",
+    help="Expected IP for DNS verification (if different from egress)",
+)
+@click.pass_context
+def restart(ctx, stack_path, spec_file, config_file, force, expected_ip):
+    """Pull latest code and restart deployment using git-tracked spec.
+
+    GitOps workflow:
+    1. Operator maintains spec.yml in their git repository
+    2. This command pulls latest code (including updated spec.yml)
+    3. If hostname changed, verifies DNS routes to this server
+    4. Syncs deployment directory with the git-tracked spec
+    5. Stops and restarts the deployment
+
+    Data volumes are always preserved. The cluster is never destroyed.
+
+    Stack source resolution (in order):
+    1. --stack-path argument (if provided)
+    2. stack-source field in deployment.yml (if stored)
+    3. Error if neither available
+
+    Note: spec.yml should be maintained in git, not regenerated from
+    commands.py on each restart. Use 'deploy init' only for initial
+    spec generation, then customize and commit to your operator repo.
+    """
+    from stack_orchestrator.util import get_yaml, get_parsed_deployment_spec
+    from stack_orchestrator.deploy.deployment_create import create_operation
+    from stack_orchestrator.deploy.dns_probe import verify_dns_via_probe
+
+    deployment_context: DeploymentContext = ctx.obj
+
+    # Get current spec info (before git pull)
+    current_spec = deployment_context.spec
+    current_http_proxy = current_spec.get_http_proxy()
+    current_hostname = (
+        current_http_proxy[0]["host-name"] if current_http_proxy else None
+    )
+
+    # Resolve stack source path
+    if stack_path:
+        stack_source = Path(stack_path).resolve()
+    else:
+        # Try to get from deployment.yml
+        deployment_file = (
+            deployment_context.deployment_dir / constants.deployment_file_name
+        )
+        deployment_data = get_yaml().load(open(deployment_file))
+        stack_source_str = deployment_data.get("stack-source")
+        if not stack_source_str:
+            print(
+                "Error: No stack-source in deployment.yml and --stack-path not provided"
+            )
+            print("Use --stack-path to specify the stack git repository location")
+            sys.exit(1)
+        stack_source = Path(stack_source_str)
+
+    if not stack_source.exists():
+        print(f"Error: Stack source path does not exist: {stack_source}")
+        sys.exit(1)
+
+    print("=== Deployment Restart ===")
+    print(f"Deployment dir: {deployment_context.deployment_dir}")
+    print(f"Stack source: {stack_source}")
+    print(f"Current hostname: {current_hostname}")
+
+    # Step 1: Git pull (brings in updated spec.yml from operator's repo)
+    print("\n[1/4] Pulling latest code from stack repository...")
+    git_result = subprocess.run(
+        ["git", "pull"], cwd=stack_source, capture_output=True, text=True
+    )
+    if git_result.returncode != 0:
+        print(f"Git pull failed: {git_result.stderr}")
+        sys.exit(1)
+    print(f"Git pull: {git_result.stdout.strip()}")
+
+    # Determine spec file location
+    # Priority: --spec-file argument > repo's deployment/spec.yml > deployment dir
+    # Stack path is like: repo/stack_orchestrator/data/stacks/stack-name
+    # So repo root is 4 parents up
+    repo_root = stack_source.parent.parent.parent.parent
+    if spec_file:
+        # Spec file relative to repo root
+        spec_file_path = repo_root / spec_file
+    else:
+        # Try standard GitOps location in repo
+        gitops_spec = repo_root / "deployment" / "spec.yml"
+        if gitops_spec.exists():
+            spec_file_path = gitops_spec
+        else:
+            # Fall back to deployment directory
+            spec_file_path = deployment_context.deployment_dir / "spec.yml"
+
+    if not spec_file_path.exists():
+        print(f"Error: spec.yml not found at {spec_file_path}")
+        print("For GitOps, add spec.yml to your repo at deployment/spec.yml")
+        print("Or specify --spec-file with path relative to repo root")
+        sys.exit(1)
+
+    print(f"Using spec: {spec_file_path}")
+
+    # Parse spec to check for hostname changes
+    new_spec_obj = get_parsed_deployment_spec(str(spec_file_path))
+    new_http_proxy = new_spec_obj.get("network", {}).get("http-proxy", [])
+    new_hostname = new_http_proxy[0]["host-name"] if new_http_proxy else None
+
+    print(f"Spec hostname: {new_hostname}")
+
+    # Step 2: DNS verification (only if hostname changed)
+    if new_hostname and new_hostname != current_hostname:
+        print(f"\n[2/4] Hostname changed: {current_hostname} -> {new_hostname}")
+        if force:
+            print("DNS verification skipped (--force)")
+        else:
+            print("Verifying DNS via probe...")
+            if not verify_dns_via_probe(new_hostname):
+                print(f"\nDNS verification failed for {new_hostname}")
+                print("Ensure DNS is configured before restarting.")
+                print("Use --force to skip this check.")
+                sys.exit(1)
+    else:
+        print("\n[2/4] Hostname unchanged, skipping DNS verification")
+
+    # Step 3: Sync deployment directory with spec
+    print("\n[3/4] Syncing deployment directory...")
+    deploy_ctx = make_deploy_context(ctx)
+    create_operation(
+        deployment_command_context=deploy_ctx,
+        spec_file=str(spec_file_path),
+        deployment_dir=str(deployment_context.deployment_dir),
+        update=True,
+        network_dir=None,
+        initial_peers=None,
+    )
+
+    # Reload deployment context with updated spec
+    deployment_context.init(deployment_context.deployment_dir)
+    ctx.obj = deployment_context
+
+    # Stop deployment
+    print("\n[4/4] Restarting deployment...")
+    ctx.obj = make_deploy_context(ctx)
+    down_operation(
+        ctx, delete_volumes=False, extra_args_list=[], skip_cluster_management=True
+    )
+
+    # Brief pause to ensure clean shutdown
+    time.sleep(5)
+
+    # Start deployment
+    up_operation(
+        ctx, services_list=None, stay_attached=False, skip_cluster_management=True
+    )
+
+    print("\n=== Restart Complete ===")
+    print("Deployment restarted with git-tracked configuration.")
+    if new_hostname and new_hostname != current_hostname:
+        print(f"\nNew hostname: {new_hostname}")
+        print("Caddy will automatically provision TLS certificate.")
--- a/stack_orchestrator/deploy/deployment_create.py
+++ b/stack_orchestrator/deploy/deployment_create.py
@ -15,9 +15,12 @@

 import click
 from importlib import util
+import json
 import os
+import re
+import base64
 from pathlib import Path
-from typing import List
+from typing import List, Optional
 import random
 from shutil import copy, copyfile, copytree, rmtree
 from secrets import token_hex
@ -484,15 +487,180 @@ def init_operation(
        get_yaml().dump(spec_file_content, output_file)


-def _write_config_file(spec_file: Path, config_env_file: Path):
+# Token pattern: $generate:hex:32$ or $generate:base64:16$
+GENERATE_TOKEN_PATTERN = re.compile(r"\$generate:(\w+):(\d+)\$")
+
+
+def _generate_and_store_secrets(config_vars: dict, deployment_name: str):
+    """Generate secrets for $generate:...$ tokens and store in K8s Secret.
+
+    Called by `deploy create` - generates fresh secrets and stores them.
+    Returns the generated secrets dict for reference.
+    """
+    from kubernetes import client, config as k8s_config
+
+    secrets = {}
+    for name, value in config_vars.items():
+        if not isinstance(value, str):
+            continue
+        match = GENERATE_TOKEN_PATTERN.search(value)
+        if not match:
+            continue
+
+        secret_type, length = match.group(1), int(match.group(2))
+        if secret_type == "hex":
+            secrets[name] = token_hex(length)
+        elif secret_type == "base64":
+            secrets[name] = base64.b64encode(os.urandom(length)).decode()
+        else:
+            secrets[name] = token_hex(length)
+
+    if not secrets:
+        return secrets
+
+    # Store in K8s Secret
+    try:
+        k8s_config.load_kube_config()
+    except Exception:
+        # Fall back to in-cluster config if available
+        try:
+            k8s_config.load_incluster_config()
+        except Exception:
+            print(
+                "Warning: Could not load kube config, secrets will not be stored in K8s"
+            )
+            return secrets
+
+    v1 = client.CoreV1Api()
+    secret_name = f"{deployment_name}-generated-secrets"
+    namespace = "default"
+
+    secret_data = {k: base64.b64encode(v.encode()).decode() for k, v in secrets.items()}
+    k8s_secret = client.V1Secret(
+        metadata=client.V1ObjectMeta(name=secret_name), data=secret_data, type="Opaque"
+    )
+
+    try:
+        v1.create_namespaced_secret(namespace, k8s_secret)
+        num_secrets = len(secrets)
+        print(f"Created K8s Secret '{secret_name}' with {num_secrets} secret(s)")
+    except client.exceptions.ApiException as e:
+        if e.status == 409:  # Already exists
+            v1.replace_namespaced_secret(secret_name, namespace, k8s_secret)
+            num_secrets = len(secrets)
+            print(f"Updated K8s Secret '{secret_name}' with {num_secrets} secret(s)")
+        else:
+            raise
+
+    return secrets
+
+
+def create_registry_secret(spec: Spec, deployment_name: str) -> Optional[str]:
+    """Create K8s docker-registry secret from spec + environment.
+
+    Reads registry configuration from spec.yml and creates a Kubernetes
+    secret of type kubernetes.io/dockerconfigjson for image pulls.
+
+    Args:
+        spec: The deployment spec containing image-registry config
+        deployment_name: Name of the deployment (used for secret naming)
+
+    Returns:
+        The secret name if created, None if no registry config
+    """
+    from kubernetes import client, config as k8s_config
+
+    registry_config = spec.get_image_registry_config()
+    if not registry_config:
+        return None
+
+    server = registry_config.get("server")
+    username = registry_config.get("username")
+    token_env = registry_config.get("token-env")
+
+    if not all([server, username, token_env]):
+        return None
+
+    # Type narrowing for pyright - we've validated these aren't None above
+    assert token_env is not None
+    token = os.environ.get(token_env)
+    if not token:
+        print(
+            f"Warning: Registry token env var '{token_env}' not set, "
+            "skipping registry secret"
+        )
+        return None
+
+    # Create dockerconfigjson format (Docker API uses "password" field for tokens)
+    auth = base64.b64encode(f"{username}:{token}".encode()).decode()
+    docker_config = {
+        "auths": {server: {"username": username, "password": token, "auth": auth}}
+    }
+
+    # Secret name derived from deployment name
+    secret_name = f"{deployment_name}-registry"
+
+    # Load kube config
+    try:
+        k8s_config.load_kube_config()
+    except Exception:
+        try:
+            k8s_config.load_incluster_config()
+        except Exception:
+            print("Warning: Could not load kube config, registry secret not created")
+            return None
+
+    v1 = client.CoreV1Api()
+    namespace = "default"
+
+    k8s_secret = client.V1Secret(
+        metadata=client.V1ObjectMeta(name=secret_name),
+        data={
+            ".dockerconfigjson": base64.b64encode(
+                json.dumps(docker_config).encode()
+            ).decode()
+        },
+        type="kubernetes.io/dockerconfigjson",
+    )
+
+    try:
+        v1.create_namespaced_secret(namespace, k8s_secret)
+        print(f"Created registry secret '{secret_name}' for {server}")
+    except client.exceptions.ApiException as e:
+        if e.status == 409:  # Already exists
+            v1.replace_namespaced_secret(secret_name, namespace, k8s_secret)
+            print(f"Updated registry secret '{secret_name}' for {server}")
+        else:
+            raise
+
+    return secret_name
+
+
+def _write_config_file(
+    spec_file: Path, config_env_file: Path, deployment_name: Optional[str] = None
+):
    spec_content = get_parsed_deployment_spec(spec_file)
-    # Note: we want to write an empty file even if we have no config variables
+    config_vars = spec_content.get("config", {}) or {}
+
+    # Generate and store secrets in K8s if deployment_name provided and tokens exist
+    if deployment_name and config_vars:
+        has_generate_tokens = any(
+            isinstance(v, str) and GENERATE_TOKEN_PATTERN.search(v)
+            for v in config_vars.values()
+        )
+        if has_generate_tokens:
+            _generate_and_store_secrets(config_vars, deployment_name)
+
+    # Write non-secret config to config.env (exclude $generate:...$ tokens)
    with open(config_env_file, "w") as output_file:
-        if "config" in spec_content and spec_content["config"]:
-            config_vars = spec_content["config"]
-            if config_vars:
-                for variable_name, variable_value in config_vars.items():
-                    output_file.write(f"{variable_name}={variable_value}\n")
+        if config_vars:
+            for variable_name, variable_value in config_vars.items():
+                # Skip variables with generate tokens - they go to K8s Secret
+                if isinstance(variable_value, str) and GENERATE_TOKEN_PATTERN.search(
+                    variable_value
+                ):
+                    continue
+                output_file.write(f"{variable_name}={variable_value}\n")


 def _write_kube_config_file(external_path: Path, internal_path: Path):
@ -507,11 +675,14 @@ def _copy_files_to_directory(file_paths: List[Path], directory: Path):
        copy(path, os.path.join(directory, os.path.basename(path)))


-def _create_deployment_file(deployment_dir: Path):
+def _create_deployment_file(deployment_dir: Path, stack_source: Optional[Path] = None):
    deployment_file_path = deployment_dir.joinpath(constants.deployment_file_name)
    cluster = f"{constants.cluster_name_prefix}{token_hex(8)}"
+    deployment_content = {constants.cluster_id_key: cluster}
+    if stack_source:
+        deployment_content["stack-source"] = str(stack_source)
    with open(deployment_file_path, "w") as output_file:
-        output_file.write(f"{constants.cluster_id_key}: {cluster}\n")
+        get_yaml().dump(deployment_content, output_file)


 def _check_volume_definitions(spec):
@ -519,10 +690,14 @@ def _check_volume_definitions(spec):
        for volume_name, volume_path in spec.get_volumes().items():
            if volume_path:
                if not os.path.isabs(volume_path):
-                    raise Exception(
-                        f"Relative path {volume_path} for volume {volume_name} not "
-                        f"supported for deployment type {spec.get_deployment_type()}"
-                    )
+                    # For k8s-kind: allow relative paths, they'll be resolved
+                    # by _make_absolute_host_path() during kind config generation
+                    if not spec.is_kind_deployment():
+                        deploy_type = spec.get_deployment_type()
+                        raise Exception(
+                            f"Relative path {volume_path} for volume "
+                            f"{volume_name} not supported for {deploy_type}"
+                        )


@click.command()
@ -616,11 +791,15 @@ def create_operation(
        generate_helm_chart(stack_name, spec_file, deployment_dir_path)
        return  # Exit early for helm chart generation

+    # Resolve stack source path for restart capability
+    stack_source = get_stack_path(stack_name)
+
    if update:
        # Sync mode: write to temp dir, then copy to deployment dir with backups
        temp_dir = Path(tempfile.mkdtemp(prefix="deployment-sync-"))
        try:
-            # Write deployment files to temp dir (skip deployment.yml to preserve cluster ID)
+            # Write deployment files to temp dir
+            # (skip deployment.yml to preserve cluster ID)
            _write_deployment_files(
                temp_dir,
                Path(spec_file),
@ -628,12 +807,14 @@ def create_operation(
                stack_name,
                deployment_type,
                include_deployment_file=False,
+                stack_source=stack_source,
            )

-            # Copy from temp to deployment dir, excluding data volumes and backing up changed files
-            # Exclude data/* to avoid touching user data volumes
-            # Exclude config file to preserve deployment settings (XXX breaks passing config vars
-            # from spec. could warn about this or not exclude...)
+            # Copy from temp to deployment dir, excluding data volumes
+            # and backing up changed files.
+            # Exclude data/* to avoid touching user data volumes.
+            # Exclude config file to preserve deployment settings
+            # (XXX breaks passing config vars from spec)
            exclude_patterns = ["data", "data/*", constants.config_file_name]
            _safe_copy_tree(
                temp_dir, deployment_dir_path, exclude_patterns=exclude_patterns
@ -650,6 +831,7 @@ def create_operation(
            stack_name,
            deployment_type,
            include_deployment_file=True,
+            stack_source=stack_source,
        )

    # Delegate to the stack's Python code
@ -670,7 +852,7 @@ def create_operation(
    )


-def _safe_copy_tree(src: Path, dst: Path, exclude_patterns: List[str] = None):
+def _safe_copy_tree(src: Path, dst: Path, exclude_patterns: Optional[List[str]] = None):
    """
    Recursively copy a directory tree, backing up changed files with .bak suffix.

@ -721,6 +903,7 @@ def _write_deployment_files(
    stack_name: str,
    deployment_type: str,
    include_deployment_file: bool = True,
+    stack_source: Optional[Path] = None,
 ):
    """
    Write deployment files to target directory.
@ -730,7 +913,8 @@ def _write_deployment_files(
    :param parsed_spec: Parsed spec object
    :param stack_name: Name of stack
    :param deployment_type: Type of deployment
-    :param include_deployment_file: Whether to create deployment.yml file (skip for update)
+    :param include_deployment_file: Whether to create deployment.yml (skip for update)
+    :param stack_source: Path to stack source (git repo) for restart capability
    """
    stack_file = get_stack_path(stack_name).joinpath(constants.stack_file_name)
    parsed_stack = get_parsed_stack_config(stack_name)
@ -741,10 +925,15 @@ def _write_deployment_files(

    # Create deployment file if requested
    if include_deployment_file:
-        _create_deployment_file(target_dir)
+        _create_deployment_file(target_dir, stack_source=stack_source)

    # Copy any config variables from the spec file into an env file suitable for compose
-    _write_config_file(spec_file, target_dir.joinpath(constants.config_file_name))
+    # Use stack_name as deployment_name for K8s secret naming
+    # Extract just the name part if stack_name is a path ("path/to/stack" -> "stack")
+    deployment_name = Path(stack_name).name.replace("_", "-")
+    _write_config_file(
+        spec_file, target_dir.joinpath(constants.config_file_name), deployment_name
+    )

    # Copy any k8s config file into the target dir
    if deployment_type == "k8s":
@ -805,8 +994,9 @@ def _write_deployment_files(
                    )
        else:
            # TODO:
-            # this is odd - looks up config dir that matches a volume name, then copies as a mount dir?
-            # AFAICT this is not used by or relevant to any existing stack - roy
+            # This is odd - looks up config dir that matches a volume name,
+            # then copies as a mount dir?
+            # AFAICT not used by or relevant to any existing stack - roy

            # TODO: We should probably only do this if the volume is marked :ro.
            for volume_name, volume_path in parsed_spec.get_volumes().items():
--- a/stack_orchestrator/deploy/dns_probe.py
+++ b/stack_orchestrator/deploy/dns_probe.py
@ -0,0 +1,159 @@
+# Copyright © 2024 Vulcanize
+# SPDX-License-Identifier: AGPL-3.0
+
+"""DNS verification via temporary ingress probe."""
+
+import secrets
+import socket
+import time
+from typing import Optional
+import requests
+from kubernetes import client
+
+
+def get_server_egress_ip() -> str:
+    """Get this server's public egress IP via ipify."""
+    response = requests.get("https://api.ipify.org", timeout=10)
+    response.raise_for_status()
+    return response.text.strip()
+
+
+def resolve_hostname(hostname: str) -> list[str]:
+    """Resolve hostname to list of IP addresses."""
+    try:
+        _, _, ips = socket.gethostbyname_ex(hostname)
+        return ips
+    except socket.gaierror:
+        return []
+
+
+def verify_dns_simple(hostname: str, expected_ip: Optional[str] = None) -> bool:
+    """Simple DNS verification - check hostname resolves to expected IP.
+
+    If expected_ip not provided, uses server's egress IP.
+    Returns True if hostname resolves to expected IP.
+    """
+    resolved_ips = resolve_hostname(hostname)
+    if not resolved_ips:
+        print(f"DNS FAIL: {hostname} does not resolve")
+        return False
+
+    if expected_ip is None:
+        expected_ip = get_server_egress_ip()
+
+    if expected_ip in resolved_ips:
+        print(f"DNS OK: {hostname} -> {resolved_ips} (includes {expected_ip})")
+        return True
+    else:
+        print(f"DNS WARN: {hostname} -> {resolved_ips} (expected {expected_ip})")
+        return False
+
+
+def create_probe_ingress(hostname: str, namespace: str = "default") -> str:
+    """Create a temporary ingress for DNS probing.
+
+    Returns the probe token that the ingress will respond with.
+    """
+    token = secrets.token_hex(16)
+
+    networking_api = client.NetworkingV1Api()
+
+    # Create a simple ingress that Caddy will pick up
+    ingress = client.V1Ingress(
+        metadata=client.V1ObjectMeta(
+            name="laconic-dns-probe",
+            annotations={
+                "kubernetes.io/ingress.class": "caddy",
+                "laconic.com/probe-token": token,
+            },
+        ),
+        spec=client.V1IngressSpec(
+            rules=[
+                client.V1IngressRule(
+                    host=hostname,
+                    http=client.V1HTTPIngressRuleValue(
+                        paths=[
+                            client.V1HTTPIngressPath(
+                                path="/.well-known/laconic-probe",
+                                path_type="Exact",
+                                backend=client.V1IngressBackend(
+                                    service=client.V1IngressServiceBackend(
+                                        name="caddy-ingress-controller",
+                                        port=client.V1ServiceBackendPort(number=80),
+                                    )
+                                ),
+                            )
+                        ]
+                    ),
+                )
+            ]
+        ),
+    )
+
+    networking_api.create_namespaced_ingress(namespace=namespace, body=ingress)
+    return token
+
+
+def delete_probe_ingress(namespace: str = "default"):
+    """Delete the temporary probe ingress."""
+    networking_api = client.NetworkingV1Api()
+    try:
+        networking_api.delete_namespaced_ingress(
+            name="laconic-dns-probe", namespace=namespace
+        )
+    except client.exceptions.ApiException:
+        pass  # Ignore if already deleted
+
+
+def verify_dns_via_probe(
+    hostname: str, namespace: str = "default", timeout: int = 30, poll_interval: int = 2
+) -> bool:
+    """Verify DNS by creating temp ingress and probing it.
+
+    This definitively proves that traffic to the hostname reaches this cluster.
+
+    Args:
+        hostname: The hostname to verify
+        namespace: Kubernetes namespace for probe ingress
+        timeout: Total seconds to wait for probe to succeed
+        poll_interval: Seconds between probe attempts
+
+    Returns:
+        True if probe succeeds, False otherwise
+    """
+    # First check DNS resolves at all
+    if not resolve_hostname(hostname):
+        print(f"DNS FAIL: {hostname} does not resolve")
+        return False
+
+    print(f"Creating probe ingress for {hostname}...")
+    create_probe_ingress(hostname, namespace)
+
+    try:
+        # Wait for Caddy to pick up the ingress
+        time.sleep(3)
+
+        # Poll until success or timeout
+        probe_url = f"http://{hostname}/.well-known/laconic-probe"
+        start_time = time.time()
+        last_error = None
+
+        while time.time() - start_time < timeout:
+            try:
+                response = requests.get(probe_url, timeout=5)
+                # For now, just verify we get a response from this cluster
+                # A more robust check would verify a unique token
+                if response.status_code < 500:
+                    print(f"DNS PROBE OK: {hostname} routes to this cluster")
+                    return True
+            except requests.RequestException as e:
+                last_error = e
+
+            time.sleep(poll_interval)
+
+        print(f"DNS PROBE FAIL: {hostname} - {last_error}")
+        return False
+
+    finally:
+        print("Cleaning up probe ingress...")
+        delete_probe_ingress(namespace)
--- a/stack_orchestrator/deploy/k8s/cluster_info.py
+++ b/stack_orchestrator/deploy/k8s/cluster_info.py
@ -352,11 +352,15 @@ class ClusterInfo:
                continue

            if not os.path.isabs(volume_path):
-                print(
-                    f"WARNING: {volume_name}:{volume_path} is not absolute, "
-                    "cannot bind volume."
-                )
-                continue
+                # For k8s-kind, allow relative paths:
+                # - PV uses /mnt/{volume_name} (path inside kind node)
+                # - extraMounts resolve the relative path to Docker Host
+                if not self.spec.is_kind_deployment():
+                    print(
+                        f"WARNING: {volume_name}:{volume_path} is not absolute, "
+                        "cannot bind volume."
+                    )
+                    continue

            if self.spec.is_kind_deployment():
                host_path = client.V1HostPathVolumeSource(
@ -453,6 +457,16 @@ class ClusterInfo:
                if "command" in service_info:
                    cmd = service_info["command"]
                    container_args = cmd if isinstance(cmd, list) else cmd.split()
+                # Add env_from to pull secrets from K8s Secret
+                secret_name = f"{self.app_name}-generated-secrets"
+                env_from = [
+                    client.V1EnvFromSource(
+                        secret_ref=client.V1SecretEnvSource(
+                            name=secret_name,
+                            optional=True,  # Don't fail if no secrets
+                        )
+                    )
+                ]
                container = client.V1Container(
                    name=container_name,
                    image=image_to_use,
@ -460,6 +474,7 @@ class ClusterInfo:
                    command=container_command,
                    args=container_args,
                    env=envs,
+                    env_from=env_from,
                    ports=container_ports if container_ports else None,
                    volume_mounts=volume_mounts,
                    security_context=client.V1SecurityContext(
@ -476,7 +491,12 @@ class ClusterInfo:
        volumes = volumes_for_pod_files(
            self.parsed_pod_yaml_map, self.spec, self.app_name
        )
-        image_pull_secrets = [client.V1LocalObjectReference(name="laconic-registry")]
+        registry_config = self.spec.get_image_registry_config()
+        if registry_config:
+            secret_name = f"{self.app_name}-registry"
+            image_pull_secrets = [client.V1LocalObjectReference(name=secret_name)]
+        else:
+            image_pull_secrets = []

        annotations = None
        labels = {"app": self.app_name}
--- a/stack_orchestrator/deploy/k8s/deploy_k8s.py
+++ b/stack_orchestrator/deploy/k8s/deploy_k8s.py
@ -29,6 +29,7 @@ from stack_orchestrator.deploy.k8s.helpers import (
 from stack_orchestrator.deploy.k8s.helpers import (
    install_ingress_for_kind,
    wait_for_ingress_in_kind,
+    is_ingress_running,
 )
 from stack_orchestrator.deploy.k8s.helpers import (
    pods_in_deployment,
@ -289,22 +290,38 @@ class K8sDeployer(Deployer):
        self.skip_cluster_management = skip_cluster_management
        if not opts.o.dry_run:
            if self.is_kind() and not self.skip_cluster_management:
-                # Create the kind cluster
-                create_cluster(
-                    self.kind_cluster_name,
-                    str(self.deployment_dir.joinpath(constants.kind_config_filename)),
+                # Create the kind cluster (or reuse existing one)
+                kind_config = str(
+                    self.deployment_dir.joinpath(constants.kind_config_filename)
                )
-                # Ensure the referenced containers are copied into kind
-                load_images_into_kind(
-                    self.kind_cluster_name, self.cluster_info.image_set
+                actual_cluster = create_cluster(self.kind_cluster_name, kind_config)
+                if actual_cluster != self.kind_cluster_name:
+                    # An existing cluster was found, use it instead
+                    self.kind_cluster_name = actual_cluster
+                # Only load locally-built images into kind
+                # Registry images (docker.io, ghcr.io, etc.) will be pulled by k8s
+                local_containers = self.deployment_context.stack.obj.get(
+                    "containers", []
                )
+                if local_containers:
+                    # Filter image_set to only images matching local containers
+                    local_images = {
+                        img
+                        for img in self.cluster_info.image_set
+                        if any(c in img for c in local_containers)
+                    }
+                    if local_images:
+                        load_images_into_kind(self.kind_cluster_name, local_images)
+                # Note: if no local containers defined, all images come from registries
            self.connect_api()
            if self.is_kind() and not self.skip_cluster_management:
                # Configure ingress controller (not installed by default in kind)
-                install_ingress_for_kind()
-                # Wait for ingress to start
-                # (deployment provisioning will fail unless this is done)
-                wait_for_ingress_in_kind()
+                # Skip if already running (idempotent for shared cluster)
+                if not is_ingress_running():
+                    install_ingress_for_kind(self.cluster_info.spec.get_acme_email())
+                    # Wait for ingress to start
+                    # (deployment provisioning will fail unless this is done)
+                    wait_for_ingress_in_kind()
                # Create RuntimeClass if unlimited_memlock is enabled
                if self.cluster_info.spec.get_unlimited_memlock():
                    _create_runtime_class(
@ -315,6 +332,11 @@ class K8sDeployer(Deployer):
        else:
            print("Dry run mode enabled, skipping k8s API connect")

+        # Create registry secret if configured
+        from stack_orchestrator.deploy.deployment_create import create_registry_secret
+
+        create_registry_secret(self.cluster_info.spec, self.cluster_info.app_name)
+
        self._create_volume_data()
        self._create_deployment()

--- a/stack_orchestrator/deploy/k8s/helpers.py
+++ b/stack_orchestrator/deploy/k8s/helpers.py
@ -14,11 +14,13 @@
 # along with this program.  If not, see <http:#www.gnu.org/licenses/>.

 from kubernetes import client, utils, watch
+from kubernetes.client.exceptions import ApiException
 import os
 from pathlib import Path
 import subprocess
 import re
 from typing import Set, Mapping, List, Optional, cast
+import yaml

 from stack_orchestrator.util import get_k8s_dir, error_exit
 from stack_orchestrator.opts import opts
@ -96,16 +98,227 @@ def _run_command(command: str):
    return result


+def _get_etcd_host_path_from_kind_config(config_file: str) -> Optional[str]:
+    """Extract etcd host path from kind config extraMounts."""
+    import yaml
+
+    try:
+        with open(config_file, "r") as f:
+            config = yaml.safe_load(f)
+    except Exception:
+        return None
+
+    nodes = config.get("nodes", [])
+    for node in nodes:
+        extra_mounts = node.get("extraMounts", [])
+        for mount in extra_mounts:
+            if mount.get("containerPath") == "/var/lib/etcd":
+                return mount.get("hostPath")
+    return None
+
+
+def _clean_etcd_keeping_certs(etcd_path: str) -> bool:
+    """Clean persisted etcd, keeping only TLS certificates.
+
+    When etcd is persisted and a cluster is recreated, kind tries to install
+    resources fresh but they already exist. Instead of trying to delete
+    specific stale resources (blacklist), we keep only the valuable data
+    (caddy TLS certs) and delete everything else (whitelist approach).
+
+    The etcd image is distroless (no shell), so we extract the statically-linked
+    etcdctl binary and run it from alpine which has shell support.
+
+    Returns True if cleanup succeeded, False if no action needed or failed.
+    """
+    db_path = Path(etcd_path) / "member" / "snap" / "db"
+    # Check existence using docker since etcd dir is root-owned
+    check_cmd = (
+        f"docker run --rm -v {etcd_path}:/etcd:ro alpine:3.19 "
+        "test -f /etcd/member/snap/db"
+    )
+    check_result = subprocess.run(check_cmd, shell=True, capture_output=True)
+    if check_result.returncode != 0:
+        if opts.o.debug:
+            print(f"No etcd snapshot at {db_path}, skipping cleanup")
+        return False
+
+    if opts.o.debug:
+        print(f"Cleaning persisted etcd at {etcd_path}, keeping only TLS certs")
+
+    etcd_image = "gcr.io/etcd-development/etcd:v3.5.9"
+    temp_dir = "/tmp/laconic-etcd-cleanup"
+
+    # Whitelist: prefixes to KEEP - everything else gets deleted
+    keep_prefixes = "/registry/secrets/caddy-system"
+
+    # The etcd image is distroless (no shell). We extract the statically-linked
+    # etcdctl binary and run it from alpine which has shell + jq support.
+    cleanup_script = f"""
+        set -e
+        ALPINE_IMAGE="alpine:3.19"
+
+        # Cleanup previous runs
+        docker rm -f laconic-etcd-cleanup 2>/dev/null || true
+        docker rm -f etcd-extract 2>/dev/null || true
+        docker run --rm -v /tmp:/tmp $ALPINE_IMAGE rm -rf {temp_dir}
+
+        # Create temp dir
+        docker run --rm -v /tmp:/tmp $ALPINE_IMAGE mkdir -p {temp_dir}
+
+        # Extract etcdctl binary (it's statically linked)
+        docker create --name etcd-extract {etcd_image}
+        docker cp etcd-extract:/usr/local/bin/etcdctl /tmp/etcdctl-bin
+        docker rm etcd-extract
+        docker run --rm -v /tmp/etcdctl-bin:/src:ro -v {temp_dir}:/dst $ALPINE_IMAGE \
+            sh -c "cp /src /dst/etcdctl && chmod +x /dst/etcdctl"
+
+        # Copy db to temp location
+        docker run --rm \
+            -v {etcd_path}:/etcd:ro \
+            -v {temp_dir}:/tmp-work \
+            $ALPINE_IMAGE cp /etcd/member/snap/db /tmp-work/etcd-snapshot.db
+
+        # Restore snapshot
+        docker run --rm -v {temp_dir}:/work {etcd_image} \
+            etcdutl snapshot restore /work/etcd-snapshot.db \
+                --data-dir=/work/etcd-data --skip-hash-check 2>/dev/null
+
+        # Start temp etcd (runs the etcd binary, no shell needed)
+        docker run -d --name laconic-etcd-cleanup \
+            -v {temp_dir}/etcd-data:/etcd-data \
+            -v {temp_dir}:/backup \
+            {etcd_image} etcd \
+                --data-dir=/etcd-data \
+                --listen-client-urls=http://0.0.0.0:2379 \
+                --advertise-client-urls=http://localhost:2379
+
+        sleep 3
+
+        # Use alpine with extracted etcdctl to run commands (alpine has shell + jq)
+        # Export caddy secrets
+        docker run --rm \
+            -v {temp_dir}:/backup \
+            --network container:laconic-etcd-cleanup \
+            $ALPINE_IMAGE sh -c \
+            '/backup/etcdctl get --prefix "{keep_prefixes}" -w json \
+                > /backup/kept.json 2>/dev/null || echo "{{}}" > /backup/kept.json'
+
+        # Delete ALL registry keys
+        docker run --rm \
+            -v {temp_dir}:/backup \
+            --network container:laconic-etcd-cleanup \
+            $ALPINE_IMAGE /backup/etcdctl del --prefix /registry
+
+        # Restore kept keys using jq
+        docker run --rm \
+            -v {temp_dir}:/backup \
+            --network container:laconic-etcd-cleanup \
+            $ALPINE_IMAGE sh -c '
+                apk add --no-cache jq >/dev/null 2>&1
+                jq -r ".kvs[] | @base64" /backup/kept.json 2>/dev/null | \
+                while read encoded; do
+                    key=$(echo $encoded | base64 -d | jq -r ".key" | base64 -d)
+                    val=$(echo $encoded | base64 -d | jq -r ".value" | base64 -d)
+                    echo "$val" | /backup/etcdctl put "$key"
+                done
+            ' || true
+
+        # Save cleaned snapshot
+        docker exec laconic-etcd-cleanup \
+            etcdctl snapshot save /etcd-data/cleaned-snapshot.db
+
+        docker stop laconic-etcd-cleanup
+        docker rm laconic-etcd-cleanup
+
+        # Restore to temp location first to verify it works
+        docker run --rm \
+            -v {temp_dir}/etcd-data/cleaned-snapshot.db:/data/db:ro \
+            -v {temp_dir}:/restore \
+            {etcd_image} \
+            etcdutl snapshot restore /data/db --data-dir=/restore/new-etcd \
+            --skip-hash-check 2>/dev/null
+
+        # Create timestamped backup of original (kept forever)
+        TIMESTAMP=$(date +%Y%m%d-%H%M%S)
+        docker run --rm -v {etcd_path}:/etcd $ALPINE_IMAGE \
+            cp -a /etcd/member /etcd/member.backup-$TIMESTAMP
+
+        # Replace original with cleaned version
+        docker run --rm -v {etcd_path}:/etcd -v {temp_dir}:/tmp-work $ALPINE_IMAGE \
+            sh -c "rm -rf /etcd/member && mv /tmp-work/new-etcd/member /etcd/member"
+
+        # Cleanup temp files (but NOT the timestamped backup in etcd_path)
+        docker run --rm -v /tmp:/tmp $ALPINE_IMAGE rm -rf {temp_dir}
+        rm -f /tmp/etcdctl-bin
+    """
+
+    result = subprocess.run(cleanup_script, shell=True, capture_output=True, text=True)
+    if result.returncode != 0:
+        if opts.o.debug:
+            print(f"Warning: etcd cleanup failed: {result.stderr}")
+        return False
+
+    if opts.o.debug:
+        print("Cleaned etcd, kept only TLS certificates")
+    return True
+
+
 def create_cluster(name: str, config_file: str):
+    """Create or reuse the single kind cluster for this host.
+
+    There is only one kind cluster per host by design. Multiple deployments
+    share this cluster. If a cluster already exists, it is reused.
+
+    Args:
+        name: Cluster name (used only when creating the first cluster)
+        config_file: Path to kind config file (used only when creating)
+
+    Returns:
+        The name of the cluster being used
+    """
+    existing = get_kind_cluster()
+    if existing:
+        print(f"Using existing cluster: {existing}")
+        return existing
+
+    # Clean persisted etcd, keeping only TLS certificates
+    etcd_path = _get_etcd_host_path_from_kind_config(config_file)
+    if etcd_path:
+        _clean_etcd_keeping_certs(etcd_path)
+
+    print(f"Creating new cluster: {name}")
    result = _run_command(f"kind create cluster --name {name} --config {config_file}")
    if result.returncode != 0:
        raise DeployerException(f"kind create cluster failed: {result}")
+    return name


 def destroy_cluster(name: str):
    _run_command(f"kind delete cluster --name {name}")


+def is_ingress_running() -> bool:
+    """Check if the Caddy ingress controller is already running in the cluster."""
+    try:
+        core_v1 = client.CoreV1Api()
+        pods = core_v1.list_namespaced_pod(
+            namespace="caddy-system",
+            label_selector=(
+                "app.kubernetes.io/name=caddy-ingress-controller,"
+                "app.kubernetes.io/component=controller"
+            ),
+        )
+        for pod in pods.items:
+            if pod.status and pod.status.container_statuses:
+                if pod.status.container_statuses[0].ready is True:
+                    if opts.o.debug:
+                        print("Caddy ingress controller already running")
+                    return True
+        return False
+    except ApiException:
+        return False
+
+
 def wait_for_ingress_in_kind():
    core_v1 = client.CoreV1Api()
    for i in range(20):
@ -132,7 +345,7 @@ def wait_for_ingress_in_kind():
    error_exit("ERROR: Timed out waiting for Caddy ingress to become ready")


-def install_ingress_for_kind():
+def install_ingress_for_kind(acme_email: str = ""):
    api_client = client.ApiClient()
    ingress_install = os.path.abspath(
        get_k8s_dir().joinpath(
@ -141,7 +354,34 @@ def install_ingress_for_kind():
    )
    if opts.o.debug:
        print("Installing Caddy ingress controller in kind cluster")
-    utils.create_from_yaml(api_client, yaml_file=ingress_install)
+
+    # Template the YAML with email before applying
+    with open(ingress_install) as f:
+        yaml_content = f.read()
+
+    if acme_email:
+        yaml_content = yaml_content.replace('email: ""', f'email: "{acme_email}"')
+        if opts.o.debug:
+            print(f"Configured Caddy with ACME email: {acme_email}")
+
+    # Apply templated YAML
+    yaml_objects = list(yaml.safe_load_all(yaml_content))
+    utils.create_from_yaml(api_client, yaml_objects=yaml_objects)
+
+    # Patch ConfigMap with ACME email if provided
+    if acme_email:
+        if opts.o.debug:
+            print(f"Configuring ACME email: {acme_email}")
+        core_api = client.CoreV1Api()
+        configmap = core_api.read_namespaced_config_map(
+            name="caddy-ingress-controller-configmap", namespace="caddy-system"
+        )
+        configmap.data["email"] = acme_email
+        core_api.patch_namespaced_config_map(
+            name="caddy-ingress-controller-configmap",
+            namespace="caddy-system",
+            body=configmap,
+        )


 def load_images_into_kind(kind_cluster_name: str, image_set: Set[str]):
@ -324,6 +564,25 @@ def _generate_kind_mounts(parsed_pod_files, deployment_dir, deployment_context):
    volume_host_path_map = _get_host_paths_for_volumes(deployment_context)
    seen_host_path_mounts = set()  # Track to avoid duplicate mounts

+    # Cluster state backup for offline data recovery (unique per deployment)
+    # etcd contains all k8s state; PKI certs needed to decrypt etcd offline
+    deployment_id = deployment_context.id
+    backup_subdir = f"cluster-backups/{deployment_id}"
+
+    etcd_host_path = _make_absolute_host_path(
+        Path(f"./data/{backup_subdir}/etcd"), deployment_dir
+    )
+    volume_definitions.append(
+        f"  - hostPath: {etcd_host_path}\n" f"    containerPath: /var/lib/etcd\n"
+    )
+
+    pki_host_path = _make_absolute_host_path(
+        Path(f"./data/{backup_subdir}/pki"), deployment_dir
+    )
+    volume_definitions.append(
+        f"  - hostPath: {pki_host_path}\n" f"    containerPath: /etc/kubernetes/pki\n"
+    )
+
    # Note these paths are relative to the location of the pod files (at present)
    # So we need to fix up to make them correct and absolute because kind assumes
    # relative to the cwd.
--- a/stack_orchestrator/deploy/spec.py
+++ b/stack_orchestrator/deploy/spec.py
@ -98,6 +98,17 @@ class Spec:
    def get_image_registry(self):
        return self.obj.get(constants.image_registry_key)

+    def get_image_registry_config(self) -> typing.Optional[typing.Dict]:
+        """Returns registry auth config: {server, username, token-env}.
+
+        Used for private container registries like GHCR. The token-env field
+        specifies an environment variable containing the API token/PAT.
+
+        Note: Uses 'registry-credentials' key to avoid collision with
+        'image-registry' key which is for pushing images.
+        """
+        return self.obj.get("registry-credentials")
+
    def get_volumes(self):
        return self.obj.get(constants.volumes_key, {})

@ -117,6 +128,9 @@ class Spec:
    def get_http_proxy(self):
        return self.obj.get(constants.network_key, {}).get(constants.http_proxy_key, [])

+    def get_acme_email(self):
+        return self.obj.get(constants.network_key, {}).get("acme-email", "")
+
    def get_annotations(self):
        return self.obj.get(constants.annotations_key, {})

@ -179,6 +193,9 @@ class Spec:
    def get_deployment_type(self):
        return self.obj.get(constants.deploy_to_key)

+    def get_acme_email(self):
+        return self.obj.get(constants.network_key, {}).get(constants.acme_email_key, "")
+
    def is_kubernetes_deployment(self):
        return self.get_deployment_type() in [
            constants.k8s_kind_deploy_type,