Allow relative volume paths for k8s-kind deployments

For k8s-kind, relative paths (e.g., ./data/rpc-config) are resolved to $DEPLOYMENT_DIR/path by _make_absolute_host_path() during kind config generation. This provides Docker Host persistence that survives cluster restarts. Previously, validation threw an exception before paths could be resolved, making it impossible to use relative paths for persistent storage. Changes: - deployment_create.py: Skip relative path check for k8s-kind - cluster_info.py: Allow relative paths to reach PV generation - docs/deployment_patterns.md: Document volume persistence patterns Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fix repo root path calculation (4 parents from stack path)
2026-02-02 23:26:13 -05:00 · 2026-02-02 22:49:19 -05:00 · 2026-02-02 22:48:19 -05:00 · 2026-02-02 22:18:19 -05:00 · 2026-02-02 19:31:45 -05:00 · 2026-02-02 19:30:13 -05:00
11 changed files with 684 additions and 24 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -8,6 +8,7 @@ NEVER assume your hypotheses are true without evidence

 ALWAYS clearly state when something is a hypothesis
 ALWAYS use evidence from the systems your interacting with to support your claims and hypotheses
+ALWAYS run `pre-commit run --all-files` before committing changes

 ## Key Principles

--- a/docs/deployment_patterns.md
+++ b/docs/deployment_patterns.md
@ -0,0 +1,114 @@
+# Deployment Patterns
+
+## GitOps Pattern
+
+For production deployments, we recommend a GitOps approach where your deployment configuration is tracked in version control.
+
+### Overview
+
+- **spec.yml is your source of truth**: Maintain it in your operator repository
+- **Don't regenerate on every restart**: Run `deploy init` once, then customize and commit
+- **Use restart for updates**: The restart command respects your git-tracked spec.yml
+
+### Workflow
+
+1. **Initial setup**: Run `deploy init` once to generate a spec.yml template
+2. **Customize and commit**: Edit spec.yml with your configuration (hostnames, resources, etc.) and commit to your operator repo
+3. **Deploy from git**: Use the committed spec.yml for deployments
+4. **Update via git**: Make changes in git, then restart to apply
+
+```bash
+# Initial setup (run once)
+laconic-so --stack my-stack deploy init --output spec.yml
+
+# Customize for your environment
+vim spec.yml  # Set hostname, resources, etc.
+
+# Commit to your operator repository
+git add spec.yml
+git commit -m "Add my-stack deployment configuration"
+git push
+
+# On deployment server: deploy from git-tracked spec
+laconic-so deploy create \
+  --spec-file /path/to/operator-repo/spec.yml \
+  --deployment-dir my-deployment
+
+laconic-so deployment --dir my-deployment start
+```
+
+### Updating Deployments
+
+When you need to update a deployment:
+
+```bash
+# 1. Make changes in your operator repo
+vim /path/to/operator-repo/spec.yml
+git commit -am "Update configuration"
+git push
+
+# 2. On deployment server: pull and restart
+cd /path/to/operator-repo && git pull
+laconic-so deployment --dir my-deployment restart
+```
+
+The `restart` command:
+- Pulls latest code from the stack repository
+- Uses your git-tracked spec.yml (does NOT regenerate from defaults)
+- Syncs the deployment directory
+- Restarts services
+
+### Anti-patterns
+
+**Don't do this:**
+```bash
+# BAD: Regenerating spec on every deployment
+laconic-so --stack my-stack deploy init --output spec.yml
+laconic-so deploy create --spec-file spec.yml ...
+```
+
+This overwrites your customizations with defaults from the stack's `commands.py`.
+
+**Do this instead:**
+```bash
+# GOOD: Use your git-tracked spec
+git pull  # Get latest spec.yml from your operator repo
+laconic-so deployment --dir my-deployment restart
+```
+
+## Volume Persistence in k8s-kind
+
+k8s-kind has 3 storage layers:
+
+- **Docker Host**: The physical server running Docker
+- **Kind Node**: A Docker container simulating a k8s node
+- **Pod Container**: Your workload
+
+For k8s-kind, volumes with paths are mounted from Docker Host → Kind Node → Pod via extraMounts.
+
+| spec.yml volume | Storage Location | Survives Pod Restart | Survives Cluster Restart |
+|-----------------|------------------|---------------------|-------------------------|
+| `vol:` (empty)  | Kind Node PVC    | ✅ | ❌ |
+| `vol: ./data/x` | Docker Host      | ✅ | ✅ |
+| `vol: /abs/path`| Docker Host      | ✅ | ✅ |
+
+**Recommendation**: Always use paths for data you want to keep. Relative paths
+(e.g., `./data/rpc-config`) resolve to `$DEPLOYMENT_DIR/data/rpc-config` on the
+Docker Host.
+
+### Example
+
+```yaml
+# In spec.yml
+volumes:
+  rpc-config: ./data/rpc-config  # Persists to $DEPLOYMENT_DIR/data/rpc-config
+  chain-data: ./data/chain       # Persists to $DEPLOYMENT_DIR/data/chain
+  temp-cache:                    # Empty = Kind Node PVC (lost on cluster delete)
+```
+
+### The Antipattern
+
+Empty-path volumes appear persistent because they survive pod restarts (data lives
+in Kind Node container). However, this data is lost when the kind cluster is
+recreated. This "false persistence" has caused data loss when operators assumed
+their data was safe.
--- a/stack_orchestrator/constants.py
+++ b/stack_orchestrator/constants.py
@ -44,3 +44,4 @@ unlimited_memlock_key = "unlimited-memlock"
 runtime_class_key = "runtime-class"
 high_memlock_runtime = "high-memlock"
 high_memlock_spec_filename = "high-memlock-spec.json"
+acme_email_key = "acme-email"
--- a/stack_orchestrator/data/k8s/components/ingress/ingress-caddy-kind-deploy.yaml
+++ b/stack_orchestrator/data/k8s/components/ingress/ingress-caddy-kind-deploy.yaml
@ -93,6 +93,7 @@ rules:
      - get
      - create
      - update
+      - delete
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
--- a/stack_orchestrator/deploy/deployment.py
+++ b/stack_orchestrator/deploy/deployment.py
@ -15,7 +15,9 @@

 import click
 from pathlib import Path
+import subprocess
 import sys
+import time
 from stack_orchestrator import constants
 from stack_orchestrator.deploy.images import push_images_operation
 from stack_orchestrator.deploy.deploy import (
@ -228,3 +230,176 @@ def run_job(ctx, job_name, helm_release):

    ctx.obj = make_deploy_context(ctx)
    run_job_operation(ctx, job_name, helm_release)
+
+
+@command.command()
+@click.option("--stack-path", help="Path to stack git repo (overrides stored path)")
+@click.option(
+    "--spec-file", help="Path to GitOps spec.yml in repo (e.g., deployment/spec.yml)"
+)
+@click.option("--config-file", help="Config file to pass to deploy init")
+@click.option(
+    "--force",
+    is_flag=True,
+    default=False,
+    help="Skip DNS verification",
+)
+@click.option(
+    "--expected-ip",
+    help="Expected IP for DNS verification (if different from egress)",
+)
+@click.pass_context
+def restart(ctx, stack_path, spec_file, config_file, force, expected_ip):
+    """Pull latest code and restart deployment using git-tracked spec.
+
+    GitOps workflow:
+    1. Operator maintains spec.yml in their git repository
+    2. This command pulls latest code (including updated spec.yml)
+    3. If hostname changed, verifies DNS routes to this server
+    4. Syncs deployment directory with the git-tracked spec
+    5. Stops and restarts the deployment
+
+    Data volumes are always preserved. The cluster is never destroyed.
+
+    Stack source resolution (in order):
+    1. --stack-path argument (if provided)
+    2. stack-source field in deployment.yml (if stored)
+    3. Error if neither available
+
+    Note: spec.yml should be maintained in git, not regenerated from
+    commands.py on each restart. Use 'deploy init' only for initial
+    spec generation, then customize and commit to your operator repo.
+    """
+    from stack_orchestrator.util import get_yaml, get_parsed_deployment_spec
+    from stack_orchestrator.deploy.deployment_create import create_operation
+    from stack_orchestrator.deploy.dns_probe import verify_dns_via_probe
+
+    deployment_context: DeploymentContext = ctx.obj
+
+    # Get current spec info (before git pull)
+    current_spec = deployment_context.spec
+    current_http_proxy = current_spec.get_http_proxy()
+    current_hostname = (
+        current_http_proxy[0]["host-name"] if current_http_proxy else None
+    )
+
+    # Resolve stack source path
+    if stack_path:
+        stack_source = Path(stack_path).resolve()
+    else:
+        # Try to get from deployment.yml
+        deployment_file = (
+            deployment_context.deployment_dir / constants.deployment_file_name
+        )
+        deployment_data = get_yaml().load(open(deployment_file))
+        stack_source_str = deployment_data.get("stack-source")
+        if not stack_source_str:
+            print(
+                "Error: No stack-source in deployment.yml and --stack-path not provided"
+            )
+            print("Use --stack-path to specify the stack git repository location")
+            sys.exit(1)
+        stack_source = Path(stack_source_str)
+
+    if not stack_source.exists():
+        print(f"Error: Stack source path does not exist: {stack_source}")
+        sys.exit(1)
+
+    print("=== Deployment Restart ===")
+    print(f"Deployment dir: {deployment_context.deployment_dir}")
+    print(f"Stack source: {stack_source}")
+    print(f"Current hostname: {current_hostname}")
+
+    # Step 1: Git pull (brings in updated spec.yml from operator's repo)
+    print("\n[1/4] Pulling latest code from stack repository...")
+    git_result = subprocess.run(
+        ["git", "pull"], cwd=stack_source, capture_output=True, text=True
+    )
+    if git_result.returncode != 0:
+        print(f"Git pull failed: {git_result.stderr}")
+        sys.exit(1)
+    print(f"Git pull: {git_result.stdout.strip()}")
+
+    # Determine spec file location
+    # Priority: --spec-file argument > repo's deployment/spec.yml > deployment dir
+    # Stack path is like: repo/stack_orchestrator/data/stacks/stack-name
+    # So repo root is 4 parents up
+    repo_root = stack_source.parent.parent.parent.parent
+    if spec_file:
+        # Spec file relative to repo root
+        spec_file_path = repo_root / spec_file
+    else:
+        # Try standard GitOps location in repo
+        gitops_spec = repo_root / "deployment" / "spec.yml"
+        if gitops_spec.exists():
+            spec_file_path = gitops_spec
+        else:
+            # Fall back to deployment directory
+            spec_file_path = deployment_context.deployment_dir / "spec.yml"
+
+    if not spec_file_path.exists():
+        print(f"Error: spec.yml not found at {spec_file_path}")
+        print("For GitOps, add spec.yml to your repo at deployment/spec.yml")
+        print("Or specify --spec-file with path relative to repo root")
+        sys.exit(1)
+
+    print(f"Using spec: {spec_file_path}")
+
+    # Parse spec to check for hostname changes
+    new_spec_obj = get_parsed_deployment_spec(str(spec_file_path))
+    new_http_proxy = new_spec_obj.get("network", {}).get("http-proxy", [])
+    new_hostname = new_http_proxy[0]["host-name"] if new_http_proxy else None
+
+    print(f"Spec hostname: {new_hostname}")
+
+    # Step 2: DNS verification (only if hostname changed)
+    if new_hostname and new_hostname != current_hostname:
+        print(f"\n[2/4] Hostname changed: {current_hostname} -> {new_hostname}")
+        if force:
+            print("DNS verification skipped (--force)")
+        else:
+            print("Verifying DNS via probe...")
+            if not verify_dns_via_probe(new_hostname):
+                print(f"\nDNS verification failed for {new_hostname}")
+                print("Ensure DNS is configured before restarting.")
+                print("Use --force to skip this check.")
+                sys.exit(1)
+    else:
+        print("\n[2/4] Hostname unchanged, skipping DNS verification")
+
+    # Step 3: Sync deployment directory with spec
+    print("\n[3/4] Syncing deployment directory...")
+    deploy_ctx = make_deploy_context(ctx)
+    create_operation(
+        deployment_command_context=deploy_ctx,
+        spec_file=str(spec_file_path),
+        deployment_dir=str(deployment_context.deployment_dir),
+        update=True,
+        network_dir=None,
+        initial_peers=None,
+    )
+
+    # Reload deployment context with updated spec
+    deployment_context.init(deployment_context.deployment_dir)
+    ctx.obj = deployment_context
+
+    # Stop deployment
+    print("\n[4/4] Restarting deployment...")
+    ctx.obj = make_deploy_context(ctx)
+    down_operation(
+        ctx, delete_volumes=False, extra_args_list=[], skip_cluster_management=True
+    )
+
+    # Brief pause to ensure clean shutdown
+    time.sleep(5)
+
+    # Start deployment
+    up_operation(
+        ctx, services_list=None, stay_attached=False, skip_cluster_management=True
+    )
+
+    print("\n=== Restart Complete ===")
+    print("Deployment restarted with git-tracked configuration.")
+    if new_hostname and new_hostname != current_hostname:
+        print(f"\nNew hostname: {new_hostname}")
+        print("Caddy will automatically provision TLS certificate.")
--- a/stack_orchestrator/deploy/deployment_create.py
+++ b/stack_orchestrator/deploy/deployment_create.py
@ -17,7 +17,7 @@ import click
 from importlib import util
 import os
 from pathlib import Path
-from typing import List
+from typing import List, Optional
 import random
 from shutil import copy, copyfile, copytree, rmtree
 from secrets import token_hex
@ -507,11 +507,14 @@ def _copy_files_to_directory(file_paths: List[Path], directory: Path):
        copy(path, os.path.join(directory, os.path.basename(path)))


-def _create_deployment_file(deployment_dir: Path):
+def _create_deployment_file(deployment_dir: Path, stack_source: Optional[Path] = None):
    deployment_file_path = deployment_dir.joinpath(constants.deployment_file_name)
    cluster = f"{constants.cluster_name_prefix}{token_hex(8)}"
+    deployment_content = {constants.cluster_id_key: cluster}
+    if stack_source:
+        deployment_content["stack-source"] = str(stack_source)
    with open(deployment_file_path, "w") as output_file:
-        output_file.write(f"{constants.cluster_id_key}: {cluster}\n")
+        get_yaml().dump(deployment_content, output_file)


 def _check_volume_definitions(spec):
@ -519,10 +522,14 @@ def _check_volume_definitions(spec):
        for volume_name, volume_path in spec.get_volumes().items():
            if volume_path:
                if not os.path.isabs(volume_path):
-                    raise Exception(
-                        f"Relative path {volume_path} for volume {volume_name} not "
-                        f"supported for deployment type {spec.get_deployment_type()}"
-                    )
+                    # For k8s-kind: allow relative paths, they'll be resolved
+                    # by _make_absolute_host_path() during kind config generation
+                    if not spec.is_kind_deployment():
+                        deploy_type = spec.get_deployment_type()
+                        raise Exception(
+                            f"Relative path {volume_path} for volume "
+                            f"{volume_name} not supported for {deploy_type}"
+                        )


@click.command()
@ -616,11 +623,15 @@ def create_operation(
        generate_helm_chart(stack_name, spec_file, deployment_dir_path)
        return  # Exit early for helm chart generation

+    # Resolve stack source path for restart capability
+    stack_source = get_stack_path(stack_name)
+
    if update:
        # Sync mode: write to temp dir, then copy to deployment dir with backups
        temp_dir = Path(tempfile.mkdtemp(prefix="deployment-sync-"))
        try:
-            # Write deployment files to temp dir (skip deployment.yml to preserve cluster ID)
+            # Write deployment files to temp dir
+            # (skip deployment.yml to preserve cluster ID)
            _write_deployment_files(
                temp_dir,
                Path(spec_file),
@ -628,12 +639,14 @@ def create_operation(
                stack_name,
                deployment_type,
                include_deployment_file=False,
+                stack_source=stack_source,
            )

-            # Copy from temp to deployment dir, excluding data volumes and backing up changed files
-            # Exclude data/* to avoid touching user data volumes
-            # Exclude config file to preserve deployment settings (XXX breaks passing config vars
-            # from spec. could warn about this or not exclude...)
+            # Copy from temp to deployment dir, excluding data volumes
+            # and backing up changed files.
+            # Exclude data/* to avoid touching user data volumes.
+            # Exclude config file to preserve deployment settings
+            # (XXX breaks passing config vars from spec)
            exclude_patterns = ["data", "data/*", constants.config_file_name]
            _safe_copy_tree(
                temp_dir, deployment_dir_path, exclude_patterns=exclude_patterns
@ -650,6 +663,7 @@ def create_operation(
            stack_name,
            deployment_type,
            include_deployment_file=True,
+            stack_source=stack_source,
        )

    # Delegate to the stack's Python code
@ -670,7 +684,7 @@ def create_operation(
    )


-def _safe_copy_tree(src: Path, dst: Path, exclude_patterns: List[str] = None):
+def _safe_copy_tree(src: Path, dst: Path, exclude_patterns: Optional[List[str]] = None):
    """
    Recursively copy a directory tree, backing up changed files with .bak suffix.

@ -721,6 +735,7 @@ def _write_deployment_files(
    stack_name: str,
    deployment_type: str,
    include_deployment_file: bool = True,
+    stack_source: Optional[Path] = None,
 ):
    """
    Write deployment files to target directory.
@ -730,7 +745,8 @@ def _write_deployment_files(
    :param parsed_spec: Parsed spec object
    :param stack_name: Name of stack
    :param deployment_type: Type of deployment
-    :param include_deployment_file: Whether to create deployment.yml file (skip for update)
+    :param include_deployment_file: Whether to create deployment.yml (skip for update)
+    :param stack_source: Path to stack source (git repo) for restart capability
    """
    stack_file = get_stack_path(stack_name).joinpath(constants.stack_file_name)
    parsed_stack = get_parsed_stack_config(stack_name)
@ -741,7 +757,7 @@ def _write_deployment_files(

    # Create deployment file if requested
    if include_deployment_file:
-        _create_deployment_file(target_dir)
+        _create_deployment_file(target_dir, stack_source=stack_source)

    # Copy any config variables from the spec file into an env file suitable for compose
    _write_config_file(spec_file, target_dir.joinpath(constants.config_file_name))
@ -805,8 +821,9 @@ def _write_deployment_files(
                    )
        else:
            # TODO:
-            # this is odd - looks up config dir that matches a volume name, then copies as a mount dir?
-            # AFAICT this is not used by or relevant to any existing stack - roy
+            # This is odd - looks up config dir that matches a volume name,
+            # then copies as a mount dir?
+            # AFAICT not used by or relevant to any existing stack - roy

            # TODO: We should probably only do this if the volume is marked :ro.
            for volume_name, volume_path in parsed_spec.get_volumes().items():
--- a/stack_orchestrator/deploy/dns_probe.py
+++ b/stack_orchestrator/deploy/dns_probe.py
@ -0,0 +1,159 @@
+# Copyright © 2024 Vulcanize
+# SPDX-License-Identifier: AGPL-3.0
+
+"""DNS verification via temporary ingress probe."""
+
+import secrets
+import socket
+import time
+from typing import Optional
+import requests
+from kubernetes import client
+
+
+def get_server_egress_ip() -> str:
+    """Get this server's public egress IP via ipify."""
+    response = requests.get("https://api.ipify.org", timeout=10)
+    response.raise_for_status()
+    return response.text.strip()
+
+
+def resolve_hostname(hostname: str) -> list[str]:
+    """Resolve hostname to list of IP addresses."""
+    try:
+        _, _, ips = socket.gethostbyname_ex(hostname)
+        return ips
+    except socket.gaierror:
+        return []
+
+
+def verify_dns_simple(hostname: str, expected_ip: Optional[str] = None) -> bool:
+    """Simple DNS verification - check hostname resolves to expected IP.
+
+    If expected_ip not provided, uses server's egress IP.
+    Returns True if hostname resolves to expected IP.
+    """
+    resolved_ips = resolve_hostname(hostname)
+    if not resolved_ips:
+        print(f"DNS FAIL: {hostname} does not resolve")
+        return False
+
+    if expected_ip is None:
+        expected_ip = get_server_egress_ip()
+
+    if expected_ip in resolved_ips:
+        print(f"DNS OK: {hostname} -> {resolved_ips} (includes {expected_ip})")
+        return True
+    else:
+        print(f"DNS WARN: {hostname} -> {resolved_ips} (expected {expected_ip})")
+        return False
+
+
+def create_probe_ingress(hostname: str, namespace: str = "default") -> str:
+    """Create a temporary ingress for DNS probing.
+
+    Returns the probe token that the ingress will respond with.
+    """
+    token = secrets.token_hex(16)
+
+    networking_api = client.NetworkingV1Api()
+
+    # Create a simple ingress that Caddy will pick up
+    ingress = client.V1Ingress(
+        metadata=client.V1ObjectMeta(
+            name="laconic-dns-probe",
+            annotations={
+                "kubernetes.io/ingress.class": "caddy",
+                "laconic.com/probe-token": token,
+            },
+        ),
+        spec=client.V1IngressSpec(
+            rules=[
+                client.V1IngressRule(
+                    host=hostname,
+                    http=client.V1HTTPIngressRuleValue(
+                        paths=[
+                            client.V1HTTPIngressPath(
+                                path="/.well-known/laconic-probe",
+                                path_type="Exact",
+                                backend=client.V1IngressBackend(
+                                    service=client.V1IngressServiceBackend(
+                                        name="caddy-ingress-controller",
+                                        port=client.V1ServiceBackendPort(number=80),
+                                    )
+                                ),
+                            )
+                        ]
+                    ),
+                )
+            ]
+        ),
+    )
+
+    networking_api.create_namespaced_ingress(namespace=namespace, body=ingress)
+    return token
+
+
+def delete_probe_ingress(namespace: str = "default"):
+    """Delete the temporary probe ingress."""
+    networking_api = client.NetworkingV1Api()
+    try:
+        networking_api.delete_namespaced_ingress(
+            name="laconic-dns-probe", namespace=namespace
+        )
+    except client.exceptions.ApiException:
+        pass  # Ignore if already deleted
+
+
+def verify_dns_via_probe(
+    hostname: str, namespace: str = "default", timeout: int = 30, poll_interval: int = 2
+) -> bool:
+    """Verify DNS by creating temp ingress and probing it.
+
+    This definitively proves that traffic to the hostname reaches this cluster.
+
+    Args:
+        hostname: The hostname to verify
+        namespace: Kubernetes namespace for probe ingress
+        timeout: Total seconds to wait for probe to succeed
+        poll_interval: Seconds between probe attempts
+
+    Returns:
+        True if probe succeeds, False otherwise
+    """
+    # First check DNS resolves at all
+    if not resolve_hostname(hostname):
+        print(f"DNS FAIL: {hostname} does not resolve")
+        return False
+
+    print(f"Creating probe ingress for {hostname}...")
+    create_probe_ingress(hostname, namespace)
+
+    try:
+        # Wait for Caddy to pick up the ingress
+        time.sleep(3)
+
+        # Poll until success or timeout
+        probe_url = f"http://{hostname}/.well-known/laconic-probe"
+        start_time = time.time()
+        last_error = None
+
+        while time.time() - start_time < timeout:
+            try:
+                response = requests.get(probe_url, timeout=5)
+                # For now, just verify we get a response from this cluster
+                # A more robust check would verify a unique token
+                if response.status_code < 500:
+                    print(f"DNS PROBE OK: {hostname} routes to this cluster")
+                    return True
+            except requests.RequestException as e:
+                last_error = e
+
+            time.sleep(poll_interval)
+
+        print(f"DNS PROBE FAIL: {hostname} - {last_error}")
+        return False
+
+    finally:
+        print("Cleaning up probe ingress...")
+        delete_probe_ingress(namespace)
--- a/stack_orchestrator/deploy/k8s/cluster_info.py
+++ b/stack_orchestrator/deploy/k8s/cluster_info.py
@ -352,11 +352,15 @@ class ClusterInfo:
                continue

            if not os.path.isabs(volume_path):
-                print(
-                    f"WARNING: {volume_name}:{volume_path} is not absolute, "
-                    "cannot bind volume."
-                )
-                continue
+                # For k8s-kind, allow relative paths:
+                # - PV uses /mnt/{volume_name} (path inside kind node)
+                # - extraMounts resolve the relative path to Docker Host
+                if not self.spec.is_kind_deployment():
+                    print(
+                        f"WARNING: {volume_name}:{volume_path} is not absolute, "
+                        "cannot bind volume."
+                    )
+                    continue

            if self.spec.is_kind_deployment():
                host_path = client.V1HostPathVolumeSource(
--- a/stack_orchestrator/deploy/k8s/deploy_k8s.py
+++ b/stack_orchestrator/deploy/k8s/deploy_k8s.py
@ -301,7 +301,7 @@ class K8sDeployer(Deployer):
            self.connect_api()
            if self.is_kind() and not self.skip_cluster_management:
                # Configure ingress controller (not installed by default in kind)
-                install_ingress_for_kind()
+                install_ingress_for_kind(self.cluster_info.spec.get_acme_email())
                # Wait for ingress to start
                # (deployment provisioning will fail unless this is done)
                wait_for_ingress_in_kind()
--- a/stack_orchestrator/deploy/k8s/helpers.py
+++ b/stack_orchestrator/deploy/k8s/helpers.py
@ -96,7 +96,177 @@ def _run_command(command: str):
    return result


+def _get_etcd_host_path_from_kind_config(config_file: str) -> Optional[str]:
+    """Extract etcd host path from kind config extraMounts."""
+    import yaml
+
+    try:
+        with open(config_file, "r") as f:
+            config = yaml.safe_load(f)
+    except Exception:
+        return None
+
+    nodes = config.get("nodes", [])
+    for node in nodes:
+        extra_mounts = node.get("extraMounts", [])
+        for mount in extra_mounts:
+            if mount.get("containerPath") == "/var/lib/etcd":
+                return mount.get("hostPath")
+    return None
+
+
+def _clean_etcd_keeping_certs(etcd_path: str) -> bool:
+    """Clean persisted etcd, keeping only TLS certificates.
+
+    When etcd is persisted and a cluster is recreated, kind tries to install
+    resources fresh but they already exist. Instead of trying to delete
+    specific stale resources (blacklist), we keep only the valuable data
+    (caddy TLS certs) and delete everything else (whitelist approach).
+
+    The etcd image is distroless (no shell), so we extract the statically-linked
+    etcdctl binary and run it from alpine which has shell support.
+
+    Returns True if cleanup succeeded, False if no action needed or failed.
+    """
+    db_path = Path(etcd_path) / "member" / "snap" / "db"
+    # Check existence using docker since etcd dir is root-owned
+    check_cmd = (
+        f"docker run --rm -v {etcd_path}:/etcd:ro alpine:3.19 "
+        "test -f /etcd/member/snap/db"
+    )
+    check_result = subprocess.run(check_cmd, shell=True, capture_output=True)
+    if check_result.returncode != 0:
+        if opts.o.debug:
+            print(f"No etcd snapshot at {db_path}, skipping cleanup")
+        return False
+
+    if opts.o.debug:
+        print(f"Cleaning persisted etcd at {etcd_path}, keeping only TLS certs")
+
+    etcd_image = "gcr.io/etcd-development/etcd:v3.5.9"
+    temp_dir = "/tmp/laconic-etcd-cleanup"
+
+    # Whitelist: prefixes to KEEP - everything else gets deleted
+    keep_prefixes = "/registry/secrets/caddy-system"
+
+    # The etcd image is distroless (no shell). We extract the statically-linked
+    # etcdctl binary and run it from alpine which has shell + jq support.
+    cleanup_script = f"""
+        set -e
+        ALPINE_IMAGE="alpine:3.19"
+
+        # Cleanup previous runs
+        docker rm -f laconic-etcd-cleanup 2>/dev/null || true
+        docker rm -f etcd-extract 2>/dev/null || true
+        docker run --rm -v /tmp:/tmp $ALPINE_IMAGE rm -rf {temp_dir}
+
+        # Create temp dir
+        docker run --rm -v /tmp:/tmp $ALPINE_IMAGE mkdir -p {temp_dir}
+
+        # Extract etcdctl binary (it's statically linked)
+        docker create --name etcd-extract {etcd_image}
+        docker cp etcd-extract:/usr/local/bin/etcdctl /tmp/etcdctl-bin
+        docker rm etcd-extract
+        docker run --rm -v /tmp/etcdctl-bin:/src:ro -v {temp_dir}:/dst $ALPINE_IMAGE \
+            sh -c "cp /src /dst/etcdctl && chmod +x /dst/etcdctl"
+
+        # Copy db to temp location
+        docker run --rm \
+            -v {etcd_path}:/etcd:ro \
+            -v {temp_dir}:/tmp-work \
+            $ALPINE_IMAGE cp /etcd/member/snap/db /tmp-work/etcd-snapshot.db
+
+        # Restore snapshot
+        docker run --rm -v {temp_dir}:/work {etcd_image} \
+            etcdutl snapshot restore /work/etcd-snapshot.db \
+                --data-dir=/work/etcd-data --skip-hash-check 2>/dev/null
+
+        # Start temp etcd (runs the etcd binary, no shell needed)
+        docker run -d --name laconic-etcd-cleanup \
+            -v {temp_dir}/etcd-data:/etcd-data \
+            -v {temp_dir}:/backup \
+            {etcd_image} etcd \
+                --data-dir=/etcd-data \
+                --listen-client-urls=http://0.0.0.0:2379 \
+                --advertise-client-urls=http://localhost:2379
+
+        sleep 3
+
+        # Use alpine with extracted etcdctl to run commands (alpine has shell + jq)
+        # Export caddy secrets
+        docker run --rm \
+            -v {temp_dir}:/backup \
+            --network container:laconic-etcd-cleanup \
+            $ALPINE_IMAGE sh -c \
+            '/backup/etcdctl get --prefix "{keep_prefixes}" -w json \
+                > /backup/kept.json 2>/dev/null || echo "{{}}" > /backup/kept.json'
+
+        # Delete ALL registry keys
+        docker run --rm \
+            -v {temp_dir}:/backup \
+            --network container:laconic-etcd-cleanup \
+            $ALPINE_IMAGE /backup/etcdctl del --prefix /registry
+
+        # Restore kept keys using jq
+        docker run --rm \
+            -v {temp_dir}:/backup \
+            --network container:laconic-etcd-cleanup \
+            $ALPINE_IMAGE sh -c '
+                apk add --no-cache jq >/dev/null 2>&1
+                jq -r ".kvs[] | @base64" /backup/kept.json 2>/dev/null | \
+                while read encoded; do
+                    key=$(echo $encoded | base64 -d | jq -r ".key" | base64 -d)
+                    val=$(echo $encoded | base64 -d | jq -r ".value" | base64 -d)
+                    echo "$val" | /backup/etcdctl put "$key"
+                done
+            ' || true
+
+        # Save cleaned snapshot
+        docker exec laconic-etcd-cleanup \
+            etcdctl snapshot save /etcd-data/cleaned-snapshot.db
+
+        docker stop laconic-etcd-cleanup
+        docker rm laconic-etcd-cleanup
+
+        # Restore to temp location first to verify it works
+        docker run --rm \
+            -v {temp_dir}/etcd-data/cleaned-snapshot.db:/data/db:ro \
+            -v {temp_dir}:/restore \
+            {etcd_image} \
+            etcdutl snapshot restore /data/db --data-dir=/restore/new-etcd \
+            --skip-hash-check 2>/dev/null
+
+        # Create timestamped backup of original (kept forever)
+        TIMESTAMP=$(date +%Y%m%d-%H%M%S)
+        docker run --rm -v {etcd_path}:/etcd $ALPINE_IMAGE \
+            cp -a /etcd/member /etcd/member.backup-$TIMESTAMP
+
+        # Replace original with cleaned version
+        docker run --rm -v {etcd_path}:/etcd -v {temp_dir}:/tmp-work $ALPINE_IMAGE \
+            sh -c "rm -rf /etcd/member && mv /tmp-work/new-etcd/member /etcd/member"
+
+        # Cleanup temp files (but NOT the timestamped backup in etcd_path)
+        docker run --rm -v /tmp:/tmp $ALPINE_IMAGE rm -rf {temp_dir}
+        rm -f /tmp/etcdctl-bin
+    """
+
+    result = subprocess.run(cleanup_script, shell=True, capture_output=True, text=True)
+    if result.returncode != 0:
+        if opts.o.debug:
+            print(f"Warning: etcd cleanup failed: {result.stderr}")
+        return False
+
+    if opts.o.debug:
+        print("Cleaned etcd, kept only TLS certificates")
+    return True
+
+
 def create_cluster(name: str, config_file: str):
+    # Clean persisted etcd, keeping only TLS certificates
+    etcd_path = _get_etcd_host_path_from_kind_config(config_file)
+    if etcd_path:
+        _clean_etcd_keeping_certs(etcd_path)
+
    result = _run_command(f"kind create cluster --name {name} --config {config_file}")
    if result.returncode != 0:
        raise DeployerException(f"kind create cluster failed: {result}")
@ -132,7 +302,7 @@ def wait_for_ingress_in_kind():
    error_exit("ERROR: Timed out waiting for Caddy ingress to become ready")


-def install_ingress_for_kind():
+def install_ingress_for_kind(acme_email: str = ""):
    api_client = client.ApiClient()
    ingress_install = os.path.abspath(
        get_k8s_dir().joinpath(
@ -143,6 +313,21 @@ def install_ingress_for_kind():
        print("Installing Caddy ingress controller in kind cluster")
    utils.create_from_yaml(api_client, yaml_file=ingress_install)

+    # Patch ConfigMap with acme email if provided
+    if acme_email:
+        core_v1 = client.CoreV1Api()
+        configmap = core_v1.read_namespaced_config_map(
+            name="caddy-ingress-controller-configmap", namespace="caddy-system"
+        )
+        configmap.data["email"] = acme_email
+        core_v1.patch_namespaced_config_map(
+            name="caddy-ingress-controller-configmap",
+            namespace="caddy-system",
+            body=configmap,
+        )
+        if opts.o.debug:
+            print(f"Patched Caddy ConfigMap with email: {acme_email}")
+

 def load_images_into_kind(kind_cluster_name: str, image_set: Set[str]):
    for image in image_set:
--- a/stack_orchestrator/deploy/spec.py
+++ b/stack_orchestrator/deploy/spec.py
@ -179,6 +179,9 @@ class Spec:
    def get_deployment_type(self):
        return self.obj.get(constants.deploy_to_key)

+    def get_acme_email(self):
+        return self.obj.get(constants.network_key, {}).get(constants.acme_email_key, "")
+
    def is_kubernetes_deployment(self):
        return self.get_deployment_type() in [
            constants.k8s_kind_deploy_type,
Author	SHA1	Message	Date
A. F. Dudley	5a0f573b0e	Allow relative volume paths for k8s-kind deployments All checks were successful Lint Checks / Run linter (push) Successful in 13s Details For k8s-kind, relative paths (e.g., ./data/rpc-config) are resolved to $DEPLOYMENT_DIR/path by _make_absolute_host_path() during kind config generation. This provides Docker Host persistence that survives cluster restarts. Previously, validation threw an exception before paths could be resolved, making it impossible to use relative paths for persistent storage. Changes: - deployment_create.py: Skip relative path check for k8s-kind - cluster_info.py: Allow relative paths to reach PV generation - docs/deployment_patterns.md: Document volume persistence patterns Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 23:26:13 -05:00
A. F. Dudley	ee89f9e87d	Fix repo root path calculation (4 parents from stack path) All checks were successful Lint Checks / Run linter (push) Successful in 14s Details	2026-02-02 22:49:19 -05:00
A. F. Dudley	600eb93b4d	Add --spec-file option to restart and auto-detect GitOps spec All checks were successful Lint Checks / Run linter (push) Successful in 14s Details - Add --spec-file option to specify spec location in repo - Auto-detect deployment/spec.yml in repo as GitOps location - Fall back to deployment dir if no repo spec found Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 22:48:19 -05:00
A. F. Dudley	ca3153bb78	Fix restart command for GitOps deployments All checks were successful Lint Checks / Run linter (push) Successful in 13s Details - Remove init_operation() from restart - don't regenerate spec from commands.py defaults, use existing git-tracked spec.yml instead - Add docs/deployment_patterns.md documenting GitOps workflow - Add pre-commit rule to CLAUDE.md - Fix line length issues in helpers.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 22:18:19 -05:00
A. F. Dudley	be334ca39f	Use docker for etcd existence check (root-owned dir) All checks were successful Lint Checks / Run linter (push) Successful in 13s Details The etcd directory is root-owned, so shell test -f fails. Use docker with volume mount to check file existence. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 19:31:45 -05:00
A. F. Dudley	0213ec5d7d	Keep timestamped backup of etcd forever All checks were successful Lint Checks / Run linter (push) Successful in 15s Details Create member.backup-YYYYMMDD-HHMMSS before cleaning. Each cluster recreation creates a new backup, preserving history. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 19:30:13 -05:00
A. F. Dudley	a4d8592815	Preserve original etcd backup until restore is verified All checks were successful Lint Checks / Run linter (push) Successful in 14s Details Move original to .bak, move new into place, then delete bak. If anything fails before the swap, original remains intact. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 19:28:53 -05:00
A. F. Dudley	8d6e50b3ae	Use whitelist approach for etcd cleanup All checks were successful Lint Checks / Run linter (push) Successful in 14s Details Instead of trying to delete specific stale resources (blacklist), keep only the valuable data (caddy TLS certs) and delete everything else. This is more robust as we don't need to maintain a list of all possible stale resources. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 19:27:59 -05:00
A. F. Dudley	51e65857b9	Fix etcd cleanup to use docker for root-owned files All checks were successful Lint Checks / Run linter (push) Successful in 14s Details Use docker containers with volume mounts to handle all file operations on root-owned etcd directories, avoiding the need for sudo on the host. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 19:22:41 -05:00
A. F. Dudley	ba9f51116d	Clear stale CNI resources from persisted etcd before cluster creation All checks were successful Lint Checks / Run linter (push) Successful in 14s Details When etcd is persisted (for certificate backup) and a cluster is recreated, kind tries to install CNI (kindnet) fresh but the persisted etcd already has those resources, causing 'AlreadyExists' errors and cluster creation failure. This fix: - Detects etcd mount path from kind config - Before cluster creation, clears stale CNI resources (kindnet, coredns) - Preserves certificate and other important data Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 19:21:00 -05:00
A. F. Dudley	5214bc8c0c	Fix Caddy ingress ACME email and RBAC issues All checks were successful Lint Checks / Run linter (push) Successful in 14s Details - Add acme_email_key constant for spec.yml parsing - Add get_acme_email() method to Spec class - Modify install_ingress_for_kind() to patch ConfigMap with email - Pass acme-email from spec to ingress installation - Add 'delete' verb to leases RBAC for certificate lock cleanup The acme-email field in spec.yml was previously ignored, causing Let's Encrypt to fail with "unable to parse email address". The missing delete permission on leases caused lock cleanup failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 19:13:10 -05:00
A. F. Dudley	10716e9d44	feat(deploy): add deployment restart command Add `laconic-so deployment restart` command that: - Pulls latest code from stack git repository - Regenerates spec.yml from stack's commands.py - Verifies DNS if hostname changed (with --force to skip) - Syncs deployment directory preserving cluster ID and data - Stops and restarts deployment with --skip-cluster-management Also stores stack-source path in deployment.yml during create for automatic stack location on restart. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 19:05:27 -05:00