Compare commits

...

12 Commits

Author SHA1 Message Date
A. F. Dudley
5a0f573b0e Allow relative volume paths for k8s-kind deployments
All checks were successful
Lint Checks / Run linter (push) Successful in 13s
For k8s-kind, relative paths (e.g., ./data/rpc-config) are resolved to
$DEPLOYMENT_DIR/path by _make_absolute_host_path() during kind config
generation. This provides Docker Host persistence that survives cluster
restarts.

Previously, validation threw an exception before paths could be resolved,
making it impossible to use relative paths for persistent storage.

Changes:
- deployment_create.py: Skip relative path check for k8s-kind
- cluster_info.py: Allow relative paths to reach PV generation
- docs/deployment_patterns.md: Document volume persistence patterns

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 23:26:13 -05:00
A. F. Dudley
ee89f9e87d Fix repo root path calculation (4 parents from stack path)
All checks were successful
Lint Checks / Run linter (push) Successful in 14s
2026-02-02 22:49:19 -05:00
A. F. Dudley
600eb93b4d Add --spec-file option to restart and auto-detect GitOps spec
All checks were successful
Lint Checks / Run linter (push) Successful in 14s
- Add --spec-file option to specify spec location in repo
- Auto-detect deployment/spec.yml in repo as GitOps location
- Fall back to deployment dir if no repo spec found

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 22:48:19 -05:00
A. F. Dudley
ca3153bb78 Fix restart command for GitOps deployments
All checks were successful
Lint Checks / Run linter (push) Successful in 13s
- Remove init_operation() from restart - don't regenerate spec from
  commands.py defaults, use existing git-tracked spec.yml instead
- Add docs/deployment_patterns.md documenting GitOps workflow
- Add pre-commit rule to CLAUDE.md
- Fix line length issues in helpers.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 22:18:19 -05:00
A. F. Dudley
be334ca39f Use docker for etcd existence check (root-owned dir)
All checks were successful
Lint Checks / Run linter (push) Successful in 13s
The etcd directory is root-owned, so shell test -f fails.
Use docker with volume mount to check file existence.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 19:31:45 -05:00
A. F. Dudley
0213ec5d7d Keep timestamped backup of etcd forever
All checks were successful
Lint Checks / Run linter (push) Successful in 15s
Create member.backup-YYYYMMDD-HHMMSS before cleaning.
Each cluster recreation creates a new backup, preserving history.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 19:30:13 -05:00
A. F. Dudley
a4d8592815 Preserve original etcd backup until restore is verified
All checks were successful
Lint Checks / Run linter (push) Successful in 14s
Move original to .bak, move new into place, then delete bak.
If anything fails before the swap, original remains intact.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 19:28:53 -05:00
A. F. Dudley
8d6e50b3ae Use whitelist approach for etcd cleanup
All checks were successful
Lint Checks / Run linter (push) Successful in 14s
Instead of trying to delete specific stale resources (blacklist),
keep only the valuable data (caddy TLS certs) and delete everything
else. This is more robust as we don't need to maintain a list of
all possible stale resources.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 19:27:59 -05:00
A. F. Dudley
51e65857b9 Fix etcd cleanup to use docker for root-owned files
All checks were successful
Lint Checks / Run linter (push) Successful in 14s
Use docker containers with volume mounts to handle all file
operations on root-owned etcd directories, avoiding the need
for sudo on the host.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 19:22:41 -05:00
A. F. Dudley
ba9f51116d Clear stale CNI resources from persisted etcd before cluster creation
All checks were successful
Lint Checks / Run linter (push) Successful in 14s
When etcd is persisted (for certificate backup) and a cluster is
recreated, kind tries to install CNI (kindnet) fresh but the
persisted etcd already has those resources, causing 'AlreadyExists'
errors and cluster creation failure.

This fix:
- Detects etcd mount path from kind config
- Before cluster creation, clears stale CNI resources (kindnet, coredns)
- Preserves certificate and other important data

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 19:21:00 -05:00
A. F. Dudley
5214bc8c0c Fix Caddy ingress ACME email and RBAC issues
All checks were successful
Lint Checks / Run linter (push) Successful in 14s
- Add acme_email_key constant for spec.yml parsing
- Add get_acme_email() method to Spec class
- Modify install_ingress_for_kind() to patch ConfigMap with email
- Pass acme-email from spec to ingress installation
- Add 'delete' verb to leases RBAC for certificate lock cleanup

The acme-email field in spec.yml was previously ignored, causing
Let's Encrypt to fail with "unable to parse email address".
The missing delete permission on leases caused lock cleanup failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 19:13:10 -05:00
A. F. Dudley
10716e9d44 feat(deploy): add deployment restart command
Add `laconic-so deployment restart` command that:
- Pulls latest code from stack git repository
- Regenerates spec.yml from stack's commands.py
- Verifies DNS if hostname changed (with --force to skip)
- Syncs deployment directory preserving cluster ID and data
- Stops and restarts deployment with --skip-cluster-management

Also stores stack-source path in deployment.yml during create
for automatic stack location on restart.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 19:05:27 -05:00
11 changed files with 684 additions and 24 deletions

View File

@ -8,6 +8,7 @@ NEVER assume your hypotheses are true without evidence
ALWAYS clearly state when something is a hypothesis
ALWAYS use evidence from the systems your interacting with to support your claims and hypotheses
ALWAYS run `pre-commit run --all-files` before committing changes
## Key Principles

114
docs/deployment_patterns.md Normal file
View File

@ -0,0 +1,114 @@
# Deployment Patterns
## GitOps Pattern
For production deployments, we recommend a GitOps approach where your deployment configuration is tracked in version control.
### Overview
- **spec.yml is your source of truth**: Maintain it in your operator repository
- **Don't regenerate on every restart**: Run `deploy init` once, then customize and commit
- **Use restart for updates**: The restart command respects your git-tracked spec.yml
### Workflow
1. **Initial setup**: Run `deploy init` once to generate a spec.yml template
2. **Customize and commit**: Edit spec.yml with your configuration (hostnames, resources, etc.) and commit to your operator repo
3. **Deploy from git**: Use the committed spec.yml for deployments
4. **Update via git**: Make changes in git, then restart to apply
```bash
# Initial setup (run once)
laconic-so --stack my-stack deploy init --output spec.yml
# Customize for your environment
vim spec.yml # Set hostname, resources, etc.
# Commit to your operator repository
git add spec.yml
git commit -m "Add my-stack deployment configuration"
git push
# On deployment server: deploy from git-tracked spec
laconic-so deploy create \
--spec-file /path/to/operator-repo/spec.yml \
--deployment-dir my-deployment
laconic-so deployment --dir my-deployment start
```
### Updating Deployments
When you need to update a deployment:
```bash
# 1. Make changes in your operator repo
vim /path/to/operator-repo/spec.yml
git commit -am "Update configuration"
git push
# 2. On deployment server: pull and restart
cd /path/to/operator-repo && git pull
laconic-so deployment --dir my-deployment restart
```
The `restart` command:
- Pulls latest code from the stack repository
- Uses your git-tracked spec.yml (does NOT regenerate from defaults)
- Syncs the deployment directory
- Restarts services
### Anti-patterns
**Don't do this:**
```bash
# BAD: Regenerating spec on every deployment
laconic-so --stack my-stack deploy init --output spec.yml
laconic-so deploy create --spec-file spec.yml ...
```
This overwrites your customizations with defaults from the stack's `commands.py`.
**Do this instead:**
```bash
# GOOD: Use your git-tracked spec
git pull # Get latest spec.yml from your operator repo
laconic-so deployment --dir my-deployment restart
```
## Volume Persistence in k8s-kind
k8s-kind has 3 storage layers:
- **Docker Host**: The physical server running Docker
- **Kind Node**: A Docker container simulating a k8s node
- **Pod Container**: Your workload
For k8s-kind, volumes with paths are mounted from Docker Host → Kind Node → Pod via extraMounts.
| spec.yml volume | Storage Location | Survives Pod Restart | Survives Cluster Restart |
|-----------------|------------------|---------------------|-------------------------|
| `vol:` (empty) | Kind Node PVC | ✅ | ❌ |
| `vol: ./data/x` | Docker Host | ✅ | ✅ |
| `vol: /abs/path`| Docker Host | ✅ | ✅ |
**Recommendation**: Always use paths for data you want to keep. Relative paths
(e.g., `./data/rpc-config`) resolve to `$DEPLOYMENT_DIR/data/rpc-config` on the
Docker Host.
### Example
```yaml
# In spec.yml
volumes:
rpc-config: ./data/rpc-config # Persists to $DEPLOYMENT_DIR/data/rpc-config
chain-data: ./data/chain # Persists to $DEPLOYMENT_DIR/data/chain
temp-cache: # Empty = Kind Node PVC (lost on cluster delete)
```
### The Antipattern
Empty-path volumes appear persistent because they survive pod restarts (data lives
in Kind Node container). However, this data is lost when the kind cluster is
recreated. This "false persistence" has caused data loss when operators assumed
their data was safe.

View File

@ -44,3 +44,4 @@ unlimited_memlock_key = "unlimited-memlock"
runtime_class_key = "runtime-class"
high_memlock_runtime = "high-memlock"
high_memlock_spec_filename = "high-memlock-spec.json"
acme_email_key = "acme-email"

View File

@ -93,6 +93,7 @@ rules:
- get
- create
- update
- delete
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding

View File

@ -15,7 +15,9 @@
import click
from pathlib import Path
import subprocess
import sys
import time
from stack_orchestrator import constants
from stack_orchestrator.deploy.images import push_images_operation
from stack_orchestrator.deploy.deploy import (
@ -228,3 +230,176 @@ def run_job(ctx, job_name, helm_release):
ctx.obj = make_deploy_context(ctx)
run_job_operation(ctx, job_name, helm_release)
@command.command()
@click.option("--stack-path", help="Path to stack git repo (overrides stored path)")
@click.option(
"--spec-file", help="Path to GitOps spec.yml in repo (e.g., deployment/spec.yml)"
)
@click.option("--config-file", help="Config file to pass to deploy init")
@click.option(
"--force",
is_flag=True,
default=False,
help="Skip DNS verification",
)
@click.option(
"--expected-ip",
help="Expected IP for DNS verification (if different from egress)",
)
@click.pass_context
def restart(ctx, stack_path, spec_file, config_file, force, expected_ip):
"""Pull latest code and restart deployment using git-tracked spec.
GitOps workflow:
1. Operator maintains spec.yml in their git repository
2. This command pulls latest code (including updated spec.yml)
3. If hostname changed, verifies DNS routes to this server
4. Syncs deployment directory with the git-tracked spec
5. Stops and restarts the deployment
Data volumes are always preserved. The cluster is never destroyed.
Stack source resolution (in order):
1. --stack-path argument (if provided)
2. stack-source field in deployment.yml (if stored)
3. Error if neither available
Note: spec.yml should be maintained in git, not regenerated from
commands.py on each restart. Use 'deploy init' only for initial
spec generation, then customize and commit to your operator repo.
"""
from stack_orchestrator.util import get_yaml, get_parsed_deployment_spec
from stack_orchestrator.deploy.deployment_create import create_operation
from stack_orchestrator.deploy.dns_probe import verify_dns_via_probe
deployment_context: DeploymentContext = ctx.obj
# Get current spec info (before git pull)
current_spec = deployment_context.spec
current_http_proxy = current_spec.get_http_proxy()
current_hostname = (
current_http_proxy[0]["host-name"] if current_http_proxy else None
)
# Resolve stack source path
if stack_path:
stack_source = Path(stack_path).resolve()
else:
# Try to get from deployment.yml
deployment_file = (
deployment_context.deployment_dir / constants.deployment_file_name
)
deployment_data = get_yaml().load(open(deployment_file))
stack_source_str = deployment_data.get("stack-source")
if not stack_source_str:
print(
"Error: No stack-source in deployment.yml and --stack-path not provided"
)
print("Use --stack-path to specify the stack git repository location")
sys.exit(1)
stack_source = Path(stack_source_str)
if not stack_source.exists():
print(f"Error: Stack source path does not exist: {stack_source}")
sys.exit(1)
print("=== Deployment Restart ===")
print(f"Deployment dir: {deployment_context.deployment_dir}")
print(f"Stack source: {stack_source}")
print(f"Current hostname: {current_hostname}")
# Step 1: Git pull (brings in updated spec.yml from operator's repo)
print("\n[1/4] Pulling latest code from stack repository...")
git_result = subprocess.run(
["git", "pull"], cwd=stack_source, capture_output=True, text=True
)
if git_result.returncode != 0:
print(f"Git pull failed: {git_result.stderr}")
sys.exit(1)
print(f"Git pull: {git_result.stdout.strip()}")
# Determine spec file location
# Priority: --spec-file argument > repo's deployment/spec.yml > deployment dir
# Stack path is like: repo/stack_orchestrator/data/stacks/stack-name
# So repo root is 4 parents up
repo_root = stack_source.parent.parent.parent.parent
if spec_file:
# Spec file relative to repo root
spec_file_path = repo_root / spec_file
else:
# Try standard GitOps location in repo
gitops_spec = repo_root / "deployment" / "spec.yml"
if gitops_spec.exists():
spec_file_path = gitops_spec
else:
# Fall back to deployment directory
spec_file_path = deployment_context.deployment_dir / "spec.yml"
if not spec_file_path.exists():
print(f"Error: spec.yml not found at {spec_file_path}")
print("For GitOps, add spec.yml to your repo at deployment/spec.yml")
print("Or specify --spec-file with path relative to repo root")
sys.exit(1)
print(f"Using spec: {spec_file_path}")
# Parse spec to check for hostname changes
new_spec_obj = get_parsed_deployment_spec(str(spec_file_path))
new_http_proxy = new_spec_obj.get("network", {}).get("http-proxy", [])
new_hostname = new_http_proxy[0]["host-name"] if new_http_proxy else None
print(f"Spec hostname: {new_hostname}")
# Step 2: DNS verification (only if hostname changed)
if new_hostname and new_hostname != current_hostname:
print(f"\n[2/4] Hostname changed: {current_hostname} -> {new_hostname}")
if force:
print("DNS verification skipped (--force)")
else:
print("Verifying DNS via probe...")
if not verify_dns_via_probe(new_hostname):
print(f"\nDNS verification failed for {new_hostname}")
print("Ensure DNS is configured before restarting.")
print("Use --force to skip this check.")
sys.exit(1)
else:
print("\n[2/4] Hostname unchanged, skipping DNS verification")
# Step 3: Sync deployment directory with spec
print("\n[3/4] Syncing deployment directory...")
deploy_ctx = make_deploy_context(ctx)
create_operation(
deployment_command_context=deploy_ctx,
spec_file=str(spec_file_path),
deployment_dir=str(deployment_context.deployment_dir),
update=True,
network_dir=None,
initial_peers=None,
)
# Reload deployment context with updated spec
deployment_context.init(deployment_context.deployment_dir)
ctx.obj = deployment_context
# Stop deployment
print("\n[4/4] Restarting deployment...")
ctx.obj = make_deploy_context(ctx)
down_operation(
ctx, delete_volumes=False, extra_args_list=[], skip_cluster_management=True
)
# Brief pause to ensure clean shutdown
time.sleep(5)
# Start deployment
up_operation(
ctx, services_list=None, stay_attached=False, skip_cluster_management=True
)
print("\n=== Restart Complete ===")
print("Deployment restarted with git-tracked configuration.")
if new_hostname and new_hostname != current_hostname:
print(f"\nNew hostname: {new_hostname}")
print("Caddy will automatically provision TLS certificate.")

View File

@ -17,7 +17,7 @@ import click
from importlib import util
import os
from pathlib import Path
from typing import List
from typing import List, Optional
import random
from shutil import copy, copyfile, copytree, rmtree
from secrets import token_hex
@ -507,11 +507,14 @@ def _copy_files_to_directory(file_paths: List[Path], directory: Path):
copy(path, os.path.join(directory, os.path.basename(path)))
def _create_deployment_file(deployment_dir: Path):
def _create_deployment_file(deployment_dir: Path, stack_source: Optional[Path] = None):
deployment_file_path = deployment_dir.joinpath(constants.deployment_file_name)
cluster = f"{constants.cluster_name_prefix}{token_hex(8)}"
deployment_content = {constants.cluster_id_key: cluster}
if stack_source:
deployment_content["stack-source"] = str(stack_source)
with open(deployment_file_path, "w") as output_file:
output_file.write(f"{constants.cluster_id_key}: {cluster}\n")
get_yaml().dump(deployment_content, output_file)
def _check_volume_definitions(spec):
@ -519,10 +522,14 @@ def _check_volume_definitions(spec):
for volume_name, volume_path in spec.get_volumes().items():
if volume_path:
if not os.path.isabs(volume_path):
raise Exception(
f"Relative path {volume_path} for volume {volume_name} not "
f"supported for deployment type {spec.get_deployment_type()}"
)
# For k8s-kind: allow relative paths, they'll be resolved
# by _make_absolute_host_path() during kind config generation
if not spec.is_kind_deployment():
deploy_type = spec.get_deployment_type()
raise Exception(
f"Relative path {volume_path} for volume "
f"{volume_name} not supported for {deploy_type}"
)
@click.command()
@ -616,11 +623,15 @@ def create_operation(
generate_helm_chart(stack_name, spec_file, deployment_dir_path)
return # Exit early for helm chart generation
# Resolve stack source path for restart capability
stack_source = get_stack_path(stack_name)
if update:
# Sync mode: write to temp dir, then copy to deployment dir with backups
temp_dir = Path(tempfile.mkdtemp(prefix="deployment-sync-"))
try:
# Write deployment files to temp dir (skip deployment.yml to preserve cluster ID)
# Write deployment files to temp dir
# (skip deployment.yml to preserve cluster ID)
_write_deployment_files(
temp_dir,
Path(spec_file),
@ -628,12 +639,14 @@ def create_operation(
stack_name,
deployment_type,
include_deployment_file=False,
stack_source=stack_source,
)
# Copy from temp to deployment dir, excluding data volumes and backing up changed files
# Exclude data/* to avoid touching user data volumes
# Exclude config file to preserve deployment settings (XXX breaks passing config vars
# from spec. could warn about this or not exclude...)
# Copy from temp to deployment dir, excluding data volumes
# and backing up changed files.
# Exclude data/* to avoid touching user data volumes.
# Exclude config file to preserve deployment settings
# (XXX breaks passing config vars from spec)
exclude_patterns = ["data", "data/*", constants.config_file_name]
_safe_copy_tree(
temp_dir, deployment_dir_path, exclude_patterns=exclude_patterns
@ -650,6 +663,7 @@ def create_operation(
stack_name,
deployment_type,
include_deployment_file=True,
stack_source=stack_source,
)
# Delegate to the stack's Python code
@ -670,7 +684,7 @@ def create_operation(
)
def _safe_copy_tree(src: Path, dst: Path, exclude_patterns: List[str] = None):
def _safe_copy_tree(src: Path, dst: Path, exclude_patterns: Optional[List[str]] = None):
"""
Recursively copy a directory tree, backing up changed files with .bak suffix.
@ -721,6 +735,7 @@ def _write_deployment_files(
stack_name: str,
deployment_type: str,
include_deployment_file: bool = True,
stack_source: Optional[Path] = None,
):
"""
Write deployment files to target directory.
@ -730,7 +745,8 @@ def _write_deployment_files(
:param parsed_spec: Parsed spec object
:param stack_name: Name of stack
:param deployment_type: Type of deployment
:param include_deployment_file: Whether to create deployment.yml file (skip for update)
:param include_deployment_file: Whether to create deployment.yml (skip for update)
:param stack_source: Path to stack source (git repo) for restart capability
"""
stack_file = get_stack_path(stack_name).joinpath(constants.stack_file_name)
parsed_stack = get_parsed_stack_config(stack_name)
@ -741,7 +757,7 @@ def _write_deployment_files(
# Create deployment file if requested
if include_deployment_file:
_create_deployment_file(target_dir)
_create_deployment_file(target_dir, stack_source=stack_source)
# Copy any config variables from the spec file into an env file suitable for compose
_write_config_file(spec_file, target_dir.joinpath(constants.config_file_name))
@ -805,8 +821,9 @@ def _write_deployment_files(
)
else:
# TODO:
# this is odd - looks up config dir that matches a volume name, then copies as a mount dir?
# AFAICT this is not used by or relevant to any existing stack - roy
# This is odd - looks up config dir that matches a volume name,
# then copies as a mount dir?
# AFAICT not used by or relevant to any existing stack - roy
# TODO: We should probably only do this if the volume is marked :ro.
for volume_name, volume_path in parsed_spec.get_volumes().items():

View File

@ -0,0 +1,159 @@
# Copyright © 2024 Vulcanize
# SPDX-License-Identifier: AGPL-3.0
"""DNS verification via temporary ingress probe."""
import secrets
import socket
import time
from typing import Optional
import requests
from kubernetes import client
def get_server_egress_ip() -> str:
"""Get this server's public egress IP via ipify."""
response = requests.get("https://api.ipify.org", timeout=10)
response.raise_for_status()
return response.text.strip()
def resolve_hostname(hostname: str) -> list[str]:
"""Resolve hostname to list of IP addresses."""
try:
_, _, ips = socket.gethostbyname_ex(hostname)
return ips
except socket.gaierror:
return []
def verify_dns_simple(hostname: str, expected_ip: Optional[str] = None) -> bool:
"""Simple DNS verification - check hostname resolves to expected IP.
If expected_ip not provided, uses server's egress IP.
Returns True if hostname resolves to expected IP.
"""
resolved_ips = resolve_hostname(hostname)
if not resolved_ips:
print(f"DNS FAIL: {hostname} does not resolve")
return False
if expected_ip is None:
expected_ip = get_server_egress_ip()
if expected_ip in resolved_ips:
print(f"DNS OK: {hostname} -> {resolved_ips} (includes {expected_ip})")
return True
else:
print(f"DNS WARN: {hostname} -> {resolved_ips} (expected {expected_ip})")
return False
def create_probe_ingress(hostname: str, namespace: str = "default") -> str:
"""Create a temporary ingress for DNS probing.
Returns the probe token that the ingress will respond with.
"""
token = secrets.token_hex(16)
networking_api = client.NetworkingV1Api()
# Create a simple ingress that Caddy will pick up
ingress = client.V1Ingress(
metadata=client.V1ObjectMeta(
name="laconic-dns-probe",
annotations={
"kubernetes.io/ingress.class": "caddy",
"laconic.com/probe-token": token,
},
),
spec=client.V1IngressSpec(
rules=[
client.V1IngressRule(
host=hostname,
http=client.V1HTTPIngressRuleValue(
paths=[
client.V1HTTPIngressPath(
path="/.well-known/laconic-probe",
path_type="Exact",
backend=client.V1IngressBackend(
service=client.V1IngressServiceBackend(
name="caddy-ingress-controller",
port=client.V1ServiceBackendPort(number=80),
)
),
)
]
),
)
]
),
)
networking_api.create_namespaced_ingress(namespace=namespace, body=ingress)
return token
def delete_probe_ingress(namespace: str = "default"):
"""Delete the temporary probe ingress."""
networking_api = client.NetworkingV1Api()
try:
networking_api.delete_namespaced_ingress(
name="laconic-dns-probe", namespace=namespace
)
except client.exceptions.ApiException:
pass # Ignore if already deleted
def verify_dns_via_probe(
hostname: str, namespace: str = "default", timeout: int = 30, poll_interval: int = 2
) -> bool:
"""Verify DNS by creating temp ingress and probing it.
This definitively proves that traffic to the hostname reaches this cluster.
Args:
hostname: The hostname to verify
namespace: Kubernetes namespace for probe ingress
timeout: Total seconds to wait for probe to succeed
poll_interval: Seconds between probe attempts
Returns:
True if probe succeeds, False otherwise
"""
# First check DNS resolves at all
if not resolve_hostname(hostname):
print(f"DNS FAIL: {hostname} does not resolve")
return False
print(f"Creating probe ingress for {hostname}...")
create_probe_ingress(hostname, namespace)
try:
# Wait for Caddy to pick up the ingress
time.sleep(3)
# Poll until success or timeout
probe_url = f"http://{hostname}/.well-known/laconic-probe"
start_time = time.time()
last_error = None
while time.time() - start_time < timeout:
try:
response = requests.get(probe_url, timeout=5)
# For now, just verify we get a response from this cluster
# A more robust check would verify a unique token
if response.status_code < 500:
print(f"DNS PROBE OK: {hostname} routes to this cluster")
return True
except requests.RequestException as e:
last_error = e
time.sleep(poll_interval)
print(f"DNS PROBE FAIL: {hostname} - {last_error}")
return False
finally:
print("Cleaning up probe ingress...")
delete_probe_ingress(namespace)

View File

@ -352,11 +352,15 @@ class ClusterInfo:
continue
if not os.path.isabs(volume_path):
print(
f"WARNING: {volume_name}:{volume_path} is not absolute, "
"cannot bind volume."
)
continue
# For k8s-kind, allow relative paths:
# - PV uses /mnt/{volume_name} (path inside kind node)
# - extraMounts resolve the relative path to Docker Host
if not self.spec.is_kind_deployment():
print(
f"WARNING: {volume_name}:{volume_path} is not absolute, "
"cannot bind volume."
)
continue
if self.spec.is_kind_deployment():
host_path = client.V1HostPathVolumeSource(

View File

@ -301,7 +301,7 @@ class K8sDeployer(Deployer):
self.connect_api()
if self.is_kind() and not self.skip_cluster_management:
# Configure ingress controller (not installed by default in kind)
install_ingress_for_kind()
install_ingress_for_kind(self.cluster_info.spec.get_acme_email())
# Wait for ingress to start
# (deployment provisioning will fail unless this is done)
wait_for_ingress_in_kind()

View File

@ -96,7 +96,177 @@ def _run_command(command: str):
return result
def _get_etcd_host_path_from_kind_config(config_file: str) -> Optional[str]:
"""Extract etcd host path from kind config extraMounts."""
import yaml
try:
with open(config_file, "r") as f:
config = yaml.safe_load(f)
except Exception:
return None
nodes = config.get("nodes", [])
for node in nodes:
extra_mounts = node.get("extraMounts", [])
for mount in extra_mounts:
if mount.get("containerPath") == "/var/lib/etcd":
return mount.get("hostPath")
return None
def _clean_etcd_keeping_certs(etcd_path: str) -> bool:
"""Clean persisted etcd, keeping only TLS certificates.
When etcd is persisted and a cluster is recreated, kind tries to install
resources fresh but they already exist. Instead of trying to delete
specific stale resources (blacklist), we keep only the valuable data
(caddy TLS certs) and delete everything else (whitelist approach).
The etcd image is distroless (no shell), so we extract the statically-linked
etcdctl binary and run it from alpine which has shell support.
Returns True if cleanup succeeded, False if no action needed or failed.
"""
db_path = Path(etcd_path) / "member" / "snap" / "db"
# Check existence using docker since etcd dir is root-owned
check_cmd = (
f"docker run --rm -v {etcd_path}:/etcd:ro alpine:3.19 "
"test -f /etcd/member/snap/db"
)
check_result = subprocess.run(check_cmd, shell=True, capture_output=True)
if check_result.returncode != 0:
if opts.o.debug:
print(f"No etcd snapshot at {db_path}, skipping cleanup")
return False
if opts.o.debug:
print(f"Cleaning persisted etcd at {etcd_path}, keeping only TLS certs")
etcd_image = "gcr.io/etcd-development/etcd:v3.5.9"
temp_dir = "/tmp/laconic-etcd-cleanup"
# Whitelist: prefixes to KEEP - everything else gets deleted
keep_prefixes = "/registry/secrets/caddy-system"
# The etcd image is distroless (no shell). We extract the statically-linked
# etcdctl binary and run it from alpine which has shell + jq support.
cleanup_script = f"""
set -e
ALPINE_IMAGE="alpine:3.19"
# Cleanup previous runs
docker rm -f laconic-etcd-cleanup 2>/dev/null || true
docker rm -f etcd-extract 2>/dev/null || true
docker run --rm -v /tmp:/tmp $ALPINE_IMAGE rm -rf {temp_dir}
# Create temp dir
docker run --rm -v /tmp:/tmp $ALPINE_IMAGE mkdir -p {temp_dir}
# Extract etcdctl binary (it's statically linked)
docker create --name etcd-extract {etcd_image}
docker cp etcd-extract:/usr/local/bin/etcdctl /tmp/etcdctl-bin
docker rm etcd-extract
docker run --rm -v /tmp/etcdctl-bin:/src:ro -v {temp_dir}:/dst $ALPINE_IMAGE \
sh -c "cp /src /dst/etcdctl && chmod +x /dst/etcdctl"
# Copy db to temp location
docker run --rm \
-v {etcd_path}:/etcd:ro \
-v {temp_dir}:/tmp-work \
$ALPINE_IMAGE cp /etcd/member/snap/db /tmp-work/etcd-snapshot.db
# Restore snapshot
docker run --rm -v {temp_dir}:/work {etcd_image} \
etcdutl snapshot restore /work/etcd-snapshot.db \
--data-dir=/work/etcd-data --skip-hash-check 2>/dev/null
# Start temp etcd (runs the etcd binary, no shell needed)
docker run -d --name laconic-etcd-cleanup \
-v {temp_dir}/etcd-data:/etcd-data \
-v {temp_dir}:/backup \
{etcd_image} etcd \
--data-dir=/etcd-data \
--listen-client-urls=http://0.0.0.0:2379 \
--advertise-client-urls=http://localhost:2379
sleep 3
# Use alpine with extracted etcdctl to run commands (alpine has shell + jq)
# Export caddy secrets
docker run --rm \
-v {temp_dir}:/backup \
--network container:laconic-etcd-cleanup \
$ALPINE_IMAGE sh -c \
'/backup/etcdctl get --prefix "{keep_prefixes}" -w json \
> /backup/kept.json 2>/dev/null || echo "{{}}" > /backup/kept.json'
# Delete ALL registry keys
docker run --rm \
-v {temp_dir}:/backup \
--network container:laconic-etcd-cleanup \
$ALPINE_IMAGE /backup/etcdctl del --prefix /registry
# Restore kept keys using jq
docker run --rm \
-v {temp_dir}:/backup \
--network container:laconic-etcd-cleanup \
$ALPINE_IMAGE sh -c '
apk add --no-cache jq >/dev/null 2>&1
jq -r ".kvs[] | @base64" /backup/kept.json 2>/dev/null | \
while read encoded; do
key=$(echo $encoded | base64 -d | jq -r ".key" | base64 -d)
val=$(echo $encoded | base64 -d | jq -r ".value" | base64 -d)
echo "$val" | /backup/etcdctl put "$key"
done
' || true
# Save cleaned snapshot
docker exec laconic-etcd-cleanup \
etcdctl snapshot save /etcd-data/cleaned-snapshot.db
docker stop laconic-etcd-cleanup
docker rm laconic-etcd-cleanup
# Restore to temp location first to verify it works
docker run --rm \
-v {temp_dir}/etcd-data/cleaned-snapshot.db:/data/db:ro \
-v {temp_dir}:/restore \
{etcd_image} \
etcdutl snapshot restore /data/db --data-dir=/restore/new-etcd \
--skip-hash-check 2>/dev/null
# Create timestamped backup of original (kept forever)
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
docker run --rm -v {etcd_path}:/etcd $ALPINE_IMAGE \
cp -a /etcd/member /etcd/member.backup-$TIMESTAMP
# Replace original with cleaned version
docker run --rm -v {etcd_path}:/etcd -v {temp_dir}:/tmp-work $ALPINE_IMAGE \
sh -c "rm -rf /etcd/member && mv /tmp-work/new-etcd/member /etcd/member"
# Cleanup temp files (but NOT the timestamped backup in etcd_path)
docker run --rm -v /tmp:/tmp $ALPINE_IMAGE rm -rf {temp_dir}
rm -f /tmp/etcdctl-bin
"""
result = subprocess.run(cleanup_script, shell=True, capture_output=True, text=True)
if result.returncode != 0:
if opts.o.debug:
print(f"Warning: etcd cleanup failed: {result.stderr}")
return False
if opts.o.debug:
print("Cleaned etcd, kept only TLS certificates")
return True
def create_cluster(name: str, config_file: str):
# Clean persisted etcd, keeping only TLS certificates
etcd_path = _get_etcd_host_path_from_kind_config(config_file)
if etcd_path:
_clean_etcd_keeping_certs(etcd_path)
result = _run_command(f"kind create cluster --name {name} --config {config_file}")
if result.returncode != 0:
raise DeployerException(f"kind create cluster failed: {result}")
@ -132,7 +302,7 @@ def wait_for_ingress_in_kind():
error_exit("ERROR: Timed out waiting for Caddy ingress to become ready")
def install_ingress_for_kind():
def install_ingress_for_kind(acme_email: str = ""):
api_client = client.ApiClient()
ingress_install = os.path.abspath(
get_k8s_dir().joinpath(
@ -143,6 +313,21 @@ def install_ingress_for_kind():
print("Installing Caddy ingress controller in kind cluster")
utils.create_from_yaml(api_client, yaml_file=ingress_install)
# Patch ConfigMap with acme email if provided
if acme_email:
core_v1 = client.CoreV1Api()
configmap = core_v1.read_namespaced_config_map(
name="caddy-ingress-controller-configmap", namespace="caddy-system"
)
configmap.data["email"] = acme_email
core_v1.patch_namespaced_config_map(
name="caddy-ingress-controller-configmap",
namespace="caddy-system",
body=configmap,
)
if opts.o.debug:
print(f"Patched Caddy ConfigMap with email: {acme_email}")
def load_images_into_kind(kind_cluster_name: str, image_set: Set[str]):
for image in image_set:

View File

@ -179,6 +179,9 @@ class Spec:
def get_deployment_type(self):
return self.obj.get(constants.deploy_to_key)
def get_acme_email(self):
return self.obj.get(constants.network_key, {}).get(constants.acme_email_key, "")
def is_kubernetes_deployment(self):
return self.get_deployment_type() in [
constants.k8s_kind_deploy_type,