stack-orchestrator/docs/bug-laconic-so-etcd-cleanup.md
A. F. Dudley 0b52fc99d7 fix: ashburn relay playbooks and document DZ tunnel ACL root cause
Playbook fixes from testing:
- ashburn-relay-biscayne: insert DNAT rules at position 1 before
  Docker's ADDRTYPE LOCAL rule (was being swallowed at position 3+)
- ashburn-relay-mia-sw01: add inbound route for 137.239.194.65 via
  egress-vrf vrf1 (nexthop only, no interface — EOS silently drops
  cross-VRF routes that specify a tunnel interface)
- ashburn-relay-was-sw01: replace PBR with static route, remove
  Loopback101

Bug doc (bug-ashburn-tunnel-port-filtering.md): root cause is the
DoubleZero agent on mia-sw01 overwrites SEC-USER-500-IN ACL, dropping
outbound gossip with src 137.239.194.65. The DZ agent controls
Tunnel500's lifecycle. Fix requires a separate GRE tunnel using
mia-sw01's free LAN IP (209.42.167.137) to bypass DZ infrastructure.

Also adds all repo docs, scripts, inventory, and remaining playbooks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 01:44:25 +00:00

2.3 KiB

Bug: laconic-so etcd cleanup wipes core kubernetes service

Summary

_clean_etcd_keeping_certs() in laconic-stack-orchestrator 1.1.0 deletes the kubernetes service from etcd, breaking cluster networking on restart.

Component

stack_orchestrator/deploy/k8s/helpers.py_clean_etcd_keeping_certs()

Reproduction

  1. Deploy with laconic-so to a k8s-kind target with persisted etcd (hostPath mount in kind-config.yml)
  2. laconic-so deployment --dir <dir> stop (destroys cluster)
  3. laconic-so deployment --dir <dir> start (recreates cluster with cleaned etcd)

Symptoms

  • kindnet pods enter CrashLoopBackOff with: panic: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
  • kubectl get svc kubernetes -n default returns NotFound
  • coredns, caddy, local-path-provisioner stuck in Pending (no CNI without kindnet)
  • No pods can be scheduled

Root Cause

_clean_etcd_keeping_certs() uses a whitelist that only preserves /registry/secrets/caddy-system keys. All other etcd keys are deleted, including /registry/services/specs/default/kubernetes — the core kubernetes ClusterIP service that kube-apiserver auto-creates.

When the kind cluster starts with the cleaned etcd, kube-apiserver sees the existing etcd data and does not re-create the kubernetes service. kindnet depends on the KUBERNETES_SERVICE_HOST environment variable which is injected by the kubelet from this service — without it, kindnet panics.

Fix Options

  1. Expand the whitelist to include /registry/services/specs/default/kubernetes and other core cluster resources
  2. Fully wipe etcd instead of selective cleanup — let the cluster bootstrap fresh (simpler, but loses Caddy TLS certs)
  3. Don't persist etcd at all — ephemeral etcd means clean state every restart (recommended for kind deployments)

Workaround

Fully delete the kind cluster before start:

kind delete cluster --name <cluster-name>
laconic-so deployment --dir <dir> start

This forces fresh etcd bootstrap. Downside: all other services deployed to the cluster (DaemonSets, other namespaces) are destroyed.

Impact

  • Affects any k8s-kind deployment with persisted etcd
  • Cluster is unrecoverable without full destroy+recreate
  • All non-laconic-so-managed workloads in the cluster are lost