Add Job and secrets support for k8s-kind deployments #995

Merged
prathamesh merged 26 commits from feature/k8s-jobs into main 2026-03-11 03:56:22 +00:00

26 Commits

Author SHA1 Message Date
aac317503e fix(test): wait for namespace termination before restart
All checks were successful
Lint Checks / Run linter (pull_request) Successful in 6m10s
Deploy Test / Run deploy test suite (pull_request) Successful in 16m28s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Successful in 25m15s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 27m14s
Webapp Test / Run webapp test suite (pull_request) Successful in 36m47s
Smoke Test / Run basic test suite (pull_request) Successful in 35m54s
Replace fixed sleep with a polling loop that waits for the deployment
namespace to be fully deleted. Without this, the start command fails
with 403 Forbidden because k8s rejects resource creation in a
namespace that is still terminating.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 07:01:27 +00:00
b85c12e4da fix(test): use --skip-cluster-management for stop/start volume test
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m50s
Deploy Test / Run deploy test suite (pull_request) Successful in 8m27s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 9m20s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 10m56s
Webapp Test / Run webapp test suite (pull_request) Failing after 14m12s
Smoke Test / Run basic test suite (pull_request) Failing after 15m12s
Recreating a kind cluster in the same CI run fails due to stale
etcd/certs and cgroup detection issues. Use --skip-cluster-management
to reuse the existing cluster, and --delete-volumes to clear PVs so
fresh PVCs can bind on restart.

The volume retention semantics are preserved: bind-mount host path
data survives (filesystem is old), provisioner volumes are fresh
(PVs were deleted).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:49:42 +00:00
a1c6c35834 style: wrap long line in cluster_info.py to fix flake8 E501
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 1m58s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m19s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 5m33s
Webapp Test / Run webapp test suite (pull_request) Successful in 6m47s
Smoke Test / Run basic test suite (pull_request) Successful in 7m30s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 9m20s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:31:25 +00:00
91f4e5fe38 fix(k8s): use distinct app label for job pods
Some checks failed
Lint Checks / Run linter (pull_request) Failing after 1m34s
Deploy Test / Run deploy test suite (pull_request) Successful in 3m17s
Smoke Test / Run basic test suite (pull_request) Successful in 4m13s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 4m33s
Webapp Test / Run webapp test suite (pull_request) Successful in 4m57s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Successful in 5m20s
Job pod templates used the same app={deployment_id} label as
deployment pods, causing pods_in_deployment() to return both.
This made the logs command warn about multiple pods and pick
the wrong one.

Use app={deployment_id}-job for job pod templates so they are
not matched by pods_in_deployment(). The Job metadata itself
retains the original app label for stack-level queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:26:03 +00:00
68ef9de016 fix(k8s): resolve internal job compose files from data/compose-jobs
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m5s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m43s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 5m4s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m26s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m36s
Smoke Test / Run basic test suite (pull_request) Successful in 5m32s
resolve_job_compose_file() used Path(stack).parent.parent for the
internal fallback, which resolved to data/stacks/compose-jobs/ instead
of data/compose-jobs/. This meant deploy create couldn't find job
compose files for internal stacks, so they were never copied to the
deployment directory and never created as k8s Jobs.

Use the same data directory resolution pattern as resolve_compose_file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 06:15:15 +00:00
a1b5220e40 fix(test): prevent set -e from killing kubectl queries in test checks
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m5s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m52s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 6m31s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 6m58s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m49s
Smoke Test / Run basic test suite (pull_request) Successful in 6m15s
kubectl commands that query jobs or pod specs exit non-zero when the
resource doesn't exist yet. Under set -e, a bare command substitution
like var=$(kubectl ...) aborts the script silently. Add || true so
the polling loop and assertion logic can handle failures gracefully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 05:59:35 +00:00
464215c72a fix(test): replace empty secrets key instead of appending duplicate
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 3m19s
Deploy Test / Run deploy test suite (pull_request) Successful in 5m57s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 6m32s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 7m31s
Webapp Test / Run webapp test suite (pull_request) Successful in 8m1s
Smoke Test / Run basic test suite (pull_request) Successful in 7m38s
deploy init already writes 'secrets: {}' into the spec file. The test
was appending a second secrets block via heredoc, which ruamel.yaml
rejects as a duplicate key. Use sed to replace the empty value instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 05:34:37 +00:00
108f13a09b fix(test): wait for kind cluster cleanup before recreating
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 1m58s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 4m11s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m43s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 6m22s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m42s
Smoke Test / Run basic test suite (pull_request) Successful in 6m26s
Replace the fixed `sleep 20` with a polling loop that waits for
`kind get clusters` to report no clusters. The previous approach
was flaky on CI runners where Docker takes longer to tear down
cgroup hierarchies after `kind delete cluster`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 05:26:48 +00:00
d64046df55 Revert "fix(test): reuse kind cluster on stop/start cycle in deploy test"
This reverts commit 35f179b755.
2026-03-10 05:24:00 +00:00
35f179b755 fix(test): reuse kind cluster on stop/start cycle in deploy test
Use --skip-cluster-management to avoid destroying and recreating the
kind cluster during the stop/start volume retention test. The second
kind create fails on some CI runners due to cgroups detection issues.

Use --delete-volumes to clear PVs so fresh PVCs can bind on restart.
Bind-mount data survives on the host filesystem; provisioner volumes
are recreated fresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 05:13:26 +00:00
1375f209d3 test(k8s): add tests for jobs, secrets, labels, and namespace isolation
Add a job compose file for the test stack and extend the k8s deploy
test to verify new features:
- Namespace isolation: pod exists in laconic-{id}, not default
- Stack labels: app.kubernetes.io/stack label set on pods
- Job completion: test-job runs to completion (status.succeeded=1)
- Secrets: spec secrets: key results in envFrom secretRef on pod

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 05:06:31 +00:00
241cd75671 fix(test): use deployment namespace in k8s control test
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 1m55s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m46s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 6m2s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 6m13s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m2s
Smoke Test / Run basic test suite (pull_request) Successful in 5m22s
The deployment control test queries pods with raw kubectl but didn't
specify the namespace. Since pods now live in laconic-{deployment_id}
instead of default, the query returned empty results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:53:52 +00:00
183a188874 ci: upgrade Kind to v0.25.0 and pin kubectl to v1.31.2
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m1s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m42s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m39s
Webapp Test / Run webapp test suite (pull_request) Successful in 6m55s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Successful in 7m7s
Smoke Test / Run basic test suite (pull_request) Successful in 5m20s
Kind v0.20.0 defaults to k8s v1.27.3 which fails on newer CI runners
(kubelet cgroups issue). Upgrade to Kind v0.25.0 (k8s v1.31.2) and
pin kubectl to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
517e102830 fix(k8s): use deployment namespace for pod/container lookups
pods_in_deployment() and containers_in_pod() hardcoded
namespace="default", but pods are created in the deployment-specific
namespace (laconic-{cluster-id}). This caused logs() to return
"Pods not running" even when pods were healthy.

Add namespace parameter to both functions and pass
self.k8s_namespace from the logs() caller.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
ef07b2c86e k8s: extract basename from stack path for labels
Stack.name contains the full absolute path from the spec file's
"stack:" key (e.g. /home/.../stacks/hyperlane-minio). K8s labels
must be <= 63 bytes and alphanumeric. Extract just the directory
basename (e.g. "hyperlane-minio") before using it as a label value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
7c8a4d91e7 k8s: add start() hook for post-deployment k8s resource creation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
b8702f0bfc k8s: add app.kubernetes.io/stack label to pods and jobs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
d7a742032e fix(webapp): use YAML round-trip instead of raw string append in _fixup_url_spec
The secrets: {} key added by init_operation for k8s deployments became
the last key in the spec file, breaking the raw string append that
assumed network: was always last. Replace with proper YAML load/modify/dump.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
b77037c73d fix: remove shadowed os import in cluster_info
Inline `import os` at line 663 shadowed the top-level import,
causing flake8 F402.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
8d65cb13a0 fix(k8s): copy configmap dirs for jobs-only stacks during deploy create
The k8s configmap directory copying was inside the `for pod in pods:`
loop. For jobs-only stacks (no pods), the loop never executes, so
configmap files were never copied into the deployment directory.
The ConfigMaps were created as empty objects, leaving volume mounts
with no files.

Move the k8s configmap copying outside the pod loop so it runs
regardless of whether the stack has pods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
47f3068e70 fix(k8s): include job volumes in PVC/ConfigMap/PV creation
For jobs-only stacks, named_volumes_from_pod_files() returned empty
because it only scanned parsed_pod_yaml_map. This caused ConfigMaps
and PVCs declared in the spec to be silently skipped.

- Add _all_named_volumes() helper that scans both pod and job maps
- Guard update() against empty parsed_pod_yaml_map (uncaught 404)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
9b304b8990 fix(k8s): remove job-name label that conflicts with k8s auto-label
Kubernetes automatically adds a job-name label to Job pod templates
matching the full Job name. Our custom job-name label used the short
name, causing a 422 validation error. Let k8s manage this label.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
e0a8477326 fix(k8s): skip Deployment creation for jobs-only stacks
When a stack defines only jobs: (no pods:), the parsed_pod_yaml_map
is empty. Creating a Deployment with no containers causes a 422 error
from the k8s API. Skip Deployment creation when there are no pods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
74deb3f8d6 feat(k8s): add Job support for non-Helm k8s-kind deployments
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:40:06 +00:00
589ed3cf69 docs: update CLI reference to match actual code
cli.md:
- Document `start`/`stop` as preferred commands (`up`/`down` as legacy)
- Add --skip-cluster-management flag for start and stop
- Add --delete-volumes flag for stop
- Add missing subcommands: restart, exec, status, port, push-images, run-job
- Add --helm-chart option to deploy create
- Reorganize deploy vs deployment sections for clarity

deployment_patterns.md:
- Add missing --stack flag to deploy create example

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:38:49 +00:00
641052558a feat: add secrets support for k8s deployments
Adds a `secrets:` key to spec.yml that references pre-existing k8s
Secrets by name. SO mounts them as envFrom.secretRef on all pod
containers. Secret contents are managed out-of-band by the operator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 04:38:48 +00:00