Add Job and secrets support for k8s-kind deployments #995

Merged
prathamesh merged 26 commits from feature/k8s-jobs into main 2026-03-11 03:56:22 +00:00
Member
Part of https://plan.wireit.in/deepstack/browse/VUL-315
prathamesh added 7 commits 2026-03-06 09:18:19 +00:00
Adds a `secrets:` key to spec.yml that references pre-existing k8s
Secrets by name. SO mounts them as envFrom.secretRef on all pod
containers. Secret contents are managed out-of-band by the operator.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
cli.md:
- Document `start`/`stop` as preferred commands (`up`/`down` as legacy)
- Add --skip-cluster-management flag for start and stop
- Add --delete-volumes flag for stop
- Add missing subcommands: restart, exec, status, port, push-images, run-job
- Add --helm-chart option to deploy create
- Reorganize deploy vs deployment sections for clarity

deployment_patterns.md:
- Add missing --stack flag to deploy create example

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When a stack defines only jobs: (no pods:), the parsed_pod_yaml_map
is empty. Creating a Deployment with no containers causes a 422 error
from the k8s API. Skip Deployment creation when there are no pods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Kubernetes automatically adds a job-name label to Job pod templates
matching the full Job name. Our custom job-name label used the short
name, causing a 422 validation error. Let k8s manage this label.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
For jobs-only stacks, named_volumes_from_pod_files() returned empty
because it only scanned parsed_pod_yaml_map. This caused ConfigMaps
and PVCs declared in the spec to be silently skipped.

- Add _all_named_volumes() helper that scans both pod and job maps
- Guard update() against empty parsed_pod_yaml_map (uncaught 404)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fix(k8s): copy configmap dirs for jobs-only stacks during deploy create
Some checks failed
Lint Checks / Run linter (pull_request) Failing after 2m2s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m55s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 5m47s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m38s
Webapp Test / Run webapp test suite (pull_request) Failing after 7m34s
Smoke Test / Run basic test suite (pull_request) Successful in 5m58s
dc60695100
The k8s configmap directory copying was inside the `for pod in pods:`
loop. For jobs-only stacks (no pods), the loop never executes, so
configmap files were never copied into the deployment directory.
The ConfigMaps were created as empty objects, leaving volume mounts
with no files.

Move the k8s configmap copying outside the pod loop so it runs
regardless of whether the stack has pods.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-06 09:51:51 +00:00
fix: remove shadowed os import in cluster_info
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 1m58s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m55s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 5m14s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m5s
Webapp Test / Run webapp test suite (pull_request) Failing after 6m38s
Smoke Test / Run basic test suite (pull_request) Successful in 7m11s
ac73bb2a73
Inline `import os` at line 663 shadowed the top-level import,
causing flake8 F402.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-06 10:01:34 +00:00
fix(webapp): use YAML round-trip instead of raw string append in _fixup_url_spec
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m9s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m52s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 5m23s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m38s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m52s
Smoke Test / Run basic test suite (pull_request) Successful in 8m1s
be8081c62f
The secrets: {} key added by init_operation for k8s deployments became
the last key in the spec file, breaking the raw string append that
assumed network: was always last. Replace with proper YAML load/modify/dump.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh requested review from ashwin 2026-03-06 10:11:43 +00:00
prathamesh added 2 commits 2026-03-09 10:01:39 +00:00
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
k8s: add start() hook for post-deployment k8s resource creation
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m7s
Deploy Test / Run deploy test suite (pull_request) Successful in 5m5s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 5m19s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m32s
Webapp Test / Run webapp test suite (pull_request) Successful in 6m7s
Smoke Test / Run basic test suite (pull_request) Successful in 3m31s
8769df6c35
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-09 12:13:18 +00:00
k8s: extract basename from stack path for labels
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 1m59s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m43s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 4m50s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m0s
Webapp Test / Run webapp test suite (pull_request) Successful in 8m36s
Smoke Test / Run basic test suite (pull_request) Successful in 2m41s
8530aa3385
Stack.name contains the full absolute path from the spec file's
"stack:" key (e.g. /home/.../stacks/hyperlane-minio). K8s labels
must be <= 63 bytes and alphanumeric. Extract just the directory
basename (e.g. "hyperlane-minio") before using it as a label value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-09 13:00:11 +00:00
fix(k8s): use deployment namespace for pod/container lookups
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 1m59s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m46s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m49s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m52s
Smoke Test / Run basic test suite (pull_request) Successful in 8m5s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 8m44s
36126545ac
pods_in_deployment() and containers_in_pod() hardcoded
namespace="default", but pods are created in the deployment-specific
namespace (laconic-{cluster-id}). This caused logs() to return
"Pods not running" even when pods were healthy.

Add namespace parameter to both functions and pass
self.k8s_namespace from the logs() caller.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-09 13:19:02 +00:00
ci: upgrade Kind to v0.25.0 and pin kubectl to v1.31.2
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m0s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m49s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m39s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m4s
Smoke Test / Run basic test suite (pull_request) Successful in 5m29s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 9m48s
a312bb5ee7
Kind v0.20.0 defaults to k8s v1.27.3 which fails on newer CI runners
(kubelet cgroups issue). Upgrade to Kind v0.25.0 (k8s v1.31.2) and
pin kubectl to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh force-pushed feature/k8s-jobs from a312bb5ee7 to 183a188874 2026-03-10 04:41:07 +00:00 Compare
prathamesh added 1 commit 2026-03-10 04:54:17 +00:00
fix(test): use deployment namespace in k8s control test
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 1m55s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m46s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 6m2s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 6m13s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m2s
Smoke Test / Run basic test suite (pull_request) Successful in 5m22s
241cd75671
The deployment control test queries pods with raw kubectl but didn't
specify the namespace. Since pods now live in laconic-{deployment_id}
instead of default, the query returned empty results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 4 commits 2026-03-10 05:27:44 +00:00
Add a job compose file for the test stack and extend the k8s deploy
test to verify new features:
- Namespace isolation: pod exists in laconic-{id}, not default
- Stack labels: app.kubernetes.io/stack label set on pods
- Job completion: test-job runs to completion (status.succeeded=1)
- Secrets: spec secrets: key results in envFrom secretRef on pod

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use --skip-cluster-management to avoid destroying and recreating the
kind cluster during the stop/start volume retention test. The second
kind create fails on some CI runners due to cgroups detection issues.

Use --delete-volumes to clear PVs so fresh PVCs can bind on restart.
Bind-mount data survives on the host filesystem; provisioner volumes
are recreated fresh.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This reverts commit 35f179b755.
fix(test): wait for kind cluster cleanup before recreating
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 1m58s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 4m11s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m43s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 6m22s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m42s
Smoke Test / Run basic test suite (pull_request) Successful in 6m26s
108f13a09b
Replace the fixed `sleep 20` with a polling loop that waits for
`kind get clusters` to report no clusters. The previous approach
was flaky on CI runners where Docker takes longer to tear down
cgroup hierarchies after `kind delete cluster`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-10 05:35:18 +00:00
fix(test): replace empty secrets key instead of appending duplicate
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 3m19s
Deploy Test / Run deploy test suite (pull_request) Successful in 5m57s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 6m32s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 7m31s
Webapp Test / Run webapp test suite (pull_request) Successful in 8m1s
Smoke Test / Run basic test suite (pull_request) Successful in 7m38s
464215c72a
deploy init already writes 'secrets: {}' into the spec file. The test
was appending a second secrets block via heredoc, which ruamel.yaml
rejects as a duplicate key. Use sed to replace the empty value instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh removed review request for ashwin 2026-03-10 05:38:31 +00:00
prathamesh added 1 commit 2026-03-10 06:00:35 +00:00
fix(test): prevent set -e from killing kubectl queries in test checks
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m5s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m52s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 6m31s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 6m58s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m49s
Smoke Test / Run basic test suite (pull_request) Successful in 6m15s
a1b5220e40
kubectl commands that query jobs or pod specs exit non-zero when the
resource doesn't exist yet. Under set -e, a bare command substitution
like var=$(kubectl ...) aborts the script silently. Add || true so
the polling loop and assertion logic can handle failures gracefully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-10 06:16:26 +00:00
fix(k8s): resolve internal job compose files from data/compose-jobs
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m5s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m43s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 5m4s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 6m26s
Webapp Test / Run webapp test suite (pull_request) Successful in 7m36s
Smoke Test / Run basic test suite (pull_request) Successful in 5m32s
68ef9de016
resolve_job_compose_file() used Path(stack).parent.parent for the
internal fallback, which resolved to data/stacks/compose-jobs/ instead
of data/compose-jobs/. This meant deploy create couldn't find job
compose files for internal stacks, so they were never copied to the
deployment directory and never created as k8s Jobs.

Use the same data directory resolution pattern as resolve_compose_file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-10 06:26:32 +00:00
fix(k8s): use distinct app label for job pods
Some checks failed
Lint Checks / Run linter (pull_request) Failing after 1m34s
Deploy Test / Run deploy test suite (pull_request) Successful in 3m17s
Smoke Test / Run basic test suite (pull_request) Successful in 4m13s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 4m33s
Webapp Test / Run webapp test suite (pull_request) Successful in 4m57s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Successful in 5m20s
91f4e5fe38
Job pod templates used the same app={deployment_id} label as
deployment pods, causing pods_in_deployment() to return both.
This made the logs command warn about multiple pods and pick
the wrong one.

Use app={deployment_id}-job for job pod templates so they are
not matched by pods_in_deployment(). The Job metadata itself
retains the original app label for stack-level queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-10 06:32:01 +00:00
style: wrap long line in cluster_info.py to fix flake8 E501
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 1m58s
Deploy Test / Run deploy test suite (pull_request) Successful in 4m19s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 5m33s
Webapp Test / Run webapp test suite (pull_request) Successful in 6m47s
Smoke Test / Run basic test suite (pull_request) Successful in 7m30s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 9m20s
a1c6c35834
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-10 06:51:42 +00:00
fix(test): use --skip-cluster-management for stop/start volume test
Some checks failed
Lint Checks / Run linter (pull_request) Successful in 2m50s
Deploy Test / Run deploy test suite (pull_request) Successful in 8m27s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Failing after 9m20s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Failing after 10m56s
Webapp Test / Run webapp test suite (pull_request) Failing after 14m12s
Smoke Test / Run basic test suite (pull_request) Failing after 15m12s
b85c12e4da
Recreating a kind cluster in the same CI run fails due to stale
etcd/certs and cgroup detection issues. Use --skip-cluster-management
to reuse the existing cluster, and --delete-volumes to clear PVs so
fresh PVCs can bind on restart.

The volume retention semantics are preserved: bind-mount host path
data survives (filesystem is old), provisioner volumes are fresh
(PVs were deleted).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh added 1 commit 2026-03-10 08:13:24 +00:00
fix(test): wait for namespace termination before restart
All checks were successful
Lint Checks / Run linter (pull_request) Successful in 6m10s
Deploy Test / Run deploy test suite (pull_request) Successful in 16m28s
K8s Deploy Test / Run deploy test suite on kind/k8s (pull_request) Successful in 25m15s
K8s Deployment Control Test / Run deployment control suite on kind/k8s (pull_request) Successful in 27m14s
Webapp Test / Run webapp test suite (pull_request) Successful in 36m47s
Smoke Test / Run basic test suite (pull_request) Successful in 35m54s
aac317503e
Replace fixed sleep with a polling loop that waits for the deployment
namespace to be fully deleted. Without this, the start command fails
with 403 Forbidden because k8s rejects resource creation in a
namespace that is still terminating.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
prathamesh scheduled this pull request to auto merge when all checks succeed 2026-03-10 09:40:36 +00:00
prathamesh requested review from AFDudley 2026-03-10 09:44:56 +00:00
prathamesh removed review request for AFDudley 2026-03-11 03:55:48 +00:00
prathamesh canceled auto merging this pull request when all checks succeed 2026-03-11 03:56:13 +00:00
prathamesh merged commit 5af6a83fa2 into main 2026-03-11 03:56:22 +00:00
prathamesh deleted branch feature/k8s-jobs 2026-03-11 03:56:22 +00:00
Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cerc-io/stack-orchestrator#995
No description provided.