Observability¶
Status: Alpha Last Updated: 2026-05-30
vworkspace-operator emits the signals an operator (the person) needs to answer "is this cluster healthy and what happened on it recently". The four surfaces are Prometheus metrics, structured JSON logs, Kubernetes events on every condition transition, and an audit-event stream posted to the control plane's POST /api/agent/events. Each is documented below with the concrete metric names, log fields, and endpoints.
The principle is "the cluster is the source of truth for what is happening; vWorkspace Server is the place a human reads the summary". Everything the operator publishes locally to Kubernetes is replicated, coalesced, into the audit stream the AI assistant in Discuss watches; nothing important happens on the cluster without leaving a record both places.
Prometheus metrics¶
The operator exposes metrics on :8080/metrics (the controller-runtime default). All metrics are labeled with controller, result (where applicable), and operator-specific labels described below.
Operator-specific metrics¶
| Metric | Type | Labels | Meaning |
|---|---|---|---|
vworkspace_operator_reconcile_total |
Counter | controller=<applicationinstance|operation|cluster>, result=<requeue|success|error> |
Reconciles per controller and outcome. |
vworkspace_operator_reconcile_duration_seconds |
Histogram | controller |
Reconcile wall time. Buckets: 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s. |
vworkspace_operator_operation_total |
Counter | type=<Backup|Restore|Upgrade|Migration|RunCommand|Runbook>, engine=<velero|workflow|job|helm|helmHookJob|volsync|snapshot>, outcome=<succeeded|failed|cancelled|blocked> |
Operation completions, useful for SLO calculations per verb and per engine. |
vworkspace_operator_pull_job_lag_seconds |
Gauge | (none) | Age (seconds) of the oldest job the operator has fetched but not yet applied. A persistent non-zero gauge indicates a stuck loop. |
vworkspace_operator_connectivity_state |
Gauge | mode=<pull|push|gitops> |
1 connected, 0 reconnecting, -1 disconnected. The same signal feeds Cluster.status.conditions[Connected]. |
vworkspace_operator_applied_jobs_total |
Counter | (none) | Pull-mode jobs applied successfully (internal/agent/metrics.go). |
vworkspace_operator_event_buffer_occupancy |
Gauge | (none) | Current depth of the outbound event buffer (Pull mode). Updated by internal/agent/events.go on enqueue, flush, and requeue. |
vworkspace_operator_managed_namespaces |
Gauge | (none) | Count of namespaces carrying app.vworkspace.io/managed-by=vworkspace. |
vworkspace_operator_credential_age_seconds |
Gauge | (none) | Age of the current bootstrap credential in seconds since the credentials Secret was last updated or rotated (internal/agent/metrics.go, updated on load/persist/rotation). |
Standard controller-runtime metrics¶
The controller-runtime library contributes the usual workqueue, controller, and webhook metrics: controller_runtime_reconcile_total, controller_runtime_reconcile_errors_total, controller_runtime_reconcile_time_seconds, workqueue_depth, workqueue_adds_total, workqueue_retries_total, workqueue_work_duration_seconds, rest_client_requests_total, etc. We do not override these names; standard tooling and dashboards work unchanged.
Recommended SLOs (starting point)¶
| SLO | Target |
|---|---|
99% of reconciles complete in under 1s for applicationinstance. |
histogram_quantile(0.99, sum by (le) (rate(vworkspace_operator_reconcile_duration_seconds_bucket{controller="applicationinstance"}[5m]))) < 1 |
99% of Backup operations succeed within 30 minutes (catalog default). |
Track via vworkspace_operator_operation_total{type="Backup",outcome="succeeded"} / sum(vworkspace_operator_operation_total{type="Backup"}). |
| Pull-job lag stays under 60s. | vworkspace_operator_pull_job_lag_seconds < 60. |
Connectivity state is 1 (connected) over the last 5 minutes. |
avg_over_time(vworkspace_operator_connectivity_state[5m]) > 0.99. |
The SLOs are starting points; an organization is expected to tune them to its own targets.
Structured logs¶
The operator's logger is the controller-runtime logger (zap under the hood), configured to emit JSON. Every log line carries a stable set of fields:
| Field | Type | When set |
|---|---|---|
level |
string | info, warn, error, debug. |
ts |
string | RFC3339 timestamp. |
msg |
string | The message. |
cluster_id |
string | The cluster's identity (Cluster.metadata.name). Always set. |
org_id |
string | The owning organization identity. Always set. |
namespace |
string | The namespace the reconcile is operating in. Set on namespaced reconciles. |
application_instance |
string | ApplicationInstance name. Set in the ApplicationInstance reconciler. |
operation_id |
string | Operation.metadata.uid. Set in the Operation reconciler. |
operator_version |
string | The operator binary's version (semver, e.g., v0.4.2). Always set. |
controller |
string | Which controller emitted the line (applicationinstance, operation, cluster). |
reconcileID |
string | Per-reconcile correlation id. |
Log levels in practice:
info: every condition transition, every external write (Helm Release apply, Velero Backup create, Workflow create, Job create), reconciliation start and end, audit-event send.warn: retryable failures (control plane unreachable but bootstrap credential intact), admission warnings, operations enteringBlocked.error: non-retryable failures, panics caught by controller-runtime, credential rotation failures.debug: per-step reconciliation traces; off by default. Enabled with--zap-log-level=debug.
Sensitive values (chart-value secrets, the operator's own bootstrap credential) are redacted (../security/secrets-handling.md). The structured shape lets kubectl logs ... | jq and any log-aggregation backend slice and dice by cluster, org, namespace, application instance, or operation id without parsing strings.
Kubernetes events¶
The operator emits a Kubernetes Event on every condition transition. This makes kubectl describe applicationinstance/<name> and kubectl describe operation/<name> informative without leaving the cluster. The events are:
| Event reason | Object | When |
|---|---|---|
ReconcileStarted |
ApplicationInstance |
The reconciler begins a generation. |
ReconcileSucceeded |
ApplicationInstance |
A generation completes with Ready=True. |
ReconcileFailed |
ApplicationInstance |
A generation completes with Failed=True. The message is the reason. |
HelmReleaseUpgraded |
ApplicationInstance |
The underlying HelmRelease transitions to a new revision. |
OperationAccepted |
Operation |
Validation passed; child resource creation imminent. |
OperationBlocked |
Operation |
A precondition is unmet; the message names the precondition. |
OperationRunning |
Operation |
The child resource (Backup, Workflow, Job, ...) is running. |
OperationSucceeded |
Operation |
The child resource succeeded. Outputs are recorded on the Operation. |
OperationFailed |
Operation |
The child resource failed. The message includes the engine-specific reason. |
ConnectivityConnected |
Cluster |
The operator established or re-established the outbound connection to Odoo. |
ConnectivityLost |
Cluster |
The outbound connection has been failing for the configured grace period. |
CredentialRotated |
Cluster |
The bootstrap credential was rotated. |
Events are namespace-local for ApplicationInstance and Operation; the Cluster events are in vworkspace-system. Standard event-aggregation tools (Kubernetes' built-in event aggregation; Argo CD's event sources; observability backends that ingest events) work without modification.
Audit events to Odoo¶
Significant events are also posted to the control plane's POST /api/agent/events over the operator's outbound channel (Pull mode) or written through the API server (Push mode). The audit payload is:
{
"cluster_id": "cluster-prod-1",
"org_id": "myteam",
"occurred_at": "2026-05-28T10:07:13Z",
"namespace": "org-myteam",
"application_instance": "nextcloud-myteam",
"operation_id": "6b62...",
"kind": "OperationSucceeded",
"reason": "VeleroBackupCompleted",
"message": "velero.io/Backup completed; 412 items, 18.4 GiB",
"operator_version": "v0.4.2"
}
Audit events are idempotent (stable keys per cluster_id + kind + operation_id) and coalesced into batches: every second, or when the buffer reaches a size threshold, whichever comes first. When the link to the control plane is down, events are queued in a bounded local buffer (vworkspace_operator_event_buffer_occupancy) and flushed on reconnect. Buffer overflow is itself reported as a Cluster.status condition and a ClusterEventBufferOverflow event.
In Odoo, the AI assistant in Discuss subscribes to the audit stream for the cluster's organization. The human operator and the AI assistant share one timeline.
Health endpoints¶
The operator exposes two endpoints on :8081:
/healthz— liveness. Returns200 OKwhile the process is running;Kubernetesrestarts the pod if this fails./readyz— readiness. Returns200 OKwhen the operator has loaded CRDs, validated RBAC, established (or re-established) the Odoo connection (Pull mode), and is ready to reconcile. Returns503otherwise; Kubernetes withholds traffic on this state. The body includes a small JSON summary of which sub-check failed.
A failing /readyz is a recoverable condition; a failing /healthz indicates a deeper bug worth filing an issue.
The Cluster CR as the overall health surface¶
The Cluster CR is the single object an operator (or a script, or the AI assistant) reads to learn whether the cluster is operational. Its status summarizes everything else:
status:
observedGeneration: 4
operatorVersion: v0.4.2
fluxVersion: v2.3.0
veleroVersion: v1.14.0
managedNamespaces:
- { name: org-myteam, allowedOperationTemplates: ["backup.velero", "restore.velero", "upgrade.helm"] }
lastHeartbeat: "2026-05-28T10:07:13Z"
conditions:
- { type: Connected, status: "True", reason: ControlPlaneReachable, lastTransitionTime: "2026-05-28T08:00:00Z", message: "Last successful round-trip 4s ago" }
- { type: Authenticated, status: "True", reason: CredentialValid, lastTransitionTime: "2026-05-28T08:00:00Z" }
- { type: ControllersHealthy, status: "True", reason: AllControllersReady, lastTransitionTime: "2026-05-28T08:00:00Z" }
- { type: CRDsRegistered, status: "True", reason: SchemaMatches, lastTransitionTime: "2026-05-28T08:00:00Z" }
- { type: Disconnected, status: "False", reason: NotApplicable }
When a problem occurs, the relevant condition flips. The AI assistant in Odoo reads the same status surface and produces the equivalent human-readable summary in Discuss.
Putting it together: the four-pane overview¶
A working observability setup for a vWorkspace cluster has four panes:
- Cluster status pane.
kubectl get cluster -n vworkspace-system -o yaml(or its dashboard equivalent). Answers "is the operator healthy". - Application status pane.
kubectl get applicationinstance -Aplus a per-app drill-down. Answers "is each application healthy". - Operation status pane.
kubectl get operation -Aplus a per-op drill-down. Answers "what is happening right now and what happened recently". - Operator metrics pane. A small Grafana dashboard with the operator-specific metrics above. Answers "is the operator itself behaving".
Add to these the Odoo Discuss timeline for the organization, and the human operator and the AI assistant have a complete picture.
Related material¶
- troubleshooting.md — How to follow each of these signals to a root cause.
- upgrades.md — Version-skew rules visible via
Cluster.status.operatorVersion. - ../security/threat-model.md — Why structured logs redact secrets.
- ../api/conditions.md — Full condition reason vocabulary.