Skip to content

Commit b8c1927

Browse files
committed
[OCTRL-1095] add k3s observability docs
1 parent d345c7c commit b8c1927

3 files changed

Lines changed: 142 additions & 0 deletions

File tree

control-operator/api/v1alpha1/task_types.go

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,13 +88,24 @@ type TaskSpec struct {
8888
NodeName string `json:"nodeName,omitempty"`
8989
}
9090

91+
const (
92+
ConditionPodReady = "PodReady"
93+
ConditionGRPCConnected = "GRPCConnected"
94+
ConditionStateInitialized = "StateInitialized"
95+
ConditionStateTransitioned = "StateTransitioned"
96+
)
97+
9198
// TaskStatus defines the observed state of Task
9299
type TaskStatus struct {
93100
// INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
94101
// Important: Run "make" to regenerate code after modifying this file
95102
Pod v1.PodStatus `json:"pod,omitempty"`
96103
State string `json:"state,omitempty"`
97104
Error string `json:"error,omitempty"`
105+
// +listType=map
106+
// +listMapKey=type
107+
// +optional
108+
Conditions []metav1.Condition `json:"conditions,omitempty"`
98109
}
99110

100111
// +kubebuilder:object:root=true

control-operator/api/v1alpha1/zz_generated.deepcopy.go

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/k3s_observability.md

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
# K3s Observability
2+
3+
> ⚠️ **Warning**
4+
> This observability setup is a prototype configured for a specific OpenSearch instance. Adjust `opensearch-config.yml` before deploying to a different environment.
5+
6+
All manifests and server-side configuration live in `control-operator/k3s-observability/`:
7+
8+
```
9+
k3s-observability/
10+
├── manifests/ # applied via kubectl
11+
│ ├── opensearch-config.yml
12+
│ ├── fluent-bit-events.yml
13+
│ ├── fluent-bit-logs.yml
14+
│ └── fluent-bit-audit.yml
15+
└── other/ # deployed manually on the k3s server node
16+
└── audit-policy.yaml
17+
```
18+
19+
## Overview
20+
21+
Three fluent-bit components run inside the k3s cluster and forward data to an external observability stack via the Fluent Forward protocol (port 24224 on `OPENSEARCH_HOST`):
22+
23+
| Manifest | Kind | What it collects |
24+
|---|---|---|
25+
| `fluent-bit-events.yml` | Deployment | Kubernetes `Event` objects — pod lifecycle, gRPC connections, controller-emitted events |
26+
| `fluent-bit-logs.yml` | DaemonSet | Container stdout/stderr from all pods |
27+
| `fluent-bit-audit.yml` | DaemonSet (control-plane only) | Kubernetes API audit log — full CRD specs on create/update/delete |
28+
29+
The external observability stack (Fluent Bit → OTel Collector → Data Prepper → OpenSearch) receives and processes the forwarded data.
30+
31+
`OPENSEARCH_HOST` and `OPENSEARCH_PORT` in `opensearch-config.yml` point at the observability Fluent Bit forward input, not at OpenSearch directly. All k3s Fluent Bit components read these via `envFrom`.
32+
33+
[Reloader](https://github.com/stakater/Reloader) can be used to automatically restart any pods whenever their ConfigMap changes including the fluent-bit ones. However it is not required for fluent-bit deployment. Each Deployment/DaemonSet has the annotation `reloader.stakater.com/auto: "true"` on the pod template.
34+
35+
## Deployment
36+
37+
### First-time setup
38+
39+
**1. Configure OpenSearch endpoint**
40+
41+
Edit `manifests/opensearch-config.yml` with the correct host and port, then apply all manifests:
42+
43+
```bash
44+
kubectl apply -f control-operator/k3s-observability/manifests/
45+
```
46+
47+
**2. Set up audit logging on the k3s server node**
48+
49+
Copy the audit policy to the server:
50+
```bash
51+
scp control-operator/k3s-observability/other/audit-policy.yaml <server>:/etc/rancher/k3s/audit-policy.yaml
52+
```
53+
54+
Create `/etc/rancher/k3s/config.yaml` on the server (create it if it doesn't exist):
55+
```yaml
56+
kube-apiserver-arg:
57+
- "audit-log-path=/var/log/k3s-audit.log"
58+
- "audit-policy-file=/etc/rancher/k3s/audit-policy.yaml"
59+
- "audit-log-maxage=7"
60+
- "audit-log-maxbackup=3"
61+
- "audit-log-maxsize=100"
62+
```
63+
64+
Restart k3s. If leftover containerd-shim processes block the restart:
65+
```bash
66+
/usr/local/bin/k3s-killall.sh && systemctl start k3s
67+
```
68+
69+
**(OPTIONAL) 3. Install Reloader**
70+
```bash
71+
kubectl apply -f https://raw.githubusercontent.com/stakater/Reloader/master/deployments/kubernetes/reloader.yaml
72+
```
73+
74+
### Updating config
75+
76+
After any change to the manifests:
77+
```bash
78+
kubectl apply -f control-operator/k3s-observability/manifests/
79+
```
80+
81+
Reloader will automatically restart affected pods when their ConfigMap changes.
82+
83+
## What is recorded and where
84+
85+
### Kubernetes Events (`fluent-bit-events`)
86+
87+
Watches the Kubernetes `Event` API directly. Captures events emitted by kubelet and the ALIECS controllers:
88+
89+
- Pod lifecycle: `Created`, `Started`, `Killing` (explicit kill), `BackOff` (crash loop)
90+
- Task controller: pod IP assignment, gRPC connection established, pod failure detected
91+
- Notable gap: containers that exit on their own do not generate a kubelet `Killing` event — their exit is only visible in pod status. The task controller emits a `PodFailed` event to fill this gap.
92+
93+
Query in OpenSearch: `WHERE attributes.kind = 'Event'`
94+
95+
### Container logs (`fluent-bit-logs`)
96+
97+
Tails `/var/log/containers/*.log` on every node. Captures stdout/stderr from all containers including the task and environment managers.
98+
99+
The ALIECS controllers are configured with `--zap-encoder=json` so their log lines are pure JSON. The fluent-bit `merge_log: on` option parses these automatically, lifting structured fields as queryable attributes. The OTel Collector further normalises controller logs — including mapping the Go `level` field (`debug`/`info`/`warn`/`error`) to OTLP `severity_text` and `severity_number` so that log level filtering works correctly in OpenSearch Dashboards:
100+
101+
### Audit log (`fluent-bit-audit`)
102+
103+
Tails `/var/log/k3s-audit.log` on the control-plane node. Records every API server interaction matching the audit policy.
104+
105+
**What is captured:**
106+
107+
| Resource | Level | Verbs |
108+
|---|---|---|
109+
| ALIECS CRDs (Task, Environment, TaskTemplate) | `RequestResponse` (full spec) | create, update, patch, delete |
110+
| Pods | `Metadata` (no body) | create, delete |
111+
112+
`RequestResponse` means the full request and response body is logged — i.e. the complete spec of every Task and Environment CRD at the time it was created or modified. This gives a persistent record of what was deployed even after the CRD is deleted.
113+
114+
`managedFields` is stripped at source via `omitManagedFields: true` in the audit policy. This field uses `.` as a JSON key (Kubernetes FieldsV1 format), which OpenSearch rejects. Removing it at the kube-apiserver level is cleaner than filtering it in the pipeline.
115+
116+
Pod deletion (which sets the pod to Terminating) is captured at `Metadata` level via `verb: delete`.
117+
118+
What is **not** captured: pod status transitions (Running → Terminating → Succeeded/Failed) — these are `patch` operations on the Pod object and are excluded to avoid noise.
119+
120+
## Audit policy
121+
122+
The audit policy at `other/audit-policy.yaml` is a server-side file read by the kube-apiserver at startup — it is **not** a Kubernetes resource and cannot be applied with `kubectl`. Any change to it requires copying the file to the server and restarting k3s.
123+
124+
Noise excluded by policy: lease updates, node heartbeats, health/metrics endpoints. `managedFields` is excluded from all captured events via `omitManagedFields: true` on the ALIECS CRD rule.

0 commit comments

Comments
 (0)