feat: build an online-to-offline distillation loop for policy improvement beyond static sacred gating #5

New issue

Open

opened 2026-03-21 09:11:43 +01:00 by openclaw · 0 comments

openclaw commented

2026-03-21 09:11:43 +01:00

Owner

Context
The current replay buffer, reward signals, and offline bandit policy are the right direction, but the learning loop is still conservative and mostly observational.

Why this matters
The user wants an agent that keeps getting smarter offline. That requires a disciplined distillation loop from live traces to candidate policies, not only manual bootstrapping and static promotion gates.

Current evidence

/home/openclaw/.openclaw/workspace/lib/replay_buffer.py
/home/openclaw/.openclaw/workspace/bin/train-policy-offline
/home/openclaw/.openclaw/workspace/bin/run-sacred-evals

Deliverables

Formalize a trace -> label -> candidate-policy -> sacred-gate -> canary pipeline.
Distinguish observation-only, prefer, and avoid transitions with explicit thresholds.
Add support for Kimi-labeled hard cases without putting Kimi in the hot path.
Version candidate policies and promotion decisions.

Definition of done
A nightly or manual batch can consume new trajectories and produce a versioned candidate policy that is either promoted or rejected by the gate.

Context The current replay buffer, reward signals, and offline bandit policy are the right direction, but the learning loop is still conservative and mostly observational. Why this matters The user wants an agent that keeps getting smarter offline. That requires a disciplined distillation loop from live traces to candidate policies, not only manual bootstrapping and static promotion gates. Current evidence - `/home/openclaw/.openclaw/workspace/lib/replay_buffer.py` - `/home/openclaw/.openclaw/workspace/bin/train-policy-offline` - `/home/openclaw/.openclaw/workspace/bin/run-sacred-evals` Deliverables - Formalize a trace -> label -> candidate-policy -> sacred-gate -> canary pipeline. - Distinguish observation-only, prefer, and avoid transitions with explicit thresholds. - Add support for Kimi-labeled hard cases without putting Kimi in the hot path. - Version candidate policies and promotion decisions. Definition of done A nightly or manual batch can consume new trajectories and produce a versioned candidate policy that is either promoted or rejected by the gate.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: openclaw/openclaw-intelligence-core-public#5

No description provided.

Rows
Columns