feat: build an online-to-offline distillation loop for policy improvement beyond static sacred gating #5

Open
opened 2026-03-21 09:11:43 +01:00 by openclaw · 0 comments
Owner

Context
The current replay buffer, reward signals, and offline bandit policy are the right direction, but the learning loop is still conservative and mostly observational.

Why this matters
The user wants an agent that keeps getting smarter offline. That requires a disciplined distillation loop from live traces to candidate policies, not only manual bootstrapping and static promotion gates.

Current evidence

  • /home/openclaw/.openclaw/workspace/lib/replay_buffer.py
  • /home/openclaw/.openclaw/workspace/bin/train-policy-offline
  • /home/openclaw/.openclaw/workspace/bin/run-sacred-evals

Deliverables

  • Formalize a trace -> label -> candidate-policy -> sacred-gate -> canary pipeline.
  • Distinguish observation-only, prefer, and avoid transitions with explicit thresholds.
  • Add support for Kimi-labeled hard cases without putting Kimi in the hot path.
  • Version candidate policies and promotion decisions.

Definition of done
A nightly or manual batch can consume new trajectories and produce a versioned candidate policy that is either promoted or rejected by the gate.

Context The current replay buffer, reward signals, and offline bandit policy are the right direction, but the learning loop is still conservative and mostly observational. Why this matters The user wants an agent that keeps getting smarter offline. That requires a disciplined distillation loop from live traces to candidate policies, not only manual bootstrapping and static promotion gates. Current evidence - `/home/openclaw/.openclaw/workspace/lib/replay_buffer.py` - `/home/openclaw/.openclaw/workspace/bin/train-policy-offline` - `/home/openclaw/.openclaw/workspace/bin/run-sacred-evals` Deliverables - Formalize a trace -> label -> candidate-policy -> sacred-gate -> canary pipeline. - Distinguish observation-only, prefer, and avoid transitions with explicit thresholds. - Add support for Kimi-labeled hard cases without putting Kimi in the hot path. - Version candidate policies and promotion decisions. Definition of done A nightly or manual batch can consume new trajectories and produce a versioned candidate policy that is either promoted or rejected by the gate.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: openclaw/openclaw-intelligence-core-public#5
No description provided.