feat: build an online-to-offline distillation loop for policy improvement beyond static sacred gating #5
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
The current replay buffer, reward signals, and offline bandit policy are the right direction, but the learning loop is still conservative and mostly observational.
Why this matters
The user wants an agent that keeps getting smarter offline. That requires a disciplined distillation loop from live traces to candidate policies, not only manual bootstrapping and static promotion gates.
Current evidence
/home/openclaw/.openclaw/workspace/lib/replay_buffer.py/home/openclaw/.openclaw/workspace/bin/train-policy-offline/home/openclaw/.openclaw/workspace/bin/run-sacred-evalsDeliverables
Definition of done
A nightly or manual batch can consume new trajectories and produce a versioned candidate policy that is either promoted or rejected by the gate.