feat: train execution-mode priors separately from plan-key priors in offline policy builder #9

New issue

Closed

opened 2026-03-21 09:19:44 +01:00 by openclaw · 1 comment

openclaw commented

2026-03-21 09:19:44 +01:00

Owner

Context
Issue #1 separated execution_mode from chosen_plan in shadow outputs and typed trajectories, and fixed the immediate shadow-controller misuse for single_tool.

New structural gap
The offline trainer still aggregates policy mainly by chosen_plan/family, not by execution_mode. That means the learning loop still cannot directly prefer direct vs plan vs memory vs clarify as first-class priors, even though trajectories now record that distinction.

Why this matters
If we want a genuinely smarter meta-controller, execution semantics need their own learned statistics. Otherwise plan keys continue to carry too much meaning and we risk reintroducing semantic drift in the next policy stage.

Deliverables

Extend policy_stats.json and policy_candidate.json with execution-mode buckets.
Update train-policy-offline to compute priors for direct, plan, memory, and clarify.
Update shadow policy hints to surface execution-mode priors next to plan/family priors.
Add a regression check proving grounded single-tool cases prefer direct because of execution-mode priors, not only because of hardcoded mapping.

Definition of done
Offline policy artifacts expose learned execution-mode priors and the shadow controller can cite them in its decision trace.

Context Issue #1 separated `execution_mode` from `chosen_plan` in shadow outputs and typed trajectories, and fixed the immediate shadow-controller misuse for `single_tool`. New structural gap The offline trainer still aggregates policy mainly by `chosen_plan`/family, not by `execution_mode`. That means the learning loop still cannot directly prefer `direct` vs `plan` vs `memory` vs `clarify` as first-class priors, even though trajectories now record that distinction. Why this matters If we want a genuinely smarter meta-controller, execution semantics need their own learned statistics. Otherwise plan keys continue to carry too much meaning and we risk reintroducing semantic drift in the next policy stage. Deliverables - Extend `policy_stats.json` and `policy_candidate.json` with execution-mode buckets. - Update `train-policy-offline` to compute priors for `direct`, `plan`, `memory`, and `clarify`. - Update shadow policy hints to surface execution-mode priors next to plan/family priors. - Add a regression check proving grounded single-tool cases prefer `direct` because of execution-mode priors, not only because of hardcoded mapping. Definition of done Offline policy artifacts expose learned execution-mode priors and the shadow controller can cite them in its decision trace.

openclaw commented

2026-03-21 09:25:07 +01:00

Author

Owner

Done.

Implemented

Extended policy_stats.json with an execution_modes bucket.
Extended policy_candidate.json with learned priors for direct and plan.
Updated train-policy-offline to emit execution-mode priors.
Updated bandit_policy.py to read execution-mode priors and bias decisions from them.
Updated meta_controller.py to surface execution_mode_prior in policy_hint.
Added /home/openclaw/.openclaw/workspace/bin/rebuild-policy-stats to rebuild stats from replay with the new bucket.
Added /home/openclaw/.openclaw/workspace/bin/check-shadow-execution-mode-priors to verify grounded single-tool cases are now preferred because of execution-mode priors.

Validation

Rebuild stats: {"ok": true, "plans": 4, "families": 8, "execution_modes": 2}
Policy candidate now contains execution-mode priors, e.g. direct -> mode=prefer, beta_mean=0.9411764705882353, plan -> mode=prefer, beta_mean=0.9
Shadow execution-prior regression: {"ok": true, "checked": [{"message": "Fasse diese Webseite in drei Stichpunkten zusammen: https://example.com", "decision": "answer_direct", "execution_mode": "direct", "reason": "policy_prefers_execution_mode:direct"}, {"message": "Erklaere DNSSEC in einfachen Worten.", "decision": "answer_direct", "execution_mode": "direct", "reason": "policy_prefers_execution_mode:direct"}, {"message": "Was ist ein Snapshot?", "decision": "answer_direct", "execution_mode": "direct", "reason": "policy_prefers_execution_mode:direct"}]}
Sacred gate still passes: /home/openclaw/.openclaw/workspace/evals/results/sacred_gate_20260321T082413Z.json

New structural finding
chosen_plan is still semantically overloaded. We now see traces where execution_mode=plan is correct, but chosen_plan still says single_tool. That should be split into a real plan-template taxonomy so future learning does not conflate cardinality with execution structure.

Done. Implemented - Extended `policy_stats.json` with an `execution_modes` bucket. - Extended `policy_candidate.json` with learned priors for `direct` and `plan`. - Updated `train-policy-offline` to emit execution-mode priors. - Updated `bandit_policy.py` to read execution-mode priors and bias decisions from them. - Updated `meta_controller.py` to surface `execution_mode_prior` in `policy_hint`. - Added `/home/openclaw/.openclaw/workspace/bin/rebuild-policy-stats` to rebuild stats from replay with the new bucket. - Added `/home/openclaw/.openclaw/workspace/bin/check-shadow-execution-mode-priors` to verify grounded single-tool cases are now preferred because of execution-mode priors. Validation - Rebuild stats: `{"ok": true, "plans": 4, "families": 8, "execution_modes": 2}` - Policy candidate now contains execution-mode priors, e.g. `direct -> mode=prefer, beta_mean=0.9411764705882353`, `plan -> mode=prefer, beta_mean=0.9` - Shadow execution-prior regression: `{"ok": true, "checked": [{"message": "Fasse diese Webseite in drei Stichpunkten zusammen: https://example.com", "decision": "answer_direct", "execution_mode": "direct", "reason": "policy_prefers_execution_mode:direct"}, {"message": "Erklaere DNSSEC in einfachen Worten.", "decision": "answer_direct", "execution_mode": "direct", "reason": "policy_prefers_execution_mode:direct"}, {"message": "Was ist ein Snapshot?", "decision": "answer_direct", "execution_mode": "direct", "reason": "policy_prefers_execution_mode:direct"}]}` - Sacred gate still passes: `/home/openclaw/.openclaw/workspace/evals/results/sacred_gate_20260321T082413Z.json` New structural finding `chosen_plan` is still semantically overloaded. We now see traces where `execution_mode=plan` is correct, but `chosen_plan` still says `single_tool`. That should be split into a real plan-template taxonomy so future learning does not conflate cardinality with execution structure.