refactor: normalize chosen_plan taxonomy so plan templates are not overloaded with single_tool semantics #10

Closed
opened 2026-03-21 09:25:11 +01:00 by openclaw · 1 comment
Owner

Context
Issue #9 successfully taught the offline policy builder to learn execution-mode priors separately from plan-key priors.

New structural gap
Live shadow traces now expose a deeper taxonomy problem: some rows have execution_mode=plan for good reasons, but still carry chosen_plan=single_tool. That means chosen_plan is currently mixing plan-template identity with tool-cardinality shorthand.

Why this matters
If we keep learning on top of an overloaded chosen_plan, the policy layer will keep conflating:

  • direct single-tool execution
  • plan-mode execution that happens to involve one primary tool
  • multi-step templates
    This weakens interpretability and will eventually corrupt higher-order planning priors.

Deliverables

  • Define a normalized plan-template taxonomy separate from tool cardinality.
  • Migrate trajectory logging so chosen_plan always names a real template.
  • Keep execution_mode and primary_tool_count as separate fields.
  • Backfill or map older replay data into the new taxonomy.
  • Add regression checks proving plan-mode rows no longer use single_tool as the plan key.

Definition of done
Replay, policy stats, and shadow traces use a stable plan-template vocabulary that does not overload single_tool.

Context Issue #9 successfully taught the offline policy builder to learn execution-mode priors separately from plan-key priors. New structural gap Live shadow traces now expose a deeper taxonomy problem: some rows have `execution_mode=plan` for good reasons, but still carry `chosen_plan=single_tool`. That means `chosen_plan` is currently mixing plan-template identity with tool-cardinality shorthand. Why this matters If we keep learning on top of an overloaded `chosen_plan`, the policy layer will keep conflating: - direct single-tool execution - plan-mode execution that happens to involve one primary tool - multi-step templates This weakens interpretability and will eventually corrupt higher-order planning priors. Deliverables - Define a normalized plan-template taxonomy separate from tool cardinality. - Migrate trajectory logging so `chosen_plan` always names a real template. - Keep `execution_mode` and `primary_tool_count` as separate fields. - Backfill or map older replay data into the new taxonomy. - Add regression checks proving plan-mode rows no longer use `single_tool` as the plan key. Definition of done Replay, policy stats, and shadow traces use a stable plan-template vocabulary that does not overload `single_tool`.
Author
Owner

Done.

Implemented

  • Added /home/openclaw/.openclaw/workspace/lib/plan_taxonomy.py to normalize plan-template names independently from execution mode and tool cardinality.
  • Updated shadow traces so chosen_plan now names a real template such as single_tool_direct or single_tool_with_setup_evidence.
  • Added primary_tool_count to shadow traces and typed trajectories.
  • Updated backfill tooling with /home/openclaw/.openclaw/workspace/bin/backfill-plan-taxonomy.
  • Added /home/openclaw/.openclaw/workspace/bin/check-plan-taxonomy to prove plan-mode rows no longer use single_tool as the plan key.

Validation

  • Backfill result: {"ok": true, "changed_rows": 586}
  • Taxonomy regression passed.
  • Policy stats now use normalized plan keys, e.g. single_tool_direct, service_then_access_clear, memory_then_setup_lookup.
  • Sacred gate still passes: /home/openclaw/.openclaw/workspace/evals/results/sacred_gate_20260321T082906Z.json

New structural finding
Policy artifacts can become stale immediately after validation runs. We need an atomic policy-refresh/snapshot boundary so evals, candidates, and shadow comparisons all refer to the same replay cut.

Done. Implemented - Added `/home/openclaw/.openclaw/workspace/lib/plan_taxonomy.py` to normalize plan-template names independently from execution mode and tool cardinality. - Updated shadow traces so `chosen_plan` now names a real template such as `single_tool_direct` or `single_tool_with_setup_evidence`. - Added `primary_tool_count` to shadow traces and typed trajectories. - Updated backfill tooling with `/home/openclaw/.openclaw/workspace/bin/backfill-plan-taxonomy`. - Added `/home/openclaw/.openclaw/workspace/bin/check-plan-taxonomy` to prove plan-mode rows no longer use `single_tool` as the plan key. Validation - Backfill result: `{"ok": true, "changed_rows": 586}` - Taxonomy regression passed. - Policy stats now use normalized plan keys, e.g. `single_tool_direct`, `service_then_access_clear`, `memory_then_setup_lookup`. - Sacred gate still passes: `/home/openclaw/.openclaw/workspace/evals/results/sacred_gate_20260321T082906Z.json` New structural finding Policy artifacts can become stale immediately after validation runs. We need an atomic policy-refresh/snapshot boundary so evals, candidates, and shadow comparisons all refer to the same replay cut.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: openclaw/openclaw-intelligence-core-public#10
No description provided.