chore: close out phase 7 as out of scope

2026-03-21 08:52:51 +01:00
parent 107ac0524b
commit 24b940d526
2 changed files with 373 additions and 2 deletions
--- a/docs/native-template-tool-formatting-plan.md
+++ b/docs/native-template-tool-formatting-plan.md
@@ -0,0 +1,371 @@
 # Native Template Tool Formatting Plan
 This document extracts Phase 7 item 19 from `session-cache-upgrade.md` into a standalone implementation plan.
 The goal is to describe what would be required to move the API server from the current app-managed tool prompting approach to a model-template-native tool formatting approach later, without keeping the work buried inside the larger session/cache rewrite document.
 ## Summary
 Current state:
 - The app formats tool instructions itself.
 - `PromptBuilder` injects tool definitions into prompt text.
 - `ToolPromptBuilder` produces model-specific tool prompt text and replays assistant tool calls back into prompt history.
 - `UserInput.tools` is currently not used for the API path.
 Proposed future state:
 - The app passes structured tools via `UserInput.tools`.
 - The model's Jinja chat template formats tools natively.
 - The app stops injecting tool instructions into the system prompt for models that are verified to support native template tools.
 - Manual prompt formatting remains available as a fallback.
 This is not a simple flag flip in the current codebase. It is a separate integration project.
 ## Why Consider This Later
 Potential benefits:
 - Less model-specific prompt text generation in app code.
 - Closer alignment with template authors' intended tool formatting.
 - Possible improvement in tool-call quality for models with reliable native tool templates.
 - Reduced duplication between app-side prompt construction and template-side prompt construction.
 Current reasons not to prioritize it immediately:
 - The current manual path is already implemented and tested.
 - Model-template behavior is not uniformly reliable. Phase 6 validation already showed that some local Qwen builds do not consistently honor their own documented thinking-tag contract.
 - The current code does not yet contain a real runtime strategy switch between manual and native tool formatting.
 ## Current Implementation
 Today, the API path does the following:
 1. If tools are present, `PromptBuilder` appends a model-specific tool prompt into the instructions block.
 2. Assistant tool calls in message history are rewritten back into model-native text form.
 3. Tool outputs are also rewritten into model-specific history text.
 4. `UserInput` is built with `tools: nil`.
 5. Output parsing prefers framework-emitted tool calls first, then falls back to text parsing.
 Files involved:
 - `MLXServer/Server/PromptBuilder.swift`
 - `MLXServer/Server/ToolPromptBuilder.swift`
 - `MLXServer/Server/APIServer.swift`
 - `MLXServer/Server/ToolCallParser.swift`
 ## Validated Local Model Templates
 The following observations are based on the local model template files currently present in the MLX Server cache.
 ### Qwen3.5 0.8B, 4B, and 9B
 Local Qwen3.5 templates do appear to support native tool formatting at the template level.
 Observed capabilities in the local `chat_template.jinja` files:
 - explicit `if tools` branch at the top of the template
 - renders a `<tools>` block containing serialized tool definitions
 - instructs the model to emit tool calls in a native Qwen XML format
 - replays prior assistant `tool_calls` in template-native form
 - replays `tool` role messages through `<tool_response>` wrappers
 Implication:
 - Qwen3.5 models are plausible candidates for a future `templateNative` allowlist.
 Important caveat:
 - template support on paper is not enough by itself. Phase 6 validation already showed that local Qwen3.5 builds do not consistently honor every documented template contract, specifically for `<think>...</think>` behavior. Native tool formatting for Qwen therefore still requires runtime validation, not just template inspection.
 ### Gemma 3 4B
 The local Gemma template does not appear to support native tools.
 Observed behavior in the local `chat_template.json`:
 - no `tools` variable handling
 - no native tool-definition rendering path
 - no replay path for assistant `tool_calls`
 - no dedicated `tool` role handling
 - template structure is focused on alternating user/model turns and image placeholders only
 Implication:
 - Gemma must remain on the current manual prompt formatting path unless a different local template or upstream framework behavior is introduced.
 ### Practical Conclusion
 If this work is taken on later, the initial allowlist should be:
 - Qwen3.5 family: possible candidate, but only after runtime validation
 - Gemma 3: not a candidate under the current local template
 ## Target Implementation
 For verified models, the API path should be able to:
 1. Convert OpenAI-format tool definitions into framework-native tool specs.
 2. Pass those tool specs through `UserInput.tools`.
 3. Avoid appending manual tool instructions to the system prompt.
 4. Keep output parsing compatible with both framework-native tool call events and text fallback parsing.
 5. Fall back to the current manual path when native template tool formatting is unsupported or broken.
 ## Impact On TokenPrefixCache And Prompt Reuse
 This change does not require a redesign of `TokenPrefixCache`, but it does affect cache behavior and rollout strategy.
 ### 1. No Core Cache Algorithm Change Is Required
 The current cache key is built from the prepared token sequence returned by `container.prepare(input:)`, plus image fingerprint augmentation for VL models.
 That means:
 - if tool formatting changes the rendered prompt, the token sequence changes
 - if the token sequence changes, the cache key changes automatically
 - prefix, supersequence, and LCP matching continue to work without algorithmic modification
 So the cache implementation itself does not need a new matching strategy just for native-template tools.
 ### 2. Cache Hits Become Strategy-Sensitive
 Even if the semantic request is identical, the manual path and the template-native path may render different prompt text.
 Result:
 - existing cache entries created under `manualPrompt` will usually not hit under `templateNative`
 - this is expected and safe
 - rollout will temporarily reduce cache hit rate for any model moved to the new path until fresh entries are built
 There is no cache migration requirement. Old entries can simply age out.
 ### 3. Strategy Changes Can Fragment Cache Reuse
 If the same model sometimes uses `manualPrompt` and sometimes uses `templateNative`, prompt reuse becomes less predictable because token prefixes will diverge.
 Practical effect:
 - more misses across otherwise similar requests
 - less interpretable hit-rate statistics during rollout
 Recommended mitigation:
 - keep strategy stable per model
 - use an explicit allowlist rather than opportunistic per-request switching
 ### 4. Deterministic Tool Serialization Matters More
 TokenPrefixCache depends on byte-stable prompt rendering. If logically identical tool schemas are rendered with different key ordering or formatting across requests, cache hits will degrade.
 This matters more under a native-template path because tool schema serialization moves closer to template/framework behavior.
 Validation requirement:
 - the same tool definitions must render to the same token sequence across runs for a stable cache key
 This should be tested explicitly for any allowlisted model.
 ### 5. Multi-Turn Replay Has Direct Cache Impact
 The current manual path reconstructs prior assistant tool calls and tool responses in deterministic model-specific text.
 If the native-template path replays history differently, then:
 - second-turn and later requests may produce different token prefixes
 - prefix reuse depth may shrink
 - supersequence and LCP opportunities may change even when conversation meaning is unchanged
 So history replay semantics are not just a correctness concern; they also affect cache reuse quality.
 ### 6. Image-Aware Cache Keying Is Unchanged
 The current vision cache-key augmentation based on image fingerprints is independent of tool formatting.
 Implication:
 - no change is needed to Gemma/Qwen image-aware cache key construction just because tools move from manual prompt text to `UserInput.tools`
 ### 7. Prompt Estimation May Need Adjustment
 Today, `PromptBuilder` estimates prompt size before prepare using app-constructed instruction and message text.
 Under a native-template path, some tool formatting moves inside the template/framework.
 Impact:
 - pre-prepare `estimatedBytes` and `estimatedPromptTokens` may become less representative
 - the actual prepared token count remains authoritative for cache keys and post-prepare accounting
 This does not break TokenPrefixCache, but it may require revisiting prompt estimation if UI or request validation depends on the earlier estimate.
 ## Recommended Design
 ### 1. Introduce a Real Strategy Type
 Add an explicit strategy abstraction for the API path.
 Suggested shape:
 ```swift
 enum ToolFormattingStrategy {
    case manualPrompt
    case templateNative
 }
 ```
 This should become a real code path selector, not just a design note.
 ### 2. Do Not Auto-Detect Aggressively At First
 The original note suggested auto-detecting whether a model template supports tools natively.
 That is possible, but it is risky as an initial rollout because:
 - preparation succeeding does not prove correct tool formatting
 - a template may accept `tools` but produce malformed tool calls
 - model behavior can still vary across quantized or repackaged local builds
 Recommended first rollout:
 - start with an explicit allowlist of models verified to work with native template tools
 - keep all other models on the current manual path
 - only add dynamic detection later if there is a clear need
 ### 3. Add Conversion From API Tools To Framework Tool Specs
 `APIChatCompletionRequest.tools` uses the OpenAI-compatible app model.
 To support template-native formatting, the app will need a conversion layer from:
 - `APIToolDefinition`
 to:
 - the `mlx-swift-lm` native tool specification type used by `UserInput.tools`
 Required work:
 - map function names
 - map descriptions
 - map parameter schemas
 - preserve required vs optional fields
 - confirm how nested object/array schemas must be represented in the framework type
 This conversion should live in a dedicated helper instead of being embedded directly inside `PromptBuilder`.
 ### 4. Update PromptBuilder To Support Both Paths
 `PromptBuilder` currently always uses the manual path.
 It will need to change so that:
 - on `manualPrompt`, behavior stays the same as today
 - on `templateNative`, manual system-prompt tool injection is skipped
 - on `templateNative`, `UserInput.tools` is populated with converted tool specs
 Important constraint:
 - message-history handling for assistant tool calls and tool outputs may also need strategy-dependent treatment
 The current replay logic assumes the app is responsible for reconstructing model-native text history. If the template-native path expects structured tool state instead, replay rules may need to change.
 ### 5. Verify History Replay Semantics
 This is one of the main reasons item 19 is not a trivial switch.
 Today, history replay is manual:
 - assistant tool calls are converted back into Qwen `<tool_call>` or Gemma `tool_code`
 - tool outputs are converted back into model-specific history text
 Questions that must be answered for a native-template path:
 1. Does the template expect previous assistant tool calls to appear as plain text, structured tool metadata, or both?
 2. Does the template expect tool responses to be represented through normal chat messages only, or via another structured field?
 3. Does the framework already shape those prior turns correctly when `UserInput.tools` is present?
 If the answer is not fully consistent across models, the app will still need model-specific replay logic even under `templateNative`.
 ### 6. Keep Output Parsing Hierarchy As-Is
 The output parsing hierarchy already matches the preferred design:
 1. framework-emitted tool calls first
 2. text parser fallback second
 That part likely does not need architectural change.
 However, the following should still be verified under the new path:
 - non-streaming tool responses
 - streaming tool-call chunks
 - multi-turn tool conversations
 - mixed content plus tool calls
 ### 7. Add Safe Fallback Behavior
 This feature should not be all-or-nothing.
 Recommended behavior:
 - if model is not allowlisted, use `manualPrompt`
 - if model is allowlisted but native template behavior fails validation, fall back to `manualPrompt`
 - avoid silent partial activation
 Possible rollout options:
 - compile-time default to manual, enable native only in tests
 - runtime flag for development builds
 - per-model hardcoded allowlist after verification
 ## Suggested Implementation Steps
 1. Add `ToolFormattingStrategy` and wire it through the API prompt-building path.
 2. Add a converter from `APIToolDefinition` to framework-native tool specs.
 3. Update `PromptBuilder` so `UserInput.tools` can be populated for the native path.
 4. Keep manual prompt injection untouched as the fallback path.
 5. Verify how prior assistant tool calls and tool outputs must be replayed for native-template mode.
 6. Start with one verified model only.
 7. Add end-to-end tests for that model.
 8. Expand allowlist only after repeated validation.
 ## Testing Required
 This work would require new focused tests beyond the current manual-path coverage.
 Minimum required coverage:
 - native-template tool path can prepare successfully with tools present
 - model emits tool calls that the framework surfaces correctly
 - non-streaming response returns `finish_reason == "tool_calls"` when appropriate
 - streaming response emits OpenAI-compatible tool-call chunks in the correct order
 - tool-call arguments survive round-trip without schema loss
 - multi-turn tool conversation still replays correctly on the next request
 - fallback to `manualPrompt` still works for models outside the allowlist
 Recommended additional coverage:
 - one test per supported native-template model
 - explicit regression test for malformed tool output
 - replay test with prior assistant tool calls plus tool responses in history
 ## Risks
 Main risks:
 - template behavior differs across local model builds
 - framework-native tool support may accept a tool schema but not format prompts as expected
 - replay semantics may still require model-specific handling, reducing the benefit of the switch
 - debugging becomes harder because part of the prompt construction moves into model templates instead of app code
 ## Recommendation
 Treat this as a future experiment, not pending polish.
 It becomes worth doing only if at least one of these is true:
 - the current manual tool path shows a real correctness bug
 - a verified model demonstrates materially better tool behavior on the native-template path
 - upstream framework support becomes stable and well-documented enough to reduce integration risk
 Until then, the current manual implementation remains the safer default.
--- a/docs/session-cache-upgrade.md
+++ b/docs/session-cache-upgrade.md
@@ -2599,8 +2599,8 @@ Validation note: `InferenceStats.swift` now samples `TokenPrefixCache` directly
 ### Phase 7: Polish
-18. **Qwen3 EOS fix** — Verify first, implement if needed.
+18. **Qwen3 EOS fix** — Deferred unless a real stop-token overrun is reproduced. Keep as a verification-only item; no current evidence in this repo shows that an app-side EOS override is needed.
-19. **Native template tool formatting** — Switch from `.manualPrompt` to `.templateNative` once verified working.
+19. **Native template tool formatting** — Future experiment. See `docs/native-template-tool-formatting-plan.md` for the standalone implementation plan.
 ---