chore: close out phase 7 as out of scope
This commit is contained in:
371
docs/native-template-tool-formatting-plan.md
Normal file
371
docs/native-template-tool-formatting-plan.md
Normal file
@@ -0,0 +1,371 @@
|
|||||||
|
# Native Template Tool Formatting Plan
|
||||||
|
|
||||||
|
This document extracts Phase 7 item 19 from `session-cache-upgrade.md` into a standalone implementation plan.
|
||||||
|
|
||||||
|
The goal is to describe what would be required to move the API server from the current app-managed tool prompting approach to a model-template-native tool formatting approach later, without keeping the work buried inside the larger session/cache rewrite document.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Current state:
|
||||||
|
|
||||||
|
- The app formats tool instructions itself.
|
||||||
|
- `PromptBuilder` injects tool definitions into prompt text.
|
||||||
|
- `ToolPromptBuilder` produces model-specific tool prompt text and replays assistant tool calls back into prompt history.
|
||||||
|
- `UserInput.tools` is currently not used for the API path.
|
||||||
|
|
||||||
|
Proposed future state:
|
||||||
|
|
||||||
|
- The app passes structured tools via `UserInput.tools`.
|
||||||
|
- The model's Jinja chat template formats tools natively.
|
||||||
|
- The app stops injecting tool instructions into the system prompt for models that are verified to support native template tools.
|
||||||
|
- Manual prompt formatting remains available as a fallback.
|
||||||
|
|
||||||
|
This is not a simple flag flip in the current codebase. It is a separate integration project.
|
||||||
|
|
||||||
|
## Why Consider This Later
|
||||||
|
|
||||||
|
Potential benefits:
|
||||||
|
|
||||||
|
- Less model-specific prompt text generation in app code.
|
||||||
|
- Closer alignment with template authors' intended tool formatting.
|
||||||
|
- Possible improvement in tool-call quality for models with reliable native tool templates.
|
||||||
|
- Reduced duplication between app-side prompt construction and template-side prompt construction.
|
||||||
|
|
||||||
|
Current reasons not to prioritize it immediately:
|
||||||
|
|
||||||
|
- The current manual path is already implemented and tested.
|
||||||
|
- Model-template behavior is not uniformly reliable. Phase 6 validation already showed that some local Qwen builds do not consistently honor their own documented thinking-tag contract.
|
||||||
|
- The current code does not yet contain a real runtime strategy switch between manual and native tool formatting.
|
||||||
|
|
||||||
|
## Current Implementation
|
||||||
|
|
||||||
|
Today, the API path does the following:
|
||||||
|
|
||||||
|
1. If tools are present, `PromptBuilder` appends a model-specific tool prompt into the instructions block.
|
||||||
|
2. Assistant tool calls in message history are rewritten back into model-native text form.
|
||||||
|
3. Tool outputs are also rewritten into model-specific history text.
|
||||||
|
4. `UserInput` is built with `tools: nil`.
|
||||||
|
5. Output parsing prefers framework-emitted tool calls first, then falls back to text parsing.
|
||||||
|
|
||||||
|
Files involved:
|
||||||
|
|
||||||
|
- `MLXServer/Server/PromptBuilder.swift`
|
||||||
|
- `MLXServer/Server/ToolPromptBuilder.swift`
|
||||||
|
- `MLXServer/Server/APIServer.swift`
|
||||||
|
- `MLXServer/Server/ToolCallParser.swift`
|
||||||
|
|
||||||
|
## Validated Local Model Templates
|
||||||
|
|
||||||
|
The following observations are based on the local model template files currently present in the MLX Server cache.
|
||||||
|
|
||||||
|
### Qwen3.5 0.8B, 4B, and 9B
|
||||||
|
|
||||||
|
Local Qwen3.5 templates do appear to support native tool formatting at the template level.
|
||||||
|
|
||||||
|
Observed capabilities in the local `chat_template.jinja` files:
|
||||||
|
|
||||||
|
- explicit `if tools` branch at the top of the template
|
||||||
|
- renders a `<tools>` block containing serialized tool definitions
|
||||||
|
- instructs the model to emit tool calls in a native Qwen XML format
|
||||||
|
- replays prior assistant `tool_calls` in template-native form
|
||||||
|
- replays `tool` role messages through `<tool_response>` wrappers
|
||||||
|
|
||||||
|
Implication:
|
||||||
|
|
||||||
|
- Qwen3.5 models are plausible candidates for a future `templateNative` allowlist.
|
||||||
|
|
||||||
|
Important caveat:
|
||||||
|
|
||||||
|
- template support on paper is not enough by itself. Phase 6 validation already showed that local Qwen3.5 builds do not consistently honor every documented template contract, specifically for `<think>...</think>` behavior. Native tool formatting for Qwen therefore still requires runtime validation, not just template inspection.
|
||||||
|
|
||||||
|
### Gemma 3 4B
|
||||||
|
|
||||||
|
The local Gemma template does not appear to support native tools.
|
||||||
|
|
||||||
|
Observed behavior in the local `chat_template.json`:
|
||||||
|
|
||||||
|
- no `tools` variable handling
|
||||||
|
- no native tool-definition rendering path
|
||||||
|
- no replay path for assistant `tool_calls`
|
||||||
|
- no dedicated `tool` role handling
|
||||||
|
- template structure is focused on alternating user/model turns and image placeholders only
|
||||||
|
|
||||||
|
Implication:
|
||||||
|
|
||||||
|
- Gemma must remain on the current manual prompt formatting path unless a different local template or upstream framework behavior is introduced.
|
||||||
|
|
||||||
|
### Practical Conclusion
|
||||||
|
|
||||||
|
If this work is taken on later, the initial allowlist should be:
|
||||||
|
|
||||||
|
- Qwen3.5 family: possible candidate, but only after runtime validation
|
||||||
|
- Gemma 3: not a candidate under the current local template
|
||||||
|
|
||||||
|
## Target Implementation
|
||||||
|
|
||||||
|
For verified models, the API path should be able to:
|
||||||
|
|
||||||
|
1. Convert OpenAI-format tool definitions into framework-native tool specs.
|
||||||
|
2. Pass those tool specs through `UserInput.tools`.
|
||||||
|
3. Avoid appending manual tool instructions to the system prompt.
|
||||||
|
4. Keep output parsing compatible with both framework-native tool call events and text fallback parsing.
|
||||||
|
5. Fall back to the current manual path when native template tool formatting is unsupported or broken.
|
||||||
|
|
||||||
|
## Impact On TokenPrefixCache And Prompt Reuse
|
||||||
|
|
||||||
|
This change does not require a redesign of `TokenPrefixCache`, but it does affect cache behavior and rollout strategy.
|
||||||
|
|
||||||
|
### 1. No Core Cache Algorithm Change Is Required
|
||||||
|
|
||||||
|
The current cache key is built from the prepared token sequence returned by `container.prepare(input:)`, plus image fingerprint augmentation for VL models.
|
||||||
|
|
||||||
|
That means:
|
||||||
|
|
||||||
|
- if tool formatting changes the rendered prompt, the token sequence changes
|
||||||
|
- if the token sequence changes, the cache key changes automatically
|
||||||
|
- prefix, supersequence, and LCP matching continue to work without algorithmic modification
|
||||||
|
|
||||||
|
So the cache implementation itself does not need a new matching strategy just for native-template tools.
|
||||||
|
|
||||||
|
### 2. Cache Hits Become Strategy-Sensitive
|
||||||
|
|
||||||
|
Even if the semantic request is identical, the manual path and the template-native path may render different prompt text.
|
||||||
|
|
||||||
|
Result:
|
||||||
|
|
||||||
|
- existing cache entries created under `manualPrompt` will usually not hit under `templateNative`
|
||||||
|
- this is expected and safe
|
||||||
|
- rollout will temporarily reduce cache hit rate for any model moved to the new path until fresh entries are built
|
||||||
|
|
||||||
|
There is no cache migration requirement. Old entries can simply age out.
|
||||||
|
|
||||||
|
### 3. Strategy Changes Can Fragment Cache Reuse
|
||||||
|
|
||||||
|
If the same model sometimes uses `manualPrompt` and sometimes uses `templateNative`, prompt reuse becomes less predictable because token prefixes will diverge.
|
||||||
|
|
||||||
|
Practical effect:
|
||||||
|
|
||||||
|
- more misses across otherwise similar requests
|
||||||
|
- less interpretable hit-rate statistics during rollout
|
||||||
|
|
||||||
|
Recommended mitigation:
|
||||||
|
|
||||||
|
- keep strategy stable per model
|
||||||
|
- use an explicit allowlist rather than opportunistic per-request switching
|
||||||
|
|
||||||
|
### 4. Deterministic Tool Serialization Matters More
|
||||||
|
|
||||||
|
TokenPrefixCache depends on byte-stable prompt rendering. If logically identical tool schemas are rendered with different key ordering or formatting across requests, cache hits will degrade.
|
||||||
|
|
||||||
|
This matters more under a native-template path because tool schema serialization moves closer to template/framework behavior.
|
||||||
|
|
||||||
|
Validation requirement:
|
||||||
|
|
||||||
|
- the same tool definitions must render to the same token sequence across runs for a stable cache key
|
||||||
|
|
||||||
|
This should be tested explicitly for any allowlisted model.
|
||||||
|
|
||||||
|
### 5. Multi-Turn Replay Has Direct Cache Impact
|
||||||
|
|
||||||
|
The current manual path reconstructs prior assistant tool calls and tool responses in deterministic model-specific text.
|
||||||
|
|
||||||
|
If the native-template path replays history differently, then:
|
||||||
|
|
||||||
|
- second-turn and later requests may produce different token prefixes
|
||||||
|
- prefix reuse depth may shrink
|
||||||
|
- supersequence and LCP opportunities may change even when conversation meaning is unchanged
|
||||||
|
|
||||||
|
So history replay semantics are not just a correctness concern; they also affect cache reuse quality.
|
||||||
|
|
||||||
|
### 6. Image-Aware Cache Keying Is Unchanged
|
||||||
|
|
||||||
|
The current vision cache-key augmentation based on image fingerprints is independent of tool formatting.
|
||||||
|
|
||||||
|
Implication:
|
||||||
|
|
||||||
|
- no change is needed to Gemma/Qwen image-aware cache key construction just because tools move from manual prompt text to `UserInput.tools`
|
||||||
|
|
||||||
|
### 7. Prompt Estimation May Need Adjustment
|
||||||
|
|
||||||
|
Today, `PromptBuilder` estimates prompt size before prepare using app-constructed instruction and message text.
|
||||||
|
|
||||||
|
Under a native-template path, some tool formatting moves inside the template/framework.
|
||||||
|
|
||||||
|
Impact:
|
||||||
|
|
||||||
|
- pre-prepare `estimatedBytes` and `estimatedPromptTokens` may become less representative
|
||||||
|
- the actual prepared token count remains authoritative for cache keys and post-prepare accounting
|
||||||
|
|
||||||
|
This does not break TokenPrefixCache, but it may require revisiting prompt estimation if UI or request validation depends on the earlier estimate.
|
||||||
|
|
||||||
|
## Recommended Design
|
||||||
|
|
||||||
|
### 1. Introduce a Real Strategy Type
|
||||||
|
|
||||||
|
Add an explicit strategy abstraction for the API path.
|
||||||
|
|
||||||
|
Suggested shape:
|
||||||
|
|
||||||
|
```swift
|
||||||
|
enum ToolFormattingStrategy {
|
||||||
|
case manualPrompt
|
||||||
|
case templateNative
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This should become a real code path selector, not just a design note.
|
||||||
|
|
||||||
|
### 2. Do Not Auto-Detect Aggressively At First
|
||||||
|
|
||||||
|
The original note suggested auto-detecting whether a model template supports tools natively.
|
||||||
|
|
||||||
|
That is possible, but it is risky as an initial rollout because:
|
||||||
|
|
||||||
|
- preparation succeeding does not prove correct tool formatting
|
||||||
|
- a template may accept `tools` but produce malformed tool calls
|
||||||
|
- model behavior can still vary across quantized or repackaged local builds
|
||||||
|
|
||||||
|
Recommended first rollout:
|
||||||
|
|
||||||
|
- start with an explicit allowlist of models verified to work with native template tools
|
||||||
|
- keep all other models on the current manual path
|
||||||
|
- only add dynamic detection later if there is a clear need
|
||||||
|
|
||||||
|
### 3. Add Conversion From API Tools To Framework Tool Specs
|
||||||
|
|
||||||
|
`APIChatCompletionRequest.tools` uses the OpenAI-compatible app model.
|
||||||
|
|
||||||
|
To support template-native formatting, the app will need a conversion layer from:
|
||||||
|
|
||||||
|
- `APIToolDefinition`
|
||||||
|
|
||||||
|
to:
|
||||||
|
|
||||||
|
- the `mlx-swift-lm` native tool specification type used by `UserInput.tools`
|
||||||
|
|
||||||
|
Required work:
|
||||||
|
|
||||||
|
- map function names
|
||||||
|
- map descriptions
|
||||||
|
- map parameter schemas
|
||||||
|
- preserve required vs optional fields
|
||||||
|
- confirm how nested object/array schemas must be represented in the framework type
|
||||||
|
|
||||||
|
This conversion should live in a dedicated helper instead of being embedded directly inside `PromptBuilder`.
|
||||||
|
|
||||||
|
### 4. Update PromptBuilder To Support Both Paths
|
||||||
|
|
||||||
|
`PromptBuilder` currently always uses the manual path.
|
||||||
|
|
||||||
|
It will need to change so that:
|
||||||
|
|
||||||
|
- on `manualPrompt`, behavior stays the same as today
|
||||||
|
- on `templateNative`, manual system-prompt tool injection is skipped
|
||||||
|
- on `templateNative`, `UserInput.tools` is populated with converted tool specs
|
||||||
|
|
||||||
|
Important constraint:
|
||||||
|
|
||||||
|
- message-history handling for assistant tool calls and tool outputs may also need strategy-dependent treatment
|
||||||
|
|
||||||
|
The current replay logic assumes the app is responsible for reconstructing model-native text history. If the template-native path expects structured tool state instead, replay rules may need to change.
|
||||||
|
|
||||||
|
### 5. Verify History Replay Semantics
|
||||||
|
|
||||||
|
This is one of the main reasons item 19 is not a trivial switch.
|
||||||
|
|
||||||
|
Today, history replay is manual:
|
||||||
|
|
||||||
|
- assistant tool calls are converted back into Qwen `<tool_call>` or Gemma `tool_code`
|
||||||
|
- tool outputs are converted back into model-specific history text
|
||||||
|
|
||||||
|
Questions that must be answered for a native-template path:
|
||||||
|
|
||||||
|
1. Does the template expect previous assistant tool calls to appear as plain text, structured tool metadata, or both?
|
||||||
|
2. Does the template expect tool responses to be represented through normal chat messages only, or via another structured field?
|
||||||
|
3. Does the framework already shape those prior turns correctly when `UserInput.tools` is present?
|
||||||
|
|
||||||
|
If the answer is not fully consistent across models, the app will still need model-specific replay logic even under `templateNative`.
|
||||||
|
|
||||||
|
### 6. Keep Output Parsing Hierarchy As-Is
|
||||||
|
|
||||||
|
The output parsing hierarchy already matches the preferred design:
|
||||||
|
|
||||||
|
1. framework-emitted tool calls first
|
||||||
|
2. text parser fallback second
|
||||||
|
|
||||||
|
That part likely does not need architectural change.
|
||||||
|
|
||||||
|
However, the following should still be verified under the new path:
|
||||||
|
|
||||||
|
- non-streaming tool responses
|
||||||
|
- streaming tool-call chunks
|
||||||
|
- multi-turn tool conversations
|
||||||
|
- mixed content plus tool calls
|
||||||
|
|
||||||
|
### 7. Add Safe Fallback Behavior
|
||||||
|
|
||||||
|
This feature should not be all-or-nothing.
|
||||||
|
|
||||||
|
Recommended behavior:
|
||||||
|
|
||||||
|
- if model is not allowlisted, use `manualPrompt`
|
||||||
|
- if model is allowlisted but native template behavior fails validation, fall back to `manualPrompt`
|
||||||
|
- avoid silent partial activation
|
||||||
|
|
||||||
|
Possible rollout options:
|
||||||
|
|
||||||
|
- compile-time default to manual, enable native only in tests
|
||||||
|
- runtime flag for development builds
|
||||||
|
- per-model hardcoded allowlist after verification
|
||||||
|
|
||||||
|
## Suggested Implementation Steps
|
||||||
|
|
||||||
|
1. Add `ToolFormattingStrategy` and wire it through the API prompt-building path.
|
||||||
|
2. Add a converter from `APIToolDefinition` to framework-native tool specs.
|
||||||
|
3. Update `PromptBuilder` so `UserInput.tools` can be populated for the native path.
|
||||||
|
4. Keep manual prompt injection untouched as the fallback path.
|
||||||
|
5. Verify how prior assistant tool calls and tool outputs must be replayed for native-template mode.
|
||||||
|
6. Start with one verified model only.
|
||||||
|
7. Add end-to-end tests for that model.
|
||||||
|
8. Expand allowlist only after repeated validation.
|
||||||
|
|
||||||
|
## Testing Required
|
||||||
|
|
||||||
|
This work would require new focused tests beyond the current manual-path coverage.
|
||||||
|
|
||||||
|
Minimum required coverage:
|
||||||
|
|
||||||
|
- native-template tool path can prepare successfully with tools present
|
||||||
|
- model emits tool calls that the framework surfaces correctly
|
||||||
|
- non-streaming response returns `finish_reason == "tool_calls"` when appropriate
|
||||||
|
- streaming response emits OpenAI-compatible tool-call chunks in the correct order
|
||||||
|
- tool-call arguments survive round-trip without schema loss
|
||||||
|
- multi-turn tool conversation still replays correctly on the next request
|
||||||
|
- fallback to `manualPrompt` still works for models outside the allowlist
|
||||||
|
|
||||||
|
Recommended additional coverage:
|
||||||
|
|
||||||
|
- one test per supported native-template model
|
||||||
|
- explicit regression test for malformed tool output
|
||||||
|
- replay test with prior assistant tool calls plus tool responses in history
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
Main risks:
|
||||||
|
|
||||||
|
- template behavior differs across local model builds
|
||||||
|
- framework-native tool support may accept a tool schema but not format prompts as expected
|
||||||
|
- replay semantics may still require model-specific handling, reducing the benefit of the switch
|
||||||
|
- debugging becomes harder because part of the prompt construction moves into model templates instead of app code
|
||||||
|
|
||||||
|
## Recommendation
|
||||||
|
|
||||||
|
Treat this as a future experiment, not pending polish.
|
||||||
|
|
||||||
|
It becomes worth doing only if at least one of these is true:
|
||||||
|
|
||||||
|
- the current manual tool path shows a real correctness bug
|
||||||
|
- a verified model demonstrates materially better tool behavior on the native-template path
|
||||||
|
- upstream framework support becomes stable and well-documented enough to reduce integration risk
|
||||||
|
|
||||||
|
Until then, the current manual implementation remains the safer default.
|
||||||
@@ -2599,8 +2599,8 @@ Validation note: `InferenceStats.swift` now samples `TokenPrefixCache` directly
|
|||||||
|
|
||||||
### Phase 7: Polish
|
### Phase 7: Polish
|
||||||
|
|
||||||
18. **Qwen3 EOS fix** — Verify first, implement if needed.
|
18. **Qwen3 EOS fix** — Deferred unless a real stop-token overrun is reproduced. Keep as a verification-only item; no current evidence in this repo shows that an app-side EOS override is needed.
|
||||||
19. **Native template tool formatting** — Switch from `.manualPrompt` to `.templateNative` once verified working.
|
19. **Native template tool formatting** — Future experiment. See `docs/native-template-tool-formatting-plan.md` for the standalone implementation plan.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user