feat: implement phase 2 of session-cache-upgrade.md

This commit is contained in:
2026-03-20 08:57:54 +01:00
parent e98e5fd88b
commit e40a2f3c45
10 changed files with 1282 additions and 99 deletions

View File

@@ -2564,9 +2564,11 @@ Each step should be independently buildable and testable.
### Phase 2: Core Engine
4. **`PromptBuilder.swift`** — Convert API messages to UserInput. Test by comparing tokenized output to what ChatSession produces for the same messages.
5. **`TokenPrefixCache.swift`** — The big one. Build trie + eviction + monitoring. Test: insert entries, verify lookup, verify eviction under memory pressure, verify trie cleanup.
6. **`InferenceEngine.swift`** — Thin wrapper using `container.perform { ctx in MLXLMCommon.generate(input:cache:parameters:context:) }`. Test: run a simple prompt through it, verify output matches ChatSession output.
4. [x] **`PromptBuilder.swift`** — Convert API messages to UserInput. Test by comparing tokenized output to what ChatSession produces for the same messages.
5. [x] **`TokenPrefixCache.swift`** — The big one. Build trie + eviction + monitoring. Test: insert entries, verify lookup, verify eviction under memory pressure, verify trie cleanup.
6. [x] **`InferenceEngine.swift`** — Thin wrapper using `container.perform { ctx in MLXLMCommon.generate(input:cache:parameters:context:) }`. Test: run a simple prompt through it, verify output matches ChatSession output.
Validation note: `PromptBuilder.swift` is now covered by both shaping-parity unit tests and a model-backed tokenization parity test against the cached local Gemma 3 4B VLM. `InferenceEngine.swift` is now covered by a model-backed smoke test that compares one-token output and prompt-token counts against `ChatSession` on the same locally cached Gemma model.
### Phase 3: Integration
@@ -2614,8 +2616,8 @@ Each step should be independently buildable and testable.
### Memory Management
- [ ] Memory budget computed correctly from Metal device
- [ ] Entries evicted under memory pressure (oldest first)
- [ ] Expired entries pruned after 30 min idle
- [x] Entries evicted under memory pressure (oldest first)
- [x] Expired entries pruned after 30 min idle
- [ ] Trie nodes cleaned up when entries are evicted (no memory leak)
- [ ] `snapshot()` reports accurate memory usage and hit rates
@@ -2628,7 +2630,7 @@ Each step should be independently buildable and testable.
### Streaming
- [ ] SSE JSON is valid and parseable by standard clients
- [ ] `StreamingSSEEncoder` output matches `JSONEncoder` output byte-for-byte (for content deltas)
- [x] `StreamingSSEEncoder` output matches `JSONEncoder` output byte-for-byte (for content deltas)
- [ ] Role delta sent once at stream start
- [ ] Tool call chunks sent correctly
- [ ] Final chunk has finish_reason and usage stats