fix: better handling of API stuff, still not where internal chat is

2026-03-17 21:24:04 +01:00
parent 20f9c0bcc4
commit ed6cc5f5d1
4 changed files with 358 additions and 190 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -36,6 +36,15 @@ open "build/Debug/MLX Server.app"
 | `gemma` | `mlx-community/gemma-3-4b-it-4bit` | Vision + tool use via `tool_code` blocks (128k context) |
 | `qwen` | `mlx-community/Qwen3-VL-4B-Instruct-4bit` | Vision + tool use via `<tool_call>` tags (256k context) |

+## Critical Performance Rule
+
+**Inference speed is the #1 priority.** The token generation loop must never be blocked or slowed by anything else — no MainActor hops, no SwiftUI observation, no synchronous I/O. Everything that isn't inference (stats collection, UI updates, logging) must run on separate threads via loose coupling:
+
+- **`LiveCounters`** (thread-safe singleton with `OSAllocatedUnfairLock`) is the bridge: generation code writes to it directly from any thread with zero actor overhead.
+- **`InferenceStats`** (UI-side, `@Observable @MainActor`) polls `LiveCounters` at 1Hz via a timer — never the other way around.
+- SSE streaming (`sendSSEEvent`/`sendData`) runs nonisolated off MainActor so token sends don't compete with SwiftUI rendering.
+- Never gate token output on UI state, analytics, or any `@MainActor`-isolated code.
+
 ## Key Design Decisions

 - Uses `mlx-swift-lm` (`MLXVLM` / `VLMModelFactory`) as the inference backend — supports both text and vision in a single model load