feat: complete rewrite to swift

2026-03-17 19:12:54 +01:00
parent c80fe97f41
commit 5313b7175e
37 changed files with 3325 additions and 2122 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,50 +1,55 @@
 # MLX Server

-OpenAI-compatible API server for local LLMs on Apple Silicon via MLX. Supports Gemma 3 4B and Qwen3 VL 4B (vision + tool use).
+Native macOS SwiftUI app for local LLMs on Apple Silicon via MLX. Provides a chat UI and an embedded OpenAI-compatible API server. Supports vision and tool use.

 ## Quick Start

 ```bash
-# Activate virtual environment
-source .venv/bin/activate
+# Build (requires xcodegen: brew install xcodegen)
+./build.sh

-# Run with Gemma 3 (default)
-./run.sh
-
-# Run with Qwen3
-./run.sh qwen
-
-# Or directly:
-python -m mlx_server.main --model mlx-community/gemma-3-4b-it-4bit --port 1234
-python -m mlx_server.main --model mlx-community/Qwen3-VL-4B-Instruct-4bit --port 1234
+# Run
+open "build/Debug/MLX Server.app"
 ```

 ## Project Structure

- `mlx_server/main.py` — FastAPI server, endpoints, CLI entrypoint
- `mlx_server/engine.py` — Model loading, prompt building, generation (mlx_vlm)
- `mlx_server/models.py` — Pydantic models for OpenAI API request/response types
+- `MLXServer/MLXServerApp.swift` — App entry point, GPU cache config
+- `MLXServer/ContentView.swift` — Main layout, toolbar, keyboard shortcuts
+- `MLXServer/Models/ModelConfig.swift` — Model definitions (alias, repoId, contextLength), resolution
+- `MLXServer/Models/ChatMessage.swift` — Chat message data model
+- `MLXServer/ViewModels/ModelManager.swift` — Model loading/switching via VLMModelFactory, offline-first resolution
+- `MLXServer/ViewModels/ChatViewModel.swift` — Chat state, ChatSession management, API server lifecycle
+- `MLXServer/Server/APIServer.swift` — NWListener HTTP server, SSE streaming, KV cache reuse, vision, tool call handling
+- `MLXServer/Server/APIModels.swift` — OpenAI-compatible Codable structs
+- `MLXServer/Server/ToolCallParser.swift` — Parses tool calls from model output (Gemma tool_code, Qwen XML tags)
+- `MLXServer/Server/ToolPromptBuilder.swift` — Model-specific tool prompt formatting
+- `MLXServer/Utilities/LocalModelResolver.swift` — Resolves HF repo IDs to ~/.cache/huggingface/hub/ snapshots
+- `MLXServer/Utilities/Preferences.swift` — UserDefaults wrapper
+- `project.yml` — xcodegen project spec
+- `build.sh` — Build script (xcodegen + xcodebuild)

 ## Supported Models

 | Alias | HuggingFace ID | Notes |
 |-------|---------------|-------|
 | `gemma` | `mlx-community/gemma-3-4b-it-4bit` | Vision + tool use via `tool_code` blocks (128k context) |
-| `gemma3n` | `mlx-community/gemma-3n-E4B-it-4bit` | Vision/audio/video + tool use via `tool_code` blocks (32k context, ~1.5x faster) |
 | `qwen` | `mlx-community/Qwen3-VL-4B-Instruct-4bit` | Vision + tool use via `<tool_call>` tags (256k context) |

 ## Key Design Decisions

- Uses `mlx_vlm` (not `mlx_lm`) as the inference backend — this supports both text and vision in a single model load
- Model-specific prompt formatting: Gemma converts system→user/assistant pairs and uses `tool_code` blocks; Qwen3 uses native system role and `<tool_call>` XML tags
- Offline-first: if the model is already cached locally (~/.cache/huggingface/hub/), the server resolves the local snapshot path directly — no network requests are made (HEAD checks, update checks, etc.)
- Thread lock on generation (single-request-at-a-time) — MLX models aren't safe for concurrent generation
- Context window size is read from each model's config at load time (Gemma 3 4B: 128k, Qwen3-VL 4B: 256k)
+- Uses `mlx-swift-lm` (`MLXVLM` / `VLMModelFactory`) as the inference backend — supports both text and vision in a single model load
+- Model-specific prompt formatting: Gemma uses `tool_code` blocks; Qwen uses `<tool_call>` XML tags
+- Offline-first: if the model is already cached locally (~/.cache/huggingface/hub/), `LocalModelResolver` resolves the local snapshot path directly — no network requests
+- HTTP server built on `Network.framework` (`NWListener`) — no third-party server dependencies
+- KV cache reuse across API requests — reuses `ChatSession` when conversation history prefix matches
+- GPU cache limit set to 20 MB; cache cleared on model unload

 ## Dependencies

-Managed via `uv` and `pyproject.toml`. Virtual environment in `.venv/`.
+Managed via Swift Package Manager (declared in `project.yml` for xcodegen).

-```bash
-uv pip install -e "."
-```
+| Package | Products |
+|---------|----------|
+| `mlx-swift-lm` | `MLXLLM`, `MLXVLM`, `MLXLMCommon` |
+| `swift-markdown-ui` | `MarkdownUI` |