feat: first tries at save dialog, so far failing

2026-03-18 11:40:43 +01:00
parent af8b8c9532
commit 82a77fdb0a
11 changed files with 445 additions and 128 deletions
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # MLX Server

-Native macOS app for running local LLMs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). Built with SwiftUI, it provides both a **chat UI** and an embedded **OpenAI-compatible API server**. Supports vision and tool use with automatic model swapping.
+Native macOS app for running local LLMs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). Built with SwiftUI, it provides both a **chat UI** and an embedded **OpenAI-compatible API server**. Supports vision, tool use, and thinking mode.

 ## Supported Models

@@ -8,6 +8,9 @@ Native macOS app for running local LLMs on Apple Silicon via [MLX](https://githu
 |-------|-------|---------|-------------|
 | `gemma` | `mlx-community/gemma-3-4b-it-4bit` | 128k | Vision, tool use (`tool_code` blocks) |
 | `qwen` | `mlx-community/Qwen3-VL-4B-Instruct-4bit` | 256k | Vision, tool use (`<tool_call>` tags) |
+| `qwen3.5-9b` | `mlx-community/Qwen3.5-9B-4bit` | 256k | Thinking mode, tool use |
+
+Any model in MLX format on HuggingFace can be added — there is no restriction on uploader or architecture.

 ## Quick Start

@@ -20,12 +23,16 @@ open "build/Debug/MLX Server.app"

 ## App Features

- **Chat interface** with markdown rendering, image attachments (file picker, drag & drop, clipboard paste)
- **Model picker** in toolbar with local/download status indicators
+- **Chat interface** with markdown rendering, image attachments (file picker, drag & drop, clipboard paste, Finder copy-paste)
+- **Model picker** in toolbar with local/download status indicators and re-download button
+- **Download progress modal** — shows file progress, percentage, and speed when downloading a new model
+- **Thinking mode** — models like Qwen3.5 can reason internally before responding; thinking content appears in a collapsible box. Toggle on/off in Settings.
 - **Streaming responses** with live token display
+- **Export chat** — File > Export Chat (Cmd+Shift+S) saves conversations as Markdown or RTF (Pages-compatible)
 - **Status bar** showing model name, context window, tokens/sec, token counts, GPU memory, API server status
- **Keyboard shortcuts**: `Cmd+N` (new chat), `Cmd+Return` (send), `Escape` (stop), `Cmd+1/2/3` (switch models)
- **Settings** (`Cmd+,`): system prompt, API port, API auto-start
+- **Keyboard shortcuts**: `Cmd+N` (new chat), `Cmd+Return` (send), `Escape` (stop), `Cmd+1/2/3/4` (switch models), `Cmd+Shift+S` (export)
+- **Settings** (`Cmd+,`): default model, thinking mode toggle, system prompt, API port, API auto-start, idle unload timeout
+- **Idle auto-unload** — model is unloaded after configurable idle time (resets on both user input and model output), reloaded on next request

 ## API Server

@@ -74,23 +81,29 @@ MLXServer/
 ├── ContentView.swift               — Main layout, toolbar, keyboard shortcuts
 ├── Models/
 │   ├── ModelConfig.swift           — Model definitions, alias/repoId resolution
-│   └── ChatMessage.swift           — Chat message data model
+│   └── ChatMessage.swift           — Chat message data model, thinking tag parser
 ├── ViewModels/
-│   ├── ModelManager.swift          — Model loading/switching via VLMModelFactory
+│   ├── ModelManager.swift          — Model loading/switching, download tracking, idle unload
 │   └── ChatViewModel.swift         — Chat state, ChatSession, API server lifecycle
 ├── Views/
-│   ├── ModelPickerView.swift       — Toolbar model selector
-│   ├── ChatMessagesView.swift      — Scrollable message list with markdown
-│   ├── ChatInputView.swift         — Text input + image attach
+│   ├── ModelPickerView.swift       — Toolbar model selector with re-download
+│   ├── ChatMessagesView.swift      — Scrollable message list with markdown + thinking blocks
+│   ├── ChatInputView.swift         — Text input + image attach (paste, drag, picker)
+│   ├── DownloadModalView.swift     — Model download progress overlay
 │   ├── StatusBarView.swift         — Model info, tok/s, GPU memory, API status
-│   └── SettingsView.swift          — System prompt + API settings
+│   ├── MonitorView.swift           — Inference statistics monitor
+│   └── SettingsView.swift          — System prompt, thinking mode, API, idle settings
+├── Commands/
+│   └── SaveChatCommands.swift      — File menu export command
 ├── Server/
 │   ├── APIServer.swift             — NWListener HTTP server, SSE streaming, KV cache reuse
 │   ├── APIModels.swift             — OpenAI-compatible Codable structs
 │   ├── ToolCallParser.swift        — Parses tool calls from model output
 │   └── ToolPromptBuilder.swift     — Model-specific tool prompt formatting
 └── Utilities/
-    ├── LocalModelResolver.swift    — Offline-first HuggingFace cache resolution
+    ├── LocalModelResolver.swift    — Offline-first HuggingFace cache resolution (sandbox + system)
+    ├── ChatExporter.swift          — Export conversations to Markdown or RTF
+    ├── FocusedValues.swift         — FocusedValue keys for menu bar integration
    └── Preferences.swift           — UserDefaults wrapper

 project.yml     — xcodegen project spec (dependencies, settings, deployment target)
@@ -99,17 +112,11 @@ build.sh        — One-command build script (xcodegen + xcodebuild)

 ## Key Design Decisions

- Uses `mlx-swift-lm` (`MLXVLM` / `VLMModelFactory`) for inference — supports both text and vision in a single model load
- **Offline-first**: `LocalModelResolver` checks `~/.cache/huggingface/hub/` for locally-cached snapshots before downloading
+- Uses `mlx-swift-lm` (`MLXVLM` / `VLMModelFactory`) for inference — loads any MLX-format model from HuggingFace
+- **Offline-first**: `LocalModelResolver` checks both the sandboxed app container and `~/.cache/huggingface/hub/` for locally-cached models before downloading
+- **No duplicate storage**: custom `HubApi` with blob cache disabled — models are stored once in the snapshot cache
 - **KV cache reuse** across API requests — reuses `ChatSession` when conversation history prefix matches
+- **Thinking mode**: `enable_thinking` passed via Jinja template context; `<think>` tags parsed in real-time during streaming
 - HTTP server built on `Network.framework` (`NWListener`) — no third-party server dependencies
 - Model-specific prompt formatting: Gemma uses `tool_code` blocks, Qwen uses `<tool_call>` XML tags
 - GPU cache limit set to 20 MB; cache cleared on model unload
-
-## Design Notes
-
- Uses `mlx_vlm` (not `mlx_lm`) as the backend — supports both text and vision in a single model load
- Offline-first: if the model is cached locally (`~/.cache/huggingface/hub/`), no network requests are made
- Thread lock on generation — MLX models aren't safe for concurrent generation
- KV prefix caching for multi-turn conversations
- Context window read from each model's config (Gemma 3 4B: 128k, Qwen3-VL 4B: 256k) with automatic summarization fallback