5.5 KiB
MLX Server
Native macOS SwiftUI app for local LLMs on Apple Silicon via MLX. Provides a chat UI and an embedded OpenAI-compatible API server. Supports vision, tool use, and thinking mode.
Quick Start
Always use ./build.sh to build the project — never call xcodebuild directly. The script runs xcodegen first (to pick up new/removed files) and uses the correct scheme, destination, and build directory.
# Build (requires xcodegen: brew install xcodegen)
./build.sh
# Run
open "build/Debug/MLX Server.app"
Project Structure
MLXServer/MLXServerApp.swift— App entry point, GPU cache config, menu commandsMLXServer/ContentView.swift— Main layout, toolbar, keyboard shortcuts, focused valuesMLXServer/Models/ModelConfig.swift— Model definitions (alias, repoId, contextLength), resolutionMLXServer/Models/ChatMessage.swift— Chat message data model,<think>tag parsingMLXServer/ViewModels/ModelManager.swift— Model loading/switching via VLMModelFactory, download tracking, idle unloadMLXServer/ViewModels/ChatViewModel.swift— Chat state, ChatSession management, API server lifecycleMLXServer/Server/APIServer.swift— NWListener HTTP server, SSE streaming, KV cache reuse, vision, tool call handlingMLXServer/Server/APIModels.swift— OpenAI-compatible Codable structsMLXServer/Server/ToolCallParser.swift— Parses tool calls from model output (Gemma tool_code, Qwen XML tags)MLXServer/Server/ToolPromptBuilder.swift— Model-specific tool prompt formattingMLXServer/Views/DownloadModalView.swift— Modal overlay for model download progressMLXServer/Views/ChatMessagesView.swift— Message bubbles with markdown rendering and collapsible thinking blocksMLXServer/Views/ChatInputView.swift— Text input, image attach (file picker, drag & drop, Finder copy-paste)MLXServer/Commands/SaveChatCommands.swift— File > Export Chat menu commandMLXServer/Utilities/LocalModelResolver.swift— Resolves HF repo IDs to local snapshots (sandbox + system cache + flat layouts)MLXServer/Utilities/ChatExporter.swift— Export conversations to Markdown or RTF (Pages-compatible)MLXServer/Utilities/FocusedValues.swift— FocusedValue keys for menu bar integrationMLXServer/Utilities/Preferences.swift— UserDefaults wrapper (model, thinking mode, API, idle timeout)project.yml— xcodegen project specbuild.sh— Build script (xcodegen + xcodebuild)
Supported Models
| Alias | HuggingFace ID | Notes |
|---|---|---|
gemma |
mlx-community/gemma-3-4b-it-4bit |
Vision + tool use via tool_code blocks (128k context) |
qwen |
mlx-community/Qwen3-VL-4B-Instruct-4bit |
Vision + tool use via <tool_call> tags (256k context) |
qwen3.5-9b |
mlx-community/Qwen3.5-9B-4bit |
Thinking mode, tool use (256k context) |
Any model in MLX format on HuggingFace can be added — no restriction on uploader or architecture.
Critical Performance Rule
Inference speed is the #1 priority. The token generation loop must never be blocked or slowed by anything else — no MainActor hops, no SwiftUI observation, no synchronous I/O. Everything that isn't inference (stats collection, UI updates, logging) must run on separate threads via loose coupling:
LiveCounters(thread-safe singleton withOSAllocatedUnfairLock) is the bridge: generation code writes to it directly from any thread with zero actor overhead.InferenceStats(UI-side,@Observable @MainActor) pollsLiveCountersat 1Hz via a timer — never the other way around.- SSE streaming (
sendSSEEvent/sendData) runs nonisolated off MainActor so token sends don't compete with SwiftUI rendering. - Never gate token output on UI state, analytics, or any
@MainActor-isolated code.
Key Design Decisions
- Uses
mlx-swift-lm(MLXVLM/VLMModelFactory) as the inference backend — loads any MLX-format model from HuggingFace - Model-specific prompt formatting: Gemma uses
tool_codeblocks; Qwen uses<tool_call>XML tags - Offline-first:
LocalModelResolverchecks the sandboxed app container, system~/.cache/huggingface/hub/, and flat download layouts — no network requests if model is cached - No duplicate storage: custom
HubApi(cache: nil)with explicitdownloadBase— models stored once in the snapshot cache, not duplicated across blob cache and snapshots - Thinking mode:
enable_thinkingpassed to Jinja template context viaadditionalContext;<think>...</think>tags parsed in real-time during streaming and shown in collapsible UI blocks. Toggleable in Settings. - Download progress: separate
isDownloadingstate fromisLoading; modal overlay shows file count, percentage, speed - Idle unload: timer resets on both user input and model generation completion (not just request start)
- Chat export: Markdown (user messages as blockquotes) and RTF (Pages-compatible with formatted markdown)
- Finder paste: local event monitor intercepts Cmd+V to check pasteboard for image file URLs before TextField handles it
- HTTP server built on
Network.framework(NWListener) — no third-party server dependencies - KV cache reuse across API requests — reuses
ChatSessionwhen conversation history prefix matches - GPU cache limit set to 20 MB; cache cleared on model unload
Dependencies
Managed via Swift Package Manager (declared in project.yml for xcodegen).
| Package | Products |
|---|---|
mlx-swift-lm |
MLXLLM, MLXVLM, MLXLMCommon |
swift-markdown-ui |
MarkdownUI |