6.1 KiB
MLX Server
Native macOS SwiftUI app for local LLMs on Apple Silicon via MLX. Provides a chat UI and an embedded OpenAI-compatible API server. Supports vision, tool use, and thinking mode.
Quick Start
Always use ./build.sh to build the project — never call xcodebuild directly. The script runs xcodegen first (to pick up new/removed files) and uses the correct scheme, destination, and build directory.
Always use ./test.sh to run tests — it regenerates the Xcode project first and runs the shared MLXServer test scheme so test runs are reproducible.
Tests are required for finished work when the change is reasonably testable. Relevant tests must exist and must pass before work is considered complete.
Pre-existing errors don't exist: every error is your responsibility and you have to fix it before claiming you are done.
# Build (requires xcodegen: brew install xcodegen)
./build.sh
# Test
./test.sh
# Run
open "build/Debug/MLX Server.app"
Project Structure
MLXServer/MLXServerApp.swift— App entry point, GPU cache config, menu commandsMLXServer/ContentView.swift— Main layout, toolbar, keyboard shortcuts, focused valuesMLXServer/Models/ModelConfig.swift— Model definitions (alias, repoId, contextLength), resolutionMLXServer/Models/ChatMessage.swift— Chat message data model,<think>tag parsingMLXServer/ViewModels/ModelManager.swift— Model loading/switching via VLMModelFactory, download tracking, idle unloadMLXServer/ViewModels/ChatViewModel.swift— Chat state, ChatSession management, API server lifecycleMLXServer/Server/APIServer.swift— NWListener HTTP server, SSE streaming, KV cache reuse, vision, tool call handlingMLXServer/Server/APIModels.swift— OpenAI-compatible Codable structsMLXServer/Server/ToolCallParser.swift— Parses tool calls from model output (Gemma tool_code, Qwen XML tags)MLXServer/Server/ToolPromptBuilder.swift— Model-specific tool prompt formattingMLXServer/Views/DownloadModalView.swift— Modal overlay for model download progressMLXServer/Views/ChatMessagesView.swift— Message bubbles with markdown rendering and collapsible thinking blocksMLXServer/Views/ChatInputView.swift— Text input, image attach (file picker, drag & drop, Finder copy-paste)MLXServer/Commands/SaveChatCommands.swift— File > Export Chat menu commandMLXServer/Utilities/LocalModelResolver.swift— Resolves HF repo IDs to local snapshots (sandbox + system cache + flat layouts)MLXServer/Utilities/ChatExporter.swift— Export conversations to Markdown or RTF (Pages-compatible)MLXServer/Utilities/FocusedValues.swift— FocusedValue keys for menu bar integrationMLXServer/Utilities/Preferences.swift— UserDefaults wrapper (model, thinking mode, API, idle timeout)project.yml— xcodegen project specbuild.sh— Build script (xcodegen + xcodebuild)
Supported Models
| Alias | HuggingFace ID | Notes |
|---|---|---|
gemma |
mlx-community/gemma-3-4b-it-4bit |
Vision + tool use via tool_code blocks (128k context) |
qwen |
mlx-community/Qwen3.5-4B-MLX-4bit |
Vision + thinking mode + tool use via <tool_call> tags (256k context) |
qwen3.5-0.8b |
mlx-community/Qwen3.5-0.8B-4bit |
Vision + thinking mode + tool use via <tool_call> tags (256k context) |
qwen3.5-9b |
mlx-community/Qwen3.5-9B-4bit |
Vision + thinking mode + tool use via <tool_call> tags (256k context) |
Any model in MLX format on HuggingFace can be added — no restriction on uploader or architecture.
Critical Performance Rule
Inference speed is the #1 priority. The token generation loop must never be blocked or slowed by anything else — no MainActor hops, no SwiftUI observation, no synchronous I/O. Everything that isn't inference (stats collection, UI updates, logging) must run on separate threads via loose coupling:
LiveCounters(thread-safe singleton withOSAllocatedUnfairLock) is the bridge: generation code writes to it directly from any thread with zero actor overhead.InferenceStats(UI-side,@Observable @MainActor) pollsLiveCountersat 1Hz via a timer — never the other way around.- SSE streaming (
sendSSEEvent/sendData) runs nonisolated off MainActor so token sends don't compete with SwiftUI rendering. - Never gate token output on UI state, analytics, or any
@MainActor-isolated code.
Key Design Decisions
- Uses
mlx-swift-lm(MLXVLM/VLMModelFactory) as the inference backend — loads any MLX-format model from HuggingFace - Model-specific prompt formatting: Gemma uses
tool_codeblocks; Qwen uses<tool_call>XML tags - Offline-first:
LocalModelResolverchecks the sandboxed app container, system~/.cache/huggingface/hub/, and flat download layouts — no network requests if model is cached - No duplicate storage: custom
HubApi(cache: nil)with explicitdownloadBase— models stored once in the snapshot cache, not duplicated across blob cache and snapshots - Thinking mode:
enable_thinkingpassed to Jinja template context viaadditionalContext;<think>...</think>tags parsed in real-time during streaming and shown in collapsible UI blocks. Toggleable in Settings. - Download progress: separate
isDownloadingstate fromisLoading; modal overlay shows file count, percentage, speed - Idle unload: timer resets on both user input and model generation completion (not just request start)
- Chat export: Markdown (user messages as blockquotes) and RTF (Pages-compatible with formatted markdown)
- Finder paste: local event monitor intercepts Cmd+V to check pasteboard for image file URLs before TextField handles it
- HTTP server built on
Network.framework(NWListener) — no third-party server dependencies - KV cache reuse across API requests — reuses
ChatSessionwhen conversation history prefix matches - GPU cache limit set to 20 MB; cache cleared on model unload
Dependencies
Managed via Swift Package Manager (declared in project.yml for xcodegen).
| Package | Products |
|---|---|
mlx-swift-lm |
MLXLLM, MLXVLM, MLXLMCommon |
swift-markdown-ui |
MarkdownUI |