MLX Server

Native macOS SwiftUI app for local LLMs on Apple Silicon via MLX. Provides a chat UI and an embedded OpenAI-compatible API server. Supports vision, tool use, and thinking mode.

Quick Start

Always use ./build.sh to build the project — never call xcodebuild directly. The script runs xcodegen first (to pick up new/removed files) and uses the correct scheme, destination, and build directory.

# Build (requires xcodegen: brew install xcodegen)
./build.sh

# Run
open "build/Debug/MLX Server.app"

Project Structure

MLXServer/MLXServerApp.swift — App entry point, GPU cache config, menu commands
MLXServer/ContentView.swift — Main layout, toolbar, keyboard shortcuts, focused values
MLXServer/Models/ModelConfig.swift — Model definitions (alias, repoId, contextLength), resolution
MLXServer/Models/ChatMessage.swift — Chat message data model, <think> tag parsing
MLXServer/ViewModels/ModelManager.swift — Model loading/switching via VLMModelFactory, download tracking, idle unload
MLXServer/ViewModels/ChatViewModel.swift — Chat state, ChatSession management, API server lifecycle
MLXServer/Server/APIServer.swift — NWListener HTTP server, SSE streaming, KV cache reuse, vision, tool call handling
MLXServer/Server/APIModels.swift — OpenAI-compatible Codable structs
MLXServer/Server/ToolCallParser.swift — Parses tool calls from model output (Gemma tool_code, Qwen XML tags)
MLXServer/Server/ToolPromptBuilder.swift — Model-specific tool prompt formatting
MLXServer/Views/DownloadModalView.swift — Modal overlay for model download progress
MLXServer/Views/ChatMessagesView.swift — Message bubbles with markdown rendering and collapsible thinking blocks
MLXServer/Views/ChatInputView.swift — Text input, image attach (file picker, drag & drop, Finder copy-paste)
MLXServer/Commands/SaveChatCommands.swift — File > Export Chat menu command
MLXServer/Utilities/LocalModelResolver.swift — Resolves HF repo IDs to local snapshots (sandbox + system cache + flat layouts)
MLXServer/Utilities/ChatExporter.swift — Export conversations to Markdown or RTF (Pages-compatible)
MLXServer/Utilities/FocusedValues.swift — FocusedValue keys for menu bar integration
MLXServer/Utilities/Preferences.swift — UserDefaults wrapper (model, thinking mode, API, idle timeout)
project.yml — xcodegen project spec
build.sh — Build script (xcodegen + xcodebuild)

Supported Models

Alias	HuggingFace ID	Notes
`gemma`	`mlx-community/gemma-3-4b-it-4bit`	Vision + tool use via `tool_code` blocks (128k context)
`qwen`	`mlx-community/Qwen3-VL-4B-Instruct-4bit`	Vision + tool use via `<tool_call>` tags (256k context)
`qwen3.5-9b`	`mlx-community/Qwen3.5-9B-4bit`	Thinking mode, tool use (256k context)

Any model in MLX format on HuggingFace can be added — no restriction on uploader or architecture.

Critical Performance Rule

Inference speed is the #1 priority. The token generation loop must never be blocked or slowed by anything else — no MainActor hops, no SwiftUI observation, no synchronous I/O. Everything that isn't inference (stats collection, UI updates, logging) must run on separate threads via loose coupling:

LiveCounters (thread-safe singleton with OSAllocatedUnfairLock) is the bridge: generation code writes to it directly from any thread with zero actor overhead.
InferenceStats (UI-side, @Observable @MainActor) polls LiveCounters at 1Hz via a timer — never the other way around.
SSE streaming (sendSSEEvent/sendData) runs nonisolated off MainActor so token sends don't compete with SwiftUI rendering.
Never gate token output on UI state, analytics, or any @MainActor-isolated code.

Key Design Decisions

Uses mlx-swift-lm (MLXVLM / VLMModelFactory) as the inference backend — loads any MLX-format model from HuggingFace
Model-specific prompt formatting: Gemma uses tool_code blocks; Qwen uses <tool_call> XML tags
Offline-first: LocalModelResolver checks the sandboxed app container, system ~/.cache/huggingface/hub/, and flat download layouts — no network requests if model is cached
No duplicate storage: custom HubApi(cache: nil) with explicit downloadBase — models stored once in the snapshot cache, not duplicated across blob cache and snapshots
Thinking mode: enable_thinking passed to Jinja template context via additionalContext; <think>...</think> tags parsed in real-time during streaming and shown in collapsible UI blocks. Toggleable in Settings.
Download progress: separate isDownloading state from isLoading; modal overlay shows file count, percentage, speed
Idle unload: timer resets on both user input and model generation completion (not just request start)
Chat export: Markdown (user messages as blockquotes) and RTF (Pages-compatible with formatted markdown)
Finder paste: local event monitor intercepts Cmd+V to check pasteboard for image file URLs before TextField handles it
HTTP server built on Network.framework (NWListener) — no third-party server dependencies
KV cache reuse across API requests — reuses ChatSession when conversation history prefix matches
GPU cache limit set to 20 MB; cache cleared on model unload

Dependencies

Managed via Swift Package Manager (declared in project.yml for xcodegen).

Package	Products
`mlx-swift-lm`	`MLXLLM`, `MLXVLM`, `MLXLMCommon`
`swift-markdown-ui`	`MarkdownUI`

5.5 KiB Raw Blame History