MLX Server

Native macOS SwiftUI app for local LLMs on Apple Silicon via MLX. Provides a chat UI and an embedded OpenAI-compatible API server. Supports vision, tool use, and thinking mode.

Quick Start

Always use ./build.sh to build the project — never call xcodebuild directly. The script runs xcodegen first (to pick up new/removed files) and uses the correct scheme, destination, and build directory.

Always use ./test.sh to run tests — it regenerates the Xcode project first and runs the shared MLXServer test scheme so test runs are reproducible.

Tests are required for finished work when the change is reasonably testable. Relevant tests must exist and must pass before work is considered complete.

Pre-existing errors don't exist: every error is your responsibility and you have to fix it before claiming you are done.

# Build (requires xcodegen: brew install xcodegen)
./build.sh

# Test
./test.sh

# Run
open "build/Debug/MLX Server.app"

Project Structure

MLXServer/MLXServerApp.swift — App entry point, GPU cache config, menu commands
MLXServer/ContentView.swift — Main layout, toolbar, keyboard shortcuts, focused values
MLXServer/Models/ModelConfig.swift — Model definitions (alias, repoId, contextLength), resolution
MLXServer/Models/ChatMessage.swift — Chat message data model, <think> tag parsing
MLXServer/ViewModels/ModelManager.swift — Model loading/switching via VLMModelFactory, download tracking, idle unload
MLXServer/ViewModels/ChatViewModel.swift — Chat state, ChatSession management, API server lifecycle
MLXServer/Server/APIServer.swift — NWListener HTTP server, SSE streaming, KV cache reuse, vision, tool call handling
MLXServer/Server/APIModels.swift — OpenAI-compatible Codable structs
MLXServer/Server/ToolCallParser.swift — Parses tool calls from model output (Gemma tool_code, Qwen XML tags)
MLXServer/Server/ToolPromptBuilder.swift — Model-specific tool prompt formatting
MLXServer/Views/DownloadModalView.swift — Modal overlay for model download progress
MLXServer/Views/ChatMessagesView.swift — Message bubbles with markdown rendering and collapsible thinking blocks
MLXServer/Views/ChatInputView.swift — Text input, image attach (file picker, drag & drop, Finder copy-paste)
MLXServer/Commands/SaveChatCommands.swift — File > Export Chat menu command
MLXServer/Utilities/LocalModelResolver.swift — Resolves HF repo IDs to local snapshots (sandbox + system cache + flat layouts)
MLXServer/Utilities/ChatExporter.swift — Export conversations to Markdown or RTF (Pages-compatible)
MLXServer/Utilities/FocusedValues.swift — FocusedValue keys for menu bar integration
MLXServer/Utilities/Preferences.swift — UserDefaults wrapper (model, thinking mode, API, idle timeout)
project.yml — xcodegen project spec
build.sh — Build script (xcodegen + xcodebuild)

Supported Models

Alias	HuggingFace ID	Notes
`gemma`	`mlx-community/gemma-3-4b-it-4bit`	Vision + tool use via `tool_code` blocks (128k context)
`qwen`	`mlx-community/Qwen3.5-4B-MLX-4bit`	Vision + thinking mode + tool use via `<tool_call>` tags (256k context)
`qwen3.5-0.8b`	`mlx-community/Qwen3.5-0.8B-4bit`	Vision + thinking mode + tool use via `<tool_call>` tags (256k context)
`qwen3.5-9b`	`mlx-community/Qwen3.5-9B-4bit`	Vision + thinking mode + tool use via `<tool_call>` tags (256k context)

Any model in MLX format on HuggingFace can be added — no restriction on uploader or architecture.

Critical Performance Rule

Inference speed is the #1 priority. The token generation loop must never be blocked or slowed by anything else — no MainActor hops, no SwiftUI observation, no synchronous I/O. Everything that isn't inference (stats collection, UI updates, logging) must run on separate threads via loose coupling:

LiveCounters (thread-safe singleton with OSAllocatedUnfairLock) is the bridge: generation code writes to it directly from any thread with zero actor overhead.
InferenceStats (UI-side, @Observable @MainActor) polls LiveCounters at 1Hz via a timer — never the other way around.
SSE streaming (sendSSEEvent/sendData) runs nonisolated off MainActor so token sends don't compete with SwiftUI rendering.
Never gate token output on UI state, analytics, or any @MainActor-isolated code.

Key Design Decisions

Uses mlx-swift-lm (MLXVLM / VLMModelFactory) as the inference backend — loads any MLX-format model from HuggingFace
Model-specific prompt formatting: Gemma uses tool_code blocks; Qwen uses <tool_call> XML tags
Offline-first: LocalModelResolver checks the sandboxed app container, system ~/.cache/huggingface/hub/, and flat download layouts — no network requests if model is cached
No duplicate storage: custom HubApi(cache: nil) with explicit downloadBase — models stored once in the snapshot cache, not duplicated across blob cache and snapshots
Thinking mode: enable_thinking passed to Jinja template context via additionalContext; <think>...</think> tags parsed in real-time during streaming and shown in collapsible UI blocks. Toggleable in Settings.
Download progress: separate isDownloading state from isLoading; modal overlay shows file count, percentage, speed
Idle unload: timer resets on both user input and model generation completion (not just request start)
Chat export: Markdown (user messages as blockquotes) and RTF (Pages-compatible with formatted markdown)
Finder paste: local event monitor intercepts Cmd+V to check pasteboard for image file URLs before TextField handles it
HTTP server built on Network.framework (NWListener) — no third-party server dependencies
KV cache reuse across API requests — reuses ChatSession when conversation history prefix matches
GPU cache limit set to 20 MB; cache cleared on model unload

Dependencies

Managed via Swift Package Manager (declared in project.yml for xcodegen).

Package	Products
`mlx-swift-lm`	`MLXLLM`, `MLXVLM`, `MLXLMCommon`
`swift-markdown-ui`	`MarkdownUI`

6.1 KiB Raw Blame History