MLX Server
Native macOS app for running local LLMs on Apple Silicon via MLX. Built with SwiftUI, it provides both a chat UI and an embedded OpenAI-compatible API server. Supports both vision-capable and text-only MLX models, plus tool use and thinking mode where the selected model supports them.
Supported Models
| Alias | Model | Context | Loader | Capabilities |
|---|---|---|---|---|
gemma |
mlx-community/gemma-3-4b-it-4bit |
128k | VLMModelFactory |
Vision, tool use (tool_code blocks) |
qwen |
mlx-community/Qwen3.5-4B-MLX-4bit |
256k | VLMModelFactory |
Vision, thinking mode, tool use (<tool_call> tags) |
qwen3.5-0.8b |
mlx-community/Qwen3.5-0.8B-4bit |
256k | VLMModelFactory |
Vision, thinking mode, tool use (<tool_call> tags) |
qwen3.5-9b |
mlx-community/Qwen3.5-9B-4bit |
256k | VLMModelFactory |
Vision, thinking mode, tool use (<tool_call> tags) |
stheno |
synk/L3-8B-Stheno-v3.2-MLX |
8k | LLMModelFactory |
Text-only, llama-based |
violet-lotus |
hobaratio/MN-Violet-Lotus-12B-mlx-4Bit |
32k | LLMModelFactory |
Text-only, Mistral-based |
Any model in MLX format on HuggingFace can be added — there is no restriction on uploader or architecture.
Developer note: the test suite uses qwen3.5-0.8b as the main live-model target because it is substantially faster and lighter than the larger Qwen variants, but some tests still run on Gemma 3 because they validate Gemma-specific prompt shaping, cache-reuse behavior, and tool-call behavior that did not match Qwen3.5 0.8B closely enough.
Quick Start
Requires macOS 15+, Xcode 16.4+, and xcodegen (brew install xcodegen).
./build.sh # Debug build
open "build/Debug/MLX Server.app"
Run tests with the repo entrypoint:
./test.sh
For focused test runs, test.sh also accepts ONLY_TESTING and forwards it to xcodebuild -only-testing:
ONLY_TESTING='MLXServerTests/ModelBackedInferenceValidationTests/testLarge4KImageUsesGemmaResizeConfigAndPreparesSuccessfully' ./test.sh
This is intended for targeted validation while keeping the normal default as the full suite.
App Features
- Chat interface with markdown rendering and model-aware image attachments (file picker, drag & drop, clipboard paste, Finder copy-paste on vision-capable models)
- Scene-based chat starts — New Chat opens a scene picker with Neutral plus saved scenes, each with an optional model override, a scene prompt layered onto the base system prompt, an auto-sent starter prompt, and optional generation-setting overrides for chat-specific behavior
- Model picker in toolbar with curated defaults plus any locally discovered MLX models on disk
- Models window in the menu for downloading a model by HuggingFace ID, inspecting on-disk model sizes, and deleting local model folders
- Download progress modal — shows file progress, percentage, and speed when downloading a new model
- Thinking mode — models like Qwen3.5 can reason internally before responding; thinking content appears in a collapsible box. Toggle on/off in Settings.
- Streaming responses with live token display
- Native chat documents — save chats as
.mlxchatpackage documents, reopen them from File > Open Chat or by double-clicking them in Finder, and continue the conversation with restored model context, thinking blocks, and images - Export chat — File > Export Chat (Cmd+Shift+E) saves conversations as Markdown or RTF (Pages-compatible)
- Status bar showing model name, context window, tokens/sec, token counts, GPU memory, API server status
- Keyboard shortcuts:
Cmd+N(new chat),Cmd+O(open chat document),Cmd+S(save chat document),Cmd+Shift+S(save chat document as),Cmd+Shift+E(export),Cmd+Return(send),Escape(stop),Cmd+1/2/3/4/5(switch models) - Scene management — create and edit reusable roleplay/task presets from the New Chat flow or Settings
- Settings (
Cmd+,): default model, per-model generation defaults (temperature, top-p/top-k, min-p, repetition/presence/frequency penalties, max tokens, thinking mode), base system prompt, scene management, API port, API auto-start, idle unload timeout - Idle auto-unload — model is unloaded after configurable idle time (resets on both user input and model output), reloaded on next request
API Server
The embedded API server (toggle in toolbar) runs on port 1234 by default. Standard OpenAI-compatible endpoints:
GET /v1/models— lists available models withcontext_windowsizesPOST /v1/chat/completions— chat completions (streaming and non-streaming)GET /health— health check
Capability checks are enforced server-side. If a request sends images to a text-only model or tools to a model without tool support, the server returns a 400 invalid_request_error.
When a chat-completions request omits generation parameters, the API server falls back to the saved per-model defaults from Settings. Request-supplied values still take precedence on a per-call basis.
Model Swapping
Send any model ID or alias in the model field. If it differs from the currently loaded model, the server swaps automatically:
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "gemma", "messages": [{"role": "user", "content": "Hello"}]}'
Vision
Pass images as base64 data URIs in the image_url content part:
{
"model": "gemma",
"messages": [{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}]
}
Text-only models such as stheno reject image inputs.
Tool Use
Pass tools in the tools field (OpenAI format). The server handles model-specific formatting (Gemma tool_code blocks, Qwen <tool_call> XML tags) and parses tool calls from output automatically. When tools are present during streaming, output is buffered to strip tool-call markup before sending to the client.
stheno is currently documented and configured as a plain text model, so tool requests to it are rejected.
Project Structure
MLXServer/
├── MLXServerApp.swift — App entry point, GPU cache config
├── ContentView.swift — Main layout, toolbar, keyboard shortcuts
├── Models/
│ ├── ModelConfig.swift — Model definitions, alias/repoId resolution
│ └── ChatMessage.swift — Chat message data model, thinking tag parser
│ └── ChatScene.swift — Persisted chat scene presets (prompt + model + starter)
├── ViewModels/
│ ├── ModelManager.swift — Model loading/switching, download tracking, idle unload
│ └── ChatViewModel.swift — Chat state, ChatSession, API server lifecycle
│ └── SceneStore.swift — Scene persistence and editing operations
├── Views/
│ ├── SceneSelectionView.swift — New chat scene picker popover
│ ├── SceneManagementView.swift — Scene editor and list management
│ ├── ModelPickerView.swift — Toolbar model selector with re-download
│ ├── ChatMessagesView.swift — Scrollable message list with markdown + thinking blocks
│ ├── ChatInputView.swift — Text input + image attach (paste, drag, picker)
│ ├── DownloadModalView.swift — Model download progress overlay
│ ├── StatusBarView.swift — Model info, tok/s, GPU memory, API status
│ ├── MonitorView.swift — Inference statistics monitor
│ └── SettingsView.swift — System prompt, thinking mode, API, idle settings
├── Commands/
│ └── SaveChatCommands.swift — File menu new/open/save/revert/export commands
├── Documents/
│ ├── ChatDocumentController.swift — Queues Finder/app open-document requests into SwiftUI
│ ├── ChatDocumentManifest.swift — Versioned `.mlxchat` manifest schema
│ ├── ChatDocumentMigration.swift — Manifest schema migration entry point
│ └── ChatDocumentPackage.swift — Package document read/write for `.mlxchat`
├── Server/
│ ├── APIServer.swift — NWListener HTTP server, SSE streaming, KV cache reuse
│ ├── APIModels.swift — OpenAI-compatible Codable structs
│ ├── ToolCallParser.swift — Parses tool calls from model output
│ └── ToolPromptBuilder.swift — Model-specific tool prompt formatting
└── Utilities/
├── LocalModelResolver.swift — Offline-first HuggingFace cache resolution
├── ChatExporter.swift — Export conversations to Markdown or RTF
├── FocusedValues.swift — FocusedValue keys for menu bar integration
└── Preferences.swift — UserDefaults wrapper, including scene persistence
project.yml — xcodegen project spec (dependencies, settings, deployment target)
build.sh — One-command build script (xcodegen + xcodebuild)
Key Design Decisions
- Uses
mlx-swift-lmfor inference —VLMModelFactoryfor vision models andLLMModelFactoryfor text-only models - Offline-first:
LocalModelResolverchecks~/.cache/huggingface/hub/for locally-cached models before downloading - No duplicate storage: custom
HubApiwith blob cache disabled — models are stored once in the snapshot cache - KV cache reuse across API requests — reuses
ChatSessionwhen conversation history prefix matches - Thinking mode:
enable_thinkingpassed via Jinja template context;<think>tags parsed in real-time during streaming - HTTP server built on
Network.framework(NWListener) — no third-party server dependencies - Model-specific prompt formatting: Gemma uses
tool_codeblocks, Qwen uses<tool_call>XML tags - GPU cache limit set to 20 MB; cache cleared on model unload