MLX Server

Native macOS app for running local LLMs on Apple Silicon via MLX. Built with SwiftUI, it provides both a chat UI and an embedded OpenAI-compatible API server. Supports both vision-capable and text-only MLX models, plus tool use and thinking mode where the selected model supports them.

Supported Models

Alias Model Context Loader Capabilities
gemma mlx-community/gemma-3-4b-it-4bit 128k VLMModelFactory Vision, tool use (tool_code blocks)
qwen mlx-community/Qwen3.5-4B-MLX-4bit 256k VLMModelFactory Vision, thinking mode, tool use (<tool_call> tags)
qwen3.5-0.8b mlx-community/Qwen3.5-0.8B-4bit 256k VLMModelFactory Vision, thinking mode, tool use (<tool_call> tags)
qwen3.5-9b mlx-community/Qwen3.5-9B-4bit 256k VLMModelFactory Vision, thinking mode, tool use (<tool_call> tags)
stheno synk/L3-8B-Stheno-v3.2-MLX 8k LLMModelFactory Text-only, llama-based
violet-lotus hobaratio/MN-Violet-Lotus-12B-mlx-4Bit 32k LLMModelFactory Text-only, Mistral-based

Any model in MLX format on HuggingFace can be added — there is no restriction on uploader or architecture.

Developer note: the test suite uses qwen3.5-0.8b as the main live-model target because it is substantially faster and lighter than the larger Qwen variants, but some tests still run on Gemma 3 because they validate Gemma-specific prompt shaping, cache-reuse behavior, and tool-call behavior that did not match Qwen3.5 0.8B closely enough.

Quick Start

Requires macOS 15+, Xcode 16.4+, and xcodegen (brew install xcodegen).

./build.sh            # Debug build
open "build/Debug/MLX Server.app"

Run tests with the repo entrypoint:

./test.sh

For focused test runs, test.sh also accepts ONLY_TESTING and forwards it to xcodebuild -only-testing:

ONLY_TESTING='MLXServerTests/ModelBackedInferenceValidationTests/testLarge4KImageUsesGemmaResizeConfigAndPreparesSuccessfully' ./test.sh

This is intended for targeted validation while keeping the normal default as the full suite.

App Features

  • Chat interface with markdown rendering and model-aware image attachments (file picker, drag & drop, clipboard paste, Finder copy-paste on vision-capable models)
  • Scene-based chat starts — New Chat opens a scene picker with Neutral plus saved scenes, each with an optional model override, a scene prompt layered onto the base system prompt, an auto-sent starter prompt, and optional generation-setting overrides for chat-specific behavior
  • Model picker in toolbar with curated defaults plus any locally discovered MLX models on disk
  • Models window in the menu for downloading a model by HuggingFace ID, inspecting on-disk model sizes, and deleting local model folders
  • Download progress modal — shows file progress, percentage, and speed when downloading a new model
  • Thinking mode — models like Qwen3.5 can reason internally before responding; thinking content appears in a collapsible box. Toggle on/off in Settings.
  • Streaming responses with live token display
  • Native chat documents — save chats as .mlxchat package documents, reopen them from File > Open Chat or by double-clicking them in Finder, and continue the conversation with restored model context, thinking blocks, and images
  • Export chat — File > Export Chat (Cmd+Shift+E) saves conversations as Markdown or RTF (Pages-compatible)
  • Status bar showing model name, context window, tokens/sec, token counts, GPU memory, API server status
  • Keyboard shortcuts: Cmd+N (new chat), Cmd+O (open chat document), Cmd+S (save chat document), Cmd+Shift+S (save chat document as), Cmd+Shift+E (export), Cmd+Return (send), Escape (stop), Cmd+1/2/3/4/5 (switch models)
  • Scene management — create and edit reusable roleplay/task presets from the New Chat flow or Settings
  • Settings (Cmd+,): default model, per-model generation defaults (temperature, top-p/top-k, min-p, repetition/presence/frequency penalties, max tokens, thinking mode), base system prompt, scene management, API port, API auto-start, idle unload timeout
  • Idle auto-unload — model is unloaded after configurable idle time (resets on both user input and model output), reloaded on next request

API Server

The embedded API server (toggle in toolbar) runs on port 1234 by default. Standard OpenAI-compatible endpoints:

  • GET /v1/models — lists available models with context_window sizes
  • POST /v1/chat/completions — chat completions (streaming and non-streaming)
  • GET /health — health check

Capability checks are enforced server-side. If a request sends images to a text-only model or tools to a model without tool support, the server returns a 400 invalid_request_error.

When a chat-completions request omits generation parameters, the API server falls back to the saved per-model defaults from Settings. Request-supplied values still take precedence on a per-call basis.

Model Swapping

Send any model ID or alias in the model field. If it differs from the currently loaded model, the server swaps automatically:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "gemma", "messages": [{"role": "user", "content": "Hello"}]}'

Vision

Pass images as base64 data URIs in the image_url content part:

{
  "model": "gemma",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What's in this image?"},
      {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
    ]
  }]
}

Text-only models such as stheno reject image inputs.

Tool Use

Pass tools in the tools field (OpenAI format). The server handles model-specific formatting (Gemma tool_code blocks, Qwen <tool_call> XML tags) and parses tool calls from output automatically. When tools are present during streaming, output is buffered to strip tool-call markup before sending to the client.

stheno is currently documented and configured as a plain text model, so tool requests to it are rejected.

Project Structure

MLXServer/
├── MLXServerApp.swift              — App entry point, GPU cache config
├── ContentView.swift               — Main layout, toolbar, keyboard shortcuts
├── Models/
│   ├── ModelConfig.swift           — Model definitions, alias/repoId resolution
│   └── ChatMessage.swift           — Chat message data model, thinking tag parser
│   └── ChatScene.swift             — Persisted chat scene presets (prompt + model + starter)
├── ViewModels/
│   ├── ModelManager.swift          — Model loading/switching, download tracking, idle unload
│   └── ChatViewModel.swift         — Chat state, ChatSession, API server lifecycle
│   └── SceneStore.swift            — Scene persistence and editing operations
├── Views/
│   ├── SceneSelectionView.swift    — New chat scene picker popover
│   ├── SceneManagementView.swift   — Scene editor and list management
│   ├── ModelPickerView.swift       — Toolbar model selector with re-download
│   ├── ChatMessagesView.swift      — Scrollable message list with markdown + thinking blocks
│   ├── ChatInputView.swift         — Text input + image attach (paste, drag, picker)
│   ├── DownloadModalView.swift     — Model download progress overlay
│   ├── StatusBarView.swift         — Model info, tok/s, GPU memory, API status
│   ├── MonitorView.swift           — Inference statistics monitor
│   └── SettingsView.swift          — System prompt, thinking mode, API, idle settings
├── Commands/
│   └── SaveChatCommands.swift      — File menu new/open/save/revert/export commands
├── Documents/
│   ├── ChatDocumentController.swift — Queues Finder/app open-document requests into SwiftUI
│   ├── ChatDocumentManifest.swift   — Versioned `.mlxchat` manifest schema
│   ├── ChatDocumentMigration.swift  — Manifest schema migration entry point
│   └── ChatDocumentPackage.swift    — Package document read/write for `.mlxchat`
├── Server/
│   ├── APIServer.swift             — NWListener HTTP server, SSE streaming, KV cache reuse
│   ├── APIModels.swift             — OpenAI-compatible Codable structs
│   ├── ToolCallParser.swift        — Parses tool calls from model output
│   └── ToolPromptBuilder.swift     — Model-specific tool prompt formatting
└── Utilities/
    ├── LocalModelResolver.swift    — Offline-first HuggingFace cache resolution
    ├── ChatExporter.swift          — Export conversations to Markdown or RTF
    ├── FocusedValues.swift         — FocusedValue keys for menu bar integration
    └── Preferences.swift           — UserDefaults wrapper, including scene persistence

project.yml     — xcodegen project spec (dependencies, settings, deployment target)
build.sh        — One-command build script (xcodegen + xcodebuild)

Key Design Decisions

  • Uses mlx-swift-lm for inference — VLMModelFactory for vision models and LLMModelFactory for text-only models
  • Offline-first: LocalModelResolver checks ~/.cache/huggingface/hub/ for locally-cached models before downloading
  • No duplicate storage: custom HubApi with blob cache disabled — models are stored once in the snapshot cache
  • KV cache reuse across API requests — reuses ChatSession when conversation history prefix matches
  • Thinking mode: enable_thinking passed via Jinja template context; <think> tags parsed in real-time during streaming
  • HTTP server built on Network.framework (NWListener) — no third-party server dependencies
  • Model-specific prompt formatting: Gemma uses tool_code blocks, Qwen uses <tool_call> XML tags
  • GPU cache limit set to 20 MB; cache cleared on model unload
Description
a simple MLX based server for small models to run locally
Readme MIT 3.3 MiB
Languages
Swift 99.6%
Shell 0.4%