2026-03-17 12:07:45 +01:00
2026-03-17 11:58:24 +01:00
2026-03-17 11:44:24 +01:00
2026-03-17 11:44:24 +01:00
2026-03-17 09:14:27 +01:00
2026-03-17 12:07:45 +01:00
2026-03-17 11:44:24 +01:00
2026-03-17 09:14:27 +01:00

MLX Server

OpenAI-compatible API server for running local LLMs on Apple Silicon via MLX. Supports vision and tool use with automatic model swapping — only one model is loaded in memory at a time, switched on demand based on the request's model field.

Supported Models

Alias Model Capabilities
gemma mlx-community/gemma-3-4b-it-4bit Vision, tool use (tool_code blocks)
qwen mlx-community/Qwen3-VL-4B-Instruct-4bit Vision, tool use (<tool_call> tags)

Quick Start

source .venv/bin/activate

# Start with Gemma 3 (default)
./run.sh

# Start with Qwen3
./run.sh qwen

# Or directly
python -m mlx_server.main --model mlx-community/gemma-3-4b-it-4bit --port 1234

The server starts at http://127.0.0.1:1234.

API

Standard OpenAI-compatible endpoints:

  • GET /v1/models — lists all available models
  • POST /v1/chat/completions — chat completions (streaming and non-streaming)
  • GET /health — health check

Model Swapping

Send any available model ID (or alias) in the model field. If it differs from the currently loaded model, the server unloads the old one and loads the new one automatically:

# Uses Gemma
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/gemma-3-4b-it-4bit", "messages": [{"role": "user", "content": "Hello"}]}'

# Swaps to Qwen
curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Qwen3-VL-4B-Instruct-4bit", "messages": [{"role": "user", "content": "Hello"}]}'

Vision

Pass images as base64 data URIs or URLs in the image_url content part:

{
  "model": "mlx-community/gemma-3-4b-it-4bit",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What's in this image?"},
      {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
    ]
  }]
}

Tool Use

Pass tools in the tools field (OpenAI format). The server handles model-specific formatting and parses tool calls from the output automatically.

Installation

Requires Python 3.11+ and Apple Silicon.

uv pip install -e "."

Project Structure

mlx_server/
  main.py    — FastAPI server, endpoints, CLI entrypoint
  engine.py  — Model loading, prompt building, generation (mlx_vlm)
  models.py  — Pydantic models for OpenAI API types

Design Notes

  • Uses mlx_vlm (not mlx_lm) as the backend — supports both text and vision in a single model load
  • Offline-first: if the model is cached locally (~/.cache/huggingface/hub/), no network requests are made
  • Thread lock on generation — MLX models aren't safe for concurrent generation
  • KV prefix caching for multi-turn conversations
  • 128k context window via native model capabilities
Description
a simple MLX based server for small models to run locally
Readme MIT 3.3 MiB
Languages
Swift 99.6%
Shell 0.4%