chore: added README

2026-03-17 12:07:45 +01:00
parent ef83c24b0b
commit 540b187593
1 changed files with 97 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,97 @@
+# MLX Server
+
+OpenAI-compatible API server for running local LLMs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). Supports vision and tool use with automatic model swapping — only one model is loaded in memory at a time, switched on demand based on the request's `model` field.
+
+## Supported Models
+
+| Alias | Model | Capabilities |
+|-------|-------|-------------|
+| `gemma` | `mlx-community/gemma-3-4b-it-4bit` | Vision, tool use (`tool_code` blocks) |
+| `qwen` | `mlx-community/Qwen3-VL-4B-Instruct-4bit` | Vision, tool use (`<tool_call>` tags) |
+
+## Quick Start
+
+```bash
+source .venv/bin/activate
+
+# Start with Gemma 3 (default)
+./run.sh
+
+# Start with Qwen3
+./run.sh qwen
+
+# Or directly
+python -m mlx_server.main --model mlx-community/gemma-3-4b-it-4bit --port 1234
+```
+
+The server starts at `http://127.0.0.1:1234`.
+
+## API
+
+Standard OpenAI-compatible endpoints:
+
+- `GET /v1/models` — lists all available models
+- `POST /v1/chat/completions` — chat completions (streaming and non-streaming)
+- `GET /health` — health check
+
+### Model Swapping
+
+Send any available model ID (or alias) in the `model` field. If it differs from the currently loaded model, the server unloads the old one and loads the new one automatically:
+
+```bash
+# Uses Gemma
+curl http://localhost:1234/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "mlx-community/gemma-3-4b-it-4bit", "messages": [{"role": "user", "content": "Hello"}]}'
+
+# Swaps to Qwen
+curl http://localhost:1234/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "mlx-community/Qwen3-VL-4B-Instruct-4bit", "messages": [{"role": "user", "content": "Hello"}]}'
+```
+
+### Vision
+
+Pass images as base64 data URIs or URLs in the `image_url` content part:
+
+```json
+{
+  "model": "mlx-community/gemma-3-4b-it-4bit",
+  "messages": [{
+    "role": "user",
+    "content": [
+      {"type": "text", "text": "What's in this image?"},
+      {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
+    ]
+  }]
+}
+```
+
+### Tool Use
+
+Pass tools in the `tools` field (OpenAI format). The server handles model-specific formatting and parses tool calls from the output automatically.
+
+## Installation
+
+Requires Python 3.11+ and Apple Silicon.
+
+```bash
+uv pip install -e "."
+```
+
+## Project Structure
+
+```
+mlx_server/
+  main.py    — FastAPI server, endpoints, CLI entrypoint
+  engine.py  — Model loading, prompt building, generation (mlx_vlm)
+  models.py  — Pydantic models for OpenAI API types
+```
+
+## Design Notes
+
+- Uses `mlx_vlm` (not `mlx_lm`) as the backend — supports both text and vision in a single model load
+- Offline-first: if the model is cached locally (`~/.cache/huggingface/hub/`), no network requests are made
+- Thread lock on generation — MLX models aren't safe for concurrent generation
+- KV prefix caching for multi-turn conversations
+- 128k context window via native model capabilities