chore: added README

2026-03-17 12:07:45 +01:00
parent ef83c24b0b
commit 540b187593
1 changed files with 97 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,97 @@
 # MLX Server
 OpenAI-compatible API server for running local LLMs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). Supports vision and tool use with automatic model swapping — only one model is loaded in memory at a time, switched on demand based on the request's `model` field.
 ## Supported Models
 | Alias | Model | Capabilities |
 |-------|-------|-------------|
 | `gemma` | `mlx-community/gemma-3-4b-it-4bit` | Vision, tool use (`tool_code` blocks) |
 | `qwen` | `mlx-community/Qwen3-VL-4B-Instruct-4bit` | Vision, tool use (`<tool_call>` tags) |
 ## Quick Start
 ```bash
 source .venv/bin/activate
 # Start with Gemma 3 (default)
 ./run.sh
 # Start with Qwen3
 ./run.sh qwen
 # Or directly
 python -m mlx_server.main --model mlx-community/gemma-3-4b-it-4bit --port 1234
 ```
 The server starts at `http://127.0.0.1:1234`.
 ## API
 Standard OpenAI-compatible endpoints:
 - `GET /v1/models` — lists all available models
 - `POST /v1/chat/completions` — chat completions (streaming and non-streaming)
 - `GET /health` — health check
 ### Model Swapping
 Send any available model ID (or alias) in the `model` field. If it differs from the currently loaded model, the server unloads the old one and loads the new one automatically:
 ```bash
 # Uses Gemma
 curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/gemma-3-4b-it-4bit", "messages": [{"role": "user", "content": "Hello"}]}'
 # Swaps to Qwen
 curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mlx-community/Qwen3-VL-4B-Instruct-4bit", "messages": [{"role": "user", "content": "Hello"}]}'
 ```
 ### Vision
 Pass images as base64 data URIs or URLs in the `image_url` content part:
 ```json
 {
  "model": "mlx-community/gemma-3-4b-it-4bit",
  "messages": [{
    "role": "user",
    "content": [
      {"type": "text", "text": "What's in this image?"},
      {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
    ]
  }]
 }
 ```
 ### Tool Use
 Pass tools in the `tools` field (OpenAI format). The server handles model-specific formatting and parses tool calls from the output automatically.
 ## Installation
 Requires Python 3.11+ and Apple Silicon.
 ```bash
 uv pip install -e "."
 ```
 ## Project Structure
 ```
 mlx_server/
  main.py    — FastAPI server, endpoints, CLI entrypoint
  engine.py  — Model loading, prompt building, generation (mlx_vlm)
  models.py  — Pydantic models for OpenAI API types
 ```
 ## Design Notes
 - Uses `mlx_vlm` (not `mlx_lm`) as the backend — supports both text and vision in a single model load
 - Offline-first: if the model is cached locally (`~/.cache/huggingface/hub/`), no network requests are made
 - Thread lock on generation — MLX models aren't safe for concurrent generation
 - KV prefix caching for multi-turn conversations
 - 128k context window via native model capabilities