chore: added README
This commit is contained in:
97
README.md
Normal file
97
README.md
Normal file
@@ -0,0 +1,97 @@
|
|||||||
|
# MLX Server
|
||||||
|
|
||||||
|
OpenAI-compatible API server for running local LLMs on Apple Silicon via [MLX](https://github.com/ml-explore/mlx). Supports vision and tool use with automatic model swapping — only one model is loaded in memory at a time, switched on demand based on the request's `model` field.
|
||||||
|
|
||||||
|
## Supported Models
|
||||||
|
|
||||||
|
| Alias | Model | Capabilities |
|
||||||
|
|-------|-------|-------------|
|
||||||
|
| `gemma` | `mlx-community/gemma-3-4b-it-4bit` | Vision, tool use (`tool_code` blocks) |
|
||||||
|
| `qwen` | `mlx-community/Qwen3-VL-4B-Instruct-4bit` | Vision, tool use (`<tool_call>` tags) |
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
```bash
|
||||||
|
source .venv/bin/activate
|
||||||
|
|
||||||
|
# Start with Gemma 3 (default)
|
||||||
|
./run.sh
|
||||||
|
|
||||||
|
# Start with Qwen3
|
||||||
|
./run.sh qwen
|
||||||
|
|
||||||
|
# Or directly
|
||||||
|
python -m mlx_server.main --model mlx-community/gemma-3-4b-it-4bit --port 1234
|
||||||
|
```
|
||||||
|
|
||||||
|
The server starts at `http://127.0.0.1:1234`.
|
||||||
|
|
||||||
|
## API
|
||||||
|
|
||||||
|
Standard OpenAI-compatible endpoints:
|
||||||
|
|
||||||
|
- `GET /v1/models` — lists all available models
|
||||||
|
- `POST /v1/chat/completions` — chat completions (streaming and non-streaming)
|
||||||
|
- `GET /health` — health check
|
||||||
|
|
||||||
|
### Model Swapping
|
||||||
|
|
||||||
|
Send any available model ID (or alias) in the `model` field. If it differs from the currently loaded model, the server unloads the old one and loads the new one automatically:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Uses Gemma
|
||||||
|
curl http://localhost:1234/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"model": "mlx-community/gemma-3-4b-it-4bit", "messages": [{"role": "user", "content": "Hello"}]}'
|
||||||
|
|
||||||
|
# Swaps to Qwen
|
||||||
|
curl http://localhost:1234/v1/chat/completions \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"model": "mlx-community/Qwen3-VL-4B-Instruct-4bit", "messages": [{"role": "user", "content": "Hello"}]}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Vision
|
||||||
|
|
||||||
|
Pass images as base64 data URIs or URLs in the `image_url` content part:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "mlx-community/gemma-3-4b-it-4bit",
|
||||||
|
"messages": [{
|
||||||
|
"role": "user",
|
||||||
|
"content": [
|
||||||
|
{"type": "text", "text": "What's in this image?"},
|
||||||
|
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
|
||||||
|
]
|
||||||
|
}]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tool Use
|
||||||
|
|
||||||
|
Pass tools in the `tools` field (OpenAI format). The server handles model-specific formatting and parses tool calls from the output automatically.
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
Requires Python 3.11+ and Apple Silicon.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uv pip install -e "."
|
||||||
|
```
|
||||||
|
|
||||||
|
## Project Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
mlx_server/
|
||||||
|
main.py — FastAPI server, endpoints, CLI entrypoint
|
||||||
|
engine.py — Model loading, prompt building, generation (mlx_vlm)
|
||||||
|
models.py — Pydantic models for OpenAI API types
|
||||||
|
```
|
||||||
|
|
||||||
|
## Design Notes
|
||||||
|
|
||||||
|
- Uses `mlx_vlm` (not `mlx_lm`) as the backend — supports both text and vision in a single model load
|
||||||
|
- Offline-first: if the model is cached locally (`~/.cache/huggingface/hub/`), no network requests are made
|
||||||
|
- Thread lock on generation — MLX models aren't safe for concurrent generation
|
||||||
|
- KV prefix caching for multi-turn conversations
|
||||||
|
- 128k context window via native model capabilities
|
||||||
Reference in New Issue
Block a user