fix: A1-14c run embedding model on Apple GPU via EMLX with EXLA-CPU fallback

2026-05-29 16:26:33 +02:00
parent d03d033548
commit 84b91750fb
7 changed files with 112 additions and 12 deletions
--- a/SPECGAPS.md
+++ b/SPECGAPS.md
@@ -25,7 +25,7 @@ Gap categories: **SC** = spec correct, fix code | **CS** = code correct, update
 | A1-13 | ~~Git sidebar shows only "Working tree" placeholder~~ | sidebar_views.allium:651-770 | `git_view/1` now builds a full `layout: "git"` view from `BDS.Git` (repository/remote_state/status/history); `SidebarComponents` renders active + not_a_repo states | **Resolved:** `git_view/1` in sidebar.ex assembles branch/upstream/ahead/behind, status files, paginated history (20/page); `render_git_sidebar` renders branch header, sync legend, fetch/pull/push/prune-lfs buttons, commit form, clickable status files (open git_diff), history entries; shell_live wires `git_commit` (closes git_diff tabs), `git_fetch`/`git_pull`/`git_push`/`git_prune_lfs`, `git_initialize`; `BDS.Git.history` enriched with author/date, `BDS.Git.set_remote/2` added; i18n for de/fr/it/es; 3 shell tests + git author/date assertions added |
 | A1-14 | ~~Embedding uses TF-IDF hash projection instead of real neural model~~ | embedding.allium:44-53, invariants RealNeuralModel/ModelCaching/VectorCacheInDb | `Backends.Neural` runs `intfloat/multilingual-e5-small` (e5 weights behind the Xenova id) via Bumblebee+EXLA | **Resolved (core):** added bumblebee/nx/exla deps; `Backends.Neural` is a lazily-loaded GenServer that builds the Bumblebee text-embedding serving on first request (`"query: "` prefix + mean pooling + L2 norm), downloads+caches the model under the app data dir (ModelCaching), and is wired into the supervision tree when configured; vectors now persisted as packed little-endian Float32 BLOB (384×4=1536 bytes) instead of JSON text (VectorCacheInDb) with migration recreating `embedding_keys.vector` as BLOB; `InApp` demoted to documented offline/test stub; test config uses the stub so the suite stays offline; spec EmbeddingModel clarified (Xenova id ↔ intfloat weights via Bumblebee); batched inference via optional `embed_many/2` backend callback (configurable `batch_size`/`sequence_length`; rebuild/index/repair embed in chunks instead of one post at a time) + `NativeAcceleratedExecution` invariant added to spec; 4 tests added (BLOB round-trip, batched-rebuild, Neural model_info/behaviour). **Deferred:** A1-14b USearch HNSW index, A1-14c Apple GPU (EMLX). |
 | A1-14b | ~~USearch HNSW ANN index + debounced persistence not implemented~~ | embedding.allium config/FindSimilar/DebouncedPersistence | `Embeddings.Index` is now an HNSW (hnswlib) ANN index with debounced persistence | **Resolved:** rewrote `Embeddings.Index` as a DB-free GenServer wrapping an hnswlib HNSW graph (cosine, M=16, efConstruction=128, efSearch=64) — O(n·log n) build, O(log n) queries, replacing the O(n²) JSON cosine snapshot; per-project in-memory index + `label→post_id` map; 5s debounced `save_index` + `.meta.json` sidecar, force-save on project switch (`set_active_project`) and shutdown (`terminate`), `forget/1` on project delete; lazy reload from disk with rebuild-from-DB self-heal on miss; `find_similar`/`find_duplicates`/`compute_similarities` rewired (no brute-force fallback); USearch has no Elixir binding so hnswlib provides the identical HNSW algorithm/params (spec reconciled); supervision + dialyzer PLT updated; tests updated for debounced/binary persistence + self-heal. Follow-up hardening: explicit rebuild now forces re-embedding regardless of content_hash (ReindexAll), and model-unavailable errors propagate cleanly (post saves degrade to unindexed + log; rebuild/index return `{:error, reason}` surfaced as a failed task with a user-facing message instead of crashing). |
-| A1-14c | Embedding model runs on CPU only; no Apple GPU acceleration | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` uses Bumblebee+EXLA; on Apple Silicon XLA has no Metal backend so inference is native CPU (batched). Apple GPU/Neural Engine unused | Fix code: spike an EMLX (Apple MLX) Nx backend so the model executes on the Apple Silicon GPU; gate by platform/availability with EXLA-CPU fallback; verify Bumblebee serving + defn compiler compatibility and benchmark vs CPU batching |
+| A1-14c | ~~Embedding model runs on CPU only; no Apple GPU acceleration~~ | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` now selects the defn compiler at serving-build time: Apple GPU via EMLX (MLX/Metal) on arm64 macOS, EXLA-CPU elsewhere | **Resolved:** added `{:emlx, "~> 0.2.0"}` dep (ships precompiled MLX binaries; EMLX 0.2.0 implements both `EMLX.Backend` and the `Nx.Defn.Compiler` behaviour, GPU-default); `Backends.Neural` gained a pure `select_accelerator/3` policy (`:auto` prefers EMLX only when available **and** on Apple Silicon; explicit `:emlx`/`:exla` honoured; forced `:emlx` degrades to EXLA when unavailable so misconfigured hosts still run), `current_accelerator/0`, and `defn_options/1`; `build_serving` places params on `{EMLX.Backend, device: :gpu}` and compiles with `EMLX` for the EMLX path, keeps `EXLA` otherwise; new `accelerator: :auto` config key; spec `NativeAcceleratedExecution` + `EmbeddingModel` updated; PLT app added; 7 tests added (offline — test config still uses the InApp stub). |
 | A1-15 | ~~Preview vs generation content source strategy undocumented~~ | preview.allium (no invariant), generation.allium (no invariant) | Generation uses only published .md file content (`Generation.Data` snapshots set `content: nil`); preview includes published+draft posts and prefers DB content over file (`Preview.Router` queries `:published`/`:draft`, uses `editor_body`) | **Resolved:** added `PreviewDraftOverlay` invariant to preview.allium and `GenerationPublishedOnly` invariant to generation.allium; both cross-reference each other; code already correct, 3 tests added for draft-in-preview behavior |

 ### A2. Spec Should Update (code is normative)
@@ -186,7 +186,7 @@ All reconciled to follow code. Specs must be self-consistent and match code.

 ## Priority Order for Resolution

-1. **A1-1 through A1-14c** — code must follow spec (includes auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model, HNSW ANN index; only A1-14c = Apple GPU/EMLX acceleration still open)
+1. ~~**A1-1 through A1-14c**~~ — all resolved: auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model, HNSW ANN index, and Apple GPU/EMLX acceleration (A1-14c)
 2. **D1-1 through D1-18** — untested invariants/guarantees
 3. **C-1 through C-3** — internal spec inconsistencies (reconcile to code)
 4. **B1-1 through B1-6** — major code behaviors missing from spec