From 84b91750fbf7985ed6d693866c43d07ba9c472d5 Mon Sep 17 00:00:00 2001 From: Chili Palmer Date: Fri, 29 May 2026 16:26:33 +0200 Subject: [PATCH] fix: A1-14c run embedding model on Apple GPU via EMLX with EXLA-CPU fallback --- SPECGAPS.md | 4 +- config/config.exs | 5 +- lib/bds/embeddings/backends/neural.ex | 63 +++++++++++++++++++- mix.exs | 6 +- mix.lock | 1 + specs/embedding.allium | 13 ++-- test/bds/embeddings/backends/neural_test.exs | 32 ++++++++++ 7 files changed, 112 insertions(+), 12 deletions(-) diff --git a/SPECGAPS.md b/SPECGAPS.md index 1a43195..6202fe7 100644 --- a/SPECGAPS.md +++ b/SPECGAPS.md @@ -25,7 +25,7 @@ Gap categories: **SC** = spec correct, fix code | **CS** = code correct, update | A1-13 | ~~Git sidebar shows only "Working tree" placeholder~~ | sidebar_views.allium:651-770 | `git_view/1` now builds a full `layout: "git"` view from `BDS.Git` (repository/remote_state/status/history); `SidebarComponents` renders active + not_a_repo states | **Resolved:** `git_view/1` in sidebar.ex assembles branch/upstream/ahead/behind, status files, paginated history (20/page); `render_git_sidebar` renders branch header, sync legend, fetch/pull/push/prune-lfs buttons, commit form, clickable status files (open git_diff), history entries; shell_live wires `git_commit` (closes git_diff tabs), `git_fetch`/`git_pull`/`git_push`/`git_prune_lfs`, `git_initialize`; `BDS.Git.history` enriched with author/date, `BDS.Git.set_remote/2` added; i18n for de/fr/it/es; 3 shell tests + git author/date assertions added | | A1-14 | ~~Embedding uses TF-IDF hash projection instead of real neural model~~ | embedding.allium:44-53, invariants RealNeuralModel/ModelCaching/VectorCacheInDb | `Backends.Neural` runs `intfloat/multilingual-e5-small` (e5 weights behind the Xenova id) via Bumblebee+EXLA | **Resolved (core):** added bumblebee/nx/exla deps; `Backends.Neural` is a lazily-loaded GenServer that builds the Bumblebee text-embedding serving on first request (`"query: "` prefix + mean pooling + L2 norm), downloads+caches the model under the app data dir (ModelCaching), and is wired into the supervision tree when configured; vectors now persisted as packed little-endian Float32 BLOB (384×4=1536 bytes) instead of JSON text (VectorCacheInDb) with migration recreating `embedding_keys.vector` as BLOB; `InApp` demoted to documented offline/test stub; test config uses the stub so the suite stays offline; spec EmbeddingModel clarified (Xenova id ↔ intfloat weights via Bumblebee); batched inference via optional `embed_many/2` backend callback (configurable `batch_size`/`sequence_length`; rebuild/index/repair embed in chunks instead of one post at a time) + `NativeAcceleratedExecution` invariant added to spec; 4 tests added (BLOB round-trip, batched-rebuild, Neural model_info/behaviour). **Deferred:** A1-14b USearch HNSW index, A1-14c Apple GPU (EMLX). | | A1-14b | ~~USearch HNSW ANN index + debounced persistence not implemented~~ | embedding.allium config/FindSimilar/DebouncedPersistence | `Embeddings.Index` is now an HNSW (hnswlib) ANN index with debounced persistence | **Resolved:** rewrote `Embeddings.Index` as a DB-free GenServer wrapping an hnswlib HNSW graph (cosine, M=16, efConstruction=128, efSearch=64) — O(n·log n) build, O(log n) queries, replacing the O(n²) JSON cosine snapshot; per-project in-memory index + `label→post_id` map; 5s debounced `save_index` + `.meta.json` sidecar, force-save on project switch (`set_active_project`) and shutdown (`terminate`), `forget/1` on project delete; lazy reload from disk with rebuild-from-DB self-heal on miss; `find_similar`/`find_duplicates`/`compute_similarities` rewired (no brute-force fallback); USearch has no Elixir binding so hnswlib provides the identical HNSW algorithm/params (spec reconciled); supervision + dialyzer PLT updated; tests updated for debounced/binary persistence + self-heal. Follow-up hardening: explicit rebuild now forces re-embedding regardless of content_hash (ReindexAll), and model-unavailable errors propagate cleanly (post saves degrade to unindexed + log; rebuild/index return `{:error, reason}` surfaced as a failed task with a user-facing message instead of crashing). | -| A1-14c | Embedding model runs on CPU only; no Apple GPU acceleration | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` uses Bumblebee+EXLA; on Apple Silicon XLA has no Metal backend so inference is native CPU (batched). Apple GPU/Neural Engine unused | Fix code: spike an EMLX (Apple MLX) Nx backend so the model executes on the Apple Silicon GPU; gate by platform/availability with EXLA-CPU fallback; verify Bumblebee serving + defn compiler compatibility and benchmark vs CPU batching | +| A1-14c | ~~Embedding model runs on CPU only; no Apple GPU acceleration~~ | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` now selects the defn compiler at serving-build time: Apple GPU via EMLX (MLX/Metal) on arm64 macOS, EXLA-CPU elsewhere | **Resolved:** added `{:emlx, "~> 0.2.0"}` dep (ships precompiled MLX binaries; EMLX 0.2.0 implements both `EMLX.Backend` and the `Nx.Defn.Compiler` behaviour, GPU-default); `Backends.Neural` gained a pure `select_accelerator/3` policy (`:auto` prefers EMLX only when available **and** on Apple Silicon; explicit `:emlx`/`:exla` honoured; forced `:emlx` degrades to EXLA when unavailable so misconfigured hosts still run), `current_accelerator/0`, and `defn_options/1`; `build_serving` places params on `{EMLX.Backend, device: :gpu}` and compiles with `EMLX` for the EMLX path, keeps `EXLA` otherwise; new `accelerator: :auto` config key; spec `NativeAcceleratedExecution` + `EmbeddingModel` updated; PLT app added; 7 tests added (offline — test config still uses the InApp stub). | | A1-15 | ~~Preview vs generation content source strategy undocumented~~ | preview.allium (no invariant), generation.allium (no invariant) | Generation uses only published .md file content (`Generation.Data` snapshots set `content: nil`); preview includes published+draft posts and prefers DB content over file (`Preview.Router` queries `:published`/`:draft`, uses `editor_body`) | **Resolved:** added `PreviewDraftOverlay` invariant to preview.allium and `GenerationPublishedOnly` invariant to generation.allium; both cross-reference each other; code already correct, 3 tests added for draft-in-preview behavior | ### A2. Spec Should Update (code is normative) @@ -186,7 +186,7 @@ All reconciled to follow code. Specs must be self-consistent and match code. ## Priority Order for Resolution -1. **A1-1 through A1-14c** — code must follow spec (includes auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model, HNSW ANN index; only A1-14c = Apple GPU/EMLX acceleration still open) +1. ~~**A1-1 through A1-14c**~~ — all resolved: auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model, HNSW ANN index, and Apple GPU/EMLX acceleration (A1-14c) 2. **D1-1 through D1-18** — untested invariants/guarantees 3. **C-1 through C-3** — internal spec inconsistencies (reconcile to code) 4. **B1-1 through B1-6** — major code behaviors missing from spec diff --git a/config/config.exs b/config/config.exs index b98c1ef..bbd14a2 100644 --- a/config/config.exs +++ b/config/config.exs @@ -68,7 +68,10 @@ config :bds, :embeddings, # Inference is batched: batch_size texts per compiled run, truncated to # sequence_length tokens. Tuning these trades throughput against memory. batch_size: 16, - sequence_length: 256 + sequence_length: 256, + # Hardware acceleration: :auto prefers the Apple GPU (EMLX/Metal) on Apple + # Silicon and falls back to EXLA-CPU elsewhere. Force with :emlx or :exla. + accelerator: :auto # Cache downloaded model files under the app data directory so they persist # across sessions (ModelCaching invariant). Overridden at runtime in prod. diff --git a/lib/bds/embeddings/backends/neural.ex b/lib/bds/embeddings/backends/neural.ex index fe5453c..ab1feed 100644 --- a/lib/bds/embeddings/backends/neural.ex +++ b/lib/bds/embeddings/backends/neural.ex @@ -23,8 +23,17 @@ defmodule BDS.Embeddings.Backends.Neural do compiled for a fixed `batch_size`/`sequence_length` (configurable); shorter sequences mean less wasted transformer compute. - EXLA on Apple Silicon runs on the CPU — XLA has no Metal/GPU backend. See - SPECGAPS A1-14c for the planned EMLX (Apple GPU via MLX) acceleration path. + Hardware acceleration follows the `NativeAcceleratedExecution` invariant. + The serving's defn compiler is chosen at build time: + + * On Apple Silicon (arm64 macOS) with EMLX available, inference runs on the + Apple GPU via MLX/Metal (`compiler: EMLX`, params placed on the + `EMLX.Backend` GPU device). + * Everywhere else — and as a fallback when EMLX is unavailable or explicitly + disabled — it runs on optimised native CPU via XLA (`compiler: EXLA`). + + The accelerator can be pinned with `config :bds, :embeddings, accelerator:` + to `:auto` (default), `:emlx`, or `:exla`. """ @behaviour BDS.Embeddings.Backend @@ -39,6 +48,7 @@ defmodule BDS.Embeddings.Backends.Neural do @default_dimensions 384 @default_batch_size 16 @default_sequence_length 256 + @default_accelerator :auto def child_spec(opts) do %{id: __MODULE__, start: {__MODULE__, :start_link, [opts]}} @@ -124,6 +134,8 @@ defmodule BDS.Embeddings.Backends.Neural do defp build_serving do repo = {:hf, Keyword.get(config(), :model_repo, @default_model_repo)} + accelerator = current_accelerator() + maybe_set_default_backend(accelerator) with {:ok, model_info} <- Bumblebee.load_model(repo), {:ok, tokenizer} <- Bumblebee.load_tokenizer(repo) do @@ -133,13 +145,58 @@ defmodule BDS.Embeddings.Backends.Neural do output_attribute: :hidden_state, embedding_processor: :l2_norm, compile: [batch_size: batch_size(), sequence_length: sequence_length()], - defn_options: [compiler: EXLA] + defn_options: defn_options(accelerator) ) {:ok, serving} end end + # Place model params/tensors on the Apple GPU (Metal) when accelerating with + # EMLX so the compiled inference pass actually runs on-device. EXLA manages + # its own device placement, so nothing to do there. + defp maybe_set_default_backend(:emlx), do: Nx.global_default_backend({EMLX.Backend, device: :gpu}) + defp maybe_set_default_backend(:exla), do: :ok + + @doc false + @spec defn_options(:emlx | :exla) :: keyword() + def defn_options(:emlx), do: [compiler: EMLX] + def defn_options(:exla), do: [compiler: EXLA] + + @doc false + @spec current_accelerator() :: :emlx | :exla + def current_accelerator do + select_accelerator(configured_accelerator(), emlx_available?(), apple_silicon?()) + end + + @doc """ + Pure accelerator-selection policy for `NativeAcceleratedExecution`. + + Prefer the Apple GPU (EMLX) under `:auto` only when it is both available and + running on Apple Silicon; honour an explicit `:emlx`/`:exla` request, but + degrade a forced `:emlx` to EXLA when EMLX is not loaded so a misconfigured + host still gets working CPU inference instead of crashing. + """ + @spec select_accelerator(:auto | :emlx | :exla, boolean(), boolean()) :: :emlx | :exla + def select_accelerator(:exla, _emlx_available?, _apple_silicon?), do: :exla + def select_accelerator(:emlx, true, _apple_silicon?), do: :emlx + def select_accelerator(:emlx, false, _apple_silicon?), do: :exla + def select_accelerator(:auto, true, true), do: :emlx + def select_accelerator(:auto, _emlx_available?, _apple_silicon?), do: :exla + + defp configured_accelerator do + config() |> Keyword.get(:accelerator, @default_accelerator) + end + + defp emlx_available? do + Code.ensure_loaded?(EMLX) and Code.ensure_loaded?(EMLX.Backend) + end + + defp apple_silicon? do + :os.type() == {:unix, :darwin} and + to_string(:erlang.system_info(:system_architecture)) =~ ~r/aarch64|arm/ + end + defp batch_size do config() |> Keyword.get(:batch_size, @default_batch_size) |> max(1) end diff --git a/mix.exs b/mix.exs index 8d803b8..c5f8abc 100644 --- a/mix.exs +++ b/mix.exs @@ -36,6 +36,10 @@ defmodule BDS.MixProject do {:image, "~> 0.67"}, {:nx, "~> 0.10"}, {:exla, "~> 0.10"}, + # Apple Silicon GPU (Metal) acceleration for embedding inference. Ships + # precompiled MLX binaries; the Neural backend prefers it on arm64 macOS + # and falls back to EXLA-CPU elsewhere (SPECGAPS A1-14c). + {:emlx, "~> 0.2.0"}, {:bumblebee, "~> 0.6.3"}, {:hnswlib, "~> 0.1.7"}, {:stemex, "~> 0.2.1"}, @@ -64,7 +68,7 @@ defmodule BDS.MixProject do env = Mix.env() [ - plt_add_apps: [:mix, :inets, :ssl, :nx, :exla, :bumblebee, :hnswlib], + plt_add_apps: [:mix, :inets, :ssl, :nx, :exla, :emlx, :bumblebee, :hnswlib], paths: ["_build/#{env}/lib/bds/ebin"] ] end diff --git a/mix.lock b/mix.lock index fcaf8b3..9506351 100644 --- a/mix.lock +++ b/mix.lock @@ -18,6 +18,7 @@ "ecto_sql": {:hex, :ecto_sql, "3.13.5", "2f8282b2ad97bf0f0d3217ea0a6fff320ead9e2f8770f810141189d182dc304e", [:mix], [{:db_connection, "~> 2.4.1 or ~> 2.5", [hex: :db_connection, repo: "hexpm", optional: false]}, {:ecto, "~> 3.13.0", [hex: :ecto, repo: "hexpm", optional: false]}, {:myxql, "~> 0.7", [hex: :myxql, repo: "hexpm", optional: true]}, {:postgrex, "~> 0.19 or ~> 1.0", [hex: :postgrex, repo: "hexpm", optional: true]}, {:tds, "~> 2.1.1 or ~> 2.2", [hex: :tds, repo: "hexpm", optional: true]}, {:telemetry, "~> 0.4.0 or ~> 1.0", [hex: :telemetry, repo: "hexpm", optional: false]}], "hexpm", "aa36751f4e6a2b56ae79efb0e088042e010ff4935fc8684e74c23b1f49e25fdc"}, "ecto_sqlite3": {:hex, :ecto_sqlite3, "0.22.0", "edab2d0f701b7dd05dcf7e2d97769c106aff62b5cfddc000d1dd6f46b9cbd8c3", [:mix], [{:decimal, "~> 1.6 or ~> 2.0", [hex: :decimal, repo: "hexpm", optional: false]}, {:ecto, "~> 3.13.0", [hex: :ecto, repo: "hexpm", optional: false]}, {:ecto_sql, "~> 3.13.0", [hex: :ecto_sql, repo: "hexpm", optional: false]}, {:exqlite, "~> 0.22", [hex: :exqlite, repo: "hexpm", optional: false]}], "hexpm", "5af9e031bffcc5da0b7bca90c271a7b1e7c04a93fecf7f6cd35bc1b1921a64bd"}, "elixir_make": {:hex, :elixir_make, "0.9.0", "6484b3cd8c0cee58f09f05ecaf1a140a8c97670671a6a0e7ab4dc326c3109726", [:mix], [], "hexpm", "db23d4fd8b757462ad02f8aa73431a426fe6671c80b200d9710caf3d1dd0ffdb"}, + "emlx": {:hex, :emlx, "0.2.0", "f844c5456a8051032da98276f1e5c2282ff822824b139e4788af6e48375d0e1e", [:make, :mix], [{:elixir_make, "~> 0.6", [hex: :elixir_make, repo: "hexpm", optional: false]}, {:nx, "~> 0.10", [hex: :nx, repo: "hexpm", optional: false]}], "hexpm", "24c674d716beca3daf422829ce7d5fc044981d3d0ca93a832a536016c543dd6f"}, "erlex": {:hex, :erlex, "0.2.8", "cd8116f20f3c0afe376d1e8d1f0ae2452337729f68be016ea544a72f767d9c12", [:mix], [], "hexpm", "9d66ff9fedf69e49dc3fd12831e12a8a37b76f8651dd21cd45fcf5561a8a7590"}, "esbuild": {:hex, :esbuild, "0.10.0", "b0aa3388a1c23e727c5a3e7427c932d89ee791746b0081bbe56103e9ef3d291f", [:mix], [{:jason, "~> 1.4", [hex: :jason, repo: "hexpm", optional: false]}], "hexpm", "468489cda427b974a7cc9f03ace55368a83e1a7be12fba7e30969af78e5f8c70"}, "ex_dbus": {:hex, :ex_dbus, "0.1.4", "053df83d45b27ba0b9b6ef55a47253922069a3ace12a2a7dd30d3aff58301e17", [:mix], [{:dbus, "~> 0.8.0", [hex: :dbus, repo: "hexpm", optional: false]}, {:saxy, "~> 1.4.0", [hex: :saxy, repo: "hexpm", optional: false]}], "hexpm", "d8baeaf465eab57b70a47b70e29fdfef6eb09ba110fc37176eebe6ac7874d6d5"}, diff --git a/specs/embedding.allium b/specs/embedding.allium index 3a65bac..44eabc5 100644 --- a/specs/embedding.allium +++ b/specs/embedding.allium @@ -48,7 +48,8 @@ value EmbeddingModel { -- Lazy-loaded: pipeline created on first embedding request, not at startup -- Text preprocessing: prefix all input with "query: " (e5 convention) -- Pooling: mean pooling + L2 normalization - -- Loaded on-device via Bumblebee+EXLA; the canonical e5 weights come from + -- Loaded on-device via Bumblebee (EMLX/Apple GPU or EXLA-CPU, see + -- NativeAcceleratedExecution); the canonical e5 weights come from -- the "intfloat/multilingual-e5-small" repository, surfaced under the -- "Xenova/multilingual-e5-small" model_id identifier. model_id: String -- "Xenova/multilingual-e5-small" @@ -236,10 +237,12 @@ invariant NativeAcceleratedExecution { -- Inference MUST be batched: batch_size inputs are run per compiled -- inference pass and inputs are truncated to a bounded sequence_length, so -- (re)indexing many posts is not serialised one document at a time. - -- Current implementation: Bumblebee + EXLA, which is native CPU on Apple - -- Silicon (XLA has no Metal backend); neighbour search is HNSW (hnswlib). - -- Apple GPU acceleration via EMLX/MLX is tracked as a follow-up - -- (SPECGAPS A1-14c). + -- Current implementation: Bumblebee with a runtime-selected defn compiler. + -- On Apple Silicon the model runs on the Apple GPU via EMLX (MLX/Metal, + -- params placed on the EMLX.Backend GPU device); everywhere else, and as a + -- fallback when EMLX is unavailable, it runs on optimised native CPU via + -- EXLA. Selection is `accelerator: :auto | :emlx | :exla` (default :auto). + -- Neighbour search is HNSW (hnswlib). } invariant ModelCaching { diff --git a/test/bds/embeddings/backends/neural_test.exs b/test/bds/embeddings/backends/neural_test.exs index 3690d56..df7546f 100644 --- a/test/bds/embeddings/backends/neural_test.exs +++ b/test/bds/embeddings/backends/neural_test.exs @@ -36,4 +36,36 @@ defmodule BDS.Embeddings.Backends.NeuralTest do assert BDS.Embeddings.Backend in behaviours end + + describe "accelerator selection (NativeAcceleratedExecution)" do + test "auto prefers Apple GPU (EMLX) when available on Apple Silicon" do + assert Neural.select_accelerator(:auto, true, true) == :emlx + end + + test "auto falls back to EXLA-CPU off Apple Silicon" do + assert Neural.select_accelerator(:auto, true, false) == :exla + end + + test "auto falls back to EXLA-CPU when EMLX is unavailable" do + assert Neural.select_accelerator(:auto, false, true) == :exla + end + + test "explicit :exla is honoured even on Apple Silicon with EMLX present" do + assert Neural.select_accelerator(:exla, true, true) == :exla + end + + test "explicit :emlx is honoured when available" do + assert Neural.select_accelerator(:emlx, true, true) == :emlx + assert Neural.select_accelerator(:emlx, true, false) == :emlx + end + + test "explicit :emlx degrades to EXLA when EMLX is unavailable" do + assert Neural.select_accelerator(:emlx, false, true) == :exla + end + + test "defn options map each accelerator to its native compiler" do + assert Neural.defn_options(:emlx) == [compiler: EMLX] + assert Neural.defn_options(:exla) == [compiler: EXLA] + end + end end