fix: A1-14c run embedding model on Apple GPU via EMLX with EXLA-CPU fallback

This commit is contained in:
2026-05-29 16:26:33 +02:00
parent d03d033548
commit 84b91750fb
7 changed files with 112 additions and 12 deletions

View File

@@ -25,7 +25,7 @@ Gap categories: **SC** = spec correct, fix code | **CS** = code correct, update
| A1-13 | ~~Git sidebar shows only "Working tree" placeholder~~ | sidebar_views.allium:651-770 | `git_view/1` now builds a full `layout: "git"` view from `BDS.Git` (repository/remote_state/status/history); `SidebarComponents` renders active + not_a_repo states | **Resolved:** `git_view/1` in sidebar.ex assembles branch/upstream/ahead/behind, status files, paginated history (20/page); `render_git_sidebar` renders branch header, sync legend, fetch/pull/push/prune-lfs buttons, commit form, clickable status files (open git_diff), history entries; shell_live wires `git_commit` (closes git_diff tabs), `git_fetch`/`git_pull`/`git_push`/`git_prune_lfs`, `git_initialize`; `BDS.Git.history` enriched with author/date, `BDS.Git.set_remote/2` added; i18n for de/fr/it/es; 3 shell tests + git author/date assertions added | | A1-13 | ~~Git sidebar shows only "Working tree" placeholder~~ | sidebar_views.allium:651-770 | `git_view/1` now builds a full `layout: "git"` view from `BDS.Git` (repository/remote_state/status/history); `SidebarComponents` renders active + not_a_repo states | **Resolved:** `git_view/1` in sidebar.ex assembles branch/upstream/ahead/behind, status files, paginated history (20/page); `render_git_sidebar` renders branch header, sync legend, fetch/pull/push/prune-lfs buttons, commit form, clickable status files (open git_diff), history entries; shell_live wires `git_commit` (closes git_diff tabs), `git_fetch`/`git_pull`/`git_push`/`git_prune_lfs`, `git_initialize`; `BDS.Git.history` enriched with author/date, `BDS.Git.set_remote/2` added; i18n for de/fr/it/es; 3 shell tests + git author/date assertions added |
| A1-14 | ~~Embedding uses TF-IDF hash projection instead of real neural model~~ | embedding.allium:44-53, invariants RealNeuralModel/ModelCaching/VectorCacheInDb | `Backends.Neural` runs `intfloat/multilingual-e5-small` (e5 weights behind the Xenova id) via Bumblebee+EXLA | **Resolved (core):** added bumblebee/nx/exla deps; `Backends.Neural` is a lazily-loaded GenServer that builds the Bumblebee text-embedding serving on first request (`"query: "` prefix + mean pooling + L2 norm), downloads+caches the model under the app data dir (ModelCaching), and is wired into the supervision tree when configured; vectors now persisted as packed little-endian Float32 BLOB (384×4=1536 bytes) instead of JSON text (VectorCacheInDb) with migration recreating `embedding_keys.vector` as BLOB; `InApp` demoted to documented offline/test stub; test config uses the stub so the suite stays offline; spec EmbeddingModel clarified (Xenova id ↔ intfloat weights via Bumblebee); batched inference via optional `embed_many/2` backend callback (configurable `batch_size`/`sequence_length`; rebuild/index/repair embed in chunks instead of one post at a time) + `NativeAcceleratedExecution` invariant added to spec; 4 tests added (BLOB round-trip, batched-rebuild, Neural model_info/behaviour). **Deferred:** A1-14b USearch HNSW index, A1-14c Apple GPU (EMLX). | | A1-14 | ~~Embedding uses TF-IDF hash projection instead of real neural model~~ | embedding.allium:44-53, invariants RealNeuralModel/ModelCaching/VectorCacheInDb | `Backends.Neural` runs `intfloat/multilingual-e5-small` (e5 weights behind the Xenova id) via Bumblebee+EXLA | **Resolved (core):** added bumblebee/nx/exla deps; `Backends.Neural` is a lazily-loaded GenServer that builds the Bumblebee text-embedding serving on first request (`"query: "` prefix + mean pooling + L2 norm), downloads+caches the model under the app data dir (ModelCaching), and is wired into the supervision tree when configured; vectors now persisted as packed little-endian Float32 BLOB (384×4=1536 bytes) instead of JSON text (VectorCacheInDb) with migration recreating `embedding_keys.vector` as BLOB; `InApp` demoted to documented offline/test stub; test config uses the stub so the suite stays offline; spec EmbeddingModel clarified (Xenova id ↔ intfloat weights via Bumblebee); batched inference via optional `embed_many/2` backend callback (configurable `batch_size`/`sequence_length`; rebuild/index/repair embed in chunks instead of one post at a time) + `NativeAcceleratedExecution` invariant added to spec; 4 tests added (BLOB round-trip, batched-rebuild, Neural model_info/behaviour). **Deferred:** A1-14b USearch HNSW index, A1-14c Apple GPU (EMLX). |
| A1-14b | ~~USearch HNSW ANN index + debounced persistence not implemented~~ | embedding.allium config/FindSimilar/DebouncedPersistence | `Embeddings.Index` is now an HNSW (hnswlib) ANN index with debounced persistence | **Resolved:** rewrote `Embeddings.Index` as a DB-free GenServer wrapping an hnswlib HNSW graph (cosine, M=16, efConstruction=128, efSearch=64) — O(n·log n) build, O(log n) queries, replacing the O(n²) JSON cosine snapshot; per-project in-memory index + `label→post_id` map; 5s debounced `save_index` + `.meta.json` sidecar, force-save on project switch (`set_active_project`) and shutdown (`terminate`), `forget/1` on project delete; lazy reload from disk with rebuild-from-DB self-heal on miss; `find_similar`/`find_duplicates`/`compute_similarities` rewired (no brute-force fallback); USearch has no Elixir binding so hnswlib provides the identical HNSW algorithm/params (spec reconciled); supervision + dialyzer PLT updated; tests updated for debounced/binary persistence + self-heal. Follow-up hardening: explicit rebuild now forces re-embedding regardless of content_hash (ReindexAll), and model-unavailable errors propagate cleanly (post saves degrade to unindexed + log; rebuild/index return `{:error, reason}` surfaced as a failed task with a user-facing message instead of crashing). | | A1-14b | ~~USearch HNSW ANN index + debounced persistence not implemented~~ | embedding.allium config/FindSimilar/DebouncedPersistence | `Embeddings.Index` is now an HNSW (hnswlib) ANN index with debounced persistence | **Resolved:** rewrote `Embeddings.Index` as a DB-free GenServer wrapping an hnswlib HNSW graph (cosine, M=16, efConstruction=128, efSearch=64) — O(n·log n) build, O(log n) queries, replacing the O(n²) JSON cosine snapshot; per-project in-memory index + `label→post_id` map; 5s debounced `save_index` + `.meta.json` sidecar, force-save on project switch (`set_active_project`) and shutdown (`terminate`), `forget/1` on project delete; lazy reload from disk with rebuild-from-DB self-heal on miss; `find_similar`/`find_duplicates`/`compute_similarities` rewired (no brute-force fallback); USearch has no Elixir binding so hnswlib provides the identical HNSW algorithm/params (spec reconciled); supervision + dialyzer PLT updated; tests updated for debounced/binary persistence + self-heal. Follow-up hardening: explicit rebuild now forces re-embedding regardless of content_hash (ReindexAll), and model-unavailable errors propagate cleanly (post saves degrade to unindexed + log; rebuild/index return `{:error, reason}` surfaced as a failed task with a user-facing message instead of crashing). |
| A1-14c | Embedding model runs on CPU only; no Apple GPU acceleration | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` uses Bumblebee+EXLA; on Apple Silicon XLA has no Metal backend so inference is native CPU (batched). Apple GPU/Neural Engine unused | Fix code: spike an EMLX (Apple MLX) Nx backend so the model executes on the Apple Silicon GPU; gate by platform/availability with EXLA-CPU fallback; verify Bumblebee serving + defn compiler compatibility and benchmark vs CPU batching | | A1-14c | ~~Embedding model runs on CPU only; no Apple GPU acceleration~~ | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` now selects the defn compiler at serving-build time: Apple GPU via EMLX (MLX/Metal) on arm64 macOS, EXLA-CPU elsewhere | **Resolved:** added `{:emlx, "~> 0.2.0"}` dep (ships precompiled MLX binaries; EMLX 0.2.0 implements both `EMLX.Backend` and the `Nx.Defn.Compiler` behaviour, GPU-default); `Backends.Neural` gained a pure `select_accelerator/3` policy (`:auto` prefers EMLX only when available **and** on Apple Silicon; explicit `:emlx`/`:exla` honoured; forced `:emlx` degrades to EXLA when unavailable so misconfigured hosts still run), `current_accelerator/0`, and `defn_options/1`; `build_serving` places params on `{EMLX.Backend, device: :gpu}` and compiles with `EMLX` for the EMLX path, keeps `EXLA` otherwise; new `accelerator: :auto` config key; spec `NativeAcceleratedExecution` + `EmbeddingModel` updated; PLT app added; 7 tests added (offline — test config still uses the InApp stub). |
| A1-15 | ~~Preview vs generation content source strategy undocumented~~ | preview.allium (no invariant), generation.allium (no invariant) | Generation uses only published .md file content (`Generation.Data` snapshots set `content: nil`); preview includes published+draft posts and prefers DB content over file (`Preview.Router` queries `:published`/`:draft`, uses `editor_body`) | **Resolved:** added `PreviewDraftOverlay` invariant to preview.allium and `GenerationPublishedOnly` invariant to generation.allium; both cross-reference each other; code already correct, 3 tests added for draft-in-preview behavior | | A1-15 | ~~Preview vs generation content source strategy undocumented~~ | preview.allium (no invariant), generation.allium (no invariant) | Generation uses only published .md file content (`Generation.Data` snapshots set `content: nil`); preview includes published+draft posts and prefers DB content over file (`Preview.Router` queries `:published`/`:draft`, uses `editor_body`) | **Resolved:** added `PreviewDraftOverlay` invariant to preview.allium and `GenerationPublishedOnly` invariant to generation.allium; both cross-reference each other; code already correct, 3 tests added for draft-in-preview behavior |
### A2. Spec Should Update (code is normative) ### A2. Spec Should Update (code is normative)
@@ -186,7 +186,7 @@ All reconciled to follow code. Specs must be self-consistent and match code.
## Priority Order for Resolution ## Priority Order for Resolution
1. **A1-1 through A1-14c** — code must follow spec (includes auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model, HNSW ANN index; only A1-14c = Apple GPU/EMLX acceleration still open) 1. ~~**A1-1 through A1-14c**~~ — all resolved: auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model, HNSW ANN index, and Apple GPU/EMLX acceleration (A1-14c)
2. **D1-1 through D1-18** — untested invariants/guarantees 2. **D1-1 through D1-18** — untested invariants/guarantees
3. **C-1 through C-3** — internal spec inconsistencies (reconcile to code) 3. **C-1 through C-3** — internal spec inconsistencies (reconcile to code)
4. **B1-1 through B1-6** — major code behaviors missing from spec 4. **B1-1 through B1-6** — major code behaviors missing from spec

View File

@@ -68,7 +68,10 @@ config :bds, :embeddings,
# Inference is batched: batch_size texts per compiled run, truncated to # Inference is batched: batch_size texts per compiled run, truncated to
# sequence_length tokens. Tuning these trades throughput against memory. # sequence_length tokens. Tuning these trades throughput against memory.
batch_size: 16, batch_size: 16,
sequence_length: 256 sequence_length: 256,
# Hardware acceleration: :auto prefers the Apple GPU (EMLX/Metal) on Apple
# Silicon and falls back to EXLA-CPU elsewhere. Force with :emlx or :exla.
accelerator: :auto
# Cache downloaded model files under the app data directory so they persist # Cache downloaded model files under the app data directory so they persist
# across sessions (ModelCaching invariant). Overridden at runtime in prod. # across sessions (ModelCaching invariant). Overridden at runtime in prod.

View File

@@ -23,8 +23,17 @@ defmodule BDS.Embeddings.Backends.Neural do
compiled for a fixed `batch_size`/`sequence_length` (configurable); compiled for a fixed `batch_size`/`sequence_length` (configurable);
shorter sequences mean less wasted transformer compute. shorter sequences mean less wasted transformer compute.
EXLA on Apple Silicon runs on the CPU — XLA has no Metal/GPU backend. See Hardware acceleration follows the `NativeAcceleratedExecution` invariant.
SPECGAPS A1-14c for the planned EMLX (Apple GPU via MLX) acceleration path. The serving's defn compiler is chosen at build time:
* On Apple Silicon (arm64 macOS) with EMLX available, inference runs on the
Apple GPU via MLX/Metal (`compiler: EMLX`, params placed on the
`EMLX.Backend` GPU device).
* Everywhere else — and as a fallback when EMLX is unavailable or explicitly
disabled — it runs on optimised native CPU via XLA (`compiler: EXLA`).
The accelerator can be pinned with `config :bds, :embeddings, accelerator:`
to `:auto` (default), `:emlx`, or `:exla`.
""" """
@behaviour BDS.Embeddings.Backend @behaviour BDS.Embeddings.Backend
@@ -39,6 +48,7 @@ defmodule BDS.Embeddings.Backends.Neural do
@default_dimensions 384 @default_dimensions 384
@default_batch_size 16 @default_batch_size 16
@default_sequence_length 256 @default_sequence_length 256
@default_accelerator :auto
def child_spec(opts) do def child_spec(opts) do
%{id: __MODULE__, start: {__MODULE__, :start_link, [opts]}} %{id: __MODULE__, start: {__MODULE__, :start_link, [opts]}}
@@ -124,6 +134,8 @@ defmodule BDS.Embeddings.Backends.Neural do
defp build_serving do defp build_serving do
repo = {:hf, Keyword.get(config(), :model_repo, @default_model_repo)} repo = {:hf, Keyword.get(config(), :model_repo, @default_model_repo)}
accelerator = current_accelerator()
maybe_set_default_backend(accelerator)
with {:ok, model_info} <- Bumblebee.load_model(repo), with {:ok, model_info} <- Bumblebee.load_model(repo),
{:ok, tokenizer} <- Bumblebee.load_tokenizer(repo) do {:ok, tokenizer} <- Bumblebee.load_tokenizer(repo) do
@@ -133,13 +145,58 @@ defmodule BDS.Embeddings.Backends.Neural do
output_attribute: :hidden_state, output_attribute: :hidden_state,
embedding_processor: :l2_norm, embedding_processor: :l2_norm,
compile: [batch_size: batch_size(), sequence_length: sequence_length()], compile: [batch_size: batch_size(), sequence_length: sequence_length()],
defn_options: [compiler: EXLA] defn_options: defn_options(accelerator)
) )
{:ok, serving} {:ok, serving}
end end
end end
# Place model params/tensors on the Apple GPU (Metal) when accelerating with
# EMLX so the compiled inference pass actually runs on-device. EXLA manages
# its own device placement, so nothing to do there.
defp maybe_set_default_backend(:emlx), do: Nx.global_default_backend({EMLX.Backend, device: :gpu})
defp maybe_set_default_backend(:exla), do: :ok
@doc false
@spec defn_options(:emlx | :exla) :: keyword()
def defn_options(:emlx), do: [compiler: EMLX]
def defn_options(:exla), do: [compiler: EXLA]
@doc false
@spec current_accelerator() :: :emlx | :exla
def current_accelerator do
select_accelerator(configured_accelerator(), emlx_available?(), apple_silicon?())
end
@doc """
Pure accelerator-selection policy for `NativeAcceleratedExecution`.
Prefer the Apple GPU (EMLX) under `:auto` only when it is both available and
running on Apple Silicon; honour an explicit `:emlx`/`:exla` request, but
degrade a forced `:emlx` to EXLA when EMLX is not loaded so a misconfigured
host still gets working CPU inference instead of crashing.
"""
@spec select_accelerator(:auto | :emlx | :exla, boolean(), boolean()) :: :emlx | :exla
def select_accelerator(:exla, _emlx_available?, _apple_silicon?), do: :exla
def select_accelerator(:emlx, true, _apple_silicon?), do: :emlx
def select_accelerator(:emlx, false, _apple_silicon?), do: :exla
def select_accelerator(:auto, true, true), do: :emlx
def select_accelerator(:auto, _emlx_available?, _apple_silicon?), do: :exla
defp configured_accelerator do
config() |> Keyword.get(:accelerator, @default_accelerator)
end
defp emlx_available? do
Code.ensure_loaded?(EMLX) and Code.ensure_loaded?(EMLX.Backend)
end
defp apple_silicon? do
:os.type() == {:unix, :darwin} and
to_string(:erlang.system_info(:system_architecture)) =~ ~r/aarch64|arm/
end
defp batch_size do defp batch_size do
config() |> Keyword.get(:batch_size, @default_batch_size) |> max(1) config() |> Keyword.get(:batch_size, @default_batch_size) |> max(1)
end end

View File

@@ -36,6 +36,10 @@ defmodule BDS.MixProject do
{:image, "~> 0.67"}, {:image, "~> 0.67"},
{:nx, "~> 0.10"}, {:nx, "~> 0.10"},
{:exla, "~> 0.10"}, {:exla, "~> 0.10"},
# Apple Silicon GPU (Metal) acceleration for embedding inference. Ships
# precompiled MLX binaries; the Neural backend prefers it on arm64 macOS
# and falls back to EXLA-CPU elsewhere (SPECGAPS A1-14c).
{:emlx, "~> 0.2.0"},
{:bumblebee, "~> 0.6.3"}, {:bumblebee, "~> 0.6.3"},
{:hnswlib, "~> 0.1.7"}, {:hnswlib, "~> 0.1.7"},
{:stemex, "~> 0.2.1"}, {:stemex, "~> 0.2.1"},
@@ -64,7 +68,7 @@ defmodule BDS.MixProject do
env = Mix.env() env = Mix.env()
[ [
plt_add_apps: [:mix, :inets, :ssl, :nx, :exla, :bumblebee, :hnswlib], plt_add_apps: [:mix, :inets, :ssl, :nx, :exla, :emlx, :bumblebee, :hnswlib],
paths: ["_build/#{env}/lib/bds/ebin"] paths: ["_build/#{env}/lib/bds/ebin"]
] ]
end end

View File

@@ -18,6 +18,7 @@
"ecto_sql": {:hex, :ecto_sql, "3.13.5", "2f8282b2ad97bf0f0d3217ea0a6fff320ead9e2f8770f810141189d182dc304e", [:mix], [{:db_connection, "~> 2.4.1 or ~> 2.5", [hex: :db_connection, repo: "hexpm", optional: false]}, {:ecto, "~> 3.13.0", [hex: :ecto, repo: "hexpm", optional: false]}, {:myxql, "~> 0.7", [hex: :myxql, repo: "hexpm", optional: true]}, {:postgrex, "~> 0.19 or ~> 1.0", [hex: :postgrex, repo: "hexpm", optional: true]}, {:tds, "~> 2.1.1 or ~> 2.2", [hex: :tds, repo: "hexpm", optional: true]}, {:telemetry, "~> 0.4.0 or ~> 1.0", [hex: :telemetry, repo: "hexpm", optional: false]}], "hexpm", "aa36751f4e6a2b56ae79efb0e088042e010ff4935fc8684e74c23b1f49e25fdc"}, "ecto_sql": {:hex, :ecto_sql, "3.13.5", "2f8282b2ad97bf0f0d3217ea0a6fff320ead9e2f8770f810141189d182dc304e", [:mix], [{:db_connection, "~> 2.4.1 or ~> 2.5", [hex: :db_connection, repo: "hexpm", optional: false]}, {:ecto, "~> 3.13.0", [hex: :ecto, repo: "hexpm", optional: false]}, {:myxql, "~> 0.7", [hex: :myxql, repo: "hexpm", optional: true]}, {:postgrex, "~> 0.19 or ~> 1.0", [hex: :postgrex, repo: "hexpm", optional: true]}, {:tds, "~> 2.1.1 or ~> 2.2", [hex: :tds, repo: "hexpm", optional: true]}, {:telemetry, "~> 0.4.0 or ~> 1.0", [hex: :telemetry, repo: "hexpm", optional: false]}], "hexpm", "aa36751f4e6a2b56ae79efb0e088042e010ff4935fc8684e74c23b1f49e25fdc"},
"ecto_sqlite3": {:hex, :ecto_sqlite3, "0.22.0", "edab2d0f701b7dd05dcf7e2d97769c106aff62b5cfddc000d1dd6f46b9cbd8c3", [:mix], [{:decimal, "~> 1.6 or ~> 2.0", [hex: :decimal, repo: "hexpm", optional: false]}, {:ecto, "~> 3.13.0", [hex: :ecto, repo: "hexpm", optional: false]}, {:ecto_sql, "~> 3.13.0", [hex: :ecto_sql, repo: "hexpm", optional: false]}, {:exqlite, "~> 0.22", [hex: :exqlite, repo: "hexpm", optional: false]}], "hexpm", "5af9e031bffcc5da0b7bca90c271a7b1e7c04a93fecf7f6cd35bc1b1921a64bd"}, "ecto_sqlite3": {:hex, :ecto_sqlite3, "0.22.0", "edab2d0f701b7dd05dcf7e2d97769c106aff62b5cfddc000d1dd6f46b9cbd8c3", [:mix], [{:decimal, "~> 1.6 or ~> 2.0", [hex: :decimal, repo: "hexpm", optional: false]}, {:ecto, "~> 3.13.0", [hex: :ecto, repo: "hexpm", optional: false]}, {:ecto_sql, "~> 3.13.0", [hex: :ecto_sql, repo: "hexpm", optional: false]}, {:exqlite, "~> 0.22", [hex: :exqlite, repo: "hexpm", optional: false]}], "hexpm", "5af9e031bffcc5da0b7bca90c271a7b1e7c04a93fecf7f6cd35bc1b1921a64bd"},
"elixir_make": {:hex, :elixir_make, "0.9.0", "6484b3cd8c0cee58f09f05ecaf1a140a8c97670671a6a0e7ab4dc326c3109726", [:mix], [], "hexpm", "db23d4fd8b757462ad02f8aa73431a426fe6671c80b200d9710caf3d1dd0ffdb"}, "elixir_make": {:hex, :elixir_make, "0.9.0", "6484b3cd8c0cee58f09f05ecaf1a140a8c97670671a6a0e7ab4dc326c3109726", [:mix], [], "hexpm", "db23d4fd8b757462ad02f8aa73431a426fe6671c80b200d9710caf3d1dd0ffdb"},
"emlx": {:hex, :emlx, "0.2.0", "f844c5456a8051032da98276f1e5c2282ff822824b139e4788af6e48375d0e1e", [:make, :mix], [{:elixir_make, "~> 0.6", [hex: :elixir_make, repo: "hexpm", optional: false]}, {:nx, "~> 0.10", [hex: :nx, repo: "hexpm", optional: false]}], "hexpm", "24c674d716beca3daf422829ce7d5fc044981d3d0ca93a832a536016c543dd6f"},
"erlex": {:hex, :erlex, "0.2.8", "cd8116f20f3c0afe376d1e8d1f0ae2452337729f68be016ea544a72f767d9c12", [:mix], [], "hexpm", "9d66ff9fedf69e49dc3fd12831e12a8a37b76f8651dd21cd45fcf5561a8a7590"}, "erlex": {:hex, :erlex, "0.2.8", "cd8116f20f3c0afe376d1e8d1f0ae2452337729f68be016ea544a72f767d9c12", [:mix], [], "hexpm", "9d66ff9fedf69e49dc3fd12831e12a8a37b76f8651dd21cd45fcf5561a8a7590"},
"esbuild": {:hex, :esbuild, "0.10.0", "b0aa3388a1c23e727c5a3e7427c932d89ee791746b0081bbe56103e9ef3d291f", [:mix], [{:jason, "~> 1.4", [hex: :jason, repo: "hexpm", optional: false]}], "hexpm", "468489cda427b974a7cc9f03ace55368a83e1a7be12fba7e30969af78e5f8c70"}, "esbuild": {:hex, :esbuild, "0.10.0", "b0aa3388a1c23e727c5a3e7427c932d89ee791746b0081bbe56103e9ef3d291f", [:mix], [{:jason, "~> 1.4", [hex: :jason, repo: "hexpm", optional: false]}], "hexpm", "468489cda427b974a7cc9f03ace55368a83e1a7be12fba7e30969af78e5f8c70"},
"ex_dbus": {:hex, :ex_dbus, "0.1.4", "053df83d45b27ba0b9b6ef55a47253922069a3ace12a2a7dd30d3aff58301e17", [:mix], [{:dbus, "~> 0.8.0", [hex: :dbus, repo: "hexpm", optional: false]}, {:saxy, "~> 1.4.0", [hex: :saxy, repo: "hexpm", optional: false]}], "hexpm", "d8baeaf465eab57b70a47b70e29fdfef6eb09ba110fc37176eebe6ac7874d6d5"}, "ex_dbus": {:hex, :ex_dbus, "0.1.4", "053df83d45b27ba0b9b6ef55a47253922069a3ace12a2a7dd30d3aff58301e17", [:mix], [{:dbus, "~> 0.8.0", [hex: :dbus, repo: "hexpm", optional: false]}, {:saxy, "~> 1.4.0", [hex: :saxy, repo: "hexpm", optional: false]}], "hexpm", "d8baeaf465eab57b70a47b70e29fdfef6eb09ba110fc37176eebe6ac7874d6d5"},

View File

@@ -48,7 +48,8 @@ value EmbeddingModel {
-- Lazy-loaded: pipeline created on first embedding request, not at startup -- Lazy-loaded: pipeline created on first embedding request, not at startup
-- Text preprocessing: prefix all input with "query: " (e5 convention) -- Text preprocessing: prefix all input with "query: " (e5 convention)
-- Pooling: mean pooling + L2 normalization -- Pooling: mean pooling + L2 normalization
-- Loaded on-device via Bumblebee+EXLA; the canonical e5 weights come from -- Loaded on-device via Bumblebee (EMLX/Apple GPU or EXLA-CPU, see
-- NativeAcceleratedExecution); the canonical e5 weights come from
-- the "intfloat/multilingual-e5-small" repository, surfaced under the -- the "intfloat/multilingual-e5-small" repository, surfaced under the
-- "Xenova/multilingual-e5-small" model_id identifier. -- "Xenova/multilingual-e5-small" model_id identifier.
model_id: String -- "Xenova/multilingual-e5-small" model_id: String -- "Xenova/multilingual-e5-small"
@@ -236,10 +237,12 @@ invariant NativeAcceleratedExecution {
-- Inference MUST be batched: batch_size inputs are run per compiled -- Inference MUST be batched: batch_size inputs are run per compiled
-- inference pass and inputs are truncated to a bounded sequence_length, so -- inference pass and inputs are truncated to a bounded sequence_length, so
-- (re)indexing many posts is not serialised one document at a time. -- (re)indexing many posts is not serialised one document at a time.
-- Current implementation: Bumblebee + EXLA, which is native CPU on Apple -- Current implementation: Bumblebee with a runtime-selected defn compiler.
-- Silicon (XLA has no Metal backend); neighbour search is HNSW (hnswlib). -- On Apple Silicon the model runs on the Apple GPU via EMLX (MLX/Metal,
-- Apple GPU acceleration via EMLX/MLX is tracked as a follow-up -- params placed on the EMLX.Backend GPU device); everywhere else, and as a
-- (SPECGAPS A1-14c). -- fallback when EMLX is unavailable, it runs on optimised native CPU via
-- EXLA. Selection is `accelerator: :auto | :emlx | :exla` (default :auto).
-- Neighbour search is HNSW (hnswlib).
} }
invariant ModelCaching { invariant ModelCaching {

View File

@@ -36,4 +36,36 @@ defmodule BDS.Embeddings.Backends.NeuralTest do
assert BDS.Embeddings.Backend in behaviours assert BDS.Embeddings.Backend in behaviours
end end
describe "accelerator selection (NativeAcceleratedExecution)" do
test "auto prefers Apple GPU (EMLX) when available on Apple Silicon" do
assert Neural.select_accelerator(:auto, true, true) == :emlx
end
test "auto falls back to EXLA-CPU off Apple Silicon" do
assert Neural.select_accelerator(:auto, true, false) == :exla
end
test "auto falls back to EXLA-CPU when EMLX is unavailable" do
assert Neural.select_accelerator(:auto, false, true) == :exla
end
test "explicit :exla is honoured even on Apple Silicon with EMLX present" do
assert Neural.select_accelerator(:exla, true, true) == :exla
end
test "explicit :emlx is honoured when available" do
assert Neural.select_accelerator(:emlx, true, true) == :emlx
assert Neural.select_accelerator(:emlx, true, false) == :emlx
end
test "explicit :emlx degrades to EXLA when EMLX is unavailable" do
assert Neural.select_accelerator(:emlx, false, true) == :exla
end
test "defn options map each accelerator to its native compiler" do
assert Neural.defn_options(:emlx) == [compiler: EMLX]
assert Neural.defn_options(:exla) == [compiler: EXLA]
end
end
end end