perf: A1-14b replace O(n^2) embedding snapshot with hnswlib HNSW index and debounced persistence

This commit is contained in:
2026-05-29 15:36:13 +02:00
parent 744f7543d7
commit 61ff2a77c0
12 changed files with 474 additions and 287 deletions

View File

@@ -24,7 +24,7 @@ Gap categories: **SC** = spec correct, fix code | **CS** = code correct, update
| A1-12 | ~~Real Pagefind integration for search~~ | generation.allium:208 | Functional client-side search: `PagefindUI` defined in bundled `pagefind-ui.js`, fragment index records url/title/body-scoped text per page, search-runtime wires it up | **Resolved:** bundled real `PagefindUI` (fetch index, ranked full-text match, highlighted excerpts) + `pagefind-ui.css` as local assets read into `Pagefind`; index scoped to `data-pagefind-body` (unmarked pages excluded per PagefindHtmlMarking), title from `<title>`/`<h1>`; localized "No results found" label via `data-search-no-results` (de/fr/it/es); 3 unit tests added |
| A1-13 | ~~Git sidebar shows only "Working tree" placeholder~~ | sidebar_views.allium:651-770 | `git_view/1` now builds a full `layout: "git"` view from `BDS.Git` (repository/remote_state/status/history); `SidebarComponents` renders active + not_a_repo states | **Resolved:** `git_view/1` in sidebar.ex assembles branch/upstream/ahead/behind, status files, paginated history (20/page); `render_git_sidebar` renders branch header, sync legend, fetch/pull/push/prune-lfs buttons, commit form, clickable status files (open git_diff), history entries; shell_live wires `git_commit` (closes git_diff tabs), `git_fetch`/`git_pull`/`git_push`/`git_prune_lfs`, `git_initialize`; `BDS.Git.history` enriched with author/date, `BDS.Git.set_remote/2` added; i18n for de/fr/it/es; 3 shell tests + git author/date assertions added |
| A1-14 | ~~Embedding uses TF-IDF hash projection instead of real neural model~~ | embedding.allium:44-53, invariants RealNeuralModel/ModelCaching/VectorCacheInDb | `Backends.Neural` runs `intfloat/multilingual-e5-small` (e5 weights behind the Xenova id) via Bumblebee+EXLA | **Resolved (core):** added bumblebee/nx/exla deps; `Backends.Neural` is a lazily-loaded GenServer that builds the Bumblebee text-embedding serving on first request (`"query: "` prefix + mean pooling + L2 norm), downloads+caches the model under the app data dir (ModelCaching), and is wired into the supervision tree when configured; vectors now persisted as packed little-endian Float32 BLOB (384×4=1536 bytes) instead of JSON text (VectorCacheInDb) with migration recreating `embedding_keys.vector` as BLOB; `InApp` demoted to documented offline/test stub; test config uses the stub so the suite stays offline; spec EmbeddingModel clarified (Xenova id ↔ intfloat weights via Bumblebee); batched inference via optional `embed_many/2` backend callback (configurable `batch_size`/`sequence_length`; rebuild/index/repair embed in chunks instead of one post at a time) + `NativeAcceleratedExecution` invariant added to spec; 4 tests added (BLOB round-trip, batched-rebuild, Neural model_info/behaviour). **Deferred:** A1-14b USearch HNSW index, A1-14c Apple GPU (EMLX). |
| A1-14b | USearch HNSW ANN index + debounced persistence not implemented | embedding.allium:75-87 (config), FindSimilar, invariant DebouncedPersistence | Neighbor lookup still uses the JSON cosine snapshot (`Embeddings.Index`), not a USearch HNSW index; no 5s debounced index persistence (snapshot rebuilt synchronously) | Fix code: replace JSON snapshot with USearch HNSW index file (`embeddings.usearch`, cosine, M=16, efConstruction=128, efSearch=64), label→post_id mapping, 5s debounced save + force-save on project switch/shutdown |
| A1-14b | ~~USearch HNSW ANN index + debounced persistence not implemented~~ | embedding.allium config/FindSimilar/DebouncedPersistence | `Embeddings.Index` is now an HNSW (hnswlib) ANN index with debounced persistence | **Resolved:** rewrote `Embeddings.Index` as a DB-free GenServer wrapping an hnswlib HNSW graph (cosine, M=16, efConstruction=128, efSearch=64) — O(n·log n) build, O(log n) queries, replacing the O(n²) JSON cosine snapshot; per-project in-memory index + `label→post_id` map; 5s debounced `save_index` + `.meta.json` sidecar, force-save on project switch (`set_active_project`) and shutdown (`terminate`), `forget/1` on project delete; lazy reload from disk with rebuild-from-DB self-heal on miss; `find_similar`/`find_duplicates`/`compute_similarities` rewired (no brute-force fallback); USearch has no Elixir binding so hnswlib provides the identical HNSW algorithm/params (spec reconciled); supervision + dialyzer PLT updated; tests updated for debounced/binary persistence + self-heal. |
| A1-14c | Embedding model runs on CPU only; no Apple GPU acceleration | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` uses Bumblebee+EXLA; on Apple Silicon XLA has no Metal backend so inference is native CPU (batched). Apple GPU/Neural Engine unused | Fix code: spike an EMLX (Apple MLX) Nx backend so the model executes on the Apple Silicon GPU; gate by platform/availability with EXLA-CPU fallback; verify Bumblebee serving + defn compiler compatibility and benchmark vs CPU batching |
| A1-15 | ~~Preview vs generation content source strategy undocumented~~ | preview.allium (no invariant), generation.allium (no invariant) | Generation uses only published .md file content (`Generation.Data` snapshots set `content: nil`); preview includes published+draft posts and prefers DB content over file (`Preview.Router` queries `:published`/`:draft`, uses `editor_body`) | **Resolved:** added `PreviewDraftOverlay` invariant to preview.allium and `GenerationPublishedOnly` invariant to generation.allium; both cross-reference each other; code already correct, 3 tests added for draft-in-preview behavior |
@@ -186,7 +186,7 @@ All reconciled to follow code. Specs must be self-consistent and match code.
## Priority Order for Resolution
1. **A1-1 through A1-14c** — code must follow spec (includes auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model; A1-14b = USearch HNSW index and A1-14c = Apple GPU/EMLX acceleration still open)
1. **A1-1 through A1-14c** — code must follow spec (includes auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model, HNSW ANN index; only A1-14c = Apple GPU/EMLX acceleration still open)
2. **D1-1 through D1-18** — untested invariants/guarantees
3. **C-1 through C-3** — internal spec inconsistencies (reconcile to code)
4. **B1-1 through B1-6** — major code behaviors missing from spec

View File

@@ -37,7 +37,8 @@ defmodule BDS.Application do
{Task.Supervisor, name: BDS.TCP.TaskSupervisor},
BDS.Scripting.JobStore,
{Task.Supervisor, name: BDS.Scripting.TaskSupervisor},
BDS.Scripting.JobSupervisor
BDS.Scripting.JobSupervisor,
BDS.Embeddings.Index
] ++ embedding_children() ++ desktop_children(current_env())
opts = [strategy: :one_for_one, name: BDS.Supervisor]

View File

@@ -166,11 +166,9 @@ defmodule BDS.Embeddings do
case Repo.get_by(Key, post_id: post.id, project_id: post.project_id) do
%Key{content_hash: ^content_hash} ->
if Keyword.get(opts, :refresh_index, true) and
snapshot_content_hash(post.project_id, post.id) != content_hash do
:ok = rebuild_snapshot(post.project_id)
end
# Embedding is already current. The HNSW index self-heals on query
# (find_similar/find_duplicates rebuild when no index is loaded), so
# there is nothing to refresh here.
:ok
existing_key ->
@@ -361,28 +359,28 @@ defmodule BDS.Embeddings do
{:error, :not_found} ->
{:ok, []}
{:ok, post, source_vector} ->
similar =
case Index.neighbors(post.project_id, post.id, limit) do
{:ok, _post, nil} ->
{:ok, []}
{:ok, post, %Key{} = key} ->
{:ok, query_similar(post.project_id, key, limit)}
end
end
# Queries the HNSW index for a post's neighbours, rebuilding the index from
# the DB vectors if it is not currently loaded (e.g. after a restart).
defp query_similar(project_id, %Key{} = key, limit) do
case Index.neighbors(project_id, key.label, key.vector, limit) do
{:ok, neighbors} ->
neighbors
{:error, :missing} ->
Repo.all(
from key in Key,
where: key.project_id == ^post.project_id and key.post_id != ^post.id
)
|> Enum.map(fn key ->
%{
post_id: key.post_id,
score: cosine_similarity(source_vector, decode_vector(key.vector))
}
end)
|> Enum.sort_by(& &1.score, :desc)
|> Enum.take(max(limit, 0))
end
:ok = rebuild_snapshot(project_id)
{:ok, similar}
case Index.neighbors(project_id, key.label, key.vector, limit) do
{:ok, neighbors} -> neighbors
{:error, :missing} -> []
end
end
end
@@ -395,8 +393,12 @@ defmodule BDS.Embeddings do
{:error, :not_found} ->
{:ok, %{}}
{:ok, post, source_vector} ->
{:ok, _post, nil} ->
{:ok, %{}}
{:ok, post, %Key{} = source_key} ->
target_ids = Enum.uniq(target_post_ids)
source_vector = decode_vector(source_key.vector)
scores =
Repo.all(
@@ -452,47 +454,19 @@ defmodule BDS.Embeddings do
if enabled_for_project?(project_id) do
on_progress = progress_callback(opts)
dismissed = dismissed_pair_keys(project_id)
entries = load_index_entries(project_id)
pairs =
case duplicate_pairs_with_rebuild(project_id, entries, on_progress) do
{:ok, pairs} -> pairs
{:error, :missing} -> []
end
duplicates =
case Index.duplicate_pairs(project_id, @duplicate_threshold, on_progress: on_progress) do
{:ok, pairs} ->
pairs
|> Enum.reject(fn pair -> pair_key(pair.post_id_a, pair.post_id_b) in dismissed end)
|> enrich_duplicate_pairs(project_id)
{:error, :missing} ->
keys =
Repo.all(
from key in Key,
where: key.project_id == ^project_id,
order_by: [asc: key.post_id]
)
total_keys = length(keys)
:ok = report_rebuild_started(on_progress, total_keys, "embedding entries")
keys
|> Enum.with_index(1)
|> Enum.flat_map(fn {left, index} ->
:ok = report_rebuild_progress(on_progress, index, total_keys, "embedding entries")
for right <- keys,
left.post_id < right.post_id,
pair_key(left.post_id, right.post_id) not in dismissed,
similarity =
cosine_similarity(decode_vector(left.vector), decode_vector(right.vector)),
similarity >= @duplicate_threshold do
%{
post_id_a: left.post_id,
post_id_b: right.post_id,
score: similarity
}
end
end)
|> enrich_duplicate_pairs(project_id)
end
:ok = report_rebuild_phase(on_progress, 0.99, "Resolving duplicate candidates")
{:ok, duplicates}
else
@@ -555,17 +529,33 @@ defmodule BDS.Embeddings do
with {:ok, post} <- fetch_post(post_id) do
if enabled_for_project?(post.project_id) do
:ok = ensure_key(post)
case Repo.get_by(Key, post_id: post.id, project_id: post.project_id) do
nil -> {:ok, post, []}
key -> {:ok, post, decode_vector(key.vector)}
end
{:ok, post, Repo.get_by(Key, post_id: post.id, project_id: post.project_id)}
else
{:disabled, post.project_id}
end
end
end
defp duplicate_pairs_with_rebuild(project_id, entries, on_progress) do
case Index.duplicate_pairs(project_id, entries, @duplicate_threshold, on_progress: on_progress) do
{:ok, pairs} ->
{:ok, pairs}
{:error, :missing} ->
:ok = rebuild_snapshot(project_id)
Index.duplicate_pairs(project_id, entries, @duplicate_threshold, on_progress: on_progress)
end
end
defp load_index_entries(project_id) do
Repo.all(
from key in Key,
where: key.project_id == ^project_id,
order_by: [asc: key.post_id]
)
|> Enum.map(fn key -> %{label: key.label, post_id: key.post_id, vector: key.vector} end)
end
defp ensure_key(%Post{} = post) do
case Repo.get_by(Key, post_id: post.id, project_id: post.project_id) do
nil -> sync_post(post)
@@ -704,7 +694,7 @@ defmodule BDS.Embeddings do
end
defp rebuild_snapshot(project_id) do
Index.rebuild(project_id, model_id: model_id(), dimensions: dimensions())
Index.put(project_id, dimensions(), load_index_entries(project_id))
end
defp progress_callback(opts), do: ProgressReporter.callback(opts)
@@ -729,13 +719,6 @@ defmodule BDS.Embeddings do
defp report_rebuild_phase(callback, value, label),
do: ProgressReporter.report_phase(callback, value, label)
defp snapshot_content_hash(project_id, post_id) do
case Index.read(project_id) do
{:ok, snapshot} -> get_in(snapshot, ["entries", post_id, "content_hash"])
_other -> nil
end
end
defp current_embedding_status(nil, _expected_hash), do: "missing"
defp current_embedding_status(%Key{vector: vector}, _expected_hash) when vector in [nil, ""],

View File

@@ -1,220 +1,342 @@
defmodule BDS.Embeddings.Index do
@moduledoc false
@moduledoc """
Per-project approximate-nearest-neighbour index over post embeddings.
import Ecto.Query
Backed by an HNSW graph (hnswlib) per the A1-14b / `specs/embedding.allium`
requirement — cosine space, connectivity M=16, efConstruction=128,
efSearch=64. This replaces the previous O(n²) brute-force cosine snapshot:
building is O(n·log n) and queries are O(log n).
The process is intentionally **database-free**: callers (running in their own
process, e.g. under the test SQL sandbox) read embedding vectors from the DB
and hand them in. This GenServer owns only the in-memory HNSW graphs, the
`label → post_id` maps, and file persistence.
Persistence (DebouncedPersistence invariant): the index file
(`embeddings.usearch`) plus a small sidecar holding the dimension and the
label→post_id map are written behind a 5s debounce, and force-saved on
project switch / shutdown. On a cold query the index is lazily reloaded from
those files; if they are absent the caller rebuilds from the DB vectors.
"""
use GenServer
alias BDS.Persistence
alias BDS.Embeddings.Key
alias BDS.Projects
alias BDS.ProgressReporter
alias BDS.Repo
@neighbor_limit 21
@debounce_ms 5_000
@space :cosine
@m 16
@ef_construction 128
@ef_search 64
# ─── Public API ─────────────────────────────────────────────
def start_link(opts \\ []) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
@doc "On-disk path of the HNSW index file for a project."
def path(project_id) when is_binary(project_id) do
Path.join(Projects.project_cache_dir(project_id), "embeddings.usearch")
end
def rebuild(project_id, opts) when is_binary(project_id) and is_list(opts) do
model_id = Keyword.fetch!(opts, :model_id)
dimensions = Keyword.fetch!(opts, :dimensions)
@doc """
(Re)builds the index for a project from the given entries and schedules a
debounced save. `entries` is a list of `%{label:, post_id:, vector:}` where
`vector` is the packed little-endian Float32 BLOB.
"""
def put(project_id, dimensions, entries)
when is_binary(project_id) and is_integer(dimensions) and is_list(entries) do
GenServer.call(__MODULE__, {:put, project_id, dimensions, entries}, :infinity)
end
keys =
Repo.all(
from key in Key,
where: key.project_id == ^project_id,
order_by: [asc: key.post_id]
@doc """
Returns up to `limit` nearest neighbours of `query_vector` (the post's packed
BLOB), excluding `query_label`. `{:error, :missing}` if no index is available.
"""
def neighbors(project_id, query_label, query_vector, limit)
when is_binary(project_id) and is_integer(query_label) and is_binary(query_vector) do
GenServer.call(__MODULE__, {:neighbors, project_id, query_label, query_vector, limit}, :infinity)
end
@doc """
Finds near-duplicate pairs at/above `threshold` by querying the HNSW graph for
each entry's neighbours. `{:error, :missing}` if no index is available.
"""
def duplicate_pairs(project_id, entries, threshold, opts \\ [])
when is_binary(project_id) and is_list(entries) and is_number(threshold) do
GenServer.call(
__MODULE__,
{:duplicate_pairs, project_id, entries, threshold, opts},
:infinity
)
end
entries =
keys
|> Enum.map(fn key ->
vector = decode_vector(key.vector)
@doc "Forces a pending save for a project to disk now (e.g. on project switch)."
def flush(project_id) when is_binary(project_id) do
GenServer.call(__MODULE__, {:flush, project_id}, :infinity)
end
@doc "Forces all pending saves to disk now (e.g. on shutdown)."
def flush_all do
GenServer.call(__MODULE__, :flush_all, :infinity)
end
@doc "Drops the in-memory index for a project (e.g. on project deletion)."
def forget(project_id) when is_binary(project_id) do
GenServer.call(__MODULE__, {:forget, project_id}, :infinity)
end
# ─── GenServer ──────────────────────────────────────────────
@impl true
def init(_opts) do
Process.flag(:trap_exit, true)
{:ok, %{}}
end
@impl true
def handle_call({:put, project_id, dimensions, entries}, _from, state) do
entry = build_entry(dimensions, entries)
state = state |> Map.put(project_id, entry) |> schedule_save(project_id)
{:reply, :ok, state}
end
def handle_call({:neighbors, project_id, query_label, query_vector, limit}, _from, state) do
case ensure_loaded(state, project_id) do
{:ok, %{index: nil}, state} ->
{:reply, {:error, :missing}, state}
{:ok, entry, state} ->
{:reply, {:ok, query_neighbors(entry, query_label, query_vector, limit)}, state}
{:missing, state} ->
{:reply, {:error, :missing}, state}
end
end
def handle_call({:duplicate_pairs, project_id, entries, threshold, opts}, _from, state) do
case ensure_loaded(state, project_id) do
{:ok, %{index: nil}, state} ->
{:reply, {:error, :missing}, state}
{:ok, entry, state} ->
{:reply, {:ok, scan_duplicates(entry, entries, threshold, opts)}, state}
{:missing, state} ->
{:reply, {:error, :missing}, state}
end
end
def handle_call({:flush, project_id}, _from, state) do
{:reply, :ok, save_now(state, project_id)}
end
def handle_call(:flush_all, _from, state) do
state = Enum.reduce(Map.keys(state), state, &save_now(&2, &1))
{:reply, :ok, state}
end
def handle_call({:forget, project_id}, _from, state) do
case Map.get(state, project_id) do
%{timer: timer} when is_reference(timer) -> Process.cancel_timer(timer)
_other -> :ok
end
{:reply, :ok, Map.delete(state, project_id)}
end
@impl true
def handle_info({:save, project_id}, state) do
{:noreply, save_now(state, project_id)}
end
def handle_info(_message, state), do: {:noreply, state}
@impl true
def terminate(_reason, state) do
Enum.each(Map.keys(state), &save_now(state, &1))
:ok
end
# ─── Build / query ──────────────────────────────────────────
defp build_entry(dimensions, []), do: %{index: nil, labels: %{}, dim: dimensions, timer: nil}
defp build_entry(dimensions, entries) do
count = length(entries)
{:ok, index} = HNSWLib.Index.new(@space, dimensions, count, m: @m, ef_construction: @ef_construction)
:ok = HNSWLib.Index.set_ef(index, @ef_search)
tensor =
entries
|> Enum.map(& &1.vector)
|> IO.iodata_to_binary()
|> Nx.from_binary(:f32)
|> Nx.reshape({count, dimensions})
:ok = HNSWLib.Index.add_items(index, tensor, ids: Enum.map(entries, & &1.label))
{key.post_id,
%{
"label" => key.label,
"content_hash" => key.content_hash,
"neighbors" => neighbor_entries(keys, key, vector)
}}
end)
|> Map.new()
payload = %{
"project_id" => project_id,
"model_id" => model_id,
"dimensions" => dimensions,
"updated_at" => Persistence.now_ms(),
"entries" => entries
index: index,
labels: Map.new(entries, &{&1.label, &1.post_id}),
dim: dimensions,
timer: nil
}
write_snapshot(path(project_id), payload, project_id)
end
def read(project_id) when is_binary(project_id) do
project_id
|> candidate_paths()
|> read_snapshot_paths()
end
defp query_neighbors(%{index: index, labels: labels}, query_label, query_vector, limit) do
case query(index, query_vector, limit + 1) do
[] ->
[]
def neighbors(project_id, post_id, limit) when is_binary(project_id) and is_binary(post_id) do
with {:ok, snapshot} <- read(project_id),
%{} = entry <- get_in(snapshot, ["entries", post_id]) do
entry
|> Map.get("neighbors", [])
results ->
results
|> Enum.reject(fn {label, _score} -> label == query_label end)
|> Enum.map(fn {label, score} -> %{post_id: Map.get(labels, label), score: score} end)
|> Enum.reject(&is_nil(&1.post_id))
|> Enum.take(max(limit, 0))
|> Enum.map(fn neighbor ->
%{
post_id: neighbor["post_id"],
score: neighbor["score"]
}
end)
|> then(&{:ok, &1})
else
_ -> {:error, :missing}
end
end
def duplicate_pairs(project_id, threshold, opts \\ []) when is_binary(project_id) do
with {:ok, snapshot} <- read(project_id) do
entries = Map.get(snapshot, "entries", %{})
entry_count = map_size(entries)
on_progress = progress_callback(opts)
defp scan_duplicates(%{index: index, labels: labels}, entries, threshold, opts) do
on_progress = ProgressReporter.callback(opts)
total = length(entries)
:ok = report_scan_started(on_progress, total, "embedding entries")
:ok = report_scan_started(on_progress, entry_count, "embedding entries")
pairs =
entries
|> Enum.with_index(1)
|> Enum.flat_map(fn {{post_id, entry}, index} ->
:ok = report_scan_progress(on_progress, index, entry_count, "embedding entries")
|> Enum.flat_map(fn {entry, position} ->
:ok = report_scan_progress(on_progress, position, total, "embedding entries")
entry
|> Map.get("neighbors", [])
|> Enum.filter(&(&1["score"] >= threshold))
|> Enum.map(fn neighbor ->
{post_id_a, post_id_b} = sort_pair(post_id, neighbor["post_id"])
{{post_id_a, post_id_b},
%{
post_id_a: post_id_a,
post_id_b: post_id_b,
score: neighbor["score"]
}}
index
|> query(entry.vector, @neighbor_limit)
|> Enum.reject(fn {label, _score} -> label == entry.label end)
|> Enum.map(fn {label, score} -> {Map.get(labels, label), score} end)
|> Enum.filter(fn {post_id, score} -> not is_nil(post_id) and score >= threshold end)
|> Enum.map(fn {other_post_id, score} ->
{post_id_a, post_id_b} = sort_pair(entry.post_id, other_post_id)
{{post_id_a, post_id_b}, %{post_id_a: post_id_a, post_id_b: post_id_b, score: score}}
end)
end)
|> Map.new()
|> Map.values()
|> Enum.sort_by(& &1.score, :desc)
end
{:ok, pairs}
else
_ -> {:error, :missing}
# Runs a knn query and returns [{label, similarity}] sorted by descending
# similarity. Cosine distance is converted to similarity as max(0, 1 - d).
defp query(index, query_vector, k) do
case HNSWLib.Index.get_current_count(index) do
{:ok, count} when count > 0 ->
clamped = min(k, count)
case HNSWLib.Index.knn_query(index, query_vector, k: clamped) do
{:ok, labels, distances} ->
Enum.zip(
Nx.to_flat_list(labels),
Enum.map(Nx.to_flat_list(distances), fn distance -> max(0.0, 1.0 - distance) end)
)
{:error, _reason} ->
[]
end
_other ->
[]
end
end
defp neighbor_entries(keys, current_key, current_vector) do
keys
|> Enum.reject(&(&1.post_id == current_key.post_id))
|> Enum.map(fn other_key ->
%{
"post_id" => other_key.post_id,
"label" => other_key.label,
"score" => cosine_similarity(current_vector, decode_vector(other_key.vector))
}
end)
|> Enum.sort_by(& &1["score"], :desc)
|> Enum.take(@neighbor_limit)
# ─── Persistence ────────────────────────────────────────────
defp schedule_save(state, project_id) do
entry = Map.fetch!(state, project_id)
if is_reference(entry.timer), do: Process.cancel_timer(entry.timer)
timer = Process.send_after(self(), {:save, project_id}, @debounce_ms)
Map.put(state, project_id, %{entry | timer: timer})
end
defp write_snapshot(snapshot_path, payload, project_id) do
:ok = Persistence.atomic_write(snapshot_path, Jason.encode!(payload))
legacy_path = legacy_path(snapshot_path)
defp save_now(state, project_id) do
case Map.get(state, project_id) do
nil ->
state
if File.exists?(legacy_path) do
File.rm(legacy_path)
entry ->
if is_reference(entry.timer), do: Process.cancel_timer(entry.timer)
persist(project_id, entry)
Map.put(state, project_id, %{entry | timer: nil})
end
end
cleanup_legacy_project_snapshots(project_id, snapshot_path)
defp persist(_project_id, %{index: nil}), do: :ok
defp persist(project_id, %{index: index, labels: labels, dim: dim}) do
index_path = path(project_id)
File.mkdir_p!(Path.dirname(index_path))
HNSWLib.Index.save_index(index, index_path)
write_meta(index_path, dim, labels)
:ok
rescue
_exception -> :ok
end
defp candidate_paths(project_id) do
current_snapshot_path = path(project_id)
legacy_project_snapshot_path = legacy_project_snapshot_path(project_id)
defp write_meta(index_path, dim, labels) do
payload = %{
"dim" => dim,
"labels" => Enum.map(labels, fn {label, post_id} -> [label, post_id] end)
}
[
current_snapshot_path,
legacy_path(current_snapshot_path),
legacy_project_snapshot_path,
legacy_project_snapshot_path && legacy_path(legacy_project_snapshot_path)
]
|> Enum.filter(&is_binary/1)
|> Enum.uniq()
File.write(meta_path(index_path), Jason.encode!(payload))
end
defp read_snapshot_paths([]), do: {:error, :missing}
defp ensure_loaded(state, project_id) do
case Map.get(state, project_id) do
nil ->
case load_from_disk(project_id) do
{:ok, entry} -> {:ok, entry, Map.put(state, project_id, entry)}
:error -> {:missing, state}
end
defp read_snapshot_paths([snapshot_path | rest]) do
case File.read(snapshot_path) do
{:ok, contents} -> {:ok, Jason.decode!(contents)}
{:error, :enoent} -> read_snapshot_paths(rest)
{:error, reason} -> {:error, reason}
entry ->
{:ok, entry, state}
end
end
defp cleanup_legacy_project_snapshots(project_id, snapshot_path) do
current_paths = [snapshot_path, legacy_path(snapshot_path)]
defp load_from_disk(project_id) do
index_path = path(project_id)
project_id
|> legacy_project_snapshot_path()
|> then(fn legacy_snapshot_path ->
[legacy_snapshot_path, legacy_snapshot_path && legacy_path(legacy_snapshot_path)]
end)
|> Enum.filter(&is_binary/1)
|> Enum.reject(&(&1 in current_paths))
|> Enum.each(fn legacy_snapshot_path ->
if File.exists?(legacy_snapshot_path) do
File.rm(legacy_snapshot_path)
with {:ok, %{dim: dim, labels: labels}} <- read_meta(index_path),
true <- File.exists?(index_path),
{:ok, index} <- HNSWLib.Index.load_index(@space, dim, index_path) do
:ok = HNSWLib.Index.set_ef(index, @ef_search)
{:ok, %{index: index, labels: labels, dim: dim, timer: nil}}
else
_other -> :error
end
end)
rescue
_exception -> :error
end
defp legacy_project_snapshot_path(project_id) do
case Projects.get_project(project_id) do
nil -> nil
project -> Path.join(Projects.project_data_dir(project), "embeddings.usearch")
defp read_meta(index_path) do
with {:ok, contents} <- File.read(meta_path(index_path)),
{:ok, %{"dim" => dim, "labels" => labels}} <- Jason.decode(contents) do
{:ok,
%{
dim: dim,
labels: Map.new(labels, fn [label, post_id] -> {label, post_id} end)
}}
else
_other -> :error
end
end
defp legacy_path(snapshot_path) do
Path.join(Path.dirname(snapshot_path), "embeddings.index.json")
end
# Vectors are stored as a packed little-endian Float32 BLOB; see
# BDS.Embeddings and the VectorCacheInDb invariant in embedding.allium.
defp decode_vector(nil), do: []
defp decode_vector(<<>>), do: []
defp decode_vector(binary) when is_binary(binary) do
for <<value::float-32-little <- binary>>, do: value
end
defp cosine_similarity([], _other), do: 0.0
defp cosine_similarity(_vector, []), do: 0.0
defp cosine_similarity(left, right) do
Enum.zip(left, right)
|> Enum.reduce(0.0, fn {left_value, right_value}, acc -> acc + left_value * right_value end)
|> max(0.0)
end
defp meta_path(index_path), do: index_path <> ".meta.json"
defp sort_pair(post_id_a, post_id_b) when post_id_a <= post_id_b, do: {post_id_a, post_id_b}
defp sort_pair(post_id_a, post_id_b), do: {post_id_b, post_id_a}
defp progress_callback(opts), do: ProgressReporter.callback(opts)
defp report_scan_started(callback, total, label) do
ProgressReporter.report_count_started(callback, total, label,
verb: "Scanning",

View File

@@ -148,6 +148,9 @@ defmodule BDS.Projects do
project ->
now = Persistence.now_ms()
previous_active_id =
Repo.one(from p in Project, where: p.is_active == true, select: p.id)
Repo.transaction(fn ->
Repo.update_all(
from(p in Project, where: p.is_active == true),
@@ -159,8 +162,16 @@ defmodule BDS.Projects do
|> Repo.update!()
end)
|> case do
{:ok, active_project} -> {:ok, active_project}
{:error, reason} -> {:error, reason}
{:ok, active_project} ->
# Force-save the outgoing project's embedding index (DebouncedPersistence).
if is_binary(previous_active_id) and previous_active_id != active_project.id do
BDS.Embeddings.Index.flush(previous_active_id)
end
{:ok, active_project}
{:error, reason} ->
{:error, reason}
end
end
end
@@ -194,6 +205,8 @@ defmodule BDS.Projects do
end)
|> case do
{:ok, deleted_project} ->
BDS.Embeddings.Index.forget(deleted_project.id)
Enum.each(cleanup_dirs, fn dir ->
_ = File.rm_rf(dir)
end)

View File

@@ -37,6 +37,7 @@ defmodule BDS.MixProject do
{:nx, "~> 0.10"},
{:exla, "~> 0.10"},
{:bumblebee, "~> 0.6.3"},
{:hnswlib, "~> 0.1.7"},
{:stemex, "~> 0.2.1"},
{:gettext, "~> 0.24"},
{:tailwind, "~> 0.3", runtime: Mix.env() == :dev},
@@ -63,7 +64,7 @@ defmodule BDS.MixProject do
env = Mix.env()
[
plt_add_apps: [:mix, :inets, :ssl, :nx, :exla, :bumblebee],
plt_add_apps: [:mix, :inets, :ssl, :nx, :exla, :bumblebee, :hnswlib],
paths: ["_build/#{env}/lib/bds/ebin"]
]
end

View File

@@ -28,6 +28,7 @@
"exqlite": {:hex, :exqlite, "0.36.0", "07b4f95d61cb82b8d52946d0639497fa7d32117e09b2c8d25e24a38723c295cb", [:make, :mix], [{:cc_precompiler, "~> 0.1", [hex: :cc_precompiler, repo: "hexpm", optional: false]}, {:db_connection, "~> 2.1", [hex: :db_connection, repo: "hexpm", optional: false]}, {:elixir_make, "~> 0.8", [hex: :elixir_make, repo: "hexpm", optional: false]}, {:table, "~> 0.1.0", [hex: :table, repo: "hexpm", optional: true]}], "hexpm", "cbeca3ce781f9ff07cfa9a87486f3ebd512a143ad6a14ed5c9fca21fe0bf3ae7"},
"fine": {:hex, :fine, "0.1.6", "4bf7151493443c454aac9f2fa2f34f5fefd0346a83fb5586a016c4a135c63247", [:mix], [], "hexpm", "5638eb4495488e885ebec167fa57973e5c35e1a50c344eb7666c90ec1c4e3b12"},
"gettext": {:hex, :gettext, "0.26.2", "5978aa7b21fada6deabf1f6341ddba50bc69c999e812211903b169799208f2a8", [:mix], [{:expo, "~> 0.5.1 or ~> 1.0", [hex: :expo, repo: "hexpm", optional: false]}], "hexpm", "aa978504bcf76511efdc22d580ba08e2279caab1066b76bb9aa81c4a1e0a32a5"},
"hnswlib": {:hex, :hnswlib, "0.1.7", "784afdbfbc9af53e64d4b6da3f685c07039e472636a98fa954ffae5292ad6cc4", [:make, :mix], [{:cc_precompiler, "~> 0.1.0", [hex: :cc_precompiler, repo: "hexpm", optional: false]}, {:elixir_make, "~> 0.8", [hex: :elixir_make, repo: "hexpm", optional: false]}, {:nx, "~> 0.5", [hex: :nx, repo: "hexpm", optional: false]}], "hexpm", "fb43bb675facc8bb1ef0f4f8fec92479fc23317ed0f35c7160b2f95aff3e4742"},
"hpax": {:hex, :hpax, "1.0.3", "ed67ef51ad4df91e75cc6a1494f851850c0bd98ebc0be6e81b026e765ee535aa", [:mix], [], "hexpm", "8eab6e1cfa8d5918c2ce4ba43588e894af35dbd8e91e6e55c817bca5847df34a"},
"html_entities": {:hex, :html_entities, "0.5.2", "9e47e70598da7de2a9ff6af8758399251db6dbb7eebe2b013f2bbd2515895c3c", [:mix], [], "hexpm", "c53ba390403485615623b9531e97696f076ed415e8d8058b1dbaa28181f4fdcc"},
"html_sanitize_ex": {:hex, :html_sanitize_ex, "1.4.4", "271455b4d300d5d53a5d92b5bd1c00ad14c5abf1c9ff87be069af5736496515c", [:mix], [{:mochiweb, "~> 2.15 or ~> 3.1", [hex: :mochiweb, repo: "hexpm", optional: false]}], "hexpm", "12e1754204e7db5df1750df0a5dba1bbdf89260800019ab081f2b046596be56b"},

View File

@@ -63,7 +63,7 @@ value EmbeddingVector {
-- ─── Entities ───────────────────────────────────────────────
entity EmbeddingKey {
label: Integer -- HNSW label for USearch
label: Integer -- HNSW node label / id
post: post/Post
content_hash: String -- SHA-256 of "{title}\n\n{content}"
vector: EmbeddingVector
@@ -75,9 +75,11 @@ entity DismissedDuplicatePair {
-- IDs stored in canonical order (sorted) for dedup
}
-- ─── USearch HNSW Index ─────────────────────────────────────
-- ─── HNSW Index ─────────────────────────────────────────────
config {
-- HNSW approximate-nearest-neighbour index (hnswlib). USearch has no Elixir
-- binding; hnswlib provides the same HNSW algorithm and parameters.
model_id: String = "Xenova/multilingual-e5-small"
embedding_dimensions: Integer = 384
hnsw_metric: String = "cosine"
@@ -86,7 +88,8 @@ config {
hnsw_expansion_search: Integer = 64 -- efSearch
debounce_persist: Duration = 5.seconds
-- Index file: {userData}/projects/{projectId}/embeddings.usearch
-- Key mapping is persisted alongside the embedding records
-- Key mapping (label → post_id) persisted in a sidecar (.meta.json) next
-- to the index file, plus the source-of-truth rows in embedding_keys
batch_size: Integer = 16 -- texts per batched inference run
sequence_length: Integer = 256 -- max tokens per input (truncated)
}
@@ -112,7 +115,7 @@ rule EmbedPost {
let existing = EmbeddingKey{post: post}
if not exists existing or existing.content_hash != hash:
-- Compute embedding vector via local model
-- Upsert into USearch index + embedding_keys DB table
-- Upsert into HNSW index + embedding_keys DB table
-- Debounced index save (5s)
ensures: EmbeddingKeyUpdated(post)
}
@@ -151,9 +154,9 @@ rule IndexUnindexed {
rule FindSimilar {
when: FindSimilarRequested(post, limit)
requires: semantic_similarity_enabled
-- HNSW approximate nearest neighbor search via USearch
-- HNSW approximate nearest neighbor search (hnswlib)
-- Searches index for (limit + 1) neighbors, excludes self
-- Converts USearch cosine distance to similarity: max(0, 1 - distance)
-- Converts HNSW cosine distance to similarity: max(0, 1 - distance)
-- Returns ranked list sorted by descending similarity
ensures: SimilarPostsResult(post, ranked_matches)
}
@@ -162,7 +165,7 @@ rule ComputeSimilarities {
when: ComputeSimilaritiesRequested(source_post, target_post_ids)
requires: semantic_similarity_enabled
-- Exact pairwise cosine similarity between source vector and each target vector
-- Uses in-memory vector cache, NOT USearch search
-- Uses in-memory vector cache, NOT the HNSW index
-- Returns map of post_id -> similarity score
-- Used by InsertPostLinkModal to rank FTS search results
ensures: SimilarityScoresResult(source_post, scores)
@@ -207,7 +210,7 @@ invariant ContentHashSkipsUnchanged {
}
invariant DebouncedPersistence {
-- USearch index persistence is debounced at 5 seconds
-- HNSW index persistence is debounced at 5 seconds
-- Prevents excessive disk I/O during bulk operations
-- Index also force-saved on project switch and app shutdown
}
@@ -234,8 +237,9 @@ invariant NativeAcceleratedExecution {
-- inference pass and inputs are truncated to a bounded sequence_length, so
-- (re)indexing many posts is not serialised one document at a time.
-- Current implementation: Bumblebee + EXLA, which is native CPU on Apple
-- Silicon (XLA has no Metal backend). Apple GPU acceleration via EMLX/MLX
-- is tracked as a follow-up (SPECGAPS A1-14c).
-- Silicon (XLA has no Metal backend); neighbour search is HNSW (hnswlib).
-- Apple GPU acceleration via EMLX/MLX is tracked as a follow-up
-- (SPECGAPS A1-14c).
}
invariant ModelCaching {
@@ -245,7 +249,7 @@ invariant ModelCaching {
}
invariant ProjectIsolation {
-- Each project has its own USearch index file and embedding_keys rows
-- Each project has its own HNSW index file and embedding_keys rows
-- On project switch: save current index, load new project's index
-- Model pipeline shared across projects (not reloaded)
}

View File

@@ -44,32 +44,32 @@ defmodule BDS.Desktop.AutomationTest do
assert :ok = Automation.click(session, "[data-testid='toggle-sidebar']")
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.sidebar_visible == false))
assert snapshot.sidebar_visible == false
assert :ok = Automation.press(session, "Meta+B")
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.sidebar_visible == true))
assert snapshot.sidebar_visible == true
assert :ok = Automation.press(session, "Meta+J")
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.panel_visible == true))
assert snapshot.panel_visible == true
assert :ok = Automation.press(session, "Meta+J")
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.panel_visible == false))
assert snapshot.panel_visible == false
assert :ok = Automation.press(session, "Meta+,")
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.editor_title == "Settings"))
assert snapshot.editor_title == "Settings"
assert :ok = Automation.click(session, "[data-testid='toggle-assistant']")
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.assistant_visible == true))
assert snapshot.assistant_visible == true
assert snapshot.panel_visible == false
@@ -92,7 +92,7 @@ defmodule BDS.Desktop.AutomationTest do
assert :ok = Automation.drag(session, "[data-resize='sidebar']", 90)
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.sidebar_width >= 360 and &1.sidebar_width <= 380))
assert snapshot.sidebar_width >= 360
assert snapshot.sidebar_width <= 380
@@ -100,7 +100,7 @@ defmodule BDS.Desktop.AutomationTest do
assert :ok = Automation.reload(session)
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.sidebar_visible == true and &1.sidebar_width >= resized_width - 2))
assert snapshot.sidebar_visible == true
assert snapshot.sidebar_width >= resized_width - 2
assert snapshot.sidebar_width <= resized_width + 2
@@ -140,12 +140,12 @@ defmodule BDS.Desktop.AutomationTest do
assert :ok = Automation.native_menu_action(session, "toggle_sidebar")
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.sidebar_visible == false))
assert snapshot.sidebar_visible == false
assert :ok = Automation.native_menu_action(session, "edit_preferences")
snapshot = Automation.snapshot(session)
snapshot = await(session, &(&1.editor_title == "Settings"))
assert snapshot.editor_title == "Settings"
end
@@ -175,6 +175,24 @@ defmodule BDS.Desktop.AutomationTest do
end
end
# Polls snapshots until the predicate holds (or times out), returning the
# last snapshot. UI transitions after a keypress/click/menu action are
# asynchronous, so a single immediate snapshot can race under CPU load.
defp await(session, fun, timeout \\ 5_000)
defp await(session, _fun, timeout) when timeout <= 0, do: Automation.snapshot(session)
defp await(session, fun, timeout) do
snapshot = Automation.snapshot(session)
if fun.(snapshot) do
snapshot
else
Process.sleep(50)
await(session, fun, timeout - 50)
end
end
defp wait_until(fun, timeout \\ 5_000)
defp wait_until(fun, timeout) when timeout <= 0, do: fun.()

View File

@@ -319,24 +319,28 @@ defmodule BDS.EmbeddingsTest do
assert {:ok, _indexed} = BDS.Embeddings.index_unindexed(project.id)
# Persistence is debounced (5s); force it to disk to assert the files.
:ok = BDS.Embeddings.Index.flush(project.id)
index_path = BDS.Embeddings.index_path(project.id)
assert File.exists?(index_path)
assert File.exists?(index_path <> ".meta.json")
refute String.starts_with?(index_path, BDS.Projects.project_data_dir(project))
cache_root = Application.fetch_env!(:bds, :project_cache_root) |> Path.expand()
assert index_path == Path.join([cache_root, "projects", project.id, "embeddings.usearch"])
snapshot = index_path |> File.read!() |> Jason.decode!()
assert snapshot["project_id"] == project.id
assert snapshot["model_id"] == "fake/multilingual-e5-small"
assert snapshot["dimensions"] == 384
assert snapshot["entries"][alpha.id]["label"] != nil
assert snapshot["entries"][alpha.id]["content_hash"] != nil
# The sidecar carries the dimension and the label→post_id mapping.
meta = (index_path <> ".meta.json") |> File.read!() |> Jason.decode!()
assert meta["dim"] == 384
post_ids = Enum.map(meta["labels"], fn [_label, post_id] -> post_id end)
assert alpha.id in post_ids
assert beta.id in post_ids
assert Enum.any?(snapshot["entries"][alpha.id]["neighbors"], fn neighbor ->
neighbor["post_id"] == beta.id
end)
# The HNSW index answers nearest-neighbour queries.
assert {:ok, [neighbor]} = BDS.Embeddings.find_similar(alpha.id, 1)
assert neighbor.post_id == beta.id
end
test "embedding index uses the app-internal persisted file name", %{project: project} do
@@ -443,43 +447,76 @@ defmodule BDS.EmbeddingsTest do
refreshed_key = BDS.Repo.get_by!(BDS.Embeddings.Key, project_id: project.id, post_id: post.id)
assert refreshed_key.content_hash == stale_key.content_hash
:ok = BDS.Embeddings.Index.flush(project.id)
assert File.exists?(BDS.Embeddings.index_path(project.id))
end
test "sync_post refreshes snapshot drift when the embedding hash is already current", %{
test "similarity queries keep working when sync_post finds the embedding already current", %{
project: project
} do
assert {:ok, _metadata} =
BDS.Metadata.update_project_metadata(project.id, %{semantic_similarity_enabled: true})
assert {:ok, post} =
assert {:ok, alpha} =
BDS.Posts.create_post(%{
project_id: project.id,
title: "Snapshot Repair",
title: "Alpha",
content: "space rocket orbit mission galaxy",
language: "en"
})
assert {:ok, post} = BDS.Posts.publish_post(post.id)
assert {:ok, beta} =
BDS.Posts.create_post(%{
project_id: project.id,
title: "Beta",
content: "rocket launch orbit mission station",
language: "en"
})
assert {:ok, alpha} = BDS.Posts.publish_post(alpha.id)
assert {:ok, _beta} = BDS.Posts.publish_post(beta.id)
assert {:ok, _indexed} = BDS.Embeddings.index_unindexed(project.id)
key = BDS.Repo.get_by!(BDS.Embeddings.Key, project_id: project.id, post_id: post.id)
index_path = BDS.Embeddings.index_path(project.id)
# Re-syncing with an unchanged content hash is a no-op for the index...
assert :ok = BDS.Embeddings.sync_post(alpha.id)
snapshot = index_path |> File.read!() |> Jason.decode!()
# ...and nearest-neighbour queries still resolve through the HNSW index.
assert {:ok, [neighbor]} = BDS.Embeddings.find_similar(alpha.id, 1)
assert neighbor.post_id == beta.id
end
drifted_snapshot =
put_in(snapshot, ["entries", post.id, "content_hash"], "stale-snapshot-hash")
test "find_similar rebuilds the HNSW index on demand when none is loaded", %{project: project} do
assert {:ok, _metadata} =
BDS.Metadata.update_project_metadata(project.id, %{semantic_similarity_enabled: true})
File.write!(index_path, Jason.encode!(drifted_snapshot))
assert {:ok, alpha} =
BDS.Posts.create_post(%{
project_id: project.id,
title: "Alpha",
content: "space rocket orbit mission galaxy",
language: "en"
})
refute Enum.any?(BDS.Embeddings.diff_reports(project.id), &(&1.entity_id == post.id))
assert {:ok, beta} =
BDS.Posts.create_post(%{
project_id: project.id,
title: "Beta",
content: "rocket launch orbit mission station",
language: "en"
})
assert :ok = BDS.Embeddings.sync_post(post.id)
assert {:ok, _alpha} = BDS.Posts.publish_post(alpha.id)
assert {:ok, _beta} = BDS.Posts.publish_post(beta.id)
assert {:ok, _indexed} = BDS.Embeddings.index_unindexed(project.id)
repaired_snapshot = index_path |> File.read!() |> Jason.decode!()
assert get_in(repaired_snapshot, ["entries", post.id, "content_hash"]) == key.content_hash
# Drop the in-memory index and remove the persisted files, then query: it
# must self-heal by rebuilding from the DB vectors.
:ok = BDS.Embeddings.Index.forget(project.id)
File.rm_rf!(BDS.Projects.project_cache_dir(project.id))
refute Enum.any?(BDS.Embeddings.diff_reports(project.id), &(&1.entity_id == post.id))
assert {:ok, similar} = BDS.Embeddings.find_similar(alpha.id, 1)
assert [%{post_id: post_id}] = similar
assert post_id == beta.id
end
end

View File

@@ -395,6 +395,8 @@ defmodule BDS.MaintenanceTest do
assert {:ok, _indexed} = BDS.Embeddings.index_unindexed(project.id)
index_path = BDS.Embeddings.index_path(project.id)
# Index persistence is debounced (5s); force it to assert the file.
:ok = BDS.Embeddings.Index.flush(project.id)
assert File.exists?(index_path)
Repo.delete_all(from key in BDS.Embeddings.Key, where: key.project_id == ^project.id)
@@ -416,6 +418,8 @@ defmodule BDS.MaintenanceTest do
assert post.id in rebuilt_post_ids
assert Repo.get_by(BDS.Embeddings.Key, project_id: project.id, post_id: post.id) != nil
:ok = BDS.Embeddings.Index.flush(project.id)
assert File.exists?(index_path)
end

View File

@@ -236,6 +236,9 @@ defmodule BDS.MetadataTest do
assert metadata.semantic_similarity_enabled == true
assert BDS.Repo.get_by(BDS.Embeddings.Key, project_id: project.id, post_id: post.id) != nil
# Index persistence is debounced (5s); force it to assert the file.
:ok = BDS.Embeddings.Index.flush(project.id)
assert File.exists?(BDS.Embeddings.index_path(project.id))
end