diff --git a/SPECGAPS.md b/SPECGAPS.md index d6eb9f6..4a4ace5 100644 --- a/SPECGAPS.md +++ b/SPECGAPS.md @@ -24,7 +24,7 @@ Gap categories: **SC** = spec correct, fix code | **CS** = code correct, update | A1-12 | ~~Real Pagefind integration for search~~ | generation.allium:208 | Functional client-side search: `PagefindUI` defined in bundled `pagefind-ui.js`, fragment index records url/title/body-scoped text per page, search-runtime wires it up | **Resolved:** bundled real `PagefindUI` (fetch index, ranked full-text match, highlighted excerpts) + `pagefind-ui.css` as local assets read into `Pagefind`; index scoped to `data-pagefind-body` (unmarked pages excluded per PagefindHtmlMarking), title from ``/`<h1>`; localized "No results found" label via `data-search-no-results` (de/fr/it/es); 3 unit tests added | | A1-13 | ~~Git sidebar shows only "Working tree" placeholder~~ | sidebar_views.allium:651-770 | `git_view/1` now builds a full `layout: "git"` view from `BDS.Git` (repository/remote_state/status/history); `SidebarComponents` renders active + not_a_repo states | **Resolved:** `git_view/1` in sidebar.ex assembles branch/upstream/ahead/behind, status files, paginated history (20/page); `render_git_sidebar` renders branch header, sync legend, fetch/pull/push/prune-lfs buttons, commit form, clickable status files (open git_diff), history entries; shell_live wires `git_commit` (closes git_diff tabs), `git_fetch`/`git_pull`/`git_push`/`git_prune_lfs`, `git_initialize`; `BDS.Git.history` enriched with author/date, `BDS.Git.set_remote/2` added; i18n for de/fr/it/es; 3 shell tests + git author/date assertions added | | A1-14 | ~~Embedding uses TF-IDF hash projection instead of real neural model~~ | embedding.allium:44-53, invariants RealNeuralModel/ModelCaching/VectorCacheInDb | `Backends.Neural` runs `intfloat/multilingual-e5-small` (e5 weights behind the Xenova id) via Bumblebee+EXLA | **Resolved (core):** added bumblebee/nx/exla deps; `Backends.Neural` is a lazily-loaded GenServer that builds the Bumblebee text-embedding serving on first request (`"query: "` prefix + mean pooling + L2 norm), downloads+caches the model under the app data dir (ModelCaching), and is wired into the supervision tree when configured; vectors now persisted as packed little-endian Float32 BLOB (384×4=1536 bytes) instead of JSON text (VectorCacheInDb) with migration recreating `embedding_keys.vector` as BLOB; `InApp` demoted to documented offline/test stub; test config uses the stub so the suite stays offline; spec EmbeddingModel clarified (Xenova id ↔ intfloat weights via Bumblebee); batched inference via optional `embed_many/2` backend callback (configurable `batch_size`/`sequence_length`; rebuild/index/repair embed in chunks instead of one post at a time) + `NativeAcceleratedExecution` invariant added to spec; 4 tests added (BLOB round-trip, batched-rebuild, Neural model_info/behaviour). **Deferred:** A1-14b USearch HNSW index, A1-14c Apple GPU (EMLX). | -| A1-14b | USearch HNSW ANN index + debounced persistence not implemented | embedding.allium:75-87 (config), FindSimilar, invariant DebouncedPersistence | Neighbor lookup still uses the JSON cosine snapshot (`Embeddings.Index`), not a USearch HNSW index; no 5s debounced index persistence (snapshot rebuilt synchronously) | Fix code: replace JSON snapshot with USearch HNSW index file (`embeddings.usearch`, cosine, M=16, efConstruction=128, efSearch=64), label→post_id mapping, 5s debounced save + force-save on project switch/shutdown | +| A1-14b | ~~USearch HNSW ANN index + debounced persistence not implemented~~ | embedding.allium config/FindSimilar/DebouncedPersistence | `Embeddings.Index` is now an HNSW (hnswlib) ANN index with debounced persistence | **Resolved:** rewrote `Embeddings.Index` as a DB-free GenServer wrapping an hnswlib HNSW graph (cosine, M=16, efConstruction=128, efSearch=64) — O(n·log n) build, O(log n) queries, replacing the O(n²) JSON cosine snapshot; per-project in-memory index + `label→post_id` map; 5s debounced `save_index` + `.meta.json` sidecar, force-save on project switch (`set_active_project`) and shutdown (`terminate`), `forget/1` on project delete; lazy reload from disk with rebuild-from-DB self-heal on miss; `find_similar`/`find_duplicates`/`compute_similarities` rewired (no brute-force fallback); USearch has no Elixir binding so hnswlib provides the identical HNSW algorithm/params (spec reconciled); supervision + dialyzer PLT updated; tests updated for debounced/binary persistence + self-heal. | | A1-14c | Embedding model runs on CPU only; no Apple GPU acceleration | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` uses Bumblebee+EXLA; on Apple Silicon XLA has no Metal backend so inference is native CPU (batched). Apple GPU/Neural Engine unused | Fix code: spike an EMLX (Apple MLX) Nx backend so the model executes on the Apple Silicon GPU; gate by platform/availability with EXLA-CPU fallback; verify Bumblebee serving + defn compiler compatibility and benchmark vs CPU batching | | A1-15 | ~~Preview vs generation content source strategy undocumented~~ | preview.allium (no invariant), generation.allium (no invariant) | Generation uses only published .md file content (`Generation.Data` snapshots set `content: nil`); preview includes published+draft posts and prefers DB content over file (`Preview.Router` queries `:published`/`:draft`, uses `editor_body`) | **Resolved:** added `PreviewDraftOverlay` invariant to preview.allium and `GenerationPublishedOnly` invariant to generation.allium; both cross-reference each other; code already correct, 3 tests added for draft-in-preview behavior | @@ -186,7 +186,7 @@ All reconciled to follow code. Specs must be self-consistent and match code. ## Priority Order for Resolution -1. **A1-1 through A1-14c** — code must follow spec (includes auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model; A1-14b = USearch HNSW index and A1-14c = Apple GPU/EMLX acceleration still open) +1. **A1-1 through A1-14c** — code must follow spec (includes auto-save, on-demand preview, template lookup, validation gates, real Pagefind, graceful shutdown, real embedding model, HNSW ANN index; only A1-14c = Apple GPU/EMLX acceleration still open) 2. **D1-1 through D1-18** — untested invariants/guarantees 3. **C-1 through C-3** — internal spec inconsistencies (reconcile to code) 4. **B1-1 through B1-6** — major code behaviors missing from spec diff --git a/lib/bds/application.ex b/lib/bds/application.ex index 129c19f..4aee11a 100644 --- a/lib/bds/application.ex +++ b/lib/bds/application.ex @@ -37,7 +37,8 @@ defmodule BDS.Application do {Task.Supervisor, name: BDS.TCP.TaskSupervisor}, BDS.Scripting.JobStore, {Task.Supervisor, name: BDS.Scripting.TaskSupervisor}, - BDS.Scripting.JobSupervisor + BDS.Scripting.JobSupervisor, + BDS.Embeddings.Index ] ++ embedding_children() ++ desktop_children(current_env()) opts = [strategy: :one_for_one, name: BDS.Supervisor] diff --git a/lib/bds/embeddings.ex b/lib/bds/embeddings.ex index 0b0d546..c071d7a 100644 --- a/lib/bds/embeddings.ex +++ b/lib/bds/embeddings.ex @@ -166,11 +166,9 @@ defmodule BDS.Embeddings do case Repo.get_by(Key, post_id: post.id, project_id: post.project_id) do %Key{content_hash: ^content_hash} -> - if Keyword.get(opts, :refresh_index, true) and - snapshot_content_hash(post.project_id, post.id) != content_hash do - :ok = rebuild_snapshot(post.project_id) - end - + # Embedding is already current. The HNSW index self-heals on query + # (find_similar/find_duplicates rebuild when no index is loaded), so + # there is nothing to refresh here. :ok existing_key -> @@ -361,28 +359,28 @@ defmodule BDS.Embeddings do {:error, :not_found} -> {:ok, []} - {:ok, post, source_vector} -> - similar = - case Index.neighbors(post.project_id, post.id, limit) do - {:ok, neighbors} -> - neighbors + {:ok, _post, nil} -> + {:ok, []} - {:error, :missing} -> - Repo.all( - from key in Key, - where: key.project_id == ^post.project_id and key.post_id != ^post.id - ) - |> Enum.map(fn key -> - %{ - post_id: key.post_id, - score: cosine_similarity(source_vector, decode_vector(key.vector)) - } - end) - |> Enum.sort_by(& &1.score, :desc) - |> Enum.take(max(limit, 0)) - end + {:ok, post, %Key{} = key} -> + {:ok, query_similar(post.project_id, key, limit)} + end + end - {:ok, similar} + # Queries the HNSW index for a post's neighbours, rebuilding the index from + # the DB vectors if it is not currently loaded (e.g. after a restart). + defp query_similar(project_id, %Key{} = key, limit) do + case Index.neighbors(project_id, key.label, key.vector, limit) do + {:ok, neighbors} -> + neighbors + + {:error, :missing} -> + :ok = rebuild_snapshot(project_id) + + case Index.neighbors(project_id, key.label, key.vector, limit) do + {:ok, neighbors} -> neighbors + {:error, :missing} -> [] + end end end @@ -395,8 +393,12 @@ defmodule BDS.Embeddings do {:error, :not_found} -> {:ok, %{}} - {:ok, post, source_vector} -> + {:ok, _post, nil} -> + {:ok, %{}} + + {:ok, post, %Key{} = source_key} -> target_ids = Enum.uniq(target_post_ids) + source_vector = decode_vector(source_key.vector) scores = Repo.all( @@ -452,46 +454,18 @@ defmodule BDS.Embeddings do if enabled_for_project?(project_id) do on_progress = progress_callback(opts) dismissed = dismissed_pair_keys(project_id) + entries = load_index_entries(project_id) + + pairs = + case duplicate_pairs_with_rebuild(project_id, entries, on_progress) do + {:ok, pairs} -> pairs + {:error, :missing} -> [] + end duplicates = - case Index.duplicate_pairs(project_id, @duplicate_threshold, on_progress: on_progress) do - {:ok, pairs} -> - pairs - |> Enum.reject(fn pair -> pair_key(pair.post_id_a, pair.post_id_b) in dismissed end) - |> enrich_duplicate_pairs(project_id) - - {:error, :missing} -> - keys = - Repo.all( - from key in Key, - where: key.project_id == ^project_id, - order_by: [asc: key.post_id] - ) - - total_keys = length(keys) - - :ok = report_rebuild_started(on_progress, total_keys, "embedding entries") - - keys - |> Enum.with_index(1) - |> Enum.flat_map(fn {left, index} -> - :ok = report_rebuild_progress(on_progress, index, total_keys, "embedding entries") - - for right <- keys, - left.post_id < right.post_id, - pair_key(left.post_id, right.post_id) not in dismissed, - similarity = - cosine_similarity(decode_vector(left.vector), decode_vector(right.vector)), - similarity >= @duplicate_threshold do - %{ - post_id_a: left.post_id, - post_id_b: right.post_id, - score: similarity - } - end - end) - |> enrich_duplicate_pairs(project_id) - end + pairs + |> Enum.reject(fn pair -> pair_key(pair.post_id_a, pair.post_id_b) in dismissed end) + |> enrich_duplicate_pairs(project_id) :ok = report_rebuild_phase(on_progress, 0.99, "Resolving duplicate candidates") {:ok, duplicates} @@ -555,17 +529,33 @@ defmodule BDS.Embeddings do with {:ok, post} <- fetch_post(post_id) do if enabled_for_project?(post.project_id) do :ok = ensure_key(post) - - case Repo.get_by(Key, post_id: post.id, project_id: post.project_id) do - nil -> {:ok, post, []} - key -> {:ok, post, decode_vector(key.vector)} - end + {:ok, post, Repo.get_by(Key, post_id: post.id, project_id: post.project_id)} else {:disabled, post.project_id} end end end + defp duplicate_pairs_with_rebuild(project_id, entries, on_progress) do + case Index.duplicate_pairs(project_id, entries, @duplicate_threshold, on_progress: on_progress) do + {:ok, pairs} -> + {:ok, pairs} + + {:error, :missing} -> + :ok = rebuild_snapshot(project_id) + Index.duplicate_pairs(project_id, entries, @duplicate_threshold, on_progress: on_progress) + end + end + + defp load_index_entries(project_id) do + Repo.all( + from key in Key, + where: key.project_id == ^project_id, + order_by: [asc: key.post_id] + ) + |> Enum.map(fn key -> %{label: key.label, post_id: key.post_id, vector: key.vector} end) + end + defp ensure_key(%Post{} = post) do case Repo.get_by(Key, post_id: post.id, project_id: post.project_id) do nil -> sync_post(post) @@ -704,7 +694,7 @@ defmodule BDS.Embeddings do end defp rebuild_snapshot(project_id) do - Index.rebuild(project_id, model_id: model_id(), dimensions: dimensions()) + Index.put(project_id, dimensions(), load_index_entries(project_id)) end defp progress_callback(opts), do: ProgressReporter.callback(opts) @@ -729,13 +719,6 @@ defmodule BDS.Embeddings do defp report_rebuild_phase(callback, value, label), do: ProgressReporter.report_phase(callback, value, label) - defp snapshot_content_hash(project_id, post_id) do - case Index.read(project_id) do - {:ok, snapshot} -> get_in(snapshot, ["entries", post_id, "content_hash"]) - _other -> nil - end - end - defp current_embedding_status(nil, _expected_hash), do: "missing" defp current_embedding_status(%Key{vector: vector}, _expected_hash) when vector in [nil, ""], diff --git a/lib/bds/embeddings/index.ex b/lib/bds/embeddings/index.ex index 83447ee..1a58c63 100644 --- a/lib/bds/embeddings/index.ex +++ b/lib/bds/embeddings/index.ex @@ -1,220 +1,342 @@ defmodule BDS.Embeddings.Index do - @moduledoc false + @moduledoc """ + Per-project approximate-nearest-neighbour index over post embeddings. - import Ecto.Query + Backed by an HNSW graph (hnswlib) per the A1-14b / `specs/embedding.allium` + requirement — cosine space, connectivity M=16, efConstruction=128, + efSearch=64. This replaces the previous O(n²) brute-force cosine snapshot: + building is O(n·log n) and queries are O(log n). + + The process is intentionally **database-free**: callers (running in their own + process, e.g. under the test SQL sandbox) read embedding vectors from the DB + and hand them in. This GenServer owns only the in-memory HNSW graphs, the + `label → post_id` maps, and file persistence. + + Persistence (DebouncedPersistence invariant): the index file + (`embeddings.usearch`) plus a small sidecar holding the dimension and the + label→post_id map are written behind a 5s debounce, and force-saved on + project switch / shutdown. On a cold query the index is lazily reloaded from + those files; if they are absent the caller rebuilds from the DB vectors. + """ + + use GenServer - alias BDS.Persistence - alias BDS.Embeddings.Key alias BDS.Projects alias BDS.ProgressReporter - alias BDS.Repo @neighbor_limit 21 + @debounce_ms 5_000 + @space :cosine + @m 16 + @ef_construction 128 + @ef_search 64 + # ─── Public API ───────────────────────────────────────────── + + def start_link(opts \\ []) do + GenServer.start_link(__MODULE__, opts, name: __MODULE__) + end + + @doc "On-disk path of the HNSW index file for a project." def path(project_id) when is_binary(project_id) do Path.join(Projects.project_cache_dir(project_id), "embeddings.usearch") end - def rebuild(project_id, opts) when is_binary(project_id) and is_list(opts) do - model_id = Keyword.fetch!(opts, :model_id) - dimensions = Keyword.fetch!(opts, :dimensions) - - keys = - Repo.all( - from key in Key, - where: key.project_id == ^project_id, - order_by: [asc: key.post_id] - ) - - entries = - keys - |> Enum.map(fn key -> - vector = decode_vector(key.vector) - - {key.post_id, - %{ - "label" => key.label, - "content_hash" => key.content_hash, - "neighbors" => neighbor_entries(keys, key, vector) - }} - end) - |> Map.new() - - payload = %{ - "project_id" => project_id, - "model_id" => model_id, - "dimensions" => dimensions, - "updated_at" => Persistence.now_ms(), - "entries" => entries - } - - write_snapshot(path(project_id), payload, project_id) + @doc """ + (Re)builds the index for a project from the given entries and schedules a + debounced save. `entries` is a list of `%{label:, post_id:, vector:}` where + `vector` is the packed little-endian Float32 BLOB. + """ + def put(project_id, dimensions, entries) + when is_binary(project_id) and is_integer(dimensions) and is_list(entries) do + GenServer.call(__MODULE__, {:put, project_id, dimensions, entries}, :infinity) end - def read(project_id) when is_binary(project_id) do - project_id - |> candidate_paths() - |> read_snapshot_paths() + @doc """ + Returns up to `limit` nearest neighbours of `query_vector` (the post's packed + BLOB), excluding `query_label`. `{:error, :missing}` if no index is available. + """ + def neighbors(project_id, query_label, query_vector, limit) + when is_binary(project_id) and is_integer(query_label) and is_binary(query_vector) do + GenServer.call(__MODULE__, {:neighbors, project_id, query_label, query_vector, limit}, :infinity) end - def neighbors(project_id, post_id, limit) when is_binary(project_id) and is_binary(post_id) do - with {:ok, snapshot} <- read(project_id), - %{} = entry <- get_in(snapshot, ["entries", post_id]) do - entry - |> Map.get("neighbors", []) - |> Enum.take(max(limit, 0)) - |> Enum.map(fn neighbor -> - %{ - post_id: neighbor["post_id"], - score: neighbor["score"] - } - end) - |> then(&{:ok, &1}) - else - _ -> {:error, :missing} + @doc """ + Finds near-duplicate pairs at/above `threshold` by querying the HNSW graph for + each entry's neighbours. `{:error, :missing}` if no index is available. + """ + def duplicate_pairs(project_id, entries, threshold, opts \\ []) + when is_binary(project_id) and is_list(entries) and is_number(threshold) do + GenServer.call( + __MODULE__, + {:duplicate_pairs, project_id, entries, threshold, opts}, + :infinity + ) + end + + @doc "Forces a pending save for a project to disk now (e.g. on project switch)." + def flush(project_id) when is_binary(project_id) do + GenServer.call(__MODULE__, {:flush, project_id}, :infinity) + end + + @doc "Forces all pending saves to disk now (e.g. on shutdown)." + def flush_all do + GenServer.call(__MODULE__, :flush_all, :infinity) + end + + @doc "Drops the in-memory index for a project (e.g. on project deletion)." + def forget(project_id) when is_binary(project_id) do + GenServer.call(__MODULE__, {:forget, project_id}, :infinity) + end + + # ─── GenServer ────────────────────────────────────────────── + + @impl true + def init(_opts) do + Process.flag(:trap_exit, true) + {:ok, %{}} + end + + @impl true + def handle_call({:put, project_id, dimensions, entries}, _from, state) do + entry = build_entry(dimensions, entries) + state = state |> Map.put(project_id, entry) |> schedule_save(project_id) + {:reply, :ok, state} + end + + def handle_call({:neighbors, project_id, query_label, query_vector, limit}, _from, state) do + case ensure_loaded(state, project_id) do + {:ok, %{index: nil}, state} -> + {:reply, {:error, :missing}, state} + + {:ok, entry, state} -> + {:reply, {:ok, query_neighbors(entry, query_label, query_vector, limit)}, state} + + {:missing, state} -> + {:reply, {:error, :missing}, state} end end - def duplicate_pairs(project_id, threshold, opts \\ []) when is_binary(project_id) do - with {:ok, snapshot} <- read(project_id) do - entries = Map.get(snapshot, "entries", %{}) - entry_count = map_size(entries) - on_progress = progress_callback(opts) + def handle_call({:duplicate_pairs, project_id, entries, threshold, opts}, _from, state) do + case ensure_loaded(state, project_id) do + {:ok, %{index: nil}, state} -> + {:reply, {:error, :missing}, state} - :ok = report_scan_started(on_progress, entry_count, "embedding entries") + {:ok, entry, state} -> + {:reply, {:ok, scan_duplicates(entry, entries, threshold, opts)}, state} - pairs = - entries - |> Enum.with_index(1) - |> Enum.flat_map(fn {{post_id, entry}, index} -> - :ok = report_scan_progress(on_progress, index, entry_count, "embedding entries") - - entry - |> Map.get("neighbors", []) - |> Enum.filter(&(&1["score"] >= threshold)) - |> Enum.map(fn neighbor -> - {post_id_a, post_id_b} = sort_pair(post_id, neighbor["post_id"]) - - {{post_id_a, post_id_b}, - %{ - post_id_a: post_id_a, - post_id_b: post_id_b, - score: neighbor["score"] - }} - end) - end) - |> Map.new() - |> Map.values() - |> Enum.sort_by(& &1.score, :desc) - - {:ok, pairs} - else - _ -> {:error, :missing} + {:missing, state} -> + {:reply, {:error, :missing}, state} end end - defp neighbor_entries(keys, current_key, current_vector) do - keys - |> Enum.reject(&(&1.post_id == current_key.post_id)) - |> Enum.map(fn other_key -> - %{ - "post_id" => other_key.post_id, - "label" => other_key.label, - "score" => cosine_similarity(current_vector, decode_vector(other_key.vector)) - } - end) - |> Enum.sort_by(& &1["score"], :desc) - |> Enum.take(@neighbor_limit) + def handle_call({:flush, project_id}, _from, state) do + {:reply, :ok, save_now(state, project_id)} end - defp write_snapshot(snapshot_path, payload, project_id) do - :ok = Persistence.atomic_write(snapshot_path, Jason.encode!(payload)) - legacy_path = legacy_path(snapshot_path) + def handle_call(:flush_all, _from, state) do + state = Enum.reduce(Map.keys(state), state, &save_now(&2, &1)) + {:reply, :ok, state} + end - if File.exists?(legacy_path) do - File.rm(legacy_path) + def handle_call({:forget, project_id}, _from, state) do + case Map.get(state, project_id) do + %{timer: timer} when is_reference(timer) -> Process.cancel_timer(timer) + _other -> :ok end - cleanup_legacy_project_snapshots(project_id, snapshot_path) + {:reply, :ok, Map.delete(state, project_id)} + end + @impl true + def handle_info({:save, project_id}, state) do + {:noreply, save_now(state, project_id)} + end + + def handle_info(_message, state), do: {:noreply, state} + + @impl true + def terminate(_reason, state) do + Enum.each(Map.keys(state), &save_now(state, &1)) :ok end - defp candidate_paths(project_id) do - current_snapshot_path = path(project_id) - legacy_project_snapshot_path = legacy_project_snapshot_path(project_id) + # ─── Build / query ────────────────────────────────────────── - [ - current_snapshot_path, - legacy_path(current_snapshot_path), - legacy_project_snapshot_path, - legacy_project_snapshot_path && legacy_path(legacy_project_snapshot_path) - ] - |> Enum.filter(&is_binary/1) - |> Enum.uniq() + defp build_entry(dimensions, []), do: %{index: nil, labels: %{}, dim: dimensions, timer: nil} + + defp build_entry(dimensions, entries) do + count = length(entries) + {:ok, index} = HNSWLib.Index.new(@space, dimensions, count, m: @m, ef_construction: @ef_construction) + :ok = HNSWLib.Index.set_ef(index, @ef_search) + + tensor = + entries + |> Enum.map(& &1.vector) + |> IO.iodata_to_binary() + |> Nx.from_binary(:f32) + |> Nx.reshape({count, dimensions}) + + :ok = HNSWLib.Index.add_items(index, tensor, ids: Enum.map(entries, & &1.label)) + + %{ + index: index, + labels: Map.new(entries, &{&1.label, &1.post_id}), + dim: dimensions, + timer: nil + } end - defp read_snapshot_paths([]), do: {:error, :missing} + defp query_neighbors(%{index: index, labels: labels}, query_label, query_vector, limit) do + case query(index, query_vector, limit + 1) do + [] -> + [] - defp read_snapshot_paths([snapshot_path | rest]) do - case File.read(snapshot_path) do - {:ok, contents} -> {:ok, Jason.decode!(contents)} - {:error, :enoent} -> read_snapshot_paths(rest) - {:error, reason} -> {:error, reason} + results -> + results + |> Enum.reject(fn {label, _score} -> label == query_label end) + |> Enum.map(fn {label, score} -> %{post_id: Map.get(labels, label), score: score} end) + |> Enum.reject(&is_nil(&1.post_id)) + |> Enum.take(max(limit, 0)) end end - defp cleanup_legacy_project_snapshots(project_id, snapshot_path) do - current_paths = [snapshot_path, legacy_path(snapshot_path)] + defp scan_duplicates(%{index: index, labels: labels}, entries, threshold, opts) do + on_progress = ProgressReporter.callback(opts) + total = length(entries) + :ok = report_scan_started(on_progress, total, "embedding entries") - project_id - |> legacy_project_snapshot_path() - |> then(fn legacy_snapshot_path -> - [legacy_snapshot_path, legacy_snapshot_path && legacy_path(legacy_snapshot_path)] - end) - |> Enum.filter(&is_binary/1) - |> Enum.reject(&(&1 in current_paths)) - |> Enum.each(fn legacy_snapshot_path -> - if File.exists?(legacy_snapshot_path) do - File.rm(legacy_snapshot_path) - end + entries + |> Enum.with_index(1) + |> Enum.flat_map(fn {entry, position} -> + :ok = report_scan_progress(on_progress, position, total, "embedding entries") + + index + |> query(entry.vector, @neighbor_limit) + |> Enum.reject(fn {label, _score} -> label == entry.label end) + |> Enum.map(fn {label, score} -> {Map.get(labels, label), score} end) + |> Enum.filter(fn {post_id, score} -> not is_nil(post_id) and score >= threshold end) + |> Enum.map(fn {other_post_id, score} -> + {post_id_a, post_id_b} = sort_pair(entry.post_id, other_post_id) + {{post_id_a, post_id_b}, %{post_id_a: post_id_a, post_id_b: post_id_b, score: score}} + end) end) + |> Map.new() + |> Map.values() + |> Enum.sort_by(& &1.score, :desc) end - defp legacy_project_snapshot_path(project_id) do - case Projects.get_project(project_id) do - nil -> nil - project -> Path.join(Projects.project_data_dir(project), "embeddings.usearch") + # Runs a knn query and returns [{label, similarity}] sorted by descending + # similarity. Cosine distance is converted to similarity as max(0, 1 - d). + defp query(index, query_vector, k) do + case HNSWLib.Index.get_current_count(index) do + {:ok, count} when count > 0 -> + clamped = min(k, count) + + case HNSWLib.Index.knn_query(index, query_vector, k: clamped) do + {:ok, labels, distances} -> + Enum.zip( + Nx.to_flat_list(labels), + Enum.map(Nx.to_flat_list(distances), fn distance -> max(0.0, 1.0 - distance) end) + ) + + {:error, _reason} -> + [] + end + + _other -> + [] end end - defp legacy_path(snapshot_path) do - Path.join(Path.dirname(snapshot_path), "embeddings.index.json") + # ─── Persistence ──────────────────────────────────────────── + + defp schedule_save(state, project_id) do + entry = Map.fetch!(state, project_id) + if is_reference(entry.timer), do: Process.cancel_timer(entry.timer) + timer = Process.send_after(self(), {:save, project_id}, @debounce_ms) + Map.put(state, project_id, %{entry | timer: timer}) end - # Vectors are stored as a packed little-endian Float32 BLOB; see - # BDS.Embeddings and the VectorCacheInDb invariant in embedding.allium. - defp decode_vector(nil), do: [] - defp decode_vector(<<>>), do: [] + defp save_now(state, project_id) do + case Map.get(state, project_id) do + nil -> + state - defp decode_vector(binary) when is_binary(binary) do - for <<value::float-32-little <- binary>>, do: value + entry -> + if is_reference(entry.timer), do: Process.cancel_timer(entry.timer) + persist(project_id, entry) + Map.put(state, project_id, %{entry | timer: nil}) + end end - defp cosine_similarity([], _other), do: 0.0 - defp cosine_similarity(_vector, []), do: 0.0 + defp persist(_project_id, %{index: nil}), do: :ok - defp cosine_similarity(left, right) do - Enum.zip(left, right) - |> Enum.reduce(0.0, fn {left_value, right_value}, acc -> acc + left_value * right_value end) - |> max(0.0) + defp persist(project_id, %{index: index, labels: labels, dim: dim}) do + index_path = path(project_id) + File.mkdir_p!(Path.dirname(index_path)) + HNSWLib.Index.save_index(index, index_path) + write_meta(index_path, dim, labels) + :ok + rescue + _exception -> :ok end + defp write_meta(index_path, dim, labels) do + payload = %{ + "dim" => dim, + "labels" => Enum.map(labels, fn {label, post_id} -> [label, post_id] end) + } + + File.write(meta_path(index_path), Jason.encode!(payload)) + end + + defp ensure_loaded(state, project_id) do + case Map.get(state, project_id) do + nil -> + case load_from_disk(project_id) do + {:ok, entry} -> {:ok, entry, Map.put(state, project_id, entry)} + :error -> {:missing, state} + end + + entry -> + {:ok, entry, state} + end + end + + defp load_from_disk(project_id) do + index_path = path(project_id) + + with {:ok, %{dim: dim, labels: labels}} <- read_meta(index_path), + true <- File.exists?(index_path), + {:ok, index} <- HNSWLib.Index.load_index(@space, dim, index_path) do + :ok = HNSWLib.Index.set_ef(index, @ef_search) + {:ok, %{index: index, labels: labels, dim: dim, timer: nil}} + else + _other -> :error + end + rescue + _exception -> :error + end + + defp read_meta(index_path) do + with {:ok, contents} <- File.read(meta_path(index_path)), + {:ok, %{"dim" => dim, "labels" => labels}} <- Jason.decode(contents) do + {:ok, + %{ + dim: dim, + labels: Map.new(labels, fn [label, post_id] -> {label, post_id} end) + }} + else + _other -> :error + end + end + + defp meta_path(index_path), do: index_path <> ".meta.json" + defp sort_pair(post_id_a, post_id_b) when post_id_a <= post_id_b, do: {post_id_a, post_id_b} defp sort_pair(post_id_a, post_id_b), do: {post_id_b, post_id_a} - defp progress_callback(opts), do: ProgressReporter.callback(opts) - defp report_scan_started(callback, total, label) do ProgressReporter.report_count_started(callback, total, label, verb: "Scanning", diff --git a/lib/bds/projects.ex b/lib/bds/projects.ex index ab35ad0..5694876 100644 --- a/lib/bds/projects.ex +++ b/lib/bds/projects.ex @@ -148,6 +148,9 @@ defmodule BDS.Projects do project -> now = Persistence.now_ms() + previous_active_id = + Repo.one(from p in Project, where: p.is_active == true, select: p.id) + Repo.transaction(fn -> Repo.update_all( from(p in Project, where: p.is_active == true), @@ -159,8 +162,16 @@ defmodule BDS.Projects do |> Repo.update!() end) |> case do - {:ok, active_project} -> {:ok, active_project} - {:error, reason} -> {:error, reason} + {:ok, active_project} -> + # Force-save the outgoing project's embedding index (DebouncedPersistence). + if is_binary(previous_active_id) and previous_active_id != active_project.id do + BDS.Embeddings.Index.flush(previous_active_id) + end + + {:ok, active_project} + + {:error, reason} -> + {:error, reason} end end end @@ -194,6 +205,8 @@ defmodule BDS.Projects do end) |> case do {:ok, deleted_project} -> + BDS.Embeddings.Index.forget(deleted_project.id) + Enum.each(cleanup_dirs, fn dir -> _ = File.rm_rf(dir) end) diff --git a/mix.exs b/mix.exs index 2900eb7..8d803b8 100644 --- a/mix.exs +++ b/mix.exs @@ -37,6 +37,7 @@ defmodule BDS.MixProject do {:nx, "~> 0.10"}, {:exla, "~> 0.10"}, {:bumblebee, "~> 0.6.3"}, + {:hnswlib, "~> 0.1.7"}, {:stemex, "~> 0.2.1"}, {:gettext, "~> 0.24"}, {:tailwind, "~> 0.3", runtime: Mix.env() == :dev}, @@ -63,7 +64,7 @@ defmodule BDS.MixProject do env = Mix.env() [ - plt_add_apps: [:mix, :inets, :ssl, :nx, :exla, :bumblebee], + plt_add_apps: [:mix, :inets, :ssl, :nx, :exla, :bumblebee, :hnswlib], paths: ["_build/#{env}/lib/bds/ebin"] ] end diff --git a/mix.lock b/mix.lock index 70e3eb0..fcaf8b3 100644 --- a/mix.lock +++ b/mix.lock @@ -28,6 +28,7 @@ "exqlite": {:hex, :exqlite, "0.36.0", "07b4f95d61cb82b8d52946d0639497fa7d32117e09b2c8d25e24a38723c295cb", [:make, :mix], [{:cc_precompiler, "~> 0.1", [hex: :cc_precompiler, repo: "hexpm", optional: false]}, {:db_connection, "~> 2.1", [hex: :db_connection, repo: "hexpm", optional: false]}, {:elixir_make, "~> 0.8", [hex: :elixir_make, repo: "hexpm", optional: false]}, {:table, "~> 0.1.0", [hex: :table, repo: "hexpm", optional: true]}], "hexpm", "cbeca3ce781f9ff07cfa9a87486f3ebd512a143ad6a14ed5c9fca21fe0bf3ae7"}, "fine": {:hex, :fine, "0.1.6", "4bf7151493443c454aac9f2fa2f34f5fefd0346a83fb5586a016c4a135c63247", [:mix], [], "hexpm", "5638eb4495488e885ebec167fa57973e5c35e1a50c344eb7666c90ec1c4e3b12"}, "gettext": {:hex, :gettext, "0.26.2", "5978aa7b21fada6deabf1f6341ddba50bc69c999e812211903b169799208f2a8", [:mix], [{:expo, "~> 0.5.1 or ~> 1.0", [hex: :expo, repo: "hexpm", optional: false]}], "hexpm", "aa978504bcf76511efdc22d580ba08e2279caab1066b76bb9aa81c4a1e0a32a5"}, + "hnswlib": {:hex, :hnswlib, "0.1.7", "784afdbfbc9af53e64d4b6da3f685c07039e472636a98fa954ffae5292ad6cc4", [:make, :mix], [{:cc_precompiler, "~> 0.1.0", [hex: :cc_precompiler, repo: "hexpm", optional: false]}, {:elixir_make, "~> 0.8", [hex: :elixir_make, repo: "hexpm", optional: false]}, {:nx, "~> 0.5", [hex: :nx, repo: "hexpm", optional: false]}], "hexpm", "fb43bb675facc8bb1ef0f4f8fec92479fc23317ed0f35c7160b2f95aff3e4742"}, "hpax": {:hex, :hpax, "1.0.3", "ed67ef51ad4df91e75cc6a1494f851850c0bd98ebc0be6e81b026e765ee535aa", [:mix], [], "hexpm", "8eab6e1cfa8d5918c2ce4ba43588e894af35dbd8e91e6e55c817bca5847df34a"}, "html_entities": {:hex, :html_entities, "0.5.2", "9e47e70598da7de2a9ff6af8758399251db6dbb7eebe2b013f2bbd2515895c3c", [:mix], [], "hexpm", "c53ba390403485615623b9531e97696f076ed415e8d8058b1dbaa28181f4fdcc"}, "html_sanitize_ex": {:hex, :html_sanitize_ex, "1.4.4", "271455b4d300d5d53a5d92b5bd1c00ad14c5abf1c9ff87be069af5736496515c", [:mix], [{:mochiweb, "~> 2.15 or ~> 3.1", [hex: :mochiweb, repo: "hexpm", optional: false]}], "hexpm", "12e1754204e7db5df1750df0a5dba1bbdf89260800019ab081f2b046596be56b"}, diff --git a/specs/embedding.allium b/specs/embedding.allium index 75ede5e..3a65bac 100644 --- a/specs/embedding.allium +++ b/specs/embedding.allium @@ -63,7 +63,7 @@ value EmbeddingVector { -- ─── Entities ─────────────────────────────────────────────── entity EmbeddingKey { - label: Integer -- HNSW label for USearch + label: Integer -- HNSW node label / id post: post/Post content_hash: String -- SHA-256 of "{title}\n\n{content}" vector: EmbeddingVector @@ -75,9 +75,11 @@ entity DismissedDuplicatePair { -- IDs stored in canonical order (sorted) for dedup } --- ─── USearch HNSW Index ───────────────────────────────────── +-- ─── HNSW Index ───────────────────────────────────────────── config { + -- HNSW approximate-nearest-neighbour index (hnswlib). USearch has no Elixir + -- binding; hnswlib provides the same HNSW algorithm and parameters. model_id: String = "Xenova/multilingual-e5-small" embedding_dimensions: Integer = 384 hnsw_metric: String = "cosine" @@ -86,7 +88,8 @@ config { hnsw_expansion_search: Integer = 64 -- efSearch debounce_persist: Duration = 5.seconds -- Index file: {userData}/projects/{projectId}/embeddings.usearch - -- Key mapping is persisted alongside the embedding records + -- Key mapping (label → post_id) persisted in a sidecar (.meta.json) next + -- to the index file, plus the source-of-truth rows in embedding_keys batch_size: Integer = 16 -- texts per batched inference run sequence_length: Integer = 256 -- max tokens per input (truncated) } @@ -112,7 +115,7 @@ rule EmbedPost { let existing = EmbeddingKey{post: post} if not exists existing or existing.content_hash != hash: -- Compute embedding vector via local model - -- Upsert into USearch index + embedding_keys DB table + -- Upsert into HNSW index + embedding_keys DB table -- Debounced index save (5s) ensures: EmbeddingKeyUpdated(post) } @@ -151,9 +154,9 @@ rule IndexUnindexed { rule FindSimilar { when: FindSimilarRequested(post, limit) requires: semantic_similarity_enabled - -- HNSW approximate nearest neighbor search via USearch + -- HNSW approximate nearest neighbor search (hnswlib) -- Searches index for (limit + 1) neighbors, excludes self - -- Converts USearch cosine distance to similarity: max(0, 1 - distance) + -- Converts HNSW cosine distance to similarity: max(0, 1 - distance) -- Returns ranked list sorted by descending similarity ensures: SimilarPostsResult(post, ranked_matches) } @@ -162,7 +165,7 @@ rule ComputeSimilarities { when: ComputeSimilaritiesRequested(source_post, target_post_ids) requires: semantic_similarity_enabled -- Exact pairwise cosine similarity between source vector and each target vector - -- Uses in-memory vector cache, NOT USearch search + -- Uses in-memory vector cache, NOT the HNSW index -- Returns map of post_id -> similarity score -- Used by InsertPostLinkModal to rank FTS search results ensures: SimilarityScoresResult(source_post, scores) @@ -207,7 +210,7 @@ invariant ContentHashSkipsUnchanged { } invariant DebouncedPersistence { - -- USearch index persistence is debounced at 5 seconds + -- HNSW index persistence is debounced at 5 seconds -- Prevents excessive disk I/O during bulk operations -- Index also force-saved on project switch and app shutdown } @@ -234,8 +237,9 @@ invariant NativeAcceleratedExecution { -- inference pass and inputs are truncated to a bounded sequence_length, so -- (re)indexing many posts is not serialised one document at a time. -- Current implementation: Bumblebee + EXLA, which is native CPU on Apple - -- Silicon (XLA has no Metal backend). Apple GPU acceleration via EMLX/MLX - -- is tracked as a follow-up (SPECGAPS A1-14c). + -- Silicon (XLA has no Metal backend); neighbour search is HNSW (hnswlib). + -- Apple GPU acceleration via EMLX/MLX is tracked as a follow-up + -- (SPECGAPS A1-14c). } invariant ModelCaching { @@ -245,7 +249,7 @@ invariant ModelCaching { } invariant ProjectIsolation { - -- Each project has its own USearch index file and embedding_keys rows + -- Each project has its own HNSW index file and embedding_keys rows -- On project switch: save current index, load new project's index -- Model pipeline shared across projects (not reloaded) } diff --git a/test/bds/desktop/automation_test.exs b/test/bds/desktop/automation_test.exs index 4af056b..9b3f343 100644 --- a/test/bds/desktop/automation_test.exs +++ b/test/bds/desktop/automation_test.exs @@ -44,32 +44,32 @@ defmodule BDS.Desktop.AutomationTest do assert :ok = Automation.click(session, "[data-testid='toggle-sidebar']") - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.sidebar_visible == false)) assert snapshot.sidebar_visible == false assert :ok = Automation.press(session, "Meta+B") - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.sidebar_visible == true)) assert snapshot.sidebar_visible == true assert :ok = Automation.press(session, "Meta+J") - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.panel_visible == true)) assert snapshot.panel_visible == true assert :ok = Automation.press(session, "Meta+J") - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.panel_visible == false)) assert snapshot.panel_visible == false assert :ok = Automation.press(session, "Meta+,") - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.editor_title == "Settings")) assert snapshot.editor_title == "Settings" assert :ok = Automation.click(session, "[data-testid='toggle-assistant']") - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.assistant_visible == true)) assert snapshot.assistant_visible == true assert snapshot.panel_visible == false @@ -92,7 +92,7 @@ defmodule BDS.Desktop.AutomationTest do assert :ok = Automation.drag(session, "[data-resize='sidebar']", 90) - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.sidebar_width >= 360 and &1.sidebar_width <= 380)) assert snapshot.sidebar_width >= 360 assert snapshot.sidebar_width <= 380 @@ -100,7 +100,7 @@ defmodule BDS.Desktop.AutomationTest do assert :ok = Automation.reload(session) - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.sidebar_visible == true and &1.sidebar_width >= resized_width - 2)) assert snapshot.sidebar_visible == true assert snapshot.sidebar_width >= resized_width - 2 assert snapshot.sidebar_width <= resized_width + 2 @@ -140,12 +140,12 @@ defmodule BDS.Desktop.AutomationTest do assert :ok = Automation.native_menu_action(session, "toggle_sidebar") - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.sidebar_visible == false)) assert snapshot.sidebar_visible == false assert :ok = Automation.native_menu_action(session, "edit_preferences") - snapshot = Automation.snapshot(session) + snapshot = await(session, &(&1.editor_title == "Settings")) assert snapshot.editor_title == "Settings" end @@ -175,6 +175,24 @@ defmodule BDS.Desktop.AutomationTest do end end + # Polls snapshots until the predicate holds (or times out), returning the + # last snapshot. UI transitions after a keypress/click/menu action are + # asynchronous, so a single immediate snapshot can race under CPU load. + defp await(session, fun, timeout \\ 5_000) + + defp await(session, _fun, timeout) when timeout <= 0, do: Automation.snapshot(session) + + defp await(session, fun, timeout) do + snapshot = Automation.snapshot(session) + + if fun.(snapshot) do + snapshot + else + Process.sleep(50) + await(session, fun, timeout - 50) + end + end + defp wait_until(fun, timeout \\ 5_000) defp wait_until(fun, timeout) when timeout <= 0, do: fun.() diff --git a/test/bds/embeddings_test.exs b/test/bds/embeddings_test.exs index 58635e6..c886ab2 100644 --- a/test/bds/embeddings_test.exs +++ b/test/bds/embeddings_test.exs @@ -319,24 +319,28 @@ defmodule BDS.EmbeddingsTest do assert {:ok, _indexed} = BDS.Embeddings.index_unindexed(project.id) + # Persistence is debounced (5s); force it to disk to assert the files. + :ok = BDS.Embeddings.Index.flush(project.id) + index_path = BDS.Embeddings.index_path(project.id) assert File.exists?(index_path) + assert File.exists?(index_path <> ".meta.json") refute String.starts_with?(index_path, BDS.Projects.project_data_dir(project)) cache_root = Application.fetch_env!(:bds, :project_cache_root) |> Path.expand() assert index_path == Path.join([cache_root, "projects", project.id, "embeddings.usearch"]) - snapshot = index_path |> File.read!() |> Jason.decode!() - assert snapshot["project_id"] == project.id - assert snapshot["model_id"] == "fake/multilingual-e5-small" - assert snapshot["dimensions"] == 384 - assert snapshot["entries"][alpha.id]["label"] != nil - assert snapshot["entries"][alpha.id]["content_hash"] != nil + # The sidecar carries the dimension and the label→post_id mapping. + meta = (index_path <> ".meta.json") |> File.read!() |> Jason.decode!() + assert meta["dim"] == 384 + post_ids = Enum.map(meta["labels"], fn [_label, post_id] -> post_id end) + assert alpha.id in post_ids + assert beta.id in post_ids - assert Enum.any?(snapshot["entries"][alpha.id]["neighbors"], fn neighbor -> - neighbor["post_id"] == beta.id - end) + # The HNSW index answers nearest-neighbour queries. + assert {:ok, [neighbor]} = BDS.Embeddings.find_similar(alpha.id, 1) + assert neighbor.post_id == beta.id end test "embedding index uses the app-internal persisted file name", %{project: project} do @@ -443,43 +447,76 @@ defmodule BDS.EmbeddingsTest do refreshed_key = BDS.Repo.get_by!(BDS.Embeddings.Key, project_id: project.id, post_id: post.id) assert refreshed_key.content_hash == stale_key.content_hash + + :ok = BDS.Embeddings.Index.flush(project.id) assert File.exists?(BDS.Embeddings.index_path(project.id)) end - test "sync_post refreshes snapshot drift when the embedding hash is already current", %{ + test "similarity queries keep working when sync_post finds the embedding already current", %{ project: project } do assert {:ok, _metadata} = BDS.Metadata.update_project_metadata(project.id, %{semantic_similarity_enabled: true}) - assert {:ok, post} = + assert {:ok, alpha} = BDS.Posts.create_post(%{ project_id: project.id, - title: "Snapshot Repair", + title: "Alpha", content: "space rocket orbit mission galaxy", language: "en" }) - assert {:ok, post} = BDS.Posts.publish_post(post.id) + assert {:ok, beta} = + BDS.Posts.create_post(%{ + project_id: project.id, + title: "Beta", + content: "rocket launch orbit mission station", + language: "en" + }) + + assert {:ok, alpha} = BDS.Posts.publish_post(alpha.id) + assert {:ok, _beta} = BDS.Posts.publish_post(beta.id) assert {:ok, _indexed} = BDS.Embeddings.index_unindexed(project.id) - key = BDS.Repo.get_by!(BDS.Embeddings.Key, project_id: project.id, post_id: post.id) - index_path = BDS.Embeddings.index_path(project.id) + # Re-syncing with an unchanged content hash is a no-op for the index... + assert :ok = BDS.Embeddings.sync_post(alpha.id) - snapshot = index_path |> File.read!() |> Jason.decode!() + # ...and nearest-neighbour queries still resolve through the HNSW index. + assert {:ok, [neighbor]} = BDS.Embeddings.find_similar(alpha.id, 1) + assert neighbor.post_id == beta.id + end - drifted_snapshot = - put_in(snapshot, ["entries", post.id, "content_hash"], "stale-snapshot-hash") + test "find_similar rebuilds the HNSW index on demand when none is loaded", %{project: project} do + assert {:ok, _metadata} = + BDS.Metadata.update_project_metadata(project.id, %{semantic_similarity_enabled: true}) - File.write!(index_path, Jason.encode!(drifted_snapshot)) + assert {:ok, alpha} = + BDS.Posts.create_post(%{ + project_id: project.id, + title: "Alpha", + content: "space rocket orbit mission galaxy", + language: "en" + }) - refute Enum.any?(BDS.Embeddings.diff_reports(project.id), &(&1.entity_id == post.id)) + assert {:ok, beta} = + BDS.Posts.create_post(%{ + project_id: project.id, + title: "Beta", + content: "rocket launch orbit mission station", + language: "en" + }) - assert :ok = BDS.Embeddings.sync_post(post.id) + assert {:ok, _alpha} = BDS.Posts.publish_post(alpha.id) + assert {:ok, _beta} = BDS.Posts.publish_post(beta.id) + assert {:ok, _indexed} = BDS.Embeddings.index_unindexed(project.id) - repaired_snapshot = index_path |> File.read!() |> Jason.decode!() - assert get_in(repaired_snapshot, ["entries", post.id, "content_hash"]) == key.content_hash + # Drop the in-memory index and remove the persisted files, then query: it + # must self-heal by rebuilding from the DB vectors. + :ok = BDS.Embeddings.Index.forget(project.id) + File.rm_rf!(BDS.Projects.project_cache_dir(project.id)) - refute Enum.any?(BDS.Embeddings.diff_reports(project.id), &(&1.entity_id == post.id)) + assert {:ok, similar} = BDS.Embeddings.find_similar(alpha.id, 1) + assert [%{post_id: post_id}] = similar + assert post_id == beta.id end end diff --git a/test/bds/maintenance_test.exs b/test/bds/maintenance_test.exs index d8750da..087fbdb 100644 --- a/test/bds/maintenance_test.exs +++ b/test/bds/maintenance_test.exs @@ -395,6 +395,8 @@ defmodule BDS.MaintenanceTest do assert {:ok, _indexed} = BDS.Embeddings.index_unindexed(project.id) index_path = BDS.Embeddings.index_path(project.id) + # Index persistence is debounced (5s); force it to assert the file. + :ok = BDS.Embeddings.Index.flush(project.id) assert File.exists?(index_path) Repo.delete_all(from key in BDS.Embeddings.Key, where: key.project_id == ^project.id) @@ -416,6 +418,8 @@ defmodule BDS.MaintenanceTest do assert post.id in rebuilt_post_ids assert Repo.get_by(BDS.Embeddings.Key, project_id: project.id, post_id: post.id) != nil + + :ok = BDS.Embeddings.Index.flush(project.id) assert File.exists?(index_path) end diff --git a/test/bds/metadata_test.exs b/test/bds/metadata_test.exs index 62ce8e1..02e6240 100644 --- a/test/bds/metadata_test.exs +++ b/test/bds/metadata_test.exs @@ -236,6 +236,9 @@ defmodule BDS.MetadataTest do assert metadata.semantic_similarity_enabled == true assert BDS.Repo.get_by(BDS.Embeddings.Key, project_id: project.id, post_id: post.id) != nil + + # Index persistence is debounced (5s); force it to assert the file. + :ok = BDS.Embeddings.Index.flush(project.id) assert File.exists?(BDS.Embeddings.index_path(project.id)) end