fix: force full re-embed on explicit rebuild and degrade gracefully when embedding model is unavailable
This commit is contained in:
@@ -24,7 +24,7 @@ Gap categories: **SC** = spec correct, fix code | **CS** = code correct, update
|
|||||||
| A1-12 | ~~Real Pagefind integration for search~~ | generation.allium:208 | Functional client-side search: `PagefindUI` defined in bundled `pagefind-ui.js`, fragment index records url/title/body-scoped text per page, search-runtime wires it up | **Resolved:** bundled real `PagefindUI` (fetch index, ranked full-text match, highlighted excerpts) + `pagefind-ui.css` as local assets read into `Pagefind`; index scoped to `data-pagefind-body` (unmarked pages excluded per PagefindHtmlMarking), title from `<title>`/`<h1>`; localized "No results found" label via `data-search-no-results` (de/fr/it/es); 3 unit tests added |
|
| A1-12 | ~~Real Pagefind integration for search~~ | generation.allium:208 | Functional client-side search: `PagefindUI` defined in bundled `pagefind-ui.js`, fragment index records url/title/body-scoped text per page, search-runtime wires it up | **Resolved:** bundled real `PagefindUI` (fetch index, ranked full-text match, highlighted excerpts) + `pagefind-ui.css` as local assets read into `Pagefind`; index scoped to `data-pagefind-body` (unmarked pages excluded per PagefindHtmlMarking), title from `<title>`/`<h1>`; localized "No results found" label via `data-search-no-results` (de/fr/it/es); 3 unit tests added |
|
||||||
| A1-13 | ~~Git sidebar shows only "Working tree" placeholder~~ | sidebar_views.allium:651-770 | `git_view/1` now builds a full `layout: "git"` view from `BDS.Git` (repository/remote_state/status/history); `SidebarComponents` renders active + not_a_repo states | **Resolved:** `git_view/1` in sidebar.ex assembles branch/upstream/ahead/behind, status files, paginated history (20/page); `render_git_sidebar` renders branch header, sync legend, fetch/pull/push/prune-lfs buttons, commit form, clickable status files (open git_diff), history entries; shell_live wires `git_commit` (closes git_diff tabs), `git_fetch`/`git_pull`/`git_push`/`git_prune_lfs`, `git_initialize`; `BDS.Git.history` enriched with author/date, `BDS.Git.set_remote/2` added; i18n for de/fr/it/es; 3 shell tests + git author/date assertions added |
|
| A1-13 | ~~Git sidebar shows only "Working tree" placeholder~~ | sidebar_views.allium:651-770 | `git_view/1` now builds a full `layout: "git"` view from `BDS.Git` (repository/remote_state/status/history); `SidebarComponents` renders active + not_a_repo states | **Resolved:** `git_view/1` in sidebar.ex assembles branch/upstream/ahead/behind, status files, paginated history (20/page); `render_git_sidebar` renders branch header, sync legend, fetch/pull/push/prune-lfs buttons, commit form, clickable status files (open git_diff), history entries; shell_live wires `git_commit` (closes git_diff tabs), `git_fetch`/`git_pull`/`git_push`/`git_prune_lfs`, `git_initialize`; `BDS.Git.history` enriched with author/date, `BDS.Git.set_remote/2` added; i18n for de/fr/it/es; 3 shell tests + git author/date assertions added |
|
||||||
| A1-14 | ~~Embedding uses TF-IDF hash projection instead of real neural model~~ | embedding.allium:44-53, invariants RealNeuralModel/ModelCaching/VectorCacheInDb | `Backends.Neural` runs `intfloat/multilingual-e5-small` (e5 weights behind the Xenova id) via Bumblebee+EXLA | **Resolved (core):** added bumblebee/nx/exla deps; `Backends.Neural` is a lazily-loaded GenServer that builds the Bumblebee text-embedding serving on first request (`"query: "` prefix + mean pooling + L2 norm), downloads+caches the model under the app data dir (ModelCaching), and is wired into the supervision tree when configured; vectors now persisted as packed little-endian Float32 BLOB (384×4=1536 bytes) instead of JSON text (VectorCacheInDb) with migration recreating `embedding_keys.vector` as BLOB; `InApp` demoted to documented offline/test stub; test config uses the stub so the suite stays offline; spec EmbeddingModel clarified (Xenova id ↔ intfloat weights via Bumblebee); batched inference via optional `embed_many/2` backend callback (configurable `batch_size`/`sequence_length`; rebuild/index/repair embed in chunks instead of one post at a time) + `NativeAcceleratedExecution` invariant added to spec; 4 tests added (BLOB round-trip, batched-rebuild, Neural model_info/behaviour). **Deferred:** A1-14b USearch HNSW index, A1-14c Apple GPU (EMLX). |
|
| A1-14 | ~~Embedding uses TF-IDF hash projection instead of real neural model~~ | embedding.allium:44-53, invariants RealNeuralModel/ModelCaching/VectorCacheInDb | `Backends.Neural` runs `intfloat/multilingual-e5-small` (e5 weights behind the Xenova id) via Bumblebee+EXLA | **Resolved (core):** added bumblebee/nx/exla deps; `Backends.Neural` is a lazily-loaded GenServer that builds the Bumblebee text-embedding serving on first request (`"query: "` prefix + mean pooling + L2 norm), downloads+caches the model under the app data dir (ModelCaching), and is wired into the supervision tree when configured; vectors now persisted as packed little-endian Float32 BLOB (384×4=1536 bytes) instead of JSON text (VectorCacheInDb) with migration recreating `embedding_keys.vector` as BLOB; `InApp` demoted to documented offline/test stub; test config uses the stub so the suite stays offline; spec EmbeddingModel clarified (Xenova id ↔ intfloat weights via Bumblebee); batched inference via optional `embed_many/2` backend callback (configurable `batch_size`/`sequence_length`; rebuild/index/repair embed in chunks instead of one post at a time) + `NativeAcceleratedExecution` invariant added to spec; 4 tests added (BLOB round-trip, batched-rebuild, Neural model_info/behaviour). **Deferred:** A1-14b USearch HNSW index, A1-14c Apple GPU (EMLX). |
|
||||||
| A1-14b | ~~USearch HNSW ANN index + debounced persistence not implemented~~ | embedding.allium config/FindSimilar/DebouncedPersistence | `Embeddings.Index` is now an HNSW (hnswlib) ANN index with debounced persistence | **Resolved:** rewrote `Embeddings.Index` as a DB-free GenServer wrapping an hnswlib HNSW graph (cosine, M=16, efConstruction=128, efSearch=64) — O(n·log n) build, O(log n) queries, replacing the O(n²) JSON cosine snapshot; per-project in-memory index + `label→post_id` map; 5s debounced `save_index` + `.meta.json` sidecar, force-save on project switch (`set_active_project`) and shutdown (`terminate`), `forget/1` on project delete; lazy reload from disk with rebuild-from-DB self-heal on miss; `find_similar`/`find_duplicates`/`compute_similarities` rewired (no brute-force fallback); USearch has no Elixir binding so hnswlib provides the identical HNSW algorithm/params (spec reconciled); supervision + dialyzer PLT updated; tests updated for debounced/binary persistence + self-heal. |
|
| A1-14b | ~~USearch HNSW ANN index + debounced persistence not implemented~~ | embedding.allium config/FindSimilar/DebouncedPersistence | `Embeddings.Index` is now an HNSW (hnswlib) ANN index with debounced persistence | **Resolved:** rewrote `Embeddings.Index` as a DB-free GenServer wrapping an hnswlib HNSW graph (cosine, M=16, efConstruction=128, efSearch=64) — O(n·log n) build, O(log n) queries, replacing the O(n²) JSON cosine snapshot; per-project in-memory index + `label→post_id` map; 5s debounced `save_index` + `.meta.json` sidecar, force-save on project switch (`set_active_project`) and shutdown (`terminate`), `forget/1` on project delete; lazy reload from disk with rebuild-from-DB self-heal on miss; `find_similar`/`find_duplicates`/`compute_similarities` rewired (no brute-force fallback); USearch has no Elixir binding so hnswlib provides the identical HNSW algorithm/params (spec reconciled); supervision + dialyzer PLT updated; tests updated for debounced/binary persistence + self-heal. Follow-up hardening: explicit rebuild now forces re-embedding regardless of content_hash (ReindexAll), and model-unavailable errors propagate cleanly (post saves degrade to unindexed + log; rebuild/index return `{:error, reason}` surfaced as a failed task with a user-facing message instead of crashing). |
|
||||||
| A1-14c | Embedding model runs on CPU only; no Apple GPU acceleration | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` uses Bumblebee+EXLA; on Apple Silicon XLA has no Metal backend so inference is native CPU (batched). Apple GPU/Neural Engine unused | Fix code: spike an EMLX (Apple MLX) Nx backend so the model executes on the Apple Silicon GPU; gate by platform/availability with EXLA-CPU fallback; verify Bumblebee serving + defn compiler compatibility and benchmark vs CPU batching |
|
| A1-14c | Embedding model runs on CPU only; no Apple GPU acceleration | embedding.allium invariant NativeAcceleratedExecution | `Backends.Neural` uses Bumblebee+EXLA; on Apple Silicon XLA has no Metal backend so inference is native CPU (batched). Apple GPU/Neural Engine unused | Fix code: spike an EMLX (Apple MLX) Nx backend so the model executes on the Apple Silicon GPU; gate by platform/availability with EXLA-CPU fallback; verify Bumblebee serving + defn compiler compatibility and benchmark vs CPU batching |
|
||||||
| A1-15 | ~~Preview vs generation content source strategy undocumented~~ | preview.allium (no invariant), generation.allium (no invariant) | Generation uses only published .md file content (`Generation.Data` snapshots set `content: nil`); preview includes published+draft posts and prefers DB content over file (`Preview.Router` queries `:published`/`:draft`, uses `editor_body`) | **Resolved:** added `PreviewDraftOverlay` invariant to preview.allium and `GenerationPublishedOnly` invariant to generation.allium; both cross-reference each other; code already correct, 3 tests added for draft-in-preview behavior |
|
| A1-15 | ~~Preview vs generation content source strategy undocumented~~ | preview.allium (no invariant), generation.allium (no invariant) | Generation uses only published .md file content (`Generation.Data` snapshots set `content: nil`); preview includes published+draft posts and prefers DB content over file (`Preview.Router` queries `:published`/`:draft`, uses `editor_body`) | **Resolved:** added `PreviewDraftOverlay` invariant to preview.allium and `GenerationPublishedOnly` invariant to generation.allium; both cross-reference each other; code already correct, 3 tests added for draft-in-preview behavior |
|
||||||
|
|
||||||
|
|||||||
@@ -120,16 +120,7 @@ defmodule BDS.Desktop.ShellCommands do
|
|||||||
"rebuild_embedding_index",
|
"rebuild_embedding_index",
|
||||||
"Rebuild Embedding Index",
|
"Rebuild Embedding Index",
|
||||||
"Embeddings",
|
"Embeddings",
|
||||||
fn report ->
|
fn report -> rebuild_embedding_index_work(project, report) end
|
||||||
{:ok, rebuilt_post_ids} = Embeddings.rebuild_project(project.id, on_progress: report)
|
|
||||||
report.(1.0, "Embedding index rebuilt")
|
|
||||||
|
|
||||||
%{
|
|
||||||
project_id: project.id,
|
|
||||||
rebuilt_post_ids: rebuilt_post_ids,
|
|
||||||
rebuilt_count: length(rebuilt_post_ids)
|
|
||||||
}
|
|
||||||
end
|
|
||||||
)
|
)
|
||||||
end
|
end
|
||||||
|
|
||||||
@@ -524,20 +515,39 @@ defmodule BDS.Desktop.ShellCommands do
|
|||||||
},
|
},
|
||||||
%{
|
%{
|
||||||
name: "Rebuild Embedding Index",
|
name: "Rebuild Embedding Index",
|
||||||
work: fn report ->
|
work: fn report -> rebuild_embedding_index_work(project, report) end
|
||||||
{:ok, rebuilt_post_ids} = Embeddings.rebuild_project(project.id, on_progress: report)
|
|
||||||
report.(1.0, "Embedding index rebuilt")
|
|
||||||
|
|
||||||
%{
|
|
||||||
project_id: project.id,
|
|
||||||
rebuilt_post_ids: rebuilt_post_ids,
|
|
||||||
rebuilt_count: length(rebuilt_post_ids)
|
|
||||||
}
|
|
||||||
end
|
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
end
|
end
|
||||||
|
|
||||||
|
defp rebuild_embedding_index_work(project, report) do
|
||||||
|
case Embeddings.rebuild_project(project.id, on_progress: report) do
|
||||||
|
{:ok, rebuilt_post_ids} ->
|
||||||
|
report.(1.0, "Embedding index rebuilt")
|
||||||
|
|
||||||
|
%{
|
||||||
|
project_id: project.id,
|
||||||
|
rebuilt_post_ids: rebuilt_post_ids,
|
||||||
|
rebuilt_count: length(rebuilt_post_ids)
|
||||||
|
}
|
||||||
|
|
||||||
|
{:error, reason} ->
|
||||||
|
{:error, embedding_error_message(reason)}
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
defp embedding_error_message(reason) do
|
||||||
|
detail =
|
||||||
|
case reason do
|
||||||
|
message when is_binary(message) -> message
|
||||||
|
{:embedding_backend_unavailable, _inner} -> "the embedding service did not start"
|
||||||
|
other -> inspect(other)
|
||||||
|
end
|
||||||
|
|
||||||
|
"Could not build the embedding index: #{detail}. The model is downloaded on first use, " <>
|
||||||
|
"so check your internet connection — or turn off semantic similarity in Settings."
|
||||||
|
end
|
||||||
|
|
||||||
defp run_rebuild_sequence(_group_id, _attrs, []), do: :ok
|
defp run_rebuild_sequence(_group_id, _attrs, []), do: :ok
|
||||||
|
|
||||||
defp run_rebuild_sequence(group_id, attrs, [step | remaining_steps]) do
|
defp run_rebuild_sequence(group_id, attrs, [step | remaining_steps]) do
|
||||||
|
|||||||
@@ -2,6 +2,7 @@ defmodule BDS.Embeddings do
|
|||||||
@moduledoc false
|
@moduledoc false
|
||||||
|
|
||||||
import Ecto.Query
|
import Ecto.Query
|
||||||
|
require Logger
|
||||||
|
|
||||||
alias BDS.Persistence
|
alias BDS.Persistence
|
||||||
alias BDS.Embeddings.DismissedDuplicatePair
|
alias BDS.Embeddings.DismissedDuplicatePair
|
||||||
@@ -75,11 +76,16 @@ defmodule BDS.Embeddings do
|
|||||||
)
|
)
|
||||||
|
|
||||||
existing_keys = preload_keys_by_post_id(project_id, Enum.map(posts, & &1.id))
|
existing_keys = preload_keys_by_post_id(project_id, Enum.map(posts, & &1.id))
|
||||||
rows = build_key_rows(posts, existing_keys, max_label_value(), nil)
|
|
||||||
|
|
||||||
batch_upsert_keys(rows)
|
case build_key_rows(posts, existing_keys, max_label_value(), nil, false) do
|
||||||
:ok = rebuild_snapshot(project_id)
|
{:ok, rows} ->
|
||||||
{:ok, Enum.map(posts, & &1.id)}
|
batch_upsert_keys(rows)
|
||||||
|
:ok = rebuild_snapshot(project_id)
|
||||||
|
{:ok, Enum.map(posts, & &1.id)}
|
||||||
|
|
||||||
|
{:error, _reason} = error ->
|
||||||
|
error
|
||||||
|
end
|
||||||
else
|
else
|
||||||
{:ok, []}
|
{:ok, []}
|
||||||
end
|
end
|
||||||
@@ -106,13 +112,19 @@ defmodule BDS.Embeddings do
|
|||||||
)
|
)
|
||||||
|
|
||||||
existing_keys = preload_keys_by_post_id(project_id)
|
existing_keys = preload_keys_by_post_id(project_id)
|
||||||
rows = build_key_rows(posts, existing_keys, max_label_value(), on_progress)
|
|
||||||
|
|
||||||
batch_upsert_keys(rows)
|
# An explicit rebuild re-embeds every post from scratch (ReindexAll),
|
||||||
|
# ignoring the content_hash skip optimisation.
|
||||||
|
case build_key_rows(posts, existing_keys, max_label_value(), on_progress, true) do
|
||||||
|
{:ok, rows} ->
|
||||||
|
batch_upsert_keys(rows)
|
||||||
|
:ok = report_rebuild_phase(on_progress, 0.99, "Persisting embedding snapshot")
|
||||||
|
:ok = rebuild_snapshot(project_id)
|
||||||
|
{:ok, post_ids}
|
||||||
|
|
||||||
:ok = report_rebuild_phase(on_progress, 0.99, "Persisting embedding snapshot")
|
{:error, _reason} = error ->
|
||||||
:ok = rebuild_snapshot(project_id)
|
error
|
||||||
{:ok, post_ids}
|
end
|
||||||
else
|
else
|
||||||
{:ok, []}
|
{:ok, []}
|
||||||
end
|
end
|
||||||
@@ -172,24 +184,36 @@ defmodule BDS.Embeddings do
|
|||||||
:ok
|
:ok
|
||||||
|
|
||||||
existing_key ->
|
existing_key ->
|
||||||
label = existing_key_label(existing_key) || next_label()
|
case embed_text(raw_text, post.language) do
|
||||||
{:ok, vector} = embed_text(raw_text, post.language)
|
{:ok, vector} ->
|
||||||
|
label = existing_key_label(existing_key) || next_label()
|
||||||
|
|
||||||
(existing_key || %Key{})
|
(existing_key || %Key{})
|
||||||
|> Key.changeset(%{
|
|> Key.changeset(%{
|
||||||
label: label,
|
label: label,
|
||||||
post_id: post.id,
|
post_id: post.id,
|
||||||
project_id: post.project_id,
|
project_id: post.project_id,
|
||||||
content_hash: content_hash,
|
content_hash: content_hash,
|
||||||
vector: encode_vector(vector)
|
vector: encode_vector(vector)
|
||||||
})
|
})
|
||||||
|> Repo.insert_or_update()
|
|> Repo.insert_or_update()
|
||||||
|
|
||||||
if Keyword.get(opts, :refresh_index, true) do
|
if Keyword.get(opts, :refresh_index, true) do
|
||||||
:ok = rebuild_snapshot(post.project_id)
|
:ok = rebuild_snapshot(post.project_id)
|
||||||
|
end
|
||||||
|
|
||||||
|
:ok
|
||||||
|
|
||||||
|
{:error, reason} ->
|
||||||
|
# Embedding is best-effort on post save: if the model is unavailable
|
||||||
|
# (e.g. offline first-use download), leave the post unindexed rather
|
||||||
|
# than failing the save. An explicit reindex surfaces the error.
|
||||||
|
Logger.warning(
|
||||||
|
"Embedding unavailable for post #{post.id}: #{inspect(reason)}; left unindexed"
|
||||||
|
)
|
||||||
|
|
||||||
|
:ok
|
||||||
end
|
end
|
||||||
|
|
||||||
:ok
|
|
||||||
end
|
end
|
||||||
end
|
end
|
||||||
|
|
||||||
@@ -210,11 +234,12 @@ defmodule BDS.Embeddings do
|
|||||||
Repo.one(from key in Key, select: max(key.label)) || 0
|
Repo.one(from key in Key, select: max(key.label)) || 0
|
||||||
end
|
end
|
||||||
|
|
||||||
# Builds the upsert rows for a batch of posts. Posts whose content_hash is
|
# Builds the upsert rows for a batch of posts. Unless `force?` is set, posts
|
||||||
# unchanged are skipped (ContentHashSkipsUnchanged); the rest are embedded in
|
# whose content_hash is unchanged are skipped (ContentHashSkipsUnchanged); the
|
||||||
# batches (see embed_pending/2) so model inference is not serialised one post
|
# rest are embedded in batches (see embed_pending/2) so model inference is not
|
||||||
# at a time. Labels keep their existing value or take the next free integer.
|
# serialised one post at a time. Labels keep their existing value or take the
|
||||||
defp build_key_rows(posts, existing_keys, base_label, on_progress) do
|
# next free integer. Returns `{:error, reason}` if the model is unavailable.
|
||||||
|
defp build_key_rows(posts, existing_keys, base_label, on_progress, force?) do
|
||||||
prepared =
|
prepared =
|
||||||
Enum.map(posts, fn post ->
|
Enum.map(posts, fn post ->
|
||||||
raw_text = compose_embedding_source(post.title, resolve_post_body(post))
|
raw_text = compose_embedding_source(post.title, resolve_post_body(post))
|
||||||
@@ -226,14 +251,20 @@ defmodule BDS.Embeddings do
|
|||||||
existing: existing,
|
existing: existing,
|
||||||
raw_text: raw_text,
|
raw_text: raw_text,
|
||||||
content_hash: content_hash,
|
content_hash: content_hash,
|
||||||
needs_embed?: is_nil(existing) or existing.content_hash != content_hash
|
needs_embed?: force? or is_nil(existing) or existing.content_hash != content_hash
|
||||||
}
|
}
|
||||||
end)
|
end)
|
||||||
|
|
||||||
pending = Enum.filter(prepared, & &1.needs_embed?)
|
pending = Enum.filter(prepared, & &1.needs_embed?)
|
||||||
:ok = report_rebuild_started(on_progress, length(pending), "embedding entries")
|
:ok = report_rebuild_started(on_progress, length(pending), "embedding entries")
|
||||||
vectors_by_post_id = embed_pending(pending, on_progress)
|
|
||||||
|
|
||||||
|
case embed_pending(pending, on_progress) do
|
||||||
|
{:ok, vectors_by_post_id} -> {:ok, collect_rows(prepared, vectors_by_post_id, base_label)}
|
||||||
|
{:error, _reason} = error -> error
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
defp collect_rows(prepared, vectors_by_post_id, base_label) do
|
||||||
{rows, _next_label} =
|
{rows, _next_label} =
|
||||||
Enum.reduce(prepared, {[], base_label + 1}, fn entry, {acc, next_label} ->
|
Enum.reduce(prepared, {[], base_label + 1}, fn entry, {acc, next_label} ->
|
||||||
if entry.needs_embed? do
|
if entry.needs_embed? do
|
||||||
@@ -258,7 +289,7 @@ defmodule BDS.Embeddings do
|
|||||||
rows
|
rows
|
||||||
end
|
end
|
||||||
|
|
||||||
defp embed_pending([], _on_progress), do: %{}
|
defp embed_pending([], _on_progress), do: {:ok, %{}}
|
||||||
|
|
||||||
defp embed_pending(pending, on_progress) do
|
defp embed_pending(pending, on_progress) do
|
||||||
total = length(pending)
|
total = length(pending)
|
||||||
@@ -268,25 +299,36 @@ defmodule BDS.Embeddings do
|
|||||||
# Group by language so the lexical stub stems consistently; the neural
|
# Group by language so the lexical stub stems consistently; the neural
|
||||||
# backend is multilingual and ignores the language hint.
|
# backend is multilingual and ignores the language hint.
|
||||||
|> Enum.group_by(& &1.post.language)
|
|> Enum.group_by(& &1.post.language)
|
||||||
|> Enum.reduce({%{}, 0}, fn {language, group}, acc ->
|
|> Enum.reduce_while({%{}, 0}, fn {language, group}, acc ->
|
||||||
group
|
group
|
||||||
|> Enum.chunk_every(batch)
|
|> Enum.chunk_every(batch)
|
||||||
|> Enum.reduce(acc, fn chunk, {vectors, done} ->
|
|> Enum.reduce_while(acc, fn chunk, {vectors, done} ->
|
||||||
{:ok, chunk_vectors} = embed_many(Enum.map(chunk, & &1.raw_text), language)
|
case embed_many(Enum.map(chunk, & &1.raw_text), language) do
|
||||||
|
{:ok, chunk_vectors} ->
|
||||||
|
vectors =
|
||||||
|
chunk
|
||||||
|
|> Enum.zip(chunk_vectors)
|
||||||
|
|> Enum.reduce(vectors, fn {entry, vector}, acc ->
|
||||||
|
Map.put(acc, entry.post.id, vector)
|
||||||
|
end)
|
||||||
|
|
||||||
vectors =
|
done = done + length(chunk)
|
||||||
chunk
|
:ok = report_rebuild_progress(on_progress, done, total, "embedding entries")
|
||||||
|> Enum.zip(chunk_vectors)
|
{:cont, {vectors, done}}
|
||||||
|> Enum.reduce(vectors, fn {entry, vector}, acc ->
|
|
||||||
Map.put(acc, entry.post.id, vector)
|
|
||||||
end)
|
|
||||||
|
|
||||||
done = done + length(chunk)
|
{:error, reason} ->
|
||||||
:ok = report_rebuild_progress(on_progress, done, total, "embedding entries")
|
{:halt, {:error, reason}}
|
||||||
{vectors, done}
|
end
|
||||||
end)
|
end)
|
||||||
|
|> case do
|
||||||
|
{:error, reason} -> {:halt, {:error, reason}}
|
||||||
|
accumulator -> {:cont, accumulator}
|
||||||
|
end
|
||||||
end)
|
end)
|
||||||
|> elem(0)
|
|> case do
|
||||||
|
{:error, reason} -> {:error, reason}
|
||||||
|
{vectors, _done} -> {:ok, vectors}
|
||||||
|
end
|
||||||
end
|
end
|
||||||
|
|
||||||
defp batch_upsert_keys([]), do: :ok
|
defp batch_upsert_keys([]), do: :ok
|
||||||
@@ -337,15 +379,20 @@ defmodule BDS.Embeddings do
|
|||||||
)
|
)
|
||||||
|
|
||||||
existing_keys = preload_keys_by_post_id(project_id)
|
existing_keys = preload_keys_by_post_id(project_id)
|
||||||
rows = build_key_rows(posts, existing_keys, max_label_value(), nil)
|
|
||||||
|
|
||||||
batch_upsert_keys(rows)
|
case build_key_rows(posts, existing_keys, max_label_value(), nil, false) do
|
||||||
:ok = rebuild_snapshot(project_id)
|
{:ok, rows} ->
|
||||||
|
batch_upsert_keys(rows)
|
||||||
|
:ok = rebuild_snapshot(project_id)
|
||||||
|
|
||||||
indexed =
|
indexed =
|
||||||
Repo.all(from key in Key, where: key.project_id == ^project_id, select: key.post_id)
|
Repo.all(from key in Key, where: key.project_id == ^project_id, select: key.post_id)
|
||||||
|
|
||||||
{:ok, indexed}
|
{:ok, indexed}
|
||||||
|
|
||||||
|
{:error, _reason} = error ->
|
||||||
|
error
|
||||||
|
end
|
||||||
else
|
else
|
||||||
{:ok, []}
|
{:ok, []}
|
||||||
end
|
end
|
||||||
@@ -677,13 +724,16 @@ defmodule BDS.Embeddings do
|
|||||||
if function_exported?(backend, :embed_many, 2) do
|
if function_exported?(backend, :embed_many, 2) do
|
||||||
backend.embed_many(texts, language: language)
|
backend.embed_many(texts, language: language)
|
||||||
else
|
else
|
||||||
vectors =
|
Enum.reduce_while(texts, {:ok, []}, fn text, {:ok, acc} ->
|
||||||
Enum.map(texts, fn text ->
|
case backend.embed(text, language: language) do
|
||||||
{:ok, vector} = backend.embed(text, language: language)
|
{:ok, vector} -> {:cont, {:ok, [vector | acc]}}
|
||||||
vector
|
{:error, _reason} = error -> {:halt, error}
|
||||||
end)
|
end
|
||||||
|
end)
|
||||||
{:ok, vectors}
|
|> case do
|
||||||
|
{:ok, vectors} -> {:ok, Enum.reverse(vectors)}
|
||||||
|
{:error, _reason} = error -> error
|
||||||
|
end
|
||||||
end
|
end
|
||||||
end
|
end
|
||||||
|
|
||||||
|
|||||||
@@ -101,11 +101,18 @@ defmodule BDS.Maintenance.Repair do
|
|||||||
:file_to_db ->
|
:file_to_db ->
|
||||||
post_ids = Enum.map(items, &metadata_diff_item_entity_id/1)
|
post_ids = Enum.map(items, &metadata_diff_item_entity_id/1)
|
||||||
|
|
||||||
{:ok, repaired_post_ids} = Embeddings.repair_posts(project_id, post_ids)
|
# If the embedding model is unavailable, every item is reported as
|
||||||
repaired_post_ids = MapSet.new(repaired_post_ids)
|
# failed rather than crashing the repair task.
|
||||||
|
repaired =
|
||||||
|
case Embeddings.repair_posts(project_id, post_ids) do
|
||||||
|
{:ok, repaired_post_ids} -> repaired_post_ids
|
||||||
|
{:error, _reason} -> []
|
||||||
|
end
|
||||||
|
|
||||||
|
repaired_set = MapSet.new(repaired)
|
||||||
|
|
||||||
build_batch_repair_result(items, total, on_progress, fn item ->
|
build_batch_repair_result(items, total, on_progress, fn item ->
|
||||||
MapSet.member?(repaired_post_ids, metadata_diff_item_entity_id(item))
|
MapSet.member?(repaired_set, metadata_diff_item_entity_id(item))
|
||||||
end)
|
end)
|
||||||
|
|
||||||
:db_to_file ->
|
:db_to_file ->
|
||||||
|
|||||||
@@ -1,6 +1,8 @@
|
|||||||
defmodule BDS.Metadata do
|
defmodule BDS.Metadata do
|
||||||
@moduledoc false
|
@moduledoc false
|
||||||
|
|
||||||
|
require Logger
|
||||||
|
|
||||||
alias BDS.Embeddings
|
alias BDS.Embeddings
|
||||||
alias BDS.I18n
|
alias BDS.I18n
|
||||||
alias BDS.Persistence
|
alias BDS.Persistence
|
||||||
@@ -653,7 +655,17 @@ defmodule BDS.Metadata do
|
|||||||
) do
|
) do
|
||||||
if previous_state.semantic_similarity_enabled != true and
|
if previous_state.semantic_similarity_enabled != true and
|
||||||
project_metadata.semantic_similarity_enabled == true do
|
project_metadata.semantic_similarity_enabled == true do
|
||||||
{:ok, _indexed_post_ids} = Embeddings.index_unindexed(project_id)
|
# Backfill is best-effort: if the embedding model is unavailable, keep the
|
||||||
|
# setting enabled and log it rather than failing the metadata update.
|
||||||
|
case Embeddings.index_unindexed(project_id) do
|
||||||
|
{:ok, _indexed_post_ids} ->
|
||||||
|
:ok
|
||||||
|
|
||||||
|
{:error, reason} ->
|
||||||
|
Logger.warning(
|
||||||
|
"Embedding backfill skipped for project #{project_id}: #{inspect(reason)}"
|
||||||
|
)
|
||||||
|
end
|
||||||
end
|
end
|
||||||
|
|
||||||
result
|
result
|
||||||
|
|||||||
Binary file not shown.
File diff suppressed because one or more lines are too long
@@ -37,6 +37,40 @@ defmodule BDS.EmbeddingsTest do
|
|||||||
end
|
end
|
||||||
end
|
end
|
||||||
|
|
||||||
|
defmodule CountingBackend do
|
||||||
|
@behaviour BDS.Embeddings.Backend
|
||||||
|
|
||||||
|
@counter :embeddings_force_counter
|
||||||
|
|
||||||
|
@impl true
|
||||||
|
def model_info, do: %{model_id: "counting/multilingual-e5-small", dimensions: 384}
|
||||||
|
|
||||||
|
@impl true
|
||||||
|
def embed(text, opts) do
|
||||||
|
Agent.update(@counter, &(&1 + 1))
|
||||||
|
BDS.Embeddings.Backends.InApp.embed(text, opts)
|
||||||
|
end
|
||||||
|
|
||||||
|
@impl true
|
||||||
|
def embed_many(texts, opts) do
|
||||||
|
Agent.update(@counter, &(&1 + length(texts)))
|
||||||
|
BDS.Embeddings.Backends.InApp.embed_many(texts, opts)
|
||||||
|
end
|
||||||
|
end
|
||||||
|
|
||||||
|
defmodule FailingBackend do
|
||||||
|
@behaviour BDS.Embeddings.Backend
|
||||||
|
|
||||||
|
@impl true
|
||||||
|
def model_info, do: %{model_id: "failing/multilingual-e5-small", dimensions: 384}
|
||||||
|
|
||||||
|
@impl true
|
||||||
|
def embed(_text, _opts), do: {:error, :model_unavailable}
|
||||||
|
|
||||||
|
@impl true
|
||||||
|
def embed_many(_texts, _opts), do: {:error, :model_unavailable}
|
||||||
|
end
|
||||||
|
|
||||||
setup do
|
setup do
|
||||||
:ok = Ecto.Adapters.SQL.Sandbox.checkout(BDS.Repo)
|
:ok = Ecto.Adapters.SQL.Sandbox.checkout(BDS.Repo)
|
||||||
|
|
||||||
@@ -519,4 +553,75 @@ defmodule BDS.EmbeddingsTest do
|
|||||||
assert [%{post_id: post_id}] = similar
|
assert [%{post_id: post_id}] = similar
|
||||||
assert post_id == beta.id
|
assert post_id == beta.id
|
||||||
end
|
end
|
||||||
|
|
||||||
|
test "explicit rebuild re-embeds every post even when content is unchanged", %{project: project} do
|
||||||
|
assert {:ok, _metadata} =
|
||||||
|
BDS.Metadata.update_project_metadata(project.id, %{semantic_similarity_enabled: true})
|
||||||
|
|
||||||
|
{:ok, _agent} = Agent.start_link(fn -> 0 end, name: :embeddings_force_counter)
|
||||||
|
|
||||||
|
Application.put_env(:bds, :embeddings,
|
||||||
|
backend: CountingBackend,
|
||||||
|
model_id: "counting/multilingual-e5-small",
|
||||||
|
dimensions: 384,
|
||||||
|
batch_size: 16
|
||||||
|
)
|
||||||
|
|
||||||
|
for index <- 1..3 do
|
||||||
|
assert {:ok, post} =
|
||||||
|
BDS.Posts.create_post(%{
|
||||||
|
project_id: project.id,
|
||||||
|
title: "Force #{index}",
|
||||||
|
content: "space rocket orbit mission galaxy #{index}",
|
||||||
|
language: "en"
|
||||||
|
})
|
||||||
|
|
||||||
|
assert {:ok, _post} = BDS.Posts.publish_post(post.id)
|
||||||
|
end
|
||||||
|
|
||||||
|
# Ignore embeds triggered while creating/publishing.
|
||||||
|
Agent.update(:embeddings_force_counter, fn _count -> 0 end)
|
||||||
|
|
||||||
|
# index_unindexed honours the content_hash skip: nothing to re-embed.
|
||||||
|
assert {:ok, _indexed} = BDS.Embeddings.index_unindexed(project.id)
|
||||||
|
assert Agent.get(:embeddings_force_counter, & &1) == 0
|
||||||
|
|
||||||
|
# An explicit rebuild re-embeds all three regardless (ReindexAll).
|
||||||
|
assert {:ok, rebuilt} = BDS.Embeddings.reindex_all(project.id)
|
||||||
|
assert length(rebuilt) == 3
|
||||||
|
assert Agent.get(:embeddings_force_counter, & &1) == 3
|
||||||
|
end
|
||||||
|
|
||||||
|
test "embedding operations degrade gracefully when the model is unavailable", %{
|
||||||
|
project: project
|
||||||
|
} do
|
||||||
|
assert {:ok, _metadata} =
|
||||||
|
BDS.Metadata.update_project_metadata(project.id, %{semantic_similarity_enabled: true})
|
||||||
|
|
||||||
|
Application.put_env(:bds, :embeddings,
|
||||||
|
backend: FailingBackend,
|
||||||
|
model_id: "failing/multilingual-e5-small",
|
||||||
|
dimensions: 384
|
||||||
|
)
|
||||||
|
|
||||||
|
# Saving a post must not crash even though embedding fails; it is just left
|
||||||
|
# unindexed.
|
||||||
|
assert {:ok, post} =
|
||||||
|
BDS.Posts.create_post(%{
|
||||||
|
project_id: project.id,
|
||||||
|
title: "Offline",
|
||||||
|
content: "space rocket orbit mission galaxy",
|
||||||
|
language: "en"
|
||||||
|
})
|
||||||
|
|
||||||
|
assert {:ok, post} = BDS.Posts.publish_post(post.id)
|
||||||
|
assert BDS.Repo.get_by(BDS.Embeddings.Key, project_id: project.id, post_id: post.id) == nil
|
||||||
|
|
||||||
|
# Explicit (re)index operations surface a clean error instead of crashing.
|
||||||
|
assert {:error, :model_unavailable} = BDS.Embeddings.reindex_all(project.id)
|
||||||
|
assert {:error, :model_unavailable} = BDS.Embeddings.index_unindexed(project.id)
|
||||||
|
|
||||||
|
# Queries stay safe.
|
||||||
|
assert {:ok, []} = BDS.Embeddings.find_similar(post.id, 5)
|
||||||
|
end
|
||||||
end
|
end
|
||||||
|
|||||||
Reference in New Issue
Block a user