Files
bDS/BDS_SEMANTIC_SIMILARITY.md

5.8 KiB
Raw Blame History

Semantic Similarity in bDS

Surface thematically related posts as an impulse — "Have I written something similar?" — inspired by Luhmann's Zettelkasten. Cross-domain connections across 10k+ posts over 20 years are the point, not a flaw. The algorithm finds the surface. The human finds the depth.


Integration Point

InsertModal (src/renderer/components/InsertModal/InsertModal.tsx), link mode.

When the search field is empty (query.length < 2), instead of showing "type at least 2 characters", show 35 semantically similar posts to the currently edited post. These are default suggestions — "posts you might want to link to."

Requires threading currentPostId from Editor.tsxInsertModal (currently only passes currentPostTags / currentPostCategories).


Stack

Purpose Library npm Notes
Embeddings Hugging Face Transformers.js @huggingface/transformers ONNX, local, no API key
Vector index USearch usearch HNSW, native C++ via N-API, prebuilt binaries

Embedding model: Xenova/all-MiniLM-L6-v2 — 384 dimensions, ~90 MB on disk, ~150200 MB RAM, ~100ms/post inference, handles mixed DE/EN.

Why USearch over alternatives:

  • sqlite-vec — requires loadExtension() on the SQLite driver; bDS uses @libsql/client which doesn't expose it. Eliminated.
  • hnswlib-node — no prebuilt binaries, requires node-gyp compile. Last published 2 years ago. Risk with Electron packaging.
  • vectra — pure JS, zero build issues, but JSON storage (~30 MB for 10k posts). Acceptable fallback.
  • Brute-force in JS — works at 10k (~15ms for the math) but requires loading all embeddings from DB first. DB read overhead with @libsql/client FFI is unknown and potentially dominant.
  • USearch — prebuilt binaries via prebuildify (matches sharp, @libsql/client pattern), actively maintained, HNSW with SIMD, <1ms queries, binary persistence (~6 MB for 10k×384).

USearch specifics:

  • Keys are BigUint64Array — need a Map<bigint, string> (numeric label → post UUID) persisted alongside the index
  • index.load() loads everything into RAM (~6 MB). index.save() is a full rewrite. Fine for this scale.
  • No incremental flush / WAL — acceptable since mutations are one-at-a-time post edits

Architecture

Files on disk

<project-dir>/.bds/
  embeddings.usearch       # USearch binary index
  embeddings-keys.json     # { [numericLabel]: postId } mapping

Engine: EmbeddingEngine (src/main/engine/EmbeddingEngine.ts)

Responsibilities:

  • Load/save USearch index + key map on startup/shutdown
  • Embed post content via @huggingface/transformers
  • Add/update/remove embeddings when posts change
  • Query: given a post ID, return top-k similar post IDs with distances

Key interface:

class EmbeddingEngine {
  async initialize(): Promise<void>           // load index + model
  async embedPost(postId: string, content: string): Promise<void>
  async removePost(postId: string): Promise<void>
  async findSimilar(postId: string, k?: number): Promise<SimilarPost[]>
  async getIndexingProgress(): Promise<{ indexed: number; total: number }>
  async save(): Promise<void>
}

IPC

embeddings:findSimilar(postId: string, k?: number) → SimilarPost[]
embeddings:getProgress() → { indexed: number; total: number }

Hook into existing post lifecycle

Post create/update/delete events already exist in PostEngine. On post content change → call embeddingEngine.embedPost(). On delete → call embeddingEngine.removePost(). Save index after each mutation.

Initial indexing (10k+ posts)

  • ~100ms per post × 10k = ~17 minutes one-time background job
  • Must run as a low-priority background task after app startup
  • Emit progress events so UI can show "Indexing 3,421 / 10,247…"
  • On git sync to new machine, file watchers fire for all posts → triggers full reindex automatically
  • Model download (~90 MB) on first run — needs progress indicator or opt-in preference

UI Changes

When query.length < 2 and currentPostId is set:

  1. Call embeddings:findSimilar(currentPostId, 5) on mount
  2. Show results in the same result list format, with a subtle header like "Related posts"
  3. Clicking a suggestion works identically to a search result — inserts the link

When query.length >= 2: existing search behavior, unchanged.

Fallback: if embeddings aren't ready (indexing in progress, feature disabled), show the existing "type at least 2 characters" message.


Implementation Steps

  1. Test + implement EmbeddingEngine — model loading, embed, add/remove/query against USearch index, save/load persistence
  2. SQLite key map — persist the bigint → postId mapping (simple JSON file or a small Drizzle table)
  3. Wire into post lifecycle — hook create/update/delete → embedding updates
  4. Background indexer — on startup, diff indexed vs. existing posts, queue unindexed for background embedding with progress events
  5. IPC endpointsfindSimilar, getProgress
  6. InsertModal integration — add currentPostId prop, fetch similar on mount, render as default suggestions
  7. Settings — opt-in preference to enable semantic similarity (triggers model download + initial index)
  8. I18n — all new UI strings through locale files

Constraints

  • Feature must be opt-in (model download + 17 min indexing is not a silent default)
  • No external API calls — fully local
  • Model cached in ~/.cache/huggingface/, index in project .bds/ directory
  • .bds/ directory inside project directory must be added to .gitignore (cache is kept local not versioned)
  • Total added footprint: ~140 MB on disk (onnxruntime-node ~50 MB + model ~90 MB), ~200 MB RAM at runtime for model + index