5.8 KiB
Semantic Similarity in bDS
Surface thematically related posts as an impulse — "Have I written something similar?" — inspired by Luhmann's Zettelkasten. Cross-domain connections across 10k+ posts over 20 years are the point, not a flaw. The algorithm finds the surface. The human finds the depth.
Integration Point
InsertModal (src/renderer/components/InsertModal/InsertModal.tsx), link mode.
When the search field is empty (query.length < 2), instead of showing "type at least 2 characters", show 3–5 semantically similar posts to the currently edited post. These are default suggestions — "posts you might want to link to."
Requires threading currentPostId from Editor.tsx → InsertModal (currently only passes currentPostTags / currentPostCategories).
Stack
| Purpose | Library | npm | Notes |
|---|---|---|---|
| Embeddings | Hugging Face Transformers.js | @huggingface/transformers |
ONNX, local, no API key |
| Vector index | USearch | usearch |
HNSW, native C++ via N-API, prebuilt binaries |
Embedding model: Xenova/all-MiniLM-L6-v2 — 384 dimensions, ~90 MB on disk, ~150–200 MB RAM, ~100ms/post inference, handles mixed DE/EN.
Why USearch over alternatives:
sqlite-vec— requiresloadExtension()on the SQLite driver; bDS uses@libsql/clientwhich doesn't expose it. Eliminated.hnswlib-node— no prebuilt binaries, requiresnode-gypcompile. Last published 2 years ago. Risk with Electron packaging.vectra— pure JS, zero build issues, but JSON storage (~30 MB for 10k posts). Acceptable fallback.- Brute-force in JS — works at 10k (~15ms for the math) but requires loading all embeddings from DB first. DB read overhead with
@libsql/clientFFI is unknown and potentially dominant. - USearch — prebuilt binaries via
prebuildify(matchessharp,@libsql/clientpattern), actively maintained, HNSW with SIMD, <1ms queries, binary persistence (~6 MB for 10k×384).
USearch specifics:
- Keys are
BigUint64Array— need aMap<bigint, string>(numeric label → post UUID) persisted alongside the index index.load()loads everything into RAM (~6 MB).index.save()is a full rewrite. Fine for this scale.- No incremental flush / WAL — acceptable since mutations are one-at-a-time post edits
Architecture
Files on disk
<project-dir>/.bds/
embeddings.usearch # USearch binary index
embeddings-keys.json # { [numericLabel]: postId } mapping
Engine: EmbeddingEngine (src/main/engine/EmbeddingEngine.ts)
Responsibilities:
- Load/save USearch index + key map on startup/shutdown
- Embed post content via
@huggingface/transformers - Add/update/remove embeddings when posts change
- Query: given a post ID, return top-k similar post IDs with distances
Key interface:
class EmbeddingEngine {
async initialize(): Promise<void> // load index + model
async embedPost(postId: string, content: string): Promise<void>
async removePost(postId: string): Promise<void>
async findSimilar(postId: string, k?: number): Promise<SimilarPost[]>
async getIndexingProgress(): Promise<{ indexed: number; total: number }>
async save(): Promise<void>
}
IPC
embeddings:findSimilar(postId: string, k?: number) → SimilarPost[]
embeddings:getProgress() → { indexed: number; total: number }
Hook into existing post lifecycle
Post create/update/delete events already exist in PostEngine. On post content change → call embeddingEngine.embedPost(). On delete → call embeddingEngine.removePost(). Save index after each mutation.
Initial indexing (10k+ posts)
- ~100ms per post × 10k = ~17 minutes one-time background job
- Must run as a low-priority background task after app startup
- Emit progress events so UI can show "Indexing 3,421 / 10,247…"
- On git sync to new machine, file watchers fire for all posts → triggers full reindex automatically
- Model download (~90 MB) on first run — needs progress indicator or opt-in preference
UI Changes
InsertModal (link mode, internal tab)
When query.length < 2 and currentPostId is set:
- Call
embeddings:findSimilar(currentPostId, 5)on mount - Show results in the same result list format, with a subtle header like "Related posts"
- Clicking a suggestion works identically to a search result — inserts the link
When query.length >= 2: existing search behavior, unchanged.
Fallback: if embeddings aren't ready (indexing in progress, feature disabled), show the existing "type at least 2 characters" message.
Implementation Steps
- Test + implement
EmbeddingEngine— model loading, embed, add/remove/query against USearch index, save/load persistence - SQLite key map — persist the
bigint → postIdmapping (simple JSON file or a small Drizzle table) - Wire into post lifecycle — hook create/update/delete → embedding updates
- Background indexer — on startup, diff indexed vs. existing posts, queue unindexed for background embedding with progress events
- IPC endpoints —
findSimilar,getProgress - InsertModal integration — add
currentPostIdprop, fetch similar on mount, render as default suggestions - Settings — opt-in preference to enable semantic similarity (triggers model download + initial index)
- I18n — all new UI strings through locale files
Constraints
- Feature must be opt-in (model download + 17 min indexing is not a silent default)
- No external API calls — fully local
- Model cached in
~/.cache/huggingface/, index in project.bds/directory - .bds/ directory inside project directory must be added to .gitignore (cache is kept local not versioned)
- Total added footprint: ~140 MB on disk (onnxruntime-node ~50 MB + model ~90 MB), ~200 MB RAM at runtime for model + index