Files
bDS/BDS_SEMANTIC_SIMILARITY.md
2026-02-28 18:41:55 +01:00

4.7 KiB
Raw Blame History

Semantic Similarity in bDS: A Zettelkasten-inspired Feature

Concept

The goal is to surface thematically related posts when writing or viewing a post in bDS — not as an authoritative classification, but as an impulse: "Have I written something similar before? Where could I explore further?"

This is inspired by Niklas Luhmann's Zettelkasten method, where the system would surprise its author with unexpected connections. The key difference from Luhmann's academic use case: bDS serves a personal epistemic network across diverse topics (programming, board games, social topics, professional interests), not a focused research domain. Cross-domain connections are a feature, not a flaw.

The algorithm finds the surface. The human finds the depth.


Technical Approach

Text search (BM25, LIKE queries) only finds shared words. Semantic similarity finds shared meaning — a post about emergent structures in game design can surface next to one about software architecture, even with zero word overlap.

Embeddings

Text is converted into high-dimensional vectors. Similarity becomes geometric proximity. Small, distilled models can do this efficiently without requiring a GPU or external API.

Recommended model: Xenova/all-MiniLM-L6-v2

  • ~90 MB on disk (ONNX format)
  • ~150200 MB RAM at runtime
  • 50150ms inference time per post on CPU
  • 384-dimensional vectors
  • Works well for mixed German/English content
  • No API key, fully local

Lighter alternative: all-MiniLM-L4-v2 (~50 MB, minimal quality difference for this use case)

Node.js library: @huggingface/transformers

  • Runs ONNX models natively in Node.js
  • Downloads and caches model to ~/.cache/huggingface/ on first run
  • Subsequent runs load from local cache

Architecture

Storage: sqlite-vec

Since bDS already uses SQLite (via Drizzle ORM) as a caching layer, the natural fit is sqlite-vec — a SQLite extension for vector search by Alex Garcia (the actively maintained successor to sqlite-vss).

Node.js integration:

import Database from 'better-sqlite3'
import * as sqliteVec from 'sqlite-vec'
import { drizzle } from 'drizzle-orm/better-sqlite3'

const sqlite = new Database('bds.sqlite')
sqliteVec.load(sqlite)  // must happen before Drizzle init
const db = drizzle(sqlite)

Schema (raw SQL migration, outside Drizzle schema — virtual tables are not supported by Drizzle):

CREATE VIRTUAL TABLE IF NOT EXISTS post_embeddings
USING vec0(
  post_id TEXT PRIMARY KEY,
  embedding FLOAT[384]
);

Similarity query:

SELECT p.id, p.title, e.distance
FROM post_embeddings e
JOIN posts p ON e.post_id = p.id
WHERE e.embedding MATCH ?
  AND k = 5
ORDER BY distance;

Integration with bDS Hooks

bDS already has file system hooks that fire when posts change (triggered by external edits, e.g. via git sync across machines). The embedding step fits naturally into the existing cache-update hook:

async function onPostChanged(filePath) {
  const post = parseMarkdownFile(filePath)
  const embedding = await embedText(post.content)  // ~100ms

  db.transaction(() => {
    updatePostCache(post)           // existing cache logic
    db.run(sql`
      INSERT OR REPLACE INTO post_embeddings(post_id, embedding)
      VALUES (${post.id}, ${serializeVector(embedding)})
    `)
  })()
}

Git sync bonus: On first git pull to a new machine, hooks fire for all posts, automatically building the full vector index — no separate setup step needed.


UX Recommendation

  • Show 35 related posts maximum — enough for an impulse, not so many it becomes a management task
  • Label them clearly as "thematically related", not "you should read"
  • A low similarity threshold is fine — unexpected connections are often the most valuable
  • No need for user-facing controls over the algorithm; simplicity serves the use case

Key Libraries

Purpose Library npm
Embedding model (local) Hugging Face Transformers.js @huggingface/transformers
Vector search in SQLite sqlite-vec sqlite-vec
SQLite driver better-sqlite3 better-sqlite3
ORM (already in bDS) Drizzle ORM drizzle-orm

Philosophical Note

Luhmann's Zettelkasten was monothematic by design — everything fed into a single sociological theory. A personal blog spanning programming, board games, MTG, and everyday life is structurally different. The vector space will reflect that diversity and occasionally bridge domains in ways no intentional tagging system would — which is precisely the point.

The system is not meant to organize knowledge. It is meant to make existing connections visible.