SPFresh: incremental in-place updates for billion scale
ยท 3 min read
inspiration: you already know, im diving deep into ann indexes, and was looking into turbopuffer architecture - which points me to SPFresh making me very curious to know how it works.
SPFresh is a disk based cluster partitioned ANN index, which supports in-place updates and avoids global rebuilds, which are really expensive by continuously local rebalancing in billion scale vectors.
components:
- LIRE -> lightweight incremental re-balance protocol
- A protocol which splits/merges the partitions (postings), wisely without rebuilding the global indexes
- It only re-assigns the partitions of the boundary vectors, during split/merge which violates NPA (nearest partition assignment) rule, as the rule says the vector needs to be assigned to partition where the centroid of it is nearest.
- two conditions to check only the boundary vectors, so you dont scan everything:
- vectors from split posting where, old centroid is nearest compared to new with the boundary vector (might mean, neighbour posting is now closer)
- vectors in the neighbour postings needs check if the new centroid is nearest to the boundary vectors.
- algorithm
insert/delete: append to the nearest posting (partition), mark deletes
split: if a posting grows beyond a limit, we split the posting even smaller
merge: if a posting shrinks than a lower threshold, we merge the postings into a single one
reassignment: after split/merge, it only checks for the vectors that violates NPA rule, and re-assign the vectors to right postings.
example:
Before split: [ A ] --------- [ B ] Centroid_A: (0,0) Centroid_B: (5,0) Vector v = (2,0) โ in A (closer to (0,0)) After split A into A1 & A2: Centroid_A1: (-1,0) Centroid_A2: (2.5,0) Now v is closer to A2 (ok), but some boundary points in B might be closer to A2 โ reassign.
- SPANN
- SPANN , a disk-based, balanced-partition ANN index. it clusters the dataset into many small postings stored on disk and builds an in-memory disk-aware search graph over the partition centroids. at query time, SPANN searches this centroid graph to quickly find the nearest partitions (e.g., top-64) and then scans only those postings for nearest neighbors. it also duplicates a small fraction of vectors near partition boundaries to improve recall.
- SPFresh extend SPANN with LIRE to balance the partitions in the disk efficiently
- The research of SPFresh proves the split -> reassignment converges and wont cascade forever (assumption in euclidean distance)
how to better apply spfresh for your vector index:
- start with partition indexing of the vectors (like SPANN)
- Many small postings (partitions), centroid graph in memory and a fixed “visit-k postings” search budget.
- set thresholds
max_posting_len
,min_posting_len
- reassignment policy
- After split/merge, compute for candidates with two conditions and check only small neighbour fanouts (16-64 postings)
- storage
- Always keep the postings append only, keep a version map so queries skip stale postings, garbage-collect later.
tweak the reassign logic based on the recall, latency of the queries.