SPFresh: incremental in-place updates for billion scale

August 16, 2025 · 3 min read

inspiration: you already know, im diving deep into ann indexes, and was looking into turbopuffer architecture - which points me to SPFresh making me very curious to know how it works.

SPFresh is a disk based cluster partitioned ANN index, which supports in-place updates and avoids global rebuilds, which are really expensive by continuously local rebalancing in billion scale vectors.

components:

LIRE -> lightweight incremental re-balance protocol
- A protocol which splits/merges the partitions (postings), wisely without rebuilding the global indexes
- It only re-assigns the partitions of the boundary vectors, during split/merge which violates NPA (nearest partition assignment) rule, as the rule says the vector needs to be assigned to partition where the centroid of it is nearest.
- two conditions to check only the boundary vectors, so you dont scan everything:
  - vectors from split posting where, old centroid is nearest compared to new with the boundary vector (might mean, neighbour posting is now closer)
  - vectors in the neighbour postings needs check if the new centroid is nearest to the boundary vectors.
- algorithm
  - insert/delete: append to the nearest posting (partition), mark deletes
  - split: if a posting grows beyond a limit, we split the posting even smaller
  - merge: if a posting shrinks than a lower threshold, we merge the postings into a single one
  - reassignment: after split/merge, it only checks for the vectors that violates NPA rule, and re-assign the vectors to right postings.
  - example:
```
Before split:
[ A ] --------- [ B ]
Centroid_A: (0,0)
Centroid_B: (5,0)

Vector v = (2,0) → in A (closer to (0,0))

After split A into A1 & A2:
Centroid_A1: (-1,0)
Centroid_A2: (2.5,0)

Now v is closer to A2 (ok), but some boundary points in B might be closer to A2 → reassign.
```
SPANN
- SPANN , a disk-based, balanced-partition ANN index. it clusters the dataset into many small postings stored on disk and builds an in-memory disk-aware search graph over the partition centroids. at query time, SPANN searches this centroid graph to quickly find the nearest partitions (e.g., top-64) and then scans only those postings for nearest neighbors. it also duplicates a small fraction of vectors near partition boundaries to improve recall.
- SPFresh extend SPANN with LIRE to balance the partitions in the disk efficiently
The research of SPFresh proves the split -> reassignment converges and wont cascade forever (assumption in euclidean distance)

how to better apply spfresh for your vector index:

start with partition indexing of the vectors (like SPANN)
- Many small postings (partitions), centroid graph in memory and a fixed “visit-k postings” search budget.
set thresholds
- max_posting_len, min_posting_len
reassignment policy
- After split/merge, compute for candidates with two conditions and check only small neighbour fanouts (16-64 postings)
storage
- Always keep the postings append only, keep a version map so queries skip stale postings, garbage-collect later.

tweak the reassign logic based on the recall, latency of the queries.