SPFresh: incremental in-place updates for billion scale

  ยท  3 min read

inspiration: you already know, im diving deep into ann indexes, and was looking into turbopuffer architecture - which points me to SPFresh making me very curious to know how it works.

SPFresh is a disk based cluster partitioned ANN index, which supports in-place updates and avoids global rebuilds, which are really expensive by continuously local rebalancing in billion scale vectors.

components:

  • LIRE -> lightweight incremental re-balance protocol
    • A protocol which splits/merges the partitions (postings), wisely without rebuilding the global indexes
    • It only re-assigns the partitions of the boundary vectors, during split/merge which violates NPA (nearest partition assignment) rule, as the rule says the vector needs to be assigned to partition where the centroid of it is nearest.
    • two conditions to check only the boundary vectors, so you dont scan everything:
      • vectors from split posting where, old centroid is nearest compared to new with the boundary vector (might mean, neighbour posting is now closer)
      • vectors in the neighbour postings needs check if the new centroid is nearest to the boundary vectors.
    • algorithm
      • insert/delete: append to the nearest posting (partition), mark deletes

      • split: if a posting grows beyond a limit, we split the posting even smaller

      • merge: if a posting shrinks than a lower threshold, we merge the postings into a single one

      • reassignment: after split/merge, it only checks for the vectors that violates NPA rule, and re-assign the vectors to right postings.

      • example:

        Before split:
        [ A ] --------- [ B ]
        Centroid_A: (0,0)
        Centroid_B: (5,0)
        
        Vector v = (2,0) โ†’ in A (closer to (0,0))
        
        After split A into A1 & A2:
        Centroid_A1: (-1,0)
        Centroid_A2: (2.5,0)
        
        Now v is closer to A2 (ok), but some boundary points in B might be closer to A2 โ†’ reassign.
        
  • SPANN
    • SPANN , a disk-based, balanced-partition ANN index. it clusters the dataset into many small postings stored on disk and builds an in-memory disk-aware search graph over the partition centroids. at query time, SPANN searches this centroid graph to quickly find the nearest partitions (e.g., top-64) and then scans only those postings for nearest neighbors. it also duplicates a small fraction of vectors near partition boundaries to improve recall.
    • SPFresh extend SPANN with LIRE to balance the partitions in the disk efficiently
  • The research of SPFresh proves the split -> reassignment converges and wont cascade forever (assumption in euclidean distance)

how to better apply spfresh for your vector index:

  1. start with partition indexing of the vectors (like SPANN)
    • Many small postings (partitions), centroid graph in memory and a fixed “visit-k postings” search budget.
  2. set thresholds
    • max_posting_len, min_posting_len
  3. reassignment policy
    • After split/merge, compute for candidates with two conditions and check only small neighbour fanouts (16-64 postings)
  4. storage
    • Always keep the postings append only, keep a version map so queries skip stale postings, garbage-collect later.

tweak the reassign logic based on the recall, latency of the queries.