Search on Hrushikesh Dokala

SPFresh: incremental in-place updates for billion scale

Sat, 16 Aug 2025 00:00:00 +0000

inspiration: you already know, im diving deep into ann indexes, and was looking into turbopuffer architecture - which points me to SPFresh making me very curious to know how it works.

SPFresh is a disk based cluster partitioned ANN index, which supports in-place updates and avoids global rebuilds, which are really expensive by continuously local rebalancing in billion scale vectors.

components:

LIRE -> lightweight incremental re-balance protocol
- A protocol which splits/merges the partitions (postings), wisely without rebuilding the global indexes
- It only re-assigns the partitions of the boundary vectors, during split/merge which violates NPA (nearest partition assignment) rule, as the rule says the vector needs to be assigned to partition where the centroid of it is nearest.
- two conditions to check only the boundary vectors, so you dont scan everything:
  - vectors from split posting where, old centroid is nearest compared to new with the boundary vector (might mean, neighbour posting is now closer)
  - vectors in the neighbour postings needs check if the new centroid is nearest to the boundary vectors.
- algorithm
  - insert/delete: append to the nearest posting (partition), mark deletes

ann (approximate nearest neighbor) indexes

Fri, 15 Aug 2025 00:00:00 +0000

inspiration - im working on vector, hybrid search for data catalogs and got curious about the different index structures used at scale.

firstly, what is ann index? approximate nearest neighbor index, a data structure that store your vectors in a way that lets you avoid comparing against all of them. so, i was deep diving into ann, and i’ve learnt a few interesting indexes, based on scale of the data points. there are 2 most popular indexes - graph and cluster based.

indexes

Fri, 18 Jul 2025 00:00:00 +0000

an index is a data structure that improves the speed of data retrieval operations by providing quick access to specific information without having to search through every piece of data. (Eg, a books index - look for the word and then navigate to that page)

forward [ O(n) ]
- we have multiple documents (pages), and we extract words from each document and store it in a data structure
  - doc 1 — “book”, “pen”, “student”
  - doc 2 — “pen”
  - doc 3 — “student”
- when doing a search, it needs a full scan across the docs to retrieve which docs has the word.
- bring a pen -> do a full text scan on each doc -> Doc 1, 2
reverse/inverted [ O(1) ]