<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Search on Hrushikesh Dokala</title><link>https://hrushikesh.dev/tags/search/</link><description>Recent content in Search on Hrushikesh Dokala</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sat, 16 Aug 2025 00:00:00 +0000</lastBuildDate><atom:link href="https://hrushikesh.dev/tags/search/index.xml" rel="self" type="application/rss+xml"/><item><title>SPFresh: incremental in-place updates for billion scale</title><link>https://hrushikesh.dev/notes/spfresh/</link><pubDate>Sat, 16 Aug 2025 00:00:00 +0000</pubDate><guid>https://hrushikesh.dev/notes/spfresh/</guid><description>&lt;p>inspiration: you already know, im diving deep into &lt;a href="https://hrushikesh.dev/notes/vector-index">ann indexes&lt;/a>, and was looking into turbopuffer architecture - which points me to 





 





&lt;a href="https://arxiv.org/pdf/2410.14452" class="link-red" target="_blank" rel="noopener noreferrer">SPFresh&lt;/a>

 making me very curious to know how it works.&lt;/p>
&lt;p>SPFresh is a disk based cluster partitioned ANN index, which supports in-place updates and avoids global rebuilds, which are really expensive by continuously local rebalancing in billion scale vectors.&lt;/p>
&lt;p>&lt;strong>components&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>&lt;em>&lt;strong>LIRE&lt;/strong>&lt;/em> -&amp;gt; lightweight incremental re-balance protocol
&lt;ul>
&lt;li>A protocol which splits/merges the partitions (postings), wisely without rebuilding the global indexes&lt;/li>
&lt;li>It only re-assigns the partitions of the boundary vectors, during split/merge which violates NPA (nearest partition assignment) rule, as the rule says the vector needs to be assigned to partition where the centroid of it is nearest.&lt;/li>
&lt;li>&lt;strong>two conditions&lt;/strong> to check only the boundary vectors, so you dont scan everything:
&lt;ul>
&lt;li>vectors from split posting where, old centroid is nearest compared to new with the boundary vector (might mean, neighbour posting is now closer)&lt;/li>
&lt;li>vectors in the neighbour postings needs check if the new centroid is nearest to the boundary vectors.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;em>&lt;strong>algorithm&lt;/strong>&lt;/em>
&lt;ul>
&lt;li>
&lt;p>insert/delete: append to the nearest posting (partition), mark deletes&lt;/p></description></item><item><title>ann (approximate nearest neighbor) indexes</title><link>https://hrushikesh.dev/notes/vector-index/</link><pubDate>Fri, 15 Aug 2025 00:00:00 +0000</pubDate><guid>https://hrushikesh.dev/notes/vector-index/</guid><description>&lt;p>inspiration - im working on vector, hybrid search for data catalogs and got curious about the different index structures used at scale.&lt;/p>
&lt;p>firstly, what is ann index? approximate nearest neighbor index, a data structure that store your vectors in a way that lets you avoid comparing against all of them. so, i was deep diving into ann, and i&amp;rsquo;ve learnt a few interesting indexes, based on scale of the data points. there are 2 most popular indexes - &lt;strong>graph&lt;/strong> and &lt;strong>cluster&lt;/strong> based.&lt;/p></description></item><item><title>indexes</title><link>https://hrushikesh.dev/notes/indexes/</link><pubDate>Fri, 18 Jul 2025 00:00:00 +0000</pubDate><guid>https://hrushikesh.dev/notes/indexes/</guid><description>&lt;p>an index is a data structure that improves the speed of data retrieval operations by providing quick access to specific information without having to search through every piece of data. (Eg, a books index - look for the word and then navigate to that page)&lt;/p>
&lt;ul>
&lt;li>
&lt;p>forward [ O(n) ]&lt;/p>
&lt;ul>
&lt;li>we have multiple documents (pages), and we extract words from each document and store it in a data structure
&lt;ul>
&lt;li>doc 1 — “book”, “pen”, “student”&lt;/li>
&lt;li>doc 2 — “pen”&lt;/li>
&lt;li>doc 3 — “student”&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>when doing a search, it needs a full scan across the docs to retrieve which docs has the word.&lt;/li>
&lt;li>bring a pen -&amp;gt; do a full text scan on each doc -&amp;gt; Doc 1, 2&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>reverse/inverted [ O(1) ]&lt;/p></description></item></channel></rss>