Databases on Hrushikesh Dokala

building snaildb 🐌 : embedded, persistent key-value store written in rust

Tue, 30 Dec 2025 00:00:00 +0000

i decided to experiment with building a key-value store in rust as a fun project to both learn the language and dive into writing the low-level data structures and algorithms used in databases. inspiration was to see how far i could push the limits of a kv store compared to existing solutions like rocksdb and leveldb, especially after reading up on architectural concepts like bare-metal designs, object storage-based databases, and the bf-tree paper. this project is my way of getting hands-on experience and satisfying my curiosity about database internals, might turn out to be production ready db in future. (start date: 02 december 2025, not really sure when this gets finished)

bf-tree

Mon, 24 Nov 2025 00:00:00 +0000

Bf-tree

decouple cache pages from disk pages, it no longer has to mirror disk 1:1

lets understand a little deeper:

problem with B-tree, data lives in fixed size pages (4kb), buffer pool caches whole pages in RAM.
to update a record -> read full page -> modify a few bytes -> write back 4kb
but, cache doesnt require to mirror 1:1 disk so, in this paper they’ve introduced - “mini pages”

mini pages - a variable length in-mem fragments, so you dont need to hold the full 4kb page in buffer.

vitess architecture

Fri, 21 Nov 2025 00:00:00 +0000

first of all, im very inspired by @samlambert, ceo of planetscale. i was curious enough to explore the planetscale.com (fastest dbs available in cloud with their fast NVMe drives) and found something interesting, which is vitess.

there is a lot going on, in their website but the vitess, allows mysql dbs to scale horizontally through sharding. which is very interesting. so thought of digging deeper into it. one of the questions i had was - “what is the exact problem vitess solves for mysql?”.

SPFresh: incremental in-place updates for billion scale

Sat, 16 Aug 2025 00:00:00 +0000

inspiration: you already know, im diving deep into ann indexes, and was looking into turbopuffer architecture - which points me to SPFresh making me very curious to know how it works.

SPFresh is a disk based cluster partitioned ANN index, which supports in-place updates and avoids global rebuilds, which are really expensive by continuously local rebalancing in billion scale vectors.

components:

LIRE -> lightweight incremental re-balance protocol
- A protocol which splits/merges the partitions (postings), wisely without rebuilding the global indexes
- It only re-assigns the partitions of the boundary vectors, during split/merge which violates NPA (nearest partition assignment) rule, as the rule says the vector needs to be assigned to partition where the centroid of it is nearest.
- two conditions to check only the boundary vectors, so you dont scan everything:
  - vectors from split posting where, old centroid is nearest compared to new with the boundary vector (might mean, neighbour posting is now closer)
  - vectors in the neighbour postings needs check if the new centroid is nearest to the boundary vectors.
- algorithm
  - insert/delete: append to the nearest posting (partition), mark deletes

ann (approximate nearest neighbor) indexes

Fri, 15 Aug 2025 00:00:00 +0000

inspiration - im working on vector, hybrid search for data catalogs and got curious about the different index structures used at scale.

firstly, what is ann index? approximate nearest neighbor index, a data structure that store your vectors in a way that lets you avoid comparing against all of them. so, i was deep diving into ann, and i’ve learnt a few interesting indexes, based on scale of the data points. there are 2 most popular indexes - graph and cluster based.

indexes

Fri, 18 Jul 2025 00:00:00 +0000

an index is a data structure that improves the speed of data retrieval operations by providing quick access to specific information without having to search through every piece of data. (Eg, a books index - look for the word and then navigate to that page)

forward [ O(n) ]
- we have multiple documents (pages), and we extract words from each document and store it in a data structure
  - doc 1 — “book”, “pen”, “student”
  - doc 2 — “pen”
  - doc 3 — “student”
- when doing a search, it needs a full scan across the docs to retrieve which docs has the word.
- bring a pen -> do a full text scan on each doc -> Doc 1, 2
reverse/inverted [ O(1) ]

do you commit directly to the db? nah bro!

Sun, 15 Jun 2025 00:00:00 +0000

even i thought i know how databases work, but i was wrong until i read about the wal.

thought, this is how it works:

( user/api request ) -> ( operation (insert, update, delete) ) -> ( db (write to disk) ) -> ( respond 200 to user )

learnt, it works like this:

( user/api request ) -> ( operation (insert, update, delete) ) -> ( wal (write to buffer in-memory) ) -> ( wal (flush to disk - fsync) ) + (starts async process) -> ( respond 200 to user )