indexes
· 2 min read
an index is a data structure that improves the speed of data retrieval operations by providing quick access to specific information without having to search through every piece of data. (Eg, a books index - look for the word and then navigate to that page)
forward [ O(n) ]
- we have multiple documents (pages), and we extract words from each document and store it in a data structure
- doc 1 — “book”, “pen”, “student”
- doc 2 — “pen”
- doc 3 — “student”
- when doing a search, it needs a full scan across the docs to retrieve which docs has the word.
- bring a pen -> do a full text scan on each doc -> Doc 1, 2
- we have multiple documents (pages), and we extract words from each document and store it in a data structure
reverse/inverted [ O(1) ]
- we maintain a table which maps in reverse “words” -> “docs”
- now we can just lookup for the word, and then retrieve the docs that has this
- “pen” — doc 1, 2
- “student” — doc 1, 3
- “book” — doc 1
vector
- a special data structure which organises high dimensional vectors for efficient similarity search
- HNSW - hierarchical navigable small world
- Annoy - approximate nearest neighbours
- IVF - inverted file index
- LSH - locality sensitive hashing
- a special data structure which organises high dimensional vectors for efficient similarity search
b-tree index
- default index type in most of the databases (postgres, mysql, oracle)
- search, insert, delete: O(log n)
- range query: O(log n + k) where k = number of results
here are only a few, there are many more… but just rewinding on the basics.