building bm25
· 2 min read
i was trying to build a search engine, which can perform both vector and full text search. so ive started with the full text search, and was trying to understand bm25 and build it.
started of with understanding the bm25 ranking function -> understood its an extension of tf-idf algorithm with document normalisation and fix tf (term frequency) saturation after a point (to restrict showing the same doc which has the same word multiple times on top everytime).
steps involved in building the bm25 index:
- extract tokens, which doesnt include stop words
- build the reverse index
- build the forward index
steps involved in scoring the docs and returning the top k docs:
- extract tokens, which doesnt include stop words
- apply the ranking function to score the docs
- return the highest scoring k docs
While building the bm25 index linearly, i ran into some performance issues:
- it started fast: around 1000 docs/sec in the first batch.
- by the 25th batch (~25,000 docs), it slowed down to ~83 docs/sec.
but why?
i was using lists to store reverse/forward index entries.
➝ this caused O(n) lookups as the index grew.i was recalculating average document length after every document addition.
➝ each time, it ran a fullsum()
over all docs.
optimizations ive made:
- switched to defaultdict(set) which reduced to O(1) lookup while adding the doc in the reverse index.
- moved to do it lazily, until the batch addition completion of documents sum(n docs)/n only once
The above optimizations took me to:
metric | before | after |
---|---|---|
indexing 25k docs | 167 seconds | ~6 seconds |
avg. query time | ~55 ms | ~2 ms |
100k+ docs | crashed | scales comfortably |
this is a good start, im still learning on building the indexes and persistence of the index.
current performance :) https://t.co/leOW1SQ82k pic.twitter.com/97aMY1wQHQ
— Hrushikesh Dokala (@Hrushikeshhhh) July 25, 2024