building bm25

  ·  2 min read

i was trying to build a search engine, which can perform both vector and full text search. so ive started with the full text search, and was trying to understand bm25 and build it.

started of with understanding the bm25 ranking function -> understood its an extension of tf-idf algorithm with document normalisation and fix tf (term frequency) saturation after a point (to restrict showing the same doc which has the same word multiple times on top everytime).

steps involved in building the bm25 index:

  • extract tokens, which doesnt include stop words
  • build the reverse index
  • build the forward index

steps involved in scoring the docs and returning the top k docs:

  • extract tokens, which doesnt include stop words
  • apply the ranking function to score the docs
  • return the highest scoring k docs

While building the bm25 index linearly, i ran into some performance issues:

  • it started fast: around 1000 docs/sec in the first batch.
  • by the 25th batch (~25,000 docs), it slowed down to ~83 docs/sec.

but why?

  • i was using lists to store reverse/forward index entries.
    ➝ this caused O(n) lookups as the index grew.

  • i was recalculating average document length after every document addition.
    ➝ each time, it ran a full sum() over all docs.

optimizations ive made:

  • switched to defaultdict(set) which reduced to O(1) lookup while adding the doc in the reverse index.
  • moved to do it lazily, until the batch addition completion of documents sum(n docs)/n only once

The above optimizations took me to:

metricbeforeafter
indexing 25k docs167 seconds~6 seconds
avg. query time~55 ms~2 ms
100k+ docscrashedscales comfortably

this is a good start, im still learning on building the indexes and persistence of the index.