Hybrid Search - Ensemble Retriever

Hybrid Vector Search

Hybrid Search is basically a combination of keyword style search and a vector style search. It has the advantage of doing keyword search as well as the advantage of doing a semantic lookup that we get from embeddings and a vector search.

Keyword Search :

In GenAI Stack we internally use BM25 Algorithm. It generates a sparse vector. BM25 (Best Match 25) is an information retrieval algorithm used to rank and score the relevance of documents to a particular search query. It’s an extension of the TF-IDF (Term Frequency-Inverse Document Frequency) approach.

Key points about BM25:

  1. Term Frequency (TF): Measures the frequency of a term in a document.

  2. Inverse Document Frequency (IDF): Measures the importance of a term based on its frequency across the entire document collection.

  3. BM25 Weighting: Combines TF and IDF to calculate the relevance score of a document for a given query.

  4. Query Terms: BM25 considers the occurrence of query terms in the document and adjusts their scores accordingly.

  5. Parameter Tuning: BM25 involves tuning parameters (k1, b) to optimize the ranking performance based on the dataset.

Semantic Search:

Semantic search is a search method that aims to improve the accuracy and relevance of search results by understanding the context and meaning behind a search query. Unlike traditional keyword-based search, which primarily relies on matching keywords, semantic search tries to comprehend the intent and context of the user’s query and the content of the documents being searched.

Semantic search strives to mimic human understanding of language and context, ultimately delivering search results that align better with the user’s information needs.

In GenAI Stack, we use Vector Store retriever for Semantic Search.

Ensemble Retriever

The EnsembleRetriever takes a list of retrievers as input and ensemble the results of their get_relevant_documents() methods and rerank the results based on the Reciprocal Rank Fusion algorithm.

By leveraging the strengths of different algorithms, the EnsembleRetriever can achieve better performance than any single algorithm.

The most common pattern is to combine a sparse retriever (like BM25) with a dense retriever (like embedding similarity), because their strengths are complementary. It is also known as “Hybrid search”.

Last updated