How to find more relevant search results by combining traditional keyword-based search with modern vector search

Towards Data Science
hybrid search bar
Search bar with hybrid search capabilities

With the recent interest in Retrieval-Augmented Generation (RAG) pipelines, developers have started discussing challenges in building RAG pipelines with production-ready performance. Just like in many aspects of life, the Pareto Principle also comes into play with RAG pipelines, where achieving the initial 80% is relatively straightforward, but attaining the remaining 20% for production readiness proves challenging.

One commonly repeated theme is to improve the retrieval component of a RAG pipeline with hybrid search.

Developers who have already gained experience building RAG pipelines have started sharing their insights. One commonly repeated theme is to improve the retrieval component of a RAG pipeline with hybrid search.

This article introduces you to the concept of hybrid search, how it can help you improve your RAG pipeline performance by retrieving more relevant results, and when to use it.

Hybrid search is a search technique that combines two or more search algorithms to improve the relevance of search results. Although it is not defined which algorithms are combined, hybrid search most commonly refers to the combination of traditional keyword-based search and modern vector search.

Traditionally, keyword-based search was the obvious choice for search engines. But with the advent of Machine Learning (ML) algorithms, vector embeddings enabled a new search technique — called vector or semantic search — that allowed us to search across data semantically. However, both search techniques have essential tradeoffs to consider:

  • Keyword-based search: While its exact keyword-matching capabilities are beneficial for specific terms, such as product names or industry jargon, it is sensitive to typos and synonyms, which lead it to miss important context.
  • Vector or semantic search: While its semantic search capabilities allow multi-lingual and multi-modal search based on the data’s semantic meaning and make it robust to typos, it can miss essential keywords. Additionally, it depends on the quality of the generated vector embeddings and is sensitive to out-of-domain terms.

Combining keyword-based and vector searches into a hybrid search allows you to leverage the advantages of both search techniques to improve search results’ relevance, especially for text-search use cases.

For example, consider the search query “How to merge two Pandas DataFrames with .concat()?”. The keyword search would help find relevant results for the method .concat(). However, since the word “merge” has synonyms such as “combine”, “join”, and “concatenate”, it would be helpful if we could leverage the context awareness of semantic search (see more details in When Would You Use Hybrid Search).

If you are interested, you can play around with the different keyword-based, semantic, and hybrid search queries to search for movies in this live demo (its implementation is detailed in this article).

Hybrid search combines keyword-based and vector search techniques by fusing their search results and reranking them.

Keyword-based search

Keyword-based search in the context of hybrid search often uses a representation called sparse embeddings, which is why it is also referred to as sparse vector search. Sparse embeddings are vectors with mostly zero values with only a few non-zero values, as shown below.

[0, 0, 0, 0, 0, 1, 0, 0, 0, 24, 3, 0, 0, 0, 0, ...]

Sparse embeddings can be generated with different algorithms. The most commonly used algorithm for sparse embeddings is BM25 (Best match 25), which builds upon the TF-IDF (Term Frequency-Inverse Document Frequency) approach and refines it. In simple terms, BM25 emphasizes the importance of terms based on their frequency in a document relative to their frequency across all documents.

Vector search

Vector search is a modern search technique that has emerged with the advances in ML. Modern ML algorithms, such as Transformers, can generate a numerical representation of data objects in various modalities (text, images, etc.) called vector embeddings.

These vector embeddings are usually densely packed with information and mostly comprised of non-zero values (dense vectors), as shown below. This is why vector search is also known as dense vector search.

[0.634, 0.234, 0.867, 0.042, 0.249, 0.093, 0.029, 0.123, 0.234, ...]

A search query is embedded into the same vector space as the data objects. Then, its vector embedding is used to calculate the closest data objects based on a specified similarity metric, such as cosine distance. The returned search results list the closest data objects ranked by their similarity to the search query.

Fusion of keyword-based and vector search results

Both the keyword-based search and the vector search return a separate set of results, usually a list of search results sorted by their calculated relevance. These separate sets of search results must be combined.

There are many different strategies to combine the ranked results of two lists into one single ranking, as outlined in a paper by Benham and Culpepper [1].

Generally speaking, the search results are usually first scored. These scores can be calculated based on a specified metric, such as cosine distance, or simply just the rank in the search results list.

Then, the calculated scores are weighted with a parameter alpha, which dictates each algorithm’s weighting and impacts the results’ re-ranking.

hybrid_score = (1 - alpha) * sparse_score + alpha * dense_score

Usually, alpha takes a value between 0 and 1, with

  • alpha = 1: Pure vector search
  • alpha = 0: Pure keyword search

Below, you can see a minimal example of the fusion between keyword and vector search with scoring based on the rank and an alpha = 0.5.

Minimal example showcasing the different rankings of keyword-based search, vector search, and hybrid search.
Minimal example of how keyword and vector search results can be fused in hybrid search with scoring based on ranking and an alpha of 0.5 (Image by the author, inspired by Hybrid search explained)

A RAG pipeline has many knobs you can tune to improve its performance. One of these knobs is to improve the relevance of the retrieved context that is then fed into the LLM because if the retrieved context is not relevant for answering a given question, the LLM won’t be able to generate a relevant answer either.

Depending on your context type and query, you have to determine which of the three search techniques is most beneficial for your RAG application. Thus, the parameter alpha, which controls the weighting between keyword-based and semantic search, can be viewed as a hyperparameter that needs to be tuned.

In a common RAG pipeline using LangChain, you would define the retriever component by setting the used vectorstore component as the retriever with the .as_retriever() method as follows:

# Define and populate vector store
# See details here
vectorstore = ...

# Set vectorstore as retriever
retriever = vectorstore.as_retriever()

However, this method only enables semantic search. If you want to enable hybrid search in LangChain, you will need to define a specific retriever component with hybrid search capabilities, such as the WeaviateHybridSearchRetriever:

from langchain.retrievers.weaviate_hybrid_search import WeaviateHybridSearchRetriever

retriever = WeaviateHybridSearchRetriever(
alpha = 0.5, # defaults to 0.5, which is equal weighting between keyword and semantic search
client = client, # keyword arguments to pass to the Weaviate client
index_name = “LangChain”, # The name of the index to use
text_key = “text”, # The name of the text key to use
attributes = [], # The attributes to return in the results

The rest of the vanilla RAG pipeline will stay the same.

This small code change allows you to experiment with different weighting between keyword-based and vector searches. Note that setting alpha = 1 equals a fully semantic search as is the equivalent of defining the retriever from the vectorstore component directly (retriever = vectorstore.as_retriever()).

Hybrid search is ideal for use cases where you want to enable semantic search capabilities for a more human-like search experience but also require exact phrase matching for specific terms, such as product names or serial numbers.

An excellent example is the platform Stack Overflow, which has recently extended its search capabilities with semantic search by using hybrid search.

Initially, Stack Overflow used TF-IDF to match keywords to documents [2]. However, describing the coding problem you are trying to solve can be difficult. It may lead to different results based on the words you use to describe your problem (e.g., combining two Pandas DataFrames can be done in different methods such as merging, joining, and concatenating). Thus, a more context-aware search method, such as semantic search, would be more beneficial for these cases.

However, on the other hand, a common use case of Stack Overflow is to copy-paste error messages. For this case, exact keyword matching is the preferred search method. Also, you will want exact keyword-matching capabilities for method and argument names (e.g., .read_csv() in Pandas).

As you can guess, many similar real-world use cases benefit from context-aware semantic searches but still rely on exact keyword matching. These use cases can strongly benefit from implementing a hybrid search retriever component.

This article introduced the context of hybrid search as a combination of keyword-based and vector searches. Hybrid search merges the search results of the separate search algorithms and re-ranks the search results accordingly.

In hybrid search, the parameter alpha controls the weighting between keyword-based and semantic searches. This parameter alpha can be viewed as a hyperparameter to tune in RAG pipelines to improve the accuracy of search results.

Using the Stack Overflow [2] case study, we showcased how hybrid search can be useful for use cases where semantic search can improve the search experience. However, exact keyword matching is still important when specific terms are frequent.

Source link