1.0 Boosting Search Results with Absorb LMS
By leveraging AI, Absorb LMS provides a powerful search function that returns highly relevant results to users. To provide this robust search experience, we have incorporated AWS OpenSearch into our platform. AWS OpenSearch is a distributed, community-driven, Apache 2.0-licensed, 100% open-source search and analytics suite.
Fueled by the Apache Lucene search library, AWS OpenSearch offers a host of dynamic search capabilities, including full-text search, k-nearest neighbors (KNN) search, SQL, Anomaly Detection, Machine Learning Commons, and Trace Analytics. Using AWS OpenSearch, Absorb LMS provides an array of features to fine-tune relevance in search results, recognizing that a default, one-size fits all function may not be sufficient for all use cases.
In this blog post, we will outline how we leverage boosting parameters to improve search ranking for our users.
1.1Query Time Boosting
Since we extract multiple fields from a document (such as title, tags, and description), we can enhance the document's relevance by boosting individual fields during the query.
The following toy example illustrates how query boosting can be used to improve the ranking of the documents. Table 1 lists the four documents related to topics in classical physics that have been indexed in the AWS OpenSearch database.
Table 1: Doc ID, name, and description of documents in AWS OpenSearch
Doc ID |
Name |
Description |
D001 |
Introduction to Classical Mechanics: Newtonian |
Pre-req: Introduction to Classical Physics. The course gives brief overview of Newtonian or classical mechanics. |
D002 |
Introduction to Classical Mechanics: Hamiltonian |
Pre-req: Introduction to Classical Physics. The course gives brief overview of Hamiltonian mechanics. |
D003 |
Introduction to Classical Physics |
A prerequisite to Classical Mechanics: Newtonian. |
D004 |
Angular momentum: classical and quantum physics overview |
Further exploration on the analysis of angular momentum from both classical and quantum physics perspective. |
Table 2 illustrates the default document ranking for the query “introduction to physics” (referred to as QID 001 henceforth)
Table 2: Document ranking for the query "introduction to physics" (QID 001)
Doc ID |
Score |
Rank |
D001 |
1.36 |
2 |
D002 |
1.61 |
1 |
D003 |
1.18 |
3 |
D004 |
1.07 |
4 |
However, if we observe that most of the users select Doc003 in search results, we can infer that it needs to be ranked higher. To achieve this, we boost the title by a factor of 2. Table 3 illustrates that as a result of this modification, Doc003 now holds the maximum score and thus, the highest ranking in search results.
Table 3: Document ranking for QID 001 after title boosting
QID |
Doc ID |
Number of clicks |
D001 |
1.71 |
4 |
D002 |
1.95 |
2 |
D003 |
2.35 |
1 |
D004 |
1.78 |
3 |
1.2 Relevance Dataset: Solving the Labelled Dataset Problem
The best way to identify boosting parameters is to have a labelled dataset of documents for a set of queries and use optimization algorithms to narrow down the boosting weights for the respective fields. The Mean Reciprocal Rank (MRR) can be used as a loss function for such optimizations.
Absorb LMS acknowledges the fact that labelled datasets are not readily available. Creating them is a challenge and will require considerable manual effort. To simplify this, we could use historical search data to assign relevance to the documents as a first step. The historical search data gives us an insight into queries issued by the users and the number of times a given document was selected. For example, Table 4 shows the number of clicks that were recorded for QID001 during a specific timeframe. Based on this data, we have assigned the document with the highest number of clicks as “highly relevant” and the document with no clicks as “not relevant”.
Table 4: Number of clicks recorded for each document and the assigned relevance
QID |
Doc ID |
Number of clicks |
Relevance |
001 |
D001 |
0 |
0 (Not relevant) |
D002 |
500 |
2 (Relevant) |
|
D003 |
1000 |
3(Highly relevant) |
|
D004 |
100 |
1 (related) |
1.3 Optimization
With the dataset and a method to quantify the ranking performance of our queries in hand, we can now proceed to fine-tune our query parameters. Bayesian optimization and grid-based search are the two most common ways to search for an optimal set of parameters. Both these solutions provide the best ranking, for these types of problems where the evaluation of objective function is expensive, and gradients are not available.
Bayesian optimization is the preferred solution if there are too many parameters that need to be tweaked. It involves approximating the objective function using a gaussian process and an acquisition function to determine the next sampling point that either maximizes the objective function or lowers the prediction uncertainty.
2.0 Approach and Results: An Overview
The results of one such optimization run are shown in Figure 1. For this run, we selected 50,000 queries and assigned relevance to 31,647,318 documents based on observed user clicks. We used the scikit optimize package and the rank evaluation API along with MRR (negative) as the loss function. Four fields (name, title, tags, and question) were selected for boosting.
Figure 1: Partial dependency graphs for various fields. The plots along the diagonal are 1D and the off-diagonal plots are 2D. The optimal boost value for each field is indicated with a red star.
These field values can be further tweaked by conducting a more comprehensive grid search around the values suggested by Bayesian optimization.
To sum up, the search functionality of Absorb LMS, powered by AWS OpenSearch, effectively harnesses the power of machine learning capabilities, resulting in a superior search experience.