Photo by Scott Graham / Unsplash

Evaluating Recommender Systems

Recommender Systems Aug 3, 2025

Recommender systems are integral to many online platforms, from streaming services to e-commerce websites. They help users navigate through vast amounts of data by suggesting useful items that might be interesting or useful.

However, building effective recommender systems is only part of the challenge, evaluating their performance is equally crucial.

In this post we'll explore and get to know various methods and metrics used to evaluate recommender systems, ensuring they meet user needs and expectations.

Categories of Evaluation Experiments

Evaluating recommender systems can be broadly categorized into three types of experiments:

  • Offline Testing: This involves using static datasets, often historical data, to test the system. Techniques like cross-validation and temporal splits are common. The main advantage is that it doesn't require real users, making it cost-effective and easy to implement.
  • Online Testing: This method tests the system "in the wild" with real users. A/B testing is a popular approach, where different groups of users are exposed to recommendations from different systems. This method is predominantly used by industries with access to a large user base.
  • User Studies: These involve direct interaction with users, typically through questionnaires or interviews. User studies can be quantitative or qualitative and are designed to gather detailed feedback on user experience and satisfaction.

Evaluating Recommender Systems

Recommender systems can be evaluated from different perspectives: as a retrieval task, a classification task or a user-centric task.

Evaluation Under Retrieval Aspects

When viewed as a retrieval task, the goal is to compare predicted user-item interactions with known interactions.

Performance measures used in information retrieval (IR) can be applied.

Retrieved and Relevant Document Intersection visualization within a document corpus

Recall and Precision

Recall measures the fraction of relevant items that are successfully recommended.

$$\text{Recall} = \frac{|\text{Rel} \cap \text{Ret}|}{|\textcolor{brown}{\text{Rel}}|}$$

Precision measure the fraction of recommended items that are relevant.

$$\text{Precision} = \frac{|\text{Rel} \cap \text{Ret}|}{|\textcolor{blue}{\text{Ret}}|}$$

Here, $\text{Rel}$ is the set of all relevant items, and $\text{Ret}$ is the set of retrieved (recommended) items.

If the goal is to ensure that all potentially relevant items are recommended, high recall is important. This is useful in contexts where missing a relevant item could be detrimental.

If the primary goal is to ensure that most of the recommended items are relevant to the user, high precision is crucial. This is particularly important in contexts where user trust and satisfaction are key, as irrelevant recommendations can frustrate users.

In many real-world applications, a balance between precision and recall is necessary. This balance can be achieved using metrics like the F1 score

F-measure

The F-measure, sometimes also called F1-score or F-score is defined as the harmonic mean of precision and recall, providing a single score that balances both concerns.

$$F = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Precision@k (P@k)

Precision@k assumes that users are only interested in the top $k$ recommended items.

$$P@k = \frac{|\text{Rel} \cap \text{Ret}[1 .. k]|}{k}$$

💡
Note that this metric assumes all top $k$ items are reviewed regardless of order. However, in practice, order matters, and users are less likely to scroll through long lists to find relevant items lower down.

Average Precision (AP)

This metric is similar to Precision, but cycles through all possible recall levels $[1 .. k]$ of relevant items and provides a single-figure measure of quality.

$$AP = \frac{1}{|Rel|} \times \sum_{i=1}^{|Ret|} \text{relevant(i)} \times P@i$$

where $\text{relevant(i)}$ is defined as an indicator function which returns $1$ if the $i$-th retrieved item is relevant and $0$ otherwise. It checks whether the item at position $i$ in the retrieved list is relevant.

R-Precision

R-Precision specifically measures the precision of the search results at the position equal to the number of relevant documents $R$ in the entire collection for a given query.

It's particularly useful as it gives an indication of how well the system is performing at retrieving all relevant documents within a limited set of top results.

$$\text{R-Precision} = \frac{|\text{Rel} \cap \text{Ret}_{\text{top R}}|}{|\text{Rel}|}$$

where $\text{Ret}_{\text{top R}}$ represents the top $R$ documents retrieved by the system.

Reciprocal Rank (RR)

This metric assumes that users are satisfied after encountering the first relevant item. Hence, we calculate the inverse of the position (rank) of the first retrieved item that is also relevant.

$$RR = \frac{1}{min_k\{ \text{Ret}[k] \in \text{Rel} \}}$$

Mean Average Precision (MAP)

Mean Average Precision extends average precision to multiple users.

$$MAP = \frac{\sum_{i=1}^{|I|} AP(i)}{|I|}$$

Discounted Cumulative Gain (DCG)

Coming from metrics that do not care at all about the position of the item, to metrics that assume users are satisfied with the first item, we reach Discounted Cumulative Gain, which accounts completely for the position of relevant items in the ranking, giving higher weights to items that appear earlier in the list.

$$DCG@k = \text{relevance}(1) + \sum_{i=2}^k \frac{\text{relevance}(i)}{\log_2(i)}$$

where $\text{relevance}(i)$ is a measure of how relevant the item at position $i$ is to the query or user (e.g. binary value or a value derived from a similarity measurement such as cosine similarity). The position $i$ refers to the rank of the item in the list. $\log_2(i)$ serves as a discount factor and reduces the contribution of items that appear lower in the ranking.

The first item's relevance denoted by $\text{relevance}(1)$ is taken as is without any discount, subsequent items from position $2$ to $k$ are then discounted based on their rank in the list.

Evaluation Under Classification Aspects

When viewed as a classification ask, the goal is to predict ratings for unknown items based on known user ratings.

Mean Absolute Error (MAE)

The Mean Absolute Error metric measures the average absolute difference between predicted and true ratings.

$$MAE = \frac{1}{|T|} \sum_{(u,i) \in T} |\hat r_{u,i} - r_{u,i}|$$

where $T$ represents the test set, which is a set of pairs $(u,i)$ where $u$ denotes a user and $i$ denotes an item.

Root Mean Squared Error (RMSE)

RMSE measures the square root of the average squared difference between predicted and true ratings, disproportionally penalizing large prediction errors.

$$RMSE = \sqrt{\frac{1}{|T|} \sum_{(u,i) \in T} (\hat r_{u,i} - r_{u,i})^2 }$$

User-Centric Evaluation

Unlike the traditional quantitative metrics we discussed up until now, which focus on accuracy and performance, user-centric evaluation delves into the qualitative aspects of user experience.

Beyond-Accuracy Metrics

Beyond-Accuracy Metrics aim to capture aspects of the user experience that go beyond simple accuracy measures. These metrics provide a more holistic view of how well a recommender system performs in real-world scenarios.

  • Diversity ensures that the recommended items are not too similar, thereby providing a richer and more engaging user experience.
  • Novelty measures the ability of a recommender system to introduce users to new and unexpected items. High novelty can enhance user satisfaction by helping them discover items they might not have found otherwise.
  • Coverage measures the ability of a recommender system to serve all users and give each item a chance to be recommended.
  • Serendipity captures the ability of a recommender system to surprise and delight users by recommending items that are unexpected yet relevant.
  • Explainability refers to the ability of a recommender system to provide clear and understandable reasons for its recommendations.

Questionnaires

Questionnaires are a valuable tool for gathering detailed feedback on user experience and satisfaction. They can be designed using quantitative and qualitative methods to capture a wide range of user perceptions and experiences.

  • Quantitative Methods involve structured questions with predefined response options, such as Likert-scale ratings or manual accuracy feedback. They are useful for gathering numerical data that can be analysed statistically.
  • Qualitative Methods involve open-ended questions and structured interviews to gather detailed and nuanced feedback, such as open-question surveys, structured interviews or diary studies.

Conclusion

Evaluating recommender systems involves a combination of offline testing, online testing, and user studies. Different perspectives, such as information retrieval, machine learning, and user-centric evaluation, provide a comprehensive view of the system's performance.

Metrics like recall, precision, F-measure, MAE, RMSE, and beyond-accuracy metrics help ensure that the recommender system meets user needs and expectations.

By understanding and applying these evaluation methods, developers can build more effective and user-friendly recommender systems, ultimately enhancing the user experience and satisfaction.

Tags

Nico Filzmoser

Hi! I'm Nico 😊 I'm a technology enthusiast, passionate software engineer with a strong focus on standards, best practices and architecture… I'm also very much into Machine Learning 🤖