如何解释 AWS Personalize 中的解决方案指标?

How to interpret solution metrics in AWS Personalize?

谁能帮我通俗地解释 AWS Personalize 解决方案版本指标,或者至少告诉我这些指标理想情况下应该是什么样子?

我对机器学习一无所知,想利用 Personalize,因为它作为 'no-previous-knowledge-required' ML SaaS 进行营销。但是,我的解决方案结果中的“解决方案版本指标”似乎需要相当高的数学知识水平。

我的解决方案版本指标如下:
归一化折扣累积
5 点:0.9881,10 点:0.9890,25 点:0.9898
精度
5 点:0.1981,10 点:0.0993,25 点:0.0399
平均倒数排名
在 25:0.9833

研究

我已浏览 Personalize Developer's Guide which includes a short definition of each metric on page 72. I also attempted to skim through the Wikipedia articles on discounted cumulative gain and mean reciprocal rank。通过阅读,这是我对每个指标的解释:
NDG = 建议相关性的一致性;第一条建议和最后一条一样相关吗?
精度 = 推荐与用户的相关性;您的建议与所有用户的相关性如何?
MRR = 列表中第一个建议与列表中其他建议的相关性;您的第一个推荐与每个用户的相关性如何?

如果这些解释是正确的,那么我的解决方案指标表明我对推荐不相关的内容高度一致。这是一个有效的结论吗?

好的,我的公司有 Developer Tier Support,所以我能够从 AWS 得到这个问题的答案。

答案摘要

指标越接近“1”越好。我对我的指标的解释几乎是正确的,但我的结论是错误的。

显然,这些指标(以及一般的个性化)没有考虑用户对某个项目的喜欢程度。 Personalize 只关心相关推荐多快到达用户。这是有道理的,因为如果你得到队列中的第 25 个项目并且不喜欢你看到的任何东西,你就不可能继续寻找。

鉴于此,我的解决方案中发生的事情是第一个建议是相关的,但 none 其他建议是相关的。

来自 AWS 的详细解答

I will start with relatively easier question first: What are the ideal values for these metrics, so that a solution version can be preferred over another solution version? The answer to the above question is that for each metric, higher numbers are better. [1] If you have more than one solution version, please prefer the solution version with higher values for these metrics. Please note that you can create number of solution versions by Overriding Default Recipe Parameters [2]. And by using Hyperparameters [3].

The second question: How to understand and interpret the metrics for AWS Personalize Solution version? I can confirm from my research that the definitions and interpretation provided for these metrics in the case by you are valid.

Before I explain each metric, here is a primer for one of the main concept in Machine Learning. How these metrics are calculated? The Model training step during the creation of solution version splits the input dataset into two parts, a training dataset (~70%) and test dataset (~30%). The training dataset is used during the Model training. Once the model is trained, it is used to predict the values for test dataset. Once the prediction is made it is validated against the known (and correct) value in the test dataset. [4]

I researched further to find more resources to understand the concept behind these metrics and also elaborate further an example provided in the AWS documentation. [1]

"mean_reciprocal_rank_at_25"

Let’s first understand Reciprocal Rank: For example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (in the test dataset) we find out that only movie B and E are actually liked by the user. The Reciprocal Rank will only consider the first relevant (correct according to test dataset) recommendation which is movie B located at rank 2 and it will ignore the movie E located at rank 5. Thus the Reciprocal Rank will be 1/2 = 0.5

Now let’s expand the above example to understand Mean Reciprocal Rank: [5] Let’s assume that we ran predictions for three users and below movies were recommended.
User 1: A, B, C, D, E (user liked B and E, thus the Reciprocal Rank is 1/2)
User 2: F, G, H, I, J (user liked H and I, thus the Reciprocal Rank is 1/3)
User 3: K, L, M, N, O (user liked K, M and N, thus the Reciprocal Rank is 1)
The Mean Reciprocal Rank will be sum of all the individual Reciprocal Ranks divided by the total number of queries ran for predictions, which is 3. (1/2 + 1/3 + 1)/3 = (0.5+0.33+1)/3 = (1.83)/3 = 0.61

In case of AWS Personalize Solution version metrics, the mean of the reciprocal ranks of the first relevant recommendation out of the top 25 recommendations over all queries is called “mean_reciprocal_rank_at_25”.

"precision_at_K"

It can be stated as the capability of a model for delivering the relevant elements with the least amount of recommendations. The concept of precision is described in the following free video available at Coursera. [6] A very good article on the same topic can be found here. [7]

Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user. The precision_at_5 will be 2 correctly predicted movies out of total 5 movies and can be stated as 2/5=0.4

"normalized_discounted_cumulative_gain_at_K"

This metric use the concept of Logarithm and Logarithmic Scale to assign weighting factor to relevant items (correct values in the test dataset). The full description of Logarithm and Logarithmic Scale is beyond the scope of this document. The main objective of using Logarithmic scale is to reduce wide-ranging quantities to tiny scopes.

discounted_cumulative_gain_at_K
Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user. Let’s consider the same example, a movie streaming service uses a solution version to predict a list of 5 recommended movies for a specific user i.e; A, B, C, D, E. Once these 5 recommended movies are compared against the actual movies liked by that user (correct values in the test dataset) we find out that only movie B and E are actually liked by the user. To produce the cumulative discounted gain (DCG) at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position)
As B is at position 2 so the discounted value is = 1/log(1 + 2)
As E is at position 5 so the discounted value is = 1/log(1 + 5)
The cumulative discounted gain (DCG) is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 2) + 1/log(1 + 5) )

normalized_discounted_cumulative_gain_at_K
First of all, what is “ideal DCG”? In the above example the ideal predictions should look like B, E, A, C, D. Thus the relevant items should be at number 1 and 2 in ideal case. To produce the “ideal DCG” at 5, each relevant item is assigned a weighting factor (using Logarithmic Scale) based on its position in the top 5 recommendations. The value produced by this formula is called as “discounted value”.
The formula is 1/log(1 + position).
As B is at position 1 so the discounted value is = 1/log(1 + 1)
As E is at position 2 so the discounted value is = 1/log(1 + 2)
The ideal DCG is calculated by adding discounted values for both relevant items DCG = ( 1/log(1 + 1) + 1/log(1 + 2) )

The normalized discounted cumulative gain (NDCG) is the DCG divided by the “ideal DCG”. DCG / ideal DCG = (1/log(1 + 2) + 1/log(1 + 5)) / (1/log(1 + 1) + 1/log(1 + 2)) = 0.6241

I hope the information provided above is helpful in understanding the concept behind these metrics.

[1] https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html
[2] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config.html
[3] https://docs.aws.amazon.com/personalize/latest/dg/customizing-solution-config-hpo.html
[4] https://medium.com/@m_n_malaeb/recall-and-precision-at-k-for-recommender-systems-618483226c54
[5] https://www.blabladata.com/2014/10/26/evaluating-recommender-systems/
[6] https://www.coursera.org/lecture/ml-foundations/optimal-recommenders-4EQc2
[7] https://medium.com/@bond.kirill.alexandrovich/precision-and-recall-in-recommender-systems-and-some-metrics-stuff-ca2ad385c5f8