我应该在计算 Spearman 等级相关性 (scipy) 之前编码我的序数变量吗？

Question

我正在使用 scipy.stats.spearmanr 来计算 2 个有序变量的 Spearman 秩相关。我不确定是否对它们进行编码。我尝试了两种方法，但函数似乎无论如何都会吐出结果。所以我不确定该走哪条路。

from scipy import stats

# dummy data comparing one ordinal variable with another
print(stats.spearmanr(['always','never','sometimes','always'], ['high','medium','low','low']))
>> SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)

# encoding
print(stats.spearmanr([3,1,2,3], [3,2,1,1]))
>> SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)

Answer 1

除非数据的字母顺序与预期顺序相同，否则您应该对变量进行编码。

在内部，SciPy 正在命令您的数据进行测试。在整数的情况下，它们的顺序显然等于您的数据值，例如1 < 2 < 3。在字符串的情况下，它们的顺序很可能是它们的字母顺序，例如a < b < c.

在你的情况下，预期的订单可能是

never < sometimes < always
low < medium < high

然而，按字母顺序对这些值列表进行排序会产生（很可能不正确的）顺序

always < never < sometimes
high < low < medium

如果您手动将此列表编码为整数或可正确排序的字符串值，则可以更正此问题：

import scipy

# Incorrect alphabetical order
scipy.stats.spearmanr(['always','never','sometimes','always'], ['high','medium','low','low'])
# SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)

# Incorrect integer order
scipy.stats.spearmanr([1,2,3,1], [1,3,2,2])
# SpearmanrResult(correlation=0.5000000000000001, pvalue=0.4999999999999999)

# Correct integer order
scipy.stats.spearmanr([3,1,2,3], [3,2,1,1])
# SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)

# Correct alphabetical order
scipy.stats.spearmanr(['c','a','b','c'], ['c','b','a','a'])
# SpearmanrResult(correlation=0.05555555555555556, pvalue=0.9444444444444444)

我应该在计算 Spearman 等级相关性 (scipy) 之前编码我的序数变量吗？

Should I encode my ordinal variables before calculating Spearmans Rank Correlation (scipy)?

scipy

correlation

pearson-correlation

scipy.stats