Python 的 .cat.codes 的 R 等价物是什么,它将分类变量转换为整数水平?

What is the R equivalent for Python's .cat.codes, which converts categorical variable to integer levels?

在 python 中,您可以使用 .cat.code 为变量生成分类代码,例如

df['col3'] = df['col3'].astype('category').cat.code

你如何在 R 中做到这一点?

为@Sid29 进一步充实这一点:

python 方法函数 .cat.code 提取因子水平的数字表示。 R 中的等价物是:

a <- factor(c("good", "bad", "good", "bad", "terrible"))

as.numeric(a)
[1] 2 1 2 1 3

请注意,.cat.code 将表示 NA(或 NaN 相同的东西)与 -1 而上述 R 中的解决方案仍然保留 NA 和输出将只是 NA.

编辑:as.numeric(a) 更好。关于 as.numeric 函数中 labels 函数的使用的讨论。请参阅 ?factor 中的警告:

In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

There are some anomalies associated with factors that have NA as a level. It is suggested to use them sparingly, e.g., only for tabulation purposes.

如果您有一个 NA 值,它会将所有值强制转换为 NA,这就是使用 labels 的原因。有趣的是,c(a) 有效(请参阅下面的@42 回答)。

也许做下面的事情更清楚:

# if you want numeric code for every value
a <- factor(c("good", "bad", "good", "bad", "terrible"))
as.integer(a)
# 2 1 2 1 3


# unique labels and the values for them
setNames(levels(a), seq_along(levels(a)))
#    1          2          3 
# "bad"     "good" "terrible"