Python 的 .cat.codes 的 R 等价物是什么,它将分类变量转换为整数水平?
What is the R equivalent for Python's .cat.codes, which converts categorical variable to integer levels?
在 python 中,您可以使用 .cat.code 为变量生成分类代码,例如
df['col3'] = df['col3'].astype('category').cat.code
你如何在 R 中做到这一点?
为@Sid29 进一步充实这一点:
python 方法函数 .cat.code
提取因子水平的数字表示。 R 中的等价物是:
a <- factor(c("good", "bad", "good", "bad", "terrible"))
as.numeric(a)
[1] 2 1 2 1 3
请注意,.cat.code
将表示 NA
(或 NaN
相同的东西)与 -1
而上述 R 中的解决方案仍然保留 NA
和输出将只是 NA
.
编辑:as.numeric(a)
更好。关于 as.numeric
函数中 labels
函数的使用的讨论。请参阅 ?factor
中的警告:
In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).
There are some anomalies associated with factors that have NA as a level. It is suggested to use them sparingly, e.g., only for tabulation purposes.
如果您有一个 NA
值,它会将所有值强制转换为 NA
,这就是使用 labels
的原因。有趣的是,c(a)
有效(请参阅下面的@42 回答)。
也许做下面的事情更清楚:
# if you want numeric code for every value
a <- factor(c("good", "bad", "good", "bad", "terrible"))
as.integer(a)
# 2 1 2 1 3
# unique labels and the values for them
setNames(levels(a), seq_along(levels(a)))
# 1 2 3
# "bad" "good" "terrible"
在 python 中,您可以使用 .cat.code 为变量生成分类代码,例如
df['col3'] = df['col3'].astype('category').cat.code
你如何在 R 中做到这一点?
为@Sid29 进一步充实这一点:
python 方法函数 .cat.code
提取因子水平的数字表示。 R 中的等价物是:
a <- factor(c("good", "bad", "good", "bad", "terrible"))
as.numeric(a)
[1] 2 1 2 1 3
请注意,.cat.code
将表示 NA
(或 NaN
相同的东西)与 -1
而上述 R 中的解决方案仍然保留 NA
和输出将只是 NA
.
编辑:as.numeric(a)
更好。关于 as.numeric
函数中 labels
函数的使用的讨论。请参阅 ?factor
中的警告:
In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).
There are some anomalies associated with factors that have NA as a level. It is suggested to use them sparingly, e.g., only for tabulation purposes.
如果您有一个 NA
值,它会将所有值强制转换为 NA
,这就是使用 labels
的原因。有趣的是,c(a)
有效(请参阅下面的@42 回答)。
也许做下面的事情更清楚:
# if you want numeric code for every value
a <- factor(c("good", "bad", "good", "bad", "terrible"))
as.integer(a)
# 2 1 2 1 3
# unique labels and the values for them
setNames(levels(a), seq_along(levels(a)))
# 1 2 3
# "bad" "good" "terrible"