为什么从因子变量 return 几个 NA 的文档中强制这个因子变量？

Question

因子的文档将此代码作为构建因子变量的第一个示例：

(ff <- factor(substring("statistics", 1:10, 1:10), levels = letters))

所述文档建议如下：

To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

但是当我在他们的例子中尝试这些时，我得到了废话：

> (ff <- factor(substring("statistics", 1:10, 1:10), levels = letters))
 [1] s t a t i s t i c s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> ff
 [1] s t a t i s t i c s
Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
> as.numeric(levels(ff))[ff]
 [1] NA NA NA NA NA NA NA NA NA NA
Warning message:
NAs introduced by coercion 
> as.numeric(as.character(ff))
 [1] NA NA NA NA NA NA NA NA NA NA
Warning message:
NAs introduced by coercion

我的误会在哪里？我没有发现 ff 因子变量有任何异常。它肯定有基础数字：

> as.integer(ff)
 [1] 19 20  1 20  9 19 20  9  3 19

虽然它的级别是字符，但我也没有看到任何奇怪的地方 - 因子变量总是具有字符级别。

Answer 1

一旦你创建了 ff 运行这个：table(ff)，它会告诉你每个字母表的频率，即使是那些不存在的字母表，这些字母表的频率相应地为 0。

现在 levels(ff) return 将所有这些字母作为字符，将它们包裹在 as.numeric(levels(ff)) 中将始终 return NA。 as.numeric(as.character(ff)).

也是如此

我猜你可能会混淆 labels 和 levels。如果你运行 labels(ff) 那么你会得到数字 1 到 10 的引号。如果您使用 as.numeric() 进行转换。您将得到 10 个数字的结果。运行: as.numeric(labels(ff))

我希望这能解释您的困惑。否则请告诉我。

输出：

R>table(ff)
ff
a b c d e f g h i j k l m n o p q r 
1 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 
s t u v w x y z 
3 3 0 0 0 0 0 0 

R>levels(ff)
 [1] "a" "b" "c" "d" "e" "f" "g" "h"
 [9] "i" "j" "k" "l" "m" "n" "o" "p"
[17] "q" "r" "s" "t" "u" "v" "w" "x"
[25] "y" "z"

R>labels(ff)
 [1] "1"  "2"  "3"  "4"  "5"  "6" 
 [7] "7"  "8"  "9"  "10"

编辑：

好的，看来 OP 在文档中的这一行有问题：

The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). In particular, as.numeric applied to a factor is meaningless, and may happen by implicit coercion. To transform a factor f to approximately its original numeric values, as.numeric(levels(f))[f] is recommended and slightly more efficient than as.numeric(as.character(f)).

现在上面说如果你有因子（原本是数字），不要直接将它们转换成数字例如：

nums <- c(1, 2, 3, 10)
new_fact <- as.factor(nums)

现在，如果我们尝试从 new_fact 和运行 as.numeric(new_fact) 中获取数字，我们将得到 1,2,3,4（错误）！！！现在那是错误的，所以所有文档都说要转换为原始数字，必须执行 as.numeric(as.character(new_fact)) 或 as.numeric(levels(new_fact))[new_fact]，两者都会 return 1 2 3 10。希望对您有所帮助

为什么从因子变量 return 几个 NA 的文档中强制这个因子变量？

Why does coercing this factor variable from the documentation for factor variables return several NAs?

r

categorical-data