我们可以在 R 中获得因子矩阵吗？

Question

在R中似乎无法得到因子矩阵。是真的吗？如果是，为什么？如果没有，我该怎么办？

f <- factor(sample(letters[1:5], 20, rep=TRUE), letters[1:5])
m <- matrix(f,4,5)
is.factor(m) # fail.

m <- factor(m,letters[1:5])
is.factor(m) # oh, yes? 
is.matrix(m) # nope. fail. 

dim(f) <- c(4,5) # aha?
is.factor(f) # yes.. 
is.matrix(f) # yes!

# but then I get a strange behavior
cbind(f,f) # is not a factor anymore
head(f,2) # doesn't give the first 2 rows but the first 2 elements of f
# should I worry about it?

Answer 1

在这种情况下，它可能会像鸭子一样走路，甚至像鸭子一样嘎嘎叫，但是f来自：

f <- factor(sample(letters[1:5], 20, rep=TRUE), letters[1:5])
dim(f) <- c(4,5)

确实不是矩阵，尽管 is.matrix() 声称它严格来说是一个矩阵。就is.matrix()而言，要成为矩阵，f只需要是向量并具有dim属性。通过将属性添加到 f 您可以通过测试。然而，正如您所见，一旦您开始使用 f 作为矩阵，它很快就会失去使其成为一个因素的特征（您最终会处理水平或维度丢失）。

原子向量类型实际上只有矩阵和数组：

符合逻辑，
整数，
真实，
复杂，
字符串（或字符），以及
原始

此外，正如@hadley 提醒我的那样，您还可以拥有列表矩阵和数组（通过在列表对象上设置 dim 属性。例如，参见 Hadley 的 Matrices & Arrays 部分书，高级 R.)

这些类型之外的任何内容都将通过 as.vector() 强制转换为较低的类型。这发生在 matrix(f, nrow = 3) 不是因为 f 是原子的根据 is.atomic() （其中 returns TRUE 对于 f 因为它在内部存储为一个整数和 typeof(f) returns "integer"), 但因为它有一个 class 属性。这会在 f 的内部表示上设置 OBJECT 位，并且任何具有 class 的东西都应该通过 as.vector():[=82 强制转换为其中一种原子类型=]

matrix <- function(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
                   dimnames = NULL) {
    if (is.object(data) || !is.atomic(data)) 
        data <- as.vector(data)
....

通过 dim<-() 添加维度是创建数组而不复制对象的快速方法，但这会绕过一些检查和平衡，如果您将 f 强制为矩阵，R 会执行这些检查和平衡通过其他方法

matrix(f, nrow = 3) # or
as.matrix(f)

当您尝试使用适用于矩阵的基本函数或使用方法分派时，就会发现这一点。请注意，将尺寸分配给 f 后，f 仍然是 class "factor":

> class(f)
[1] "factor"

这解释了 head() 行为；您没有得到 head.matrix 行为，因为 f 不是矩阵，至少就 S3 机制而言是这样：

> debug(head.matrix)
> head(f) # we don't enter the debugger
[1] d c a d b d
Levels: a b c d e
> undebug(head.matrix)

和 head.default 方法调用 [ 有一个 factor 方法，因此观察到的行为：

> debugonce(`[.factor`)
> head(f)
debugging in: `[.factor`(x, seq_len(n))
debug: {
    y <- NextMethod("[")
    attr(y, "contrasts") <- attr(x, "contrasts")
    attr(y, "levels") <- attr(x, "levels")
    class(y) <- oldClass(x)
    lev <- levels(x)
    if (drop) 
        factor(y, exclude = if (anyNA(levels(x))) 
            NULL
        else NA)
    else y
}
....

cbind() 行为可以从记录的行为（来自 ?cbind，强调我的）中解释：

The functions cbind and rbind are S3 generic, ...

....

In the default method, all the vectors/matrices must be atomic (see vector) or lists. Expressions are not allowed. Language objects (such as formulae and calls) and pairlists will be coerced to lists: other objects (such as names and external pointers) will be included as elements in a list result. Any classes the inputs might have are discarded (in particular, factors are replaced by their internal codes).

同样，f 是 class "factor" 的事实打败了你，因为默认的 cbind 方法将被调用，它会去除级别信息和return 您观察到的内部整数代码。

在许多方面，您必须忽略或至少不完全相信 is.foo 函数告诉您的内容，因为它们只是使用简单的测试来判断某事是否是 foo 目的。 is.matrix() 和 is.atomic() 在涉及 f（维度） 从特定角度 时显然是错误的。他们在实施方面也是正确的，或者至少可以从实施中理解他们的行为；我认为 is.atomic(f) 是不正确的，但是如果 "if is of an atomic type" R 核心意味着 "type" 是 return 由 typeof(f) 那么 is.atomic() 就对了。更严格的测试是 is.vector()，f 失败：

> is.vector(f)
[1] FALSE

因为它具有超出 names 属性的属性：

> attributes(f)
$levels
[1] "a" "b" "c" "d" "e"

$class
[1] "factor"

$dim
[1] 4 5

至于你应该如何得到一个因子矩阵，你不能，至少如果你想让它保留因子信息（水平标签）。一种解决方案是使用字符矩阵，它会保留标签：

> fl <- levels(f)
> fm <- matrix(f, ncol = 5)
> fm
     [,1] [,2] [,3] [,4] [,5]
[1,] "c"  "a"  "a"  "c"  "b" 
[2,] "d"  "b"  "d"  "b"  "a" 
[3,] "e"  "e"  "e"  "c"  "e" 
[4,] "a"  "b"  "b"  "a"  "e"

我们存储 f 的级别以备将来使用，以防我们在此过程中丢失矩阵的一些元素。

或使用内部整数表示：

> (fm2 <- matrix(unclass(f), ncol = 5))
     [,1] [,2] [,3] [,4] [,5]
[1,]    3    1    1    3    2
[2,]    4    2    4    2    1
[3,]    5    5    5    3    5
[4,]    1    2    2    1    5

您随时可以通过以下方式再次返回 levels/labels：

> fm2[] <- fl[fm2]
> fm2
     [,1] [,2] [,3] [,4] [,5]
[1,] "c"  "a"  "a"  "c"  "b" 
[2,] "d"  "b"  "d"  "b"  "a" 
[3,] "e"  "e"  "e"  "c"  "e" 
[4,] "a"  "b"  "b"  "a"  "e"

使用数据框似乎并不理想，因为数据框的每个组件都将被视为一个单独的因素，而您似乎希望将数组视为具有一组级别的单个因素。

如果你真的想做你想做的事，也就是有一个因子矩阵，你很可能需要创建你自己的 S3 class 来做到这一点，加上所有的方法。例如，您可以将因子矩阵存储为字符矩阵，但使用 class "factorMatrix"，其中您将水平与因子矩阵一起存储为额外属性 say。然后您需要编写 [.factorMatrix，这将获取级别，然后在矩阵上使用默认的 [ 方法，然后再次添加级别属性。您也可以编写 cbind 和 head 方法。然而，所需方法的列表会快速增长，但一个简单的实现可能适合，如果你让你的对象具有 class c("factorMatrix", "matrix")（即继承自 "matrix" class），您将选择 "matrix" class 的所有 properties/methods（这将删除级别和其他属性），因此您至少可以使用对象并查看需要添加的位置填写 class.

行为的新方法

Answer 2

不幸的是，因子支持在 R 中并不完全通用，因此许多 R 函数默认将因子视为其内部存储类型，即 integer:

> typeof(factor(letters[1:3]))
[1] "integer

matrix、cbind 就是这样。他们不知道如何处理因子，但他们知道如何处理整数，因此他们将您的因子视为整数。 head其实是相反的。它确实知道如何处理一个因子，但它从不费心去检查你的因子是否也是一个矩阵，所以只是把它当作一个普通的无量纲因子向量。

如果你的矩阵有因素，你最好的选择是强迫它的性格。完成操作后，您可以将其恢复为因子形式。您也可以使用整数形式执行此操作，但这样会冒一些奇怪的风险（例如，您可以在整数矩阵上进行矩阵乘法，但这对因子没有意义）。

请注意，如果您将 class "matrix" 添加到您的因素中，一些（但不是全部）事情就会开始起作用：

f <- factor(letters[1:9])
dim(f) <- c(3, 3)
class(f) <- c("factor", "matrix")
head(f, 2)

生产：

     [,1] [,2] [,3]
[1,] a    d    g   
[2,] b    e    h   
Levels: a b c d e f g h i

这不能解决 rbind，等等

我们可以在 R 中获得因子矩阵吗？

Can we get factor matrices in R?

integer

r

vector

matrix

r-factor