将二进制结果的摘要 Table 转换为 Long Tidy DataFrame

Question

我想将具有多个分类变量的 table 以及二元实验结果的摘要转换为长格式，以便轻松地运行逻辑回归模型。

有没有一种简单的方法可以做到这一点，而不是用 rep() 制作一堆向量，然后将它们组合成一个 dataframe？理想情况下，我想要一个自动执行此操作的功能，但也许我只需要自己制作一个。

例如，如果我从这个摘要开始 table:

test   group    success  n 
A      control  1        2
A      treat    2        3
B      control  3        5
B      treat    1        3

我希望能够将其切换回以下格式：

test   group     success
A      control   1
A      control   0
A      treat     1
A      treat     1
A      treat     0
B      control   1
B      control   1
B      control   1
B      control   0
B      control   0
B      treat     1
B      treat     0
B      treat     0

谢谢！

Answer 1

reshape 包是你的朋友，在这里。在这种情况下，melt() 和 untable() 可用于规范化数据。

如果示例摘要 data.frame 位于名为 df 的变量中，则缩写答案为：

# replace total n with number of failures
df$fail = df$n - df$success
df$n = NULL

# melt and untable the data.frame
df = melt(df)
df = untable(df, df$value)

# recode the results, e.g., here by creating a new data.frame
df = data.frame(
  test = df$test, 
  group = df$group, 
  success = as.numeric(df$variable == "success")
)

这是一个非常普遍的问题的一个很好的例子。这个想法是反向计算 cross-tabulation 下的数据列表。给定 cross-tabulation，back-calculated 数据列表每个数据一行，并包含每个数据的属性。 Here is a post to the inverse of this question.

用 "data geek" 的说法，这是一个将表格数据放入 第一范式 的问题——如果这对任何人都有帮助的话。您可以 google 数据规范化 ，这将帮助您设计敏捷 data.frame 可以 cross-tabulated 并以多种不同方式进行分析。

详细来说，要使 melt() 和 untable() 在这里工作，需要对原始数据进行一些调整以包括 fail（失败次数）而不是总数 n，但这很简单：

df$fail <- df$n - df$success
df$n <- NULL

给出：

  test   group success fail
1    A control       1    1
2    A   treat       2    1
3    B control       3    2
4    B   treat       1    2

现在我们可以 "melt" table。 melt() 可以 back-calculate 用于创建交叉表的原始数据列表。

df <- melt(df)

在这种情况下，我们得到名为 variable 的新列，其中包含 "success" 或 "fail"，以及一个名为 value 的列，其中包含原始数据success 或 fail 列。

  test   group variable value
1    A control  success     1
2    A   treat  success     2
3    B control  success     3
4    B   treat  success     1
5    A control     fail     1
6    A   treat     fail     1
7    B control     fail     2
8    B   treat     fail     2

untable() 函数根据数值 "count" 向量的值重复 table 的每一行。在这种情况下，df$value 是计数向量，因为它包含成功和失败的次数。

df <- untable(df, df$value)

这将为每个数据生成一个记录，"success" 或 "fail":

    test   group variable value
1      A control  success     1
2      A   treat  success     2
2.1    A   treat  success     2
3      B control  success     3
3.1    B control  success     3
3.2    B control  success     3
4      B   treat  success     1
5      A control     fail     1
6      A   treat     fail     1
7      B control     fail     2
7.1    B control     fail     2
8      B   treat     fail     2
8.1    B   treat     fail     2

这就是解决方案。如果需要，现在可以重新编码数据以将 "success" 替换为 1，将 "fail" 替换为 0（并删除无关的 value 和 variable 列...）

  df <- data.frame(
    test = df$test, 
    group = df$group, 
    success = as.numeric(df$variable == "success")
  )

此 returns 请求的解决方案，但行的排序方式不同：

   test   group success
1     A control       1
2     A   treat       1
3     A   treat       1
4     B control       1
5     B control       1
6     B control       1
7     B   treat       1
8     A control       0
9     A   treat       0
10    B control       0
11    B control       0
12    B   treat       0
13    B   treat       0

显然，如果需要，可以使用 data.frame。 How to sort a data.frame in R.

将二进制结果的摘要 Table 转换为 Long Tidy DataFrame

Converting Summary Table of Binary Outcome to Long Tidy DataFrame

r

data-manipulation

data-munging