使用 R 聚合复杂的数据帧（对于初学者）

Question

我是 R 的新手。我正在尝试学习以不同方式聚合一些数据的最佳方法。我有一些编程经验，但我对 R 的语法还不太熟悉。

我现在的数据：

我有一个大型数据框，其中包含阅读时间实验的测量值，其格式与下面的虚构代码段类似。每行代表一个单独的度量，带有关于它的描述信息。每个参与者在数据框中占据很多行，每一行代表一个不同的实验项目：

| Participant | Item | Type | Condition1 | Condition2 | rtMeasure | list    |
|-------------|------|------|------------|------------|-----------|---------|
| 10059       | 215  | Q    | FALSE      | TRUE       | 4215.591  | qiList2 |
| 10059       | 113  | F    | FALSE      | FALSE      | 3472.066  | qiList2 |
| 10059       | 9    | B    | FALSE      | FALSE      | 4201.406  | qiList2 |
| 10059       | 303  | W    | FALSE      | TRUE       | 3619.791  | qiList2 |
| 10060       | 215  | Q    | FALSE      | TRUE       | 4985.057  | qiList2 |
| 10060       | 113  | F    | FALSE      | FALSE      | 3247.489  | qiList2 |
| 10060       | 9    | C    | TRUE       | FALSE      | 2543.65   | qiList2 |
| 10060       | 303  | W    | FALSE      | TRUE       | 3194.199  | qiList2 |
| 10061       | 215  | Q    | FALSE      | TRUE       | 2885.469  | qiList2 |
| 10061       | 113  | F    | FALSE      | FALSE      | 5901.188  | qiList2 |
| 10061       | 9    | D    | FALSE      | TRUE       | 3326.375  | qiList2 |
| 10061       | 303  | W    | FALSE      | TRUE       | 3194.199  | qiList2 |
| 10062       | 215  | Q    | FALSE      | TRUE       | 2885.469  | qiList2 |
| 10062       | 113  | F    | FALSE      | FALSE      | 5901.188  | qiList2 |
| 10062       | 9    | A    | TRUE       | TRUE       | 3326.375  | qiList2 |
| 10062       | 303  | W    | FALSE      | TRUE       | 3194.199  | qiList2 |

下面简要介绍了这些列：

Participant: 一个数字指向一个单独的主题
Item：记录该小节时出现的项目，即项目编号
Type：这是对句子的描述，有时是多余的。
- Q, F, W：填充项，这些项与项号
- A, B, C, D：不同版本的实验操作项目，即参与者可能会看到 11A，因此不会看到 11B 11C 或 11D。
Condition1 & Condition 2：多余。更明确地描述操作的编码也在类型列中编码（例如 Bs 是 -Condition1，-Condition2；Cs 是 +Condition1，-Condition2）
rtMeasure：实际测量值（在本例中，读取时间以毫秒为单位）。
List：冗余（映射 Type 到 Participant）。呈现给主题的列表。

我想得到的（探索值）：

例如，我想发现给定参与者对类型 A 和 B 项目的平均值 rtMeasure。我还想要给定参与者的总体平均值 rtMeasure。我也希望看到参与者之间句子类型的相似探索值。

我想转换成矩阵吗？

如果我将我的数据框重组为类似 Participant by (Item+Type) 及其转置版本，那么执行上述操作似乎会更容易。即：

| Participant | rtMeasure(Item 1, Type A) | rtMeasure(Item 1, Type B) | ... | rtMeasure(Item 323, Type W) |
|-------------|---------------------------|---------------------------|-----|-----------------------------|
| 12345       | 3343.334                  | NA                        | ... | 2342.115                    |
| 12346       | NA                        | 3343.334                  | ... | 2145.23                     |
| 12346       | NA                        | NA                        | ... | 2511.12                     |

并转置：

| Participant               | 12345  | 12346  | ... | 12400  |
|---------------------------|--------|--------|-----|--------|
| rtMeasure(Item 1, Type A) | 2341.2 | NA     | ... | 1903.6 |
| rtMeasure(Item 1, Type B) | NA     | 3012.4 | ... | NA     |

plyr 包似乎可以满足我的需求，但我不清楚如何攻击它。

我会使用这样的函数吗？

我可以看到解决方案是一个自定义函数，与我在下面的尝试有些相似，但我不知道如何将其转换为 R...我对 JavaScript 语法最满意，所以我将对其进行近似，但假设我有一个 R 数据框可以使用。

// assume data is the dataframe at the start of this post

var participants = valuesOf(data$Participant);
var matrix = []

for (participantId in participants) {
  var participant = {};
  participant.id = participantId;
  for (measure in dataframe[data$participant === participantId]) {
    measureLabel = measure.Item + ' ' + measure.Type;
    participant[measureLabel] = measure.rtMeasure;
  }
  matrix.push(participant);
}

上面的代码执行后，我希望 matrix 是一个 participant 对象的数组，其中属性是度量值，标记为“Item Type”

Answer 1

根据 Frank's suggestion, I attempted to create a MCVE. As he hinted might happen, I found the answer I was looking for by forcing myself to actually read through the somewhat intimidating tutorial for the plyr package: The Split-Apply-Combine Strategy for Data Analysis.

我还发现 Summarizing data in http://www.cookbook-r.com/ 很有帮助。

基本上我发现了如何使用 ddply，plyr 函数用于将数据帧聚合成不同的数据帧。

在我原来的问题中，我问的是如何看待

给定参与者的平均 rtMeasure
给定参与者的 A 类和 B 类项目的平均 rtMeasure
参与者之间句子类型的相似探索值

我将概述我是如何做每件事的，以防其他人发现它有用。

首先，载入一些编造的数据：

> df <- read.csv('df.csv')
> df
   participants items types condition1 condition2 rtMeasures
1          1001   101     F      FALSE       TRUE   3852.823
2          1001   213     Q       TRUE       TRUE   2499.445
3          1001     1     C      FALSE      FALSE   2811.198
4          1001   312     W       TRUE       TRUE   2200.470
5          1001   113     F       TRUE      FALSE   2419.663
6          1002   101     F      FALSE       TRUE   1833.647
7          1002   213     Q       TRUE       TRUE   2381.160
8          1002     1     B      FALSE      FALSE   2415.385
9          1002   312     W       TRUE       TRUE   2788.386
10         1002   113     F       TRUE      FALSE   2665.298

第一个很简单。

使用 ddply 获取每个参与者的平均 rtMeasure：

> ddply(df, .(participants), summarize, mean=mean(rtMeasures), N=length(participants));
  participants     mean N
1         1001 2756.720 5
2         1002 2416.775 5

第二个有点棘手。可能有更好的方法，但对于快速而肮脏的解决方案，这是有效的。

使用 ddply 获取每个参与者的每种类型的平均 rtMeasure：

> ddply(df, .(participants, "is type Q or W"=(types %in% c('Q', 'W'))), summarize, mean=mean(rtMeasures), N=length(participants));
  participants is type Q or W     mean N
1         1001          FALSE 3027.895 3
2         1001           TRUE 2349.958 2
3         1002          FALSE 2304.777 3
4         1002           TRUE 2584.773 2

明确地说，我根据度量的 "type" 是 Q 还是 W 来划分数据。因此，对于我的示例，"is type Q or W" 列列出的行 FALSE 显示 ABCDF 类措施的参与者的手段；该列为 TRUE 的行表示 QW 类型度量的平均值。在我的实际数据中，这些 "types" 已经是二进制编码的，所以应该不会那么乱。

按 items 或 condition1 或数据框中的任何其他描述符进行分组同样容易。

> ddply(df, .(items, types), summarize, mean=mean(rtMeasures), N=length(participants));
  items types     mean N
1     1     B 2415.385 1
2     1     C 2811.198 1
3   101     F 2843.235 2
4   113     F 2542.481 2
5   213     Q 2440.302 2
6   312     W 2494.428 2

越来越喜欢...

> ddply(df, .(Context=(condition1==FALSE & condition2==FALSE)), summarize, mean=mean(rtMeasures), N=length(participants));
  Context     mean N
1   FALSE 2580.112 8
2    TRUE 2613.291 2

使用 R 聚合复杂的数据帧（对于初学者）

Aggregating complex dataframe with R (for a beginner)

statistics

aggregate

r

我现在的数据：

我想得到的（探索值）：

我想转换成矩阵吗？