如何在不创建重复项的情况下在 R 中使用查找 table？

Question

我想知道是否有人有实现此目标的好方法。我有一个数据框，其中属于特定组（=条件）的每个观察值（=项目）都具有给定值：

# Create sample data.
item       = rep(1:3,2)                               #6 items
condition  = c(rep("control",3), rep("related",3))    #2 conditions
value      = c(10,11,12,20,21,22)                     #6 values          
df         = data.frame(item, condition, value)

  item condition value
1    1   control    10
2    2   control    11
3    3   control    12
4    1   related    20
5    2   related    21
6    3   related    22

我也有一个查询table，其中包含每组的平均值：

# Create lookup table.
condition  = c("control", "related")
mean       = c(11,21)
table      = data.frame(condition, mean)

  condition mean
1   control   11
2   related   21

我想修改我的原始数据框，使其包含一个新列，label，其中显示“low”，如果项目的值低于组平均值，否则为“high”。 它应该如下所示：

# How the output should look like.
# If the item value is less than the group mean, write "low". Write "high" otherwise.
item       = rep(1:3,2)                               
condition  = c(rep("control",3), rep("related",3))    
value      = c(10,11,12,20,21,22)                      
label      = c(rep(c("low", "high", "high"),2))
output     = data.frame(item, condition, value, label)

  item condition value label
1    1   control    10   low
2    2   control    11  high
3    3   control    12  high
4    1   related    20   low
5    2   related    21  high
6    3   related    22  high

如果这只是将组均值复制到我的原始数据框的问题，我会使用 merge。但我需要的是考虑组均值，为每个项目写一个新标签，上面写着“low”或“high”，具体取决于组均值。

我尝试的一件事是首先将我的数据框与 table 合并，然后使用 ifelse 将 value 列与平均值 列。这行得通，但我的数据框中也有一个 mean 列，我不需要它（我只需要 label 列).当然，我可以手动删除 mean 列，但这看起来很笨重。所以我想知道：有人知道 better/more 优雅的解决方案吗？

非常感谢！

Answer 1

这里有一些备选方案。 (1) 和 (2) 仅使用基数 R，而 (2)、(3) 和 (5) 不创建仅被显式删除的均值列。在 (1)、(3) 和 (4) 中，我们使用了左连接，尽管内部连接会对该数据给出相同的结果，并且在 (1a) 的情况下，我们可以将 (1) 写成一行。

1) 合并

m <- merge(df, table, all.x = TRUE)
transform(m, label = ifelse(value < mean, "low", "high"), mean = NULL)

给予：

  item condition value label
1    1   control    10   low
2    2   control    11  high
3    3   control    12  high
4    1   related    20   low
5    2   related    21  high
6    3   related    22  high

1a) 使用内部连接可以缩短为：

transform(merge(df, table), label = ifelse(value < mean, "low", "high"), mean = NULL)

2) 匹配

transform(df, 
  label = ifelse(value < table$mean[match(condition, table$condition)], "low", "high")
)

给予相同。

3) sqldf

library(sqldf)
sqldf("select 
         df.*, 
         case when value < mean 
              then 'low' 
              else 'high' 
              end label
       from df 
       left join 'table' using (condition)")

4) dplyr

library(dplyr)
df %>%
   left_join(table) %>%
   mutate(label = ifelse(value < mean, "low", "high")) %>%
   select(- mean)

5) data.table

library(data.table)
dt <- as.data.table(df)
setkey(dt, "condition")
dt[table, label := ifelse(value < mean, "low", "high")]

如何在不创建重复项的情况下在 R 中使用查找 table？

How to use a lookup table in R without creating duplicates?

lookup

r