如何使用 R 中的查找方法基于其他列计算新列?

How to calculate a new column based on other columns using a lookup approach in R?

我正在尝试根据另一列和查找 table 计算数据框中的另一列。我有一个只显示少量数据的简单示例(我的真实数据集包含数百万行)。

我有以下数据集:

  lookup<- data.frame("class"=c(1, 2, 1, 2), "type"=c("A", "B", "B", "A"), 
           "condition1"=c(50, 60, 55, 53), "condition2"=c(80, 85, 86, 83))

  lookup
  class type condition1 condition2
      1    A         50         80
      2    B         60         85
      1    B         55         86
      2    A         53         83

我的数据框是这样的形状:

  data<- data.frame("class"=c(1, 2, 2, 1, 2, 1), 
         "type"=c("A","B", "A", "A", "B", "B"), 
         "percentage_condition1"=c(0.3, 0.6, 0.1, 0.2, 0.4, 0.5), 
         "percentage_condition2"=c(0.7, 0.4, 0.9, 0.8, 0.6, 0.5))


  data
  class type percentage_condition1 percentage_condition2
    1    A                   0.3                   0.7
    2    B                   0.6                   0.4
    2    A                   0.1                   0.9
    1    A                   0.2                   0.8
    2    B                   0.4                   0.6
    1    B                   0.5                   0.5

我想在我的数据框中创建一个名为 data 的新列,它将使用查找 table 例如:

在我的数据中,我的 class 与我的 type 列匹配,它可以在我的数据框数据中计算一个新列例如(不是真正的代码):

d$new<- lookup$condition1 * data$percentage_condition1 + lookup$condition2 * data$percentage_condition2

我知道如何使用 if else 语句来完成它,但我正在尝试更有效地完成它,因为我正在处理大量数据。我知道用查找中的一列来做到这一点 table 但我没有成功地使用多个列(class 和类型列)。

感谢您的帮助和建议!

我们可以使用match获取'data'和'type'的'type'列的索引,使用该索引获取[=42=的相应行], 'condition2' 列,乘以 'data' 的百分比列,得到 rowSums

data$new <- rowSums(lookup[match(paste(data$class, data$type), 
                  paste(lookup$class, lookup$type)), 
               c("condition1", "condition2")] * data[3:4])

data
#  class type percentage_condition1 percentage_condition2  new
#1     1    A                   0.3                   0.7 71.0
#2     2    B                   0.6                   0.4 70.0
#3     2    A                   0.1                   0.9 80.0
#4     1    A                   0.2                   0.8 74.0
#5     2    B                   0.4                   0.6 75.0
#6     1    B                   0.5                   0.5 70.5

注意:使用 match,我们可以更轻松地做到这一点


或使用data.table

library(data.table)
setDT(data)[lookup, new := condition1 * percentage_condition1 + 
       condition2 * percentage_condition2, on = .(class, type)]
data
#   class type percentage_condition1 percentage_condition2  new
#1:     1    A                   0.3                   0.7 71.0
#2:     2    B                   0.6                   0.4 70.0
#3:     2    A                   0.1                   0.9 80.0
#4:     1    A                   0.2                   0.8 74.0
#5:     2    B                   0.4                   0.6 75.0
#6:     1    B                   0.5                   0.5 70.5

或使用tidyverse

library(tidyverse)
data %>% 
     left_join(lookup, by = c("class", "type")) %>%
     mutate(new = condition1 * percentage_condition1 + 
       condition2 * percentage_condition2) %>%
     select(names(data), new)
#   class type percentage_condition1 percentage_condition2  new
#1     1    A                   0.3                   0.7 71.0
#2     2    B                   0.6                   0.4 70.0
#3     2    A                   0.1                   0.9 80.0
#4     1    A                   0.2                   0.8 74.0
#5     2    B                   0.4                   0.6 75.0
#6     1    B                   0.5                   0.5 70.5

或者使用基于 SQL 的解决方案 sqldf

library(sqldf)
str1 <- "SELECT data.class, data.type, data.percentage_condition1, 
  data.percentage_condition2, (data.percentage_condition1 * lookup.condition1 + 
   data.percentage_condition2 * lookup.condition2) as new
   FROM data 
   LEFT JOIN lookup on data.class = lookup.class AND 
   data.type = lookup.type"
sqldf(str1)

或者如@G.Grothendieck在评论中提到的,使用别名标识符,sqldf解决方案可以变得更紧凑

sqldf("select D.*, L.condition1 * D.[percentage_condition1] + 
       L.condition2 * D.[percentage_condition2] as new 
       from data as D 
       left join lookup as L 
       using(class, type)")

注意:所有解决方案都保持数据集的原始顺序

一个选项是mergedatalookup然后执行计算

df1 <- merge(data, lookup) #This merges by class and type columns

df1$new <- with(df1, (condition1 * percentage_condition1) + 
                     (condition2 * percentage_condition2))


df1
#  class type percentage_condition1 percentage_condition2 condition1 condition2  new
#1     1    A                   0.3                   0.7         50         80 71.0
#2     1    A                   0.2                   0.8         50         80 74.0
#3     1    B                   0.5                   0.5         55         86 70.5
#4     2    A                   0.1                   0.9         53         83 80.0
#5     2    B                   0.6                   0.4         60         85 70.0
#6     2    B                   0.4                   0.6         60         85 75.0