如何使用 R 中的查找方法基于其他列计算新列?
How to calculate a new column based on other columns using a lookup approach in R?
我正在尝试根据另一列和查找 table 计算数据框中的另一列。我有一个只显示少量数据的简单示例(我的真实数据集包含数百万行)。
我有以下数据集:
lookup<- data.frame("class"=c(1, 2, 1, 2), "type"=c("A", "B", "B", "A"),
"condition1"=c(50, 60, 55, 53), "condition2"=c(80, 85, 86, 83))
lookup
class type condition1 condition2
1 A 50 80
2 B 60 85
1 B 55 86
2 A 53 83
我的数据框是这样的形状:
data<- data.frame("class"=c(1, 2, 2, 1, 2, 1),
"type"=c("A","B", "A", "A", "B", "B"),
"percentage_condition1"=c(0.3, 0.6, 0.1, 0.2, 0.4, 0.5),
"percentage_condition2"=c(0.7, 0.4, 0.9, 0.8, 0.6, 0.5))
data
class type percentage_condition1 percentage_condition2
1 A 0.3 0.7
2 B 0.6 0.4
2 A 0.1 0.9
1 A 0.2 0.8
2 B 0.4 0.6
1 B 0.5 0.5
我想在我的数据框中创建一个名为 data 的新列,它将使用查找 table 例如:
在我的数据中,我的 class 与我的 type 列匹配,它可以在我的数据框数据中计算一个新列例如(不是真正的代码):
d$new<- lookup$condition1 * data$percentage_condition1 + lookup$condition2 * data$percentage_condition2
我知道如何使用 if else 语句来完成它,但我正在尝试更有效地完成它,因为我正在处理大量数据。我知道用查找中的一列来做到这一点 table 但我没有成功地使用多个列(class 和类型列)。
感谢您的帮助和建议!
我们可以使用match
获取'data'和'type'的'type'列的索引,使用该索引获取[=42=的相应行], 'condition2' 列,乘以 'data' 的百分比列,得到 rowSums
data$new <- rowSums(lookup[match(paste(data$class, data$type),
paste(lookup$class, lookup$type)),
c("condition1", "condition2")] * data[3:4])
data
# class type percentage_condition1 percentage_condition2 new
#1 1 A 0.3 0.7 71.0
#2 2 B 0.6 0.4 70.0
#3 2 A 0.1 0.9 80.0
#4 1 A 0.2 0.8 74.0
#5 2 B 0.4 0.6 75.0
#6 1 B 0.5 0.5 70.5
注意:使用 match
,我们可以更轻松地做到这一点
或使用data.table
library(data.table)
setDT(data)[lookup, new := condition1 * percentage_condition1 +
condition2 * percentage_condition2, on = .(class, type)]
data
# class type percentage_condition1 percentage_condition2 new
#1: 1 A 0.3 0.7 71.0
#2: 2 B 0.6 0.4 70.0
#3: 2 A 0.1 0.9 80.0
#4: 1 A 0.2 0.8 74.0
#5: 2 B 0.4 0.6 75.0
#6: 1 B 0.5 0.5 70.5
或使用tidyverse
library(tidyverse)
data %>%
left_join(lookup, by = c("class", "type")) %>%
mutate(new = condition1 * percentage_condition1 +
condition2 * percentage_condition2) %>%
select(names(data), new)
# class type percentage_condition1 percentage_condition2 new
#1 1 A 0.3 0.7 71.0
#2 2 B 0.6 0.4 70.0
#3 2 A 0.1 0.9 80.0
#4 1 A 0.2 0.8 74.0
#5 2 B 0.4 0.6 75.0
#6 1 B 0.5 0.5 70.5
或者使用基于 SQL 的解决方案 sqldf
library(sqldf)
str1 <- "SELECT data.class, data.type, data.percentage_condition1,
data.percentage_condition2, (data.percentage_condition1 * lookup.condition1 +
data.percentage_condition2 * lookup.condition2) as new
FROM data
LEFT JOIN lookup on data.class = lookup.class AND
data.type = lookup.type"
sqldf(str1)
或者如@G.Grothendieck在评论中提到的,使用别名标识符,sqldf
解决方案可以变得更紧凑
sqldf("select D.*, L.condition1 * D.[percentage_condition1] +
L.condition2 * D.[percentage_condition2] as new
from data as D
left join lookup as L
using(class, type)")
注意:所有解决方案都保持数据集的原始顺序
一个选项是merge
data
和lookup
然后执行计算
df1 <- merge(data, lookup) #This merges by class and type columns
df1$new <- with(df1, (condition1 * percentage_condition1) +
(condition2 * percentage_condition2))
df1
# class type percentage_condition1 percentage_condition2 condition1 condition2 new
#1 1 A 0.3 0.7 50 80 71.0
#2 1 A 0.2 0.8 50 80 74.0
#3 1 B 0.5 0.5 55 86 70.5
#4 2 A 0.1 0.9 53 83 80.0
#5 2 B 0.6 0.4 60 85 70.0
#6 2 B 0.4 0.6 60 85 75.0
我正在尝试根据另一列和查找 table 计算数据框中的另一列。我有一个只显示少量数据的简单示例(我的真实数据集包含数百万行)。
我有以下数据集:
lookup<- data.frame("class"=c(1, 2, 1, 2), "type"=c("A", "B", "B", "A"),
"condition1"=c(50, 60, 55, 53), "condition2"=c(80, 85, 86, 83))
lookup
class type condition1 condition2
1 A 50 80
2 B 60 85
1 B 55 86
2 A 53 83
我的数据框是这样的形状:
data<- data.frame("class"=c(1, 2, 2, 1, 2, 1),
"type"=c("A","B", "A", "A", "B", "B"),
"percentage_condition1"=c(0.3, 0.6, 0.1, 0.2, 0.4, 0.5),
"percentage_condition2"=c(0.7, 0.4, 0.9, 0.8, 0.6, 0.5))
data
class type percentage_condition1 percentage_condition2
1 A 0.3 0.7
2 B 0.6 0.4
2 A 0.1 0.9
1 A 0.2 0.8
2 B 0.4 0.6
1 B 0.5 0.5
我想在我的数据框中创建一个名为 data 的新列,它将使用查找 table 例如:
在我的数据中,我的 class 与我的 type 列匹配,它可以在我的数据框数据中计算一个新列例如(不是真正的代码):
d$new<- lookup$condition1 * data$percentage_condition1 + lookup$condition2 * data$percentage_condition2
我知道如何使用 if else 语句来完成它,但我正在尝试更有效地完成它,因为我正在处理大量数据。我知道用查找中的一列来做到这一点 table 但我没有成功地使用多个列(class 和类型列)。
感谢您的帮助和建议!
我们可以使用match
获取'data'和'type'的'type'列的索引,使用该索引获取[=42=的相应行], 'condition2' 列,乘以 'data' 的百分比列,得到 rowSums
data$new <- rowSums(lookup[match(paste(data$class, data$type),
paste(lookup$class, lookup$type)),
c("condition1", "condition2")] * data[3:4])
data
# class type percentage_condition1 percentage_condition2 new
#1 1 A 0.3 0.7 71.0
#2 2 B 0.6 0.4 70.0
#3 2 A 0.1 0.9 80.0
#4 1 A 0.2 0.8 74.0
#5 2 B 0.4 0.6 75.0
#6 1 B 0.5 0.5 70.5
注意:使用 match
,我们可以更轻松地做到这一点
或使用data.table
library(data.table)
setDT(data)[lookup, new := condition1 * percentage_condition1 +
condition2 * percentage_condition2, on = .(class, type)]
data
# class type percentage_condition1 percentage_condition2 new
#1: 1 A 0.3 0.7 71.0
#2: 2 B 0.6 0.4 70.0
#3: 2 A 0.1 0.9 80.0
#4: 1 A 0.2 0.8 74.0
#5: 2 B 0.4 0.6 75.0
#6: 1 B 0.5 0.5 70.5
或使用tidyverse
library(tidyverse)
data %>%
left_join(lookup, by = c("class", "type")) %>%
mutate(new = condition1 * percentage_condition1 +
condition2 * percentage_condition2) %>%
select(names(data), new)
# class type percentage_condition1 percentage_condition2 new
#1 1 A 0.3 0.7 71.0
#2 2 B 0.6 0.4 70.0
#3 2 A 0.1 0.9 80.0
#4 1 A 0.2 0.8 74.0
#5 2 B 0.4 0.6 75.0
#6 1 B 0.5 0.5 70.5
或者使用基于 SQL 的解决方案 sqldf
library(sqldf)
str1 <- "SELECT data.class, data.type, data.percentage_condition1,
data.percentage_condition2, (data.percentage_condition1 * lookup.condition1 +
data.percentage_condition2 * lookup.condition2) as new
FROM data
LEFT JOIN lookup on data.class = lookup.class AND
data.type = lookup.type"
sqldf(str1)
或者如@G.Grothendieck在评论中提到的,使用别名标识符,sqldf
解决方案可以变得更紧凑
sqldf("select D.*, L.condition1 * D.[percentage_condition1] +
L.condition2 * D.[percentage_condition2] as new
from data as D
left join lookup as L
using(class, type)")
注意:所有解决方案都保持数据集的原始顺序
一个选项是merge
data
和lookup
然后执行计算
df1 <- merge(data, lookup) #This merges by class and type columns
df1$new <- with(df1, (condition1 * percentage_condition1) +
(condition2 * percentage_condition2))
df1
# class type percentage_condition1 percentage_condition2 condition1 condition2 new
#1 1 A 0.3 0.7 50 80 71.0
#2 1 A 0.2 0.8 50 80 74.0
#3 1 B 0.5 0.5 55 86 70.5
#4 2 A 0.1 0.9 53 83 80.0
#5 2 B 0.6 0.4 60 85 70.0
#6 2 B 0.4 0.6 60 85 75.0