仅当满足多个条件时才将不同列的值匹配到行
Match values from different columns to rows only when several conditions met
场景是飙车......有时 driver 与竞争对手比赛,有时他们只是单独比赛。 driver 和他们的技能水平总是完全随机的。比赛在 12 圈后结束,比赛每天进行一次,持续 10 年。有数百个 driver。独立观察员记录了比赛期间的数据,包括 driver 的速度,但仅限于 driver 中的一个!因此,数据丢失。这是数据的前 6 行:
df <- data.frame(
Driver_name = c("Rick", "Julie", "Denver", "Johny", "Cassandra", "Phillip"),
Driver_level = c("A", "C", "D", "A", "B", "B"),
Driver_speed = c(96, 91, 89, 94, 88, 99),
Competitor= c("Yes", "Yes", "Yes", "Yes", "No", "No"),
Comp_name= c("Julie", "Rick", "Johnny", "Denver", "NA", "NA"),
Comp_level= c("B", "B", "D", "A", "NA", "NA"),
Comp_speed= c("???", "???", "???", "???", "NA", "NA"),
Race_day= c(165, 165, 72, 72, 92, 65),
Lap_number= c(9, 9, 12, 12, 8, 4),
Humidity= c(33, 33, 88, 88, 12, 55),
Temperature= c(28, 28, 12, 12, 20, 28)
)
每行代表一个不同的driver,但我需要为竞争对手的速度填写数据!我将手动输入速度以演示我需要对其余数据集执行的操作。
df_1 <- data.frame(
Driver_name = c("Rick", "Julie", "Denver", "Johny", "Cassandra", "Phillip"),
Driver_level = c("A", "C", "D", "A", "B", "B"),
Driver_speed = c(96, 91, 89, 94, 88, 99),
Competitor= c("Yes", "Yes", "Yes", "Yes", "No", "No"),
Comp_name= c("Julie", "Rick", "Johnny", "Denver", "NA", "NA"),
Comp_level= c("B", "B", "D", "A", "NA", "NA"),
Comp_speed= c(91, 96, 94, 89, "NA", "NA"),
Race_day= c(165, 165, 72, 72, 92, 65),
Lap_number= c(9, 9, 12, 12, 8, 4),
Humidity= c(33, 33, 88, 88, 12, 55),
Temperature= c(28, 28, 12, 12, 20, 28)
)
这是 left_join
的理想之选。
您的数据
df <- data.frame(
Driver_name = c("Rick", "Julie", "Denver", "Johny", "Cassandra", "Phillip"),
Driver_level = c("A", "C", "D", "A", "B", "B"),
Driver_speed = c(96, 91, 89, 94, 88, 99),
Competitor= c("Yes", "Yes", "Yes", "Yes", "No", "No"),
Comp_name= c("Julie", "Rick", "Johnny", "Denver", "NA", "NA"),
Comp_level= c("B", "B", "D", "A", "NA", "NA"),
Comp_speed= c("???", "???", "???", "???", "NA", "NA"),
Race_day= c(165, 165, 72, 72, 92, 65),
Lap_number= c(9, 9, 12, 12, 8, 4),
Humidity= c(33, 33, 88, 88, 12, 55),
Temperature= c(28, 28, 12, 12, 20, 28)
)
我们加载 dplyr
包
#install.packages("dplyr") #if you don't have it
library(dplyr)
让我们删除 Comp_speed
当前包含“???”的列值。
df <- df %>% select(-Comp_speed)
让我们制作第二个仅包含名称和速度的数据帧,我们将 Driver_speed 重命名为 Comp_speed。
df2 <- df %>%
select(Driver_name, Comp_speed = Driver_speed)
现在我们可以 left_join
将 df
数据帧 df2
。 df
中的 Comp_name
匹配 df2
中的 Driver_name
df_updated <- df %>%
left_join(df2, by = c("Comp_name" = "Driver_name"))
#> Warning: Column `Comp_name`/`Driver_name` joining factors with different
#> levels, coercing to character vector
这是生成的数据帧df_updated
df_updated
#> Driver_name Driver_level Driver_speed Competitor Comp_name Comp_level
#> 1 Rick A 96 Yes Julie B
#> 2 Julie C 91 Yes Rick B
#> 3 Denver D 89 Yes Johnny D
#> 4 Johny A 94 Yes Denver A
#> 5 Cassandra B 88 No NA NA
#> 6 Phillip B 99 No NA NA
#> Race_day Lap_number Humidity Temperature Comp_speed
#> 1 165 9 33 28 91
#> 2 165 9 33 28 96
#> 3 72 12 88 12 NA
#> 4 72 12 88 12 89
#> 5 92 8 12 20 NA
#> 6 65 4 55 28 NA
更新:
正如 OP 提出的那样,这对于不止一次相互比赛的车手来说并不稳健(我的疏忽)。
假设(根据数据)Race_day
和 Lap_number
变量足以区分每场正面交锋的比赛,我们只需将它们保存在我们的 df2
数据框中.然后在我们的 left_join
中加入这些列名。这就是它的样子。
df2 <- df %>%
select(Driver_name, Comp_speed = Driver_speed, Race_day, Lap_number)
df_updated <- df %>%
left_join(df2, by = c("Comp_name" = "Driver_name", "Race_day", "Lap_number"))
#> Warning: Column `Comp_name`/`Driver_name` joining factors with different
#> levels, coercing to character vector
我们需要将 join df 留给它自己。
!names(df)%in%c("Comp_speed") 从第一个数据帧 x 中删除变量 Comp_speed。
df[c("Driver_name","Driver_speed")] 仅在第二个数据帧 y 中包含变量 Driver_name 和 Driver_speed。
总之,来自 x 的 Comp_name 与来自 y 的 Driver_name 匹配,来自 y 的 Driver_speed 被报告为 Driver_speed.y(Driver_speed.y因为 Driver_speed 已经存在于 df 中,在加入后将名称更改为 Driver_speed.x):
df <- merge(x=df[,!names(df)%in%c("Comp_speed")],y=df[,c("Driver_name","Driver_speed")],by.x="Comp_name",by.y="Driver_name",all.x=TRUE)
现在,我们只需要将 "Driver_speed.x","Driver_speed.y" 的名称更改为 "Driver_speed","Comp_speed":
library("data.table")
setnames(df,c("Driver_speed.x","Driver_speed.y"),c("Driver_speed","Comp_speed"))
我想 df$Comp_speed <- df$Driver_speed[with(df,match(Comp_name,Driver_name))]
可以满足您的需求
场景是飙车......有时 driver 与竞争对手比赛,有时他们只是单独比赛。 driver 和他们的技能水平总是完全随机的。比赛在 12 圈后结束,比赛每天进行一次,持续 10 年。有数百个 driver。独立观察员记录了比赛期间的数据,包括 driver 的速度,但仅限于 driver 中的一个!因此,数据丢失。这是数据的前 6 行:
df <- data.frame(
Driver_name = c("Rick", "Julie", "Denver", "Johny", "Cassandra", "Phillip"),
Driver_level = c("A", "C", "D", "A", "B", "B"),
Driver_speed = c(96, 91, 89, 94, 88, 99),
Competitor= c("Yes", "Yes", "Yes", "Yes", "No", "No"),
Comp_name= c("Julie", "Rick", "Johnny", "Denver", "NA", "NA"),
Comp_level= c("B", "B", "D", "A", "NA", "NA"),
Comp_speed= c("???", "???", "???", "???", "NA", "NA"),
Race_day= c(165, 165, 72, 72, 92, 65),
Lap_number= c(9, 9, 12, 12, 8, 4),
Humidity= c(33, 33, 88, 88, 12, 55),
Temperature= c(28, 28, 12, 12, 20, 28)
)
每行代表一个不同的driver,但我需要为竞争对手的速度填写数据!我将手动输入速度以演示我需要对其余数据集执行的操作。
df_1 <- data.frame(
Driver_name = c("Rick", "Julie", "Denver", "Johny", "Cassandra", "Phillip"),
Driver_level = c("A", "C", "D", "A", "B", "B"),
Driver_speed = c(96, 91, 89, 94, 88, 99),
Competitor= c("Yes", "Yes", "Yes", "Yes", "No", "No"),
Comp_name= c("Julie", "Rick", "Johnny", "Denver", "NA", "NA"),
Comp_level= c("B", "B", "D", "A", "NA", "NA"),
Comp_speed= c(91, 96, 94, 89, "NA", "NA"),
Race_day= c(165, 165, 72, 72, 92, 65),
Lap_number= c(9, 9, 12, 12, 8, 4),
Humidity= c(33, 33, 88, 88, 12, 55),
Temperature= c(28, 28, 12, 12, 20, 28)
)
这是 left_join
的理想之选。
您的数据
df <- data.frame(
Driver_name = c("Rick", "Julie", "Denver", "Johny", "Cassandra", "Phillip"),
Driver_level = c("A", "C", "D", "A", "B", "B"),
Driver_speed = c(96, 91, 89, 94, 88, 99),
Competitor= c("Yes", "Yes", "Yes", "Yes", "No", "No"),
Comp_name= c("Julie", "Rick", "Johnny", "Denver", "NA", "NA"),
Comp_level= c("B", "B", "D", "A", "NA", "NA"),
Comp_speed= c("???", "???", "???", "???", "NA", "NA"),
Race_day= c(165, 165, 72, 72, 92, 65),
Lap_number= c(9, 9, 12, 12, 8, 4),
Humidity= c(33, 33, 88, 88, 12, 55),
Temperature= c(28, 28, 12, 12, 20, 28)
)
我们加载 dplyr
包
#install.packages("dplyr") #if you don't have it
library(dplyr)
让我们删除 Comp_speed
当前包含“???”的列值。
df <- df %>% select(-Comp_speed)
让我们制作第二个仅包含名称和速度的数据帧,我们将 Driver_speed 重命名为 Comp_speed。
df2 <- df %>%
select(Driver_name, Comp_speed = Driver_speed)
现在我们可以 left_join
将 df
数据帧 df2
。 df
中的 Comp_name
匹配 df2
Driver_name
df_updated <- df %>%
left_join(df2, by = c("Comp_name" = "Driver_name"))
#> Warning: Column `Comp_name`/`Driver_name` joining factors with different
#> levels, coercing to character vector
这是生成的数据帧df_updated
df_updated
#> Driver_name Driver_level Driver_speed Competitor Comp_name Comp_level
#> 1 Rick A 96 Yes Julie B
#> 2 Julie C 91 Yes Rick B
#> 3 Denver D 89 Yes Johnny D
#> 4 Johny A 94 Yes Denver A
#> 5 Cassandra B 88 No NA NA
#> 6 Phillip B 99 No NA NA
#> Race_day Lap_number Humidity Temperature Comp_speed
#> 1 165 9 33 28 91
#> 2 165 9 33 28 96
#> 3 72 12 88 12 NA
#> 4 72 12 88 12 89
#> 5 92 8 12 20 NA
#> 6 65 4 55 28 NA
更新:
正如 OP 提出的那样,这对于不止一次相互比赛的车手来说并不稳健(我的疏忽)。
假设(根据数据)Race_day
和 Lap_number
变量足以区分每场正面交锋的比赛,我们只需将它们保存在我们的 df2
数据框中.然后在我们的 left_join
中加入这些列名。这就是它的样子。
df2 <- df %>%
select(Driver_name, Comp_speed = Driver_speed, Race_day, Lap_number)
df_updated <- df %>%
left_join(df2, by = c("Comp_name" = "Driver_name", "Race_day", "Lap_number"))
#> Warning: Column `Comp_name`/`Driver_name` joining factors with different
#> levels, coercing to character vector
我们需要将 join df 留给它自己。
!names(df)%in%c("Comp_speed") 从第一个数据帧 x 中删除变量 Comp_speed。
df[c("Driver_name","Driver_speed")] 仅在第二个数据帧 y 中包含变量 Driver_name 和 Driver_speed。
总之,来自 x 的 Comp_name 与来自 y 的 Driver_name 匹配,来自 y 的 Driver_speed 被报告为 Driver_speed.y(Driver_speed.y因为 Driver_speed 已经存在于 df 中,在加入后将名称更改为 Driver_speed.x):
df <- merge(x=df[,!names(df)%in%c("Comp_speed")],y=df[,c("Driver_name","Driver_speed")],by.x="Comp_name",by.y="Driver_name",all.x=TRUE)
现在,我们只需要将 "Driver_speed.x","Driver_speed.y" 的名称更改为 "Driver_speed","Comp_speed":
library("data.table")
setnames(df,c("Driver_speed.x","Driver_speed.y"),c("Driver_speed","Comp_speed"))
我想 df$Comp_speed <- df$Driver_speed[with(df,match(Comp_name,Driver_name))]
可以满足您的需求