仅当满足多个条件时才将不同列的值匹配到行

Match values from different columns to rows only when several conditions met

场景是飙车......有时 driver 与竞争对手比赛,有时他们只是单独比赛。 driver 和他们的技能水平总是完全随机的。比赛在 12 圈后结束,比赛每天进行一次,持续 10 年。有数百个 driver。独立观察员记录了比赛期间的数据,包括 driver 的速度,但仅限于 driver 中的一个!因此,数据丢失。这是数据的前 6 行:

    df <- data.frame(
      Driver_name =  c("Rick",  "Julie",  "Denver", "Johny",  "Cassandra", "Phillip"),
      Driver_level = c("A",     "C",      "D",      "A",      "B",         "B"),
      Driver_speed = c(96,       91,       89,       94,       88,          99),
      Competitor=    c("Yes",   "Yes",    "Yes",    "Yes",    "No",        "No"),
      Comp_name=     c("Julie", "Rick",   "Johnny", "Denver", "NA",        "NA"),
      Comp_level=    c("B",     "B",      "D",      "A",      "NA",        "NA"),
      Comp_speed=    c("???",   "???",    "???",    "???",    "NA",        "NA"),
      Race_day=      c(165,      165,      72,       72,       92,          65),
      Lap_number=    c(9,        9,        12,       12,       8,           4),
      Humidity=      c(33,       33,       88,       88,       12,          55),
      Temperature=   c(28,       28,       12,       12,       20,          28)
    )

每行代表一个不同的driver,但我需要为竞争对手的速度填写数据!我将手动输入速度以演示我需要对其余数据集执行的操作。

    df_1 <- data.frame(
      Driver_name =  c("Rick",  "Julie",  "Denver", "Johny",  "Cassandra", "Phillip"),
      Driver_level = c("A",     "C",      "D",      "A",      "B",         "B"),
      Driver_speed = c(96,       91,       89,       94,       88,          99),
      Competitor=    c("Yes",   "Yes",    "Yes",    "Yes",    "No",        "No"),
      Comp_name=     c("Julie", "Rick",   "Johnny", "Denver", "NA",        "NA"),
      Comp_level=    c("B",     "B",      "D",      "A",      "NA",        "NA"),
      Comp_speed=    c(91,       96,       94,       89,      "NA",        "NA"),
      Race_day=      c(165,      165,      72,       72,       92,          65),
      Lap_number=    c(9,        9,        12,       12,       8,           4),
      Humidity=      c(33,       33,       88,       88,       12,          55),
      Temperature=   c(28,       28,       12,       12,       20,          28)
    )

这是 left_join 的理想之选。

您的数据

df <- data.frame(
  Driver_name =  c("Rick",  "Julie",  "Denver", "Johny",  "Cassandra", "Phillip"),
  Driver_level = c("A",     "C",      "D",      "A",      "B",         "B"),
  Driver_speed = c(96,       91,       89,       94,       88,          99),
  Competitor=    c("Yes",   "Yes",    "Yes",    "Yes",    "No",        "No"),
  Comp_name=     c("Julie", "Rick",   "Johnny", "Denver", "NA",        "NA"),
  Comp_level=    c("B",     "B",      "D",      "A",      "NA",        "NA"),
  Comp_speed=    c("???",   "???",    "???",    "???",    "NA",        "NA"),
  Race_day=      c(165,      165,      72,       72,       92,          65),
  Lap_number=    c(9,        9,        12,       12,       8,           4),
  Humidity=      c(33,       33,       88,       88,       12,          55),
  Temperature=   c(28,       28,       12,       12,       20,          28)
)

我们加载 dplyr

#install.packages("dplyr") #if you don't have it
library(dplyr)

让我们删除 Comp_speed 当前包含“???”的列值。

df <- df %>% select(-Comp_speed)

让我们制作第二个仅包含名称和速度的数据帧,我们将 Driver_speed 重命名为 Comp_speed。

df2 <- df %>% 
  select(Driver_name, Comp_speed = Driver_speed)

现在我们可以 left_joindf 数据帧 df2df 中的 Comp_name 匹配 df2

中的 Driver_name
df_updated <- df %>% 
  left_join(df2, by = c("Comp_name" = "Driver_name"))
#> Warning: Column `Comp_name`/`Driver_name` joining factors with different
#> levels, coercing to character vector

这是生成的数据帧df_updated

df_updated
#>   Driver_name Driver_level Driver_speed Competitor Comp_name Comp_level
#> 1        Rick            A           96        Yes     Julie          B
#> 2       Julie            C           91        Yes      Rick          B
#> 3      Denver            D           89        Yes    Johnny          D
#> 4       Johny            A           94        Yes    Denver          A
#> 5   Cassandra            B           88         No        NA         NA
#> 6     Phillip            B           99         No        NA         NA
#>   Race_day Lap_number Humidity Temperature Comp_speed
#> 1      165          9       33          28         91
#> 2      165          9       33          28         96
#> 3       72         12       88          12         NA
#> 4       72         12       88          12         89
#> 5       92          8       12          20         NA
#> 6       65          4       55          28         NA 

更新:

正如 OP 提出的那样,这对于不止一次相互比赛的车手来说并不稳健(我的疏忽)。

假设(根据数据)Race_dayLap_number 变量足以区分每场正面交锋的比赛,我们只需将它们保存在我们的 df2 数据框中.然后在我们的 left_join 中加入这些列名。这就是它的样子。

df2 <- df %>% 
  select(Driver_name, Comp_speed = Driver_speed, Race_day, Lap_number)

df_updated <- df %>% 
  left_join(df2, by = c("Comp_name" = "Driver_name", "Race_day", "Lap_number"))
#> Warning: Column `Comp_name`/`Driver_name` joining factors with different
#> levels, coercing to character vector

我们需要将 join df 留给它自己。
!names(df)%in%c("Comp_speed") 从第一个数据帧 x 中删除变量 Comp_speed。

df[c("Driver_name","Driver_speed")] 仅在第二个数据帧 y 中包含变量 Driver_name 和 Driver_speed。

总之,来自 x 的 Comp_name 与来自 y 的 Driver_name 匹配,来自 y 的 Driver_speed 被报告为 Driver_speed.y(Driver_speed.y因为 Driver_speed 已经存在于 df 中,在加入后将名称更改为 Driver_speed.x):

df <- merge(x=df[,!names(df)%in%c("Comp_speed")],y=df[,c("Driver_name","Driver_speed")],by.x="Comp_name",by.y="Driver_name",all.x=TRUE)

现在,我们只需要将 "Driver_speed.x","Driver_speed.y" 的名称更改为 "Driver_speed","Comp_speed":

library("data.table")
setnames(df,c("Driver_speed.x","Driver_speed.y"),c("Driver_speed","Comp_speed"))

我想 df$Comp_speed <- df$Driver_speed[with(df,match(Comp_name,Driver_name))] 可以满足您的需求