按包含另一个字符串的字符串匹配值

Question

我有两个数据框。第一个看起来像这样：

month     Joanne K. Rowling   Samuel L. Jackson
2000/01   1                   0
2000/02   1                   1
2000/03   0                   1
2000/04   0                   0
2000/05   0                   1
2000/06   1                   0

test_1<-data.frame("Month"=c("2000/01","2000/02","2000/03","2000/04","2000/05","2000/06"),"Joanne K. Rowling"=c(1,1,0,0,0,1),"Samuel L. Jackson"=c(0,1,1,0,1,0))

另一个长这样

Name            Score
Samuel Jackson  67
Joanne Rowling  52

test_2<-data.frame("Name"=c("Samuel Jackson","Joanne Rowling"),"Score"=c(67,52))

我想把它们结合起来得到下面的数据框

month     Joanne K. Rowling   Samuel L. Jackson
2000/01   52                   0
2000/02   52                   67
2000/03   0                    67
2000/04   0                    0
2000/05   0                    67
2000/06   52                   0

其中值 1 被 test_2 中的分数替换。 test_1 中的 colnames 可能与 table_2 中的值略有不同，因此匹配不应固定。我找到了一种方法来做到这一点：

for(i in 1:nrow(test_2)) {
  for(k in 1:ncol(test_1){
    for(l in 1:nrow(test_1)){
      if(grepl(test_2[i,6],as.data.frame(colnames(test_1))[k,1])) {
        if(test_1[l,k]==1){
          test_1[l,k]<-test_2[i,5]
        }
      }
    }
  }
}

但它非常低效，因为我必须将其应用于数据帧列表。请尝试编写一种尽可能少循环的有效方法

Answer 1

我认为 grepl 不会直接在这里工作，因为 'Joanne Rowling' 与 'Joanne K. Rowling' 不匹配。您可以使用 stringdist::stringdistmatrix 获取匹配项，然后乘以相应的值。

mat <- stringdist::stringdistmatrix(names(test_1)[-1], test_2$Name)
test_1[-1] <- sweep(test_1[-1], 2, test_2$Score[max.col(-mat)], `*`)
test_1

#    Month Joanne K. Rowling Samuel L. Jackson
#1 2000/01                52                 0
#2 2000/02                52                67
#3 2000/03                 0                67
#4 2000/04                 0                 0
#5 2000/05                 0                67
#6 2000/06                52                 0

要将此应用于多个数据帧，您可以执行以下操作：

lapply(test_1_list, function(x) {
  mat <- stringdist::stringdistmatrix(names(x)[-1], test_2$Name)
  x[-1] <- sweep(x[-1], 2, test2$Score[max.col(-mat)], `*`)
  x
}) -> result
result

其中 test_1_list 是数据帧列表。

数据

test_1<-data.frame("Month"=c("2000/01","2000/02","2000/03","2000/04","2000/05","2000/06"),
                   "Joanne K. Rowling"=c(1,1,0,0,0,1),
                   "Samuel L. Jackson"=c(0,1,1,0,1,0), check.names = FALSE)
test_2<-data.frame("Name"=c("Samuel Jackson","Joanne Rowling"),"Score"=c(67,52))

Answer 2

您可以使用 replace 函数并定义一个索引向量来决定应替换哪些值：

# Just for JK Rowling
test_1[,2] <- replace(test_1[,2], test_1[,2] == 1, test_2[2,2])

test_1[,2] == 1 创建一个索引向量，其中 1 为 TRUE，0 为 FALSE。

然后就可以只复制塞缪尔·杰克逊的行了。

按包含另一个字符串的字符串匹配值

Match values by string containing another string

r

string-matching