R中两个字符串列中元素的部分匹配

Question

我有一个按两个标识符（Group 和 ID）分组的大数据，Initial 列显示在初始时间段，Post 列显示初始时间段之后发生的元素时间段。下面是一个工作示例：

SampleDF<-data.frame(Group=c(0,0,1),ID=c(2,2,3),
Initial=c('F28D,G06F','F24J ,'G01N'), 
Post=c('G06F','H02G','F23C,H02G,G01N'))

我想为每个 Group/ID 组合比较 Initial 和 Post 中的元素，以找出元素何时匹配，何时只存在新元素，以及既有元素和新元素存在。理想情况下，我希望最终得到一个具有以下输出的新 Type 变量：

SampleDF<-cbind(SampleDF, 'Type'=rbind(0,1,2))

其中（相对于Initial）0表示Post中没有新元素，1表示只有新元素（ s) in Post, and 2 表示 Post.

中既有已有元素也有新元素

Answer 1

您的情况很复杂，因为您的 pattern 和 vector 在使用 agrepl 进行字符串匹配时会发生变化。所以，在这里我提出了一个非常棘手但效果很好的解决方案。

element_counter = list()
for (i in 1:length(SampleDF$Initial)) {
  if (length(strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]) > 1) {
    element_counter[[i]] <- length(as.character(SampleDF$Post[i])) - sum(agrepl(as.character(SampleDF$Post[i]),strsplit(as.character(SampleDF$Initial[i]), ",")[[1]]))
  }   else { 
    element_counter[[i]] <- length(strsplit(as.character(SampleDF$Post[i]), ",")[[1]]) - sum(agrepl(SampleDF$Initial[i], strsplit(as.character(SampleDF$Post[i]), ",")[[1]]))
  }
}

SampleDF$Type <- unlist(element_counter) 


## SampleDF
#   Group  ID   Initial             Post  Type
#1     0   2  F28D,G06F             G06F    0
#2     0   2       F24J             H02G    1
#3     1   3       G01N   F23C,H02G,G01N    2

Answer 2

我将该过程分为两个步骤，查找具有新值的行，然后查找具有仅个新值的行。将这两个逻辑向量相加将创建类型。唯一需要注意的是，类型定义与您的问题定义略有不同。 0表示没有新的措施，1表示有新的和已有的措施，2表示只有已有的措施。

# This approach needs character columns not strings, so stringsAsFactors = FALSE
SampleDF<-data.frame(Group=c(0,0,1),ID=c(2,2,3),
                     Initial=c('F28D,G06F','F24J' ,'G01N'), 
                               Post=c('G06F','H02G','F23C,H02G,G01N'),
                     stringsAsFactors = FALSE)

# Identify rows where there are new occurrences in Post that are not present in Initial
SampleDF$anyNewOccurrences <- 
  mapply(FUN = function(pattern, x){
    any(!grepl(pattern, x))}, 
    pattern = gsub("," , "|", SampleDF$Initial), 
    x = strsplit(SampleDF$Post, ","))

# Identify rows where there are only new occurences (no repeated values from Initial)
SampleDF$onlyNewOccurrences <- 
  mapply(FUN = function(pattern, x){
    all(!grepl(pattern, x))}, 
    pattern = gsub("," , "|", SampleDF$Initial), 
    x = strsplit(SampleDF$Post, ","))

# Add the two value to gether to create a type code
SampleDF$Type <- SampleDF$onlyNewOccurrences + SampleDF$anyNewOccurrences

R中两个字符串列中元素的部分匹配

Partial matching of elements in two string columns in R

string

r

match

strsplit