使用 R 中的列表从多列中提取字符串

String extract from multiple columns using a list in R

我正在尝试使用一个列表从超过 2 列(2 列作为下面的示例给出)中提取信息,并创建另一列,该列包含从列表中的任一列中找到的字符串,指定要提取的列先进去看看我有下面的示例以及所需的输出是什么。希望对我正在寻找的东西有所帮助。

A<-c("This contains NYU", "This has NYU", "This has XT", "This has FIT", 
"Something something UNH","I got into UCLA","Hello XT")
B<-c("NYU","UT","USC","FIT","UNA","UCLA", "CA")
data<-data.frame(A,B)

list <- c("NYU","FIT","UCLA","CA","UT","USC")

                        A    B
1       This contains NYU  NYU
2            This has NYU   UT
3             This has XT  USC
4            This has FIT  FIT
5 Something something UNH  UNA
6         I got into UCLA UCLA
7                Hello XT   CA 

我希望代码从列表中搜索并首先在 A 列中查找,如果找不到字符串,则在 B 列中查找,如果找不到,则返回 null。通过查看列表,我希望所需的输出如下所示。

                        A    B    C
1       This contains NYU  NYU  NYU
2            This has NYU   UT  NYU
3             This has XT  USC  USC
4            This has FIT  FIT  FIT
5 Something something UNH  UNA <NA>
6         I got into UCLA UCLA UCLA
7                Hello XT   CA   CA

您可以将列表转换为正则表达式,然后应用 R 正则表达式函数:

expr <- paste0(list,collapse = "|")
# expr = "NYU|FIT|UCLA|CA|UT|USC" -> Reg expr means NYU or FIT or ......

data[,"C"] <- ""
cols <- rev(names(data)[-(which(names(data)=="C"))])

for(c in cols) {
 index <- regexpr(expr,data[,c])
 data[,"C"] <- ifelse(index != -1,substr(data[,c],index,index + attr(index,"match.length")-1),data[,"C"])     
}

希望对您有所帮助

戈塔维亚诺尼

使用 tokenizers 包中的库(tokenizers)。

合并两列并创建一个包含合并的 A 和 B 的新列

data$newC <- paste(data$A, data$B, sep = " " )

然后,执行下面的循环,该循环将提取向量中的值,然后您可以将向量绑定到现有数据帧中。

newcolumn <- 'X'

for (p in data$newC)
{
  if (!is.na(p))
{

x <- which(is.element(unlist(tokenize_words(list, lowercase = TRUE)), unlist(tokenize_words(p, lowercase = TRUE, stopwords = NULL, simplify = FALSE))))

    newcolumn <- append(newcolumn,ifelse(x[1]!= 0, list[x[1]], "NA"))
  } 
}

newcolumn <- newcolumn[-1]

newcolumn

data <- cbind(data, newcolumn)

希望对您有所帮助。 我得到的输出如您所料。

解决方案图片:

另一种方法可能是

#common between column A & vector l
C_tempA <- sapply(df$A, function(x) intersect(strsplit(as.character(x), split = " ")[[1]], l))
#common between column B & vector l
C_tempB <- sapply(df$B, function(x) intersect(as.character(x), l))

#column C calculation
df$C <- ifelse(C_tempA=="character(0)", C_tempB, C_tempA)
df$C[df$C=="character(0)"] <- NA

#final dataframe
df

输出为:

                        A    B    C
1       This contains NYU  NYU  NYU
2            This has NYU   UT  NYU
3             This has XT  USC  USC
4            This has FIT  FIT  FIT
5 Something something UNH  UNA   NA
6         I got into UCLA UCLA UCLA
7                Hello XT   CA   CA

示例数据:

df <- structure(list(A = structure(c(4L, 6L, 7L, 5L, 3L, 2L, 1L), .Label = c("Hello XT", 
"I got into UCLA", "Something something UNH", "This contains NYU", 
"This has FIT", "This has NYU", "This has XT"), class = "factor"), 
    B = structure(c(3L, 7L, 6L, 2L, 5L, 4L, 1L), .Label = c("CA", 
    "FIT", "NYU", "UCLA", "UNA", "USC", "UT"), class = "factor")), .Names = c("A", 
"B"), row.names = c(NA, -7L), class = "data.frame")

l <- c("NYU","FIT","UCLA","CA","UT","USC")