使用 R 中的列表从多列中提取字符串
String extract from multiple columns using a list in R
我正在尝试使用一个列表从超过 2 列(2 列作为下面的示例给出)中提取信息,并创建另一列,该列包含从列表中的任一列中找到的字符串,指定要提取的列先进去看看我有下面的示例以及所需的输出是什么。希望对我正在寻找的东西有所帮助。
A<-c("This contains NYU", "This has NYU", "This has XT", "This has FIT",
"Something something UNH","I got into UCLA","Hello XT")
B<-c("NYU","UT","USC","FIT","UNA","UCLA", "CA")
data<-data.frame(A,B)
list <- c("NYU","FIT","UCLA","CA","UT","USC")
A B
1 This contains NYU NYU
2 This has NYU UT
3 This has XT USC
4 This has FIT FIT
5 Something something UNH UNA
6 I got into UCLA UCLA
7 Hello XT CA
我希望代码从列表中搜索并首先在 A 列中查找,如果找不到字符串,则在 B 列中查找,如果找不到,则返回 null。通过查看列表,我希望所需的输出如下所示。
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA <NA>
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
您可以将列表转换为正则表达式,然后应用 R 正则表达式函数:
expr <- paste0(list,collapse = "|")
# expr = "NYU|FIT|UCLA|CA|UT|USC" -> Reg expr means NYU or FIT or ......
data[,"C"] <- ""
cols <- rev(names(data)[-(which(names(data)=="C"))])
for(c in cols) {
index <- regexpr(expr,data[,c])
data[,"C"] <- ifelse(index != -1,substr(data[,c],index,index + attr(index,"match.length")-1),data[,"C"])
}
希望对您有所帮助
戈塔维亚诺尼
使用 tokenizers 包中的库(tokenizers)。
合并两列并创建一个包含合并的 A 和 B 的新列
data$newC <- paste(data$A, data$B, sep = " " )
然后,执行下面的循环,该循环将提取向量中的值,然后您可以将向量绑定到现有数据帧中。
newcolumn <- 'X'
for (p in data$newC)
{
if (!is.na(p))
{
x <- which(is.element(unlist(tokenize_words(list, lowercase = TRUE)), unlist(tokenize_words(p, lowercase = TRUE, stopwords = NULL, simplify = FALSE))))
newcolumn <- append(newcolumn,ifelse(x[1]!= 0, list[x[1]], "NA"))
}
}
newcolumn <- newcolumn[-1]
newcolumn
data <- cbind(data, newcolumn)
希望对您有所帮助。
我得到的输出如您所料。
解决方案图片:
另一种方法可能是
#common between column A & vector l
C_tempA <- sapply(df$A, function(x) intersect(strsplit(as.character(x), split = " ")[[1]], l))
#common between column B & vector l
C_tempB <- sapply(df$B, function(x) intersect(as.character(x), l))
#column C calculation
df$C <- ifelse(C_tempA=="character(0)", C_tempB, C_tempA)
df$C[df$C=="character(0)"] <- NA
#final dataframe
df
输出为:
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA NA
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
示例数据:
df <- structure(list(A = structure(c(4L, 6L, 7L, 5L, 3L, 2L, 1L), .Label = c("Hello XT",
"I got into UCLA", "Something something UNH", "This contains NYU",
"This has FIT", "This has NYU", "This has XT"), class = "factor"),
B = structure(c(3L, 7L, 6L, 2L, 5L, 4L, 1L), .Label = c("CA",
"FIT", "NYU", "UCLA", "UNA", "USC", "UT"), class = "factor")), .Names = c("A",
"B"), row.names = c(NA, -7L), class = "data.frame")
l <- c("NYU","FIT","UCLA","CA","UT","USC")
我正在尝试使用一个列表从超过 2 列(2 列作为下面的示例给出)中提取信息,并创建另一列,该列包含从列表中的任一列中找到的字符串,指定要提取的列先进去看看我有下面的示例以及所需的输出是什么。希望对我正在寻找的东西有所帮助。
A<-c("This contains NYU", "This has NYU", "This has XT", "This has FIT",
"Something something UNH","I got into UCLA","Hello XT")
B<-c("NYU","UT","USC","FIT","UNA","UCLA", "CA")
data<-data.frame(A,B)
list <- c("NYU","FIT","UCLA","CA","UT","USC")
A B
1 This contains NYU NYU
2 This has NYU UT
3 This has XT USC
4 This has FIT FIT
5 Something something UNH UNA
6 I got into UCLA UCLA
7 Hello XT CA
我希望代码从列表中搜索并首先在 A 列中查找,如果找不到字符串,则在 B 列中查找,如果找不到,则返回 null。通过查看列表,我希望所需的输出如下所示。
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA <NA>
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
您可以将列表转换为正则表达式,然后应用 R 正则表达式函数:
expr <- paste0(list,collapse = "|")
# expr = "NYU|FIT|UCLA|CA|UT|USC" -> Reg expr means NYU or FIT or ......
data[,"C"] <- ""
cols <- rev(names(data)[-(which(names(data)=="C"))])
for(c in cols) {
index <- regexpr(expr,data[,c])
data[,"C"] <- ifelse(index != -1,substr(data[,c],index,index + attr(index,"match.length")-1),data[,"C"])
}
希望对您有所帮助
戈塔维亚诺尼
使用 tokenizers 包中的库(tokenizers)。
合并两列并创建一个包含合并的 A 和 B 的新列
data$newC <- paste(data$A, data$B, sep = " " )
然后,执行下面的循环,该循环将提取向量中的值,然后您可以将向量绑定到现有数据帧中。
newcolumn <- 'X'
for (p in data$newC)
{
if (!is.na(p))
{
x <- which(is.element(unlist(tokenize_words(list, lowercase = TRUE)), unlist(tokenize_words(p, lowercase = TRUE, stopwords = NULL, simplify = FALSE))))
newcolumn <- append(newcolumn,ifelse(x[1]!= 0, list[x[1]], "NA"))
}
}
newcolumn <- newcolumn[-1]
newcolumn
data <- cbind(data, newcolumn)
希望对您有所帮助。 我得到的输出如您所料。
解决方案图片:
另一种方法可能是
#common between column A & vector l
C_tempA <- sapply(df$A, function(x) intersect(strsplit(as.character(x), split = " ")[[1]], l))
#common between column B & vector l
C_tempB <- sapply(df$B, function(x) intersect(as.character(x), l))
#column C calculation
df$C <- ifelse(C_tempA=="character(0)", C_tempB, C_tempA)
df$C[df$C=="character(0)"] <- NA
#final dataframe
df
输出为:
A B C
1 This contains NYU NYU NYU
2 This has NYU UT NYU
3 This has XT USC USC
4 This has FIT FIT FIT
5 Something something UNH UNA NA
6 I got into UCLA UCLA UCLA
7 Hello XT CA CA
示例数据:
df <- structure(list(A = structure(c(4L, 6L, 7L, 5L, 3L, 2L, 1L), .Label = c("Hello XT",
"I got into UCLA", "Something something UNH", "This contains NYU",
"This has FIT", "This has NYU", "This has XT"), class = "factor"),
B = structure(c(3L, 7L, 6L, 2L, 5L, 4L, 1L), .Label = c("CA",
"FIT", "NYU", "UCLA", "UNA", "USC", "UT"), class = "factor")), .Names = c("A",
"B"), row.names = c(NA, -7L), class = "data.frame")
l <- c("NYU","FIT","UCLA","CA","UT","USC")