根据是否存在另一列中的字符串元素过滤行

Question

我正在尝试根据 R 中是否存在字符串或 part/element 字符串来过滤掉相关的行。示例如下：

colA                                      colb                           flag
New York Metropolitan Area                New York                       Yes 
New York Metropolitan Area                York                           Yes
New York Metropolitan Area                New York Area                  Yes
New York Metropolitan Area                Los Angeles                    No

到目前为止我尝试过的事情：

存在 2 个不同的数据帧

df1<- df1 %>% fuzzy_inner_join(df2, by = c("colA" = "colB"), match_fun = str_detect)

由于括号和其他特殊字符，此选项失败，将它们全部清除也无济于事。

我加入了基于上层层次结构的 2 个数据框以限制行并创建了一个数据框 df

df[, "lookup"] <- gsub(" ", "|", df[,"colB"])

df[,"flag"] <- mapply(grepl, df[,"lookup"], df[,"colA"])

结果不令人满意，因为只过滤了有限的行。

提前致谢。

Answer 1

如果我没有正确理解你的问题，那么你正在尝试匹配部分字符串并获取指示匹配的新列：

df1 <- data.frame(colA = rep("New York Metropolitan Area ", 4),
                  colb = c("New York", "York", "New York Area", "Los Angeles") )

我的第一次尝试是一个简单的 str_detect 但这尝试匹配 colb 中的整个字符串 colA:

df3 = df1%>%
  mutate(flag =  str_detect(colA, colb))

> df3
                         colA          colb  flag
1 New York Metropolitan Area       New York  TRUE
2 New York Metropolitan Area           York  TRUE
3 New York Metropolitan Area  New York Area FALSE
4 New York Metropolitan Area    Los Angeles FALSE

这不太对；尽管在此示例中，您可以先添加 df1$colb = gsub("Area", "", df1$colb )。

或者：

library(dplyr) # for pipe
library(stringr) # for str_detect
library(tidyr) # for separate

#separate colb into 3 columns (called b1,b2 and b3) with separate words (can be increased if more words)
df1 = df1 %>% separate(col = colb, c("b1","b2","b3")) 

# detect contents of columns b1, b2 or b3 in colA and create new column with logical value
df2 = df1%>%
    mutate(flag = str_detect(colA, b1)| 
                  str_detect(colA, b2)|
                  str_detect(colA, b3))

这给出了输出

> df2
                         colA   b1      b2   b3 flag
1 New York Metropolitan Area   New    York <NA> TRUE
2 New York Metropolitan Area  York    <NA> <NA> TRUE
3 New York Metropolitan Area   New    York Area TRUE
4 New York Metropolitan Area   Los Angeles <NA>   NA

Answer 2

这是一个基本的 R 解决方案。
匿名 lambda 函数 \(x, y) 是在 R 4.1.0 中引入的，旧版本的 R 使用 function(x, y).

pattern <- gsub(" ", "|", df1$colb)
i <- mapply(\(x, y)grepl(x, y), pattern, df1$colA)
df1$flag <- c("No", "Yes")[i + 1L]

df1
#                        colA          colb flag
#1 New York Metropolitan Area      New York  Yes
#2 New York Metropolitan Area          York  Yes
#3 New York Metropolitan Area New York Area  Yes
#4 New York Metropolitan Area   Los Angeles   No

要删除与模式不匹配的行：

df1[i, ]
#                        colA          colb flag
#1 New York Metropolitan Area      New York  Yes
#2 New York Metropolitan Area          York  Yes
#3 New York Metropolitan Area New York Area  Yes

数据

df1 <-
structure(list(colA = c("New York Metropolitan Area", 
"New York Metropolitan Area", "New York Metropolitan Area", 
"New York Metropolitan Area"), colb = c("New York", "York", 
"New York Area", "Los Angeles"), flag = c("Yes", "Yes", "Yes", 
"No")), row.names = c(NA, -4L), class = "data.frame")

根据是否存在另一列中的字符串元素过滤行

Filter rows based on presence of a string element from another column

string

r

stringr

fuzzyjoin

数据