Filter rows based on presence of a string element from another column
我正在尝试根据 R 中是否存在字符串或 part/element 字符串来过滤掉相关的行。示例如下:
colA colb flag
New York Metropolitan Area New York Yes
New York Metropolitan Area York Yes
New York Metropolitan Area New York Area Yes
New York Metropolitan Area Los Angeles No
- 存在 2 个不同的数据帧
df1<- df1 %>% fuzzy_inner_join(df2, by = c("colA" = "colB"), match_fun = str_detect)
- 我加入了基于上层层次结构的 2 个数据框以限制行并创建了一个数据框 df
df[, "lookup"] <- gsub(" ", "|", df[,"colB"])
df[,"flag"] <- mapply(grepl, df[,"lookup"], df[,"colA"])
df1 <- data.frame(colA = rep("New York Metropolitan Area ", 4),
colb = c("New York", "York", "New York Area", "Los Angeles") )
我的第一次尝试是一个简单的 str_detect
但这尝试匹配 colb
中的整个字符串 colA
df3 = df1%>%
mutate(flag = str_detect(colA, colb))
> df3
colA colb flag
1 New York Metropolitan Area New York TRUE
2 New York Metropolitan Area York TRUE
3 New York Metropolitan Area New York Area FALSE
4 New York Metropolitan Area Los Angeles FALSE
这不太对;尽管在此示例中,您可以先添加 df1$colb = gsub("Area", "", df1$colb )
library(dplyr) # for pipe
library(stringr) # for str_detect
library(tidyr) # for separate
#separate colb into 3 columns (called b1,b2 and b3) with separate words (can be increased if more words)
df1 = df1 %>% separate(col = colb, c("b1","b2","b3"))
# detect contents of columns b1, b2 or b3 in colA and create new column with logical value
df2 = df1%>%
mutate(flag = str_detect(colA, b1)|
str_detect(colA, b2)|
str_detect(colA, b3))
> df2
colA b1 b2 b3 flag
1 New York Metropolitan Area New York <NA> TRUE
2 New York Metropolitan Area York <NA> <NA> TRUE
3 New York Metropolitan Area New York Area TRUE
4 New York Metropolitan Area Los Angeles <NA> NA
这是一个基本的 R 解决方案。
匿名 lambda 函数 \(x, y)
是在 R 4.1.0 中引入的,旧版本的 R 使用 function(x, y)
pattern <- gsub(" ", "|", df1$colb)
i <- mapply(\(x, y)grepl(x, y), pattern, df1$colA)
df1$flag <- c("No", "Yes")[i + 1L]
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
#4 New York Metropolitan Area Los Angeles No
df1[i, ]
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
df1 <-
structure(list(colA = c("New York Metropolitan Area",
"New York Metropolitan Area", "New York Metropolitan Area",
"New York Metropolitan Area"), colb = c("New York", "York",
"New York Area", "Los Angeles"), flag = c("Yes", "Yes", "Yes",
"No")), row.names = c(NA, -4L), class = "data.frame")
我正在尝试根据 R 中是否存在字符串或 part/element 字符串来过滤掉相关的行。示例如下:
colA colb flag
New York Metropolitan Area New York Yes
New York Metropolitan Area York Yes
New York Metropolitan Area New York Area Yes
New York Metropolitan Area Los Angeles No
- 存在 2 个不同的数据帧
df1<- df1 %>% fuzzy_inner_join(df2, by = c("colA" = "colB"), match_fun = str_detect)
- 我加入了基于上层层次结构的 2 个数据框以限制行并创建了一个数据框 df
df[, "lookup"] <- gsub(" ", "|", df[,"colB"])
df[,"flag"] <- mapply(grepl, df[,"lookup"], df[,"colA"])
df1 <- data.frame(colA = rep("New York Metropolitan Area ", 4),
colb = c("New York", "York", "New York Area", "Los Angeles") )
我的第一次尝试是一个简单的 str_detect
但这尝试匹配 colb
中的整个字符串 colA
df3 = df1%>%
mutate(flag = str_detect(colA, colb))
> df3
colA colb flag
1 New York Metropolitan Area New York TRUE
2 New York Metropolitan Area York TRUE
3 New York Metropolitan Area New York Area FALSE
4 New York Metropolitan Area Los Angeles FALSE
这不太对;尽管在此示例中,您可以先添加 df1$colb = gsub("Area", "", df1$colb )
library(dplyr) # for pipe
library(stringr) # for str_detect
library(tidyr) # for separate
#separate colb into 3 columns (called b1,b2 and b3) with separate words (can be increased if more words)
df1 = df1 %>% separate(col = colb, c("b1","b2","b3"))
# detect contents of columns b1, b2 or b3 in colA and create new column with logical value
df2 = df1%>%
mutate(flag = str_detect(colA, b1)|
str_detect(colA, b2)|
str_detect(colA, b3))
> df2
colA b1 b2 b3 flag
1 New York Metropolitan Area New York <NA> TRUE
2 New York Metropolitan Area York <NA> <NA> TRUE
3 New York Metropolitan Area New York Area TRUE
4 New York Metropolitan Area Los Angeles <NA> NA
这是一个基本的 R 解决方案。
匿名 lambda 函数 \(x, y)
是在 R 4.1.0 中引入的,旧版本的 R 使用 function(x, y)
pattern <- gsub(" ", "|", df1$colb)
i <- mapply(\(x, y)grepl(x, y), pattern, df1$colA)
df1$flag <- c("No", "Yes")[i + 1L]
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
#4 New York Metropolitan Area Los Angeles No
df1[i, ]
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
df1 <-
structure(list(colA = c("New York Metropolitan Area",
"New York Metropolitan Area", "New York Metropolitan Area",
"New York Metropolitan Area"), colb = c("New York", "York",
"New York Area", "Los Angeles"), flag = c("Yes", "Yes", "Yes",
"No")), row.names = c(NA, -4L), class = "data.frame")