根据是否存在另一列中的字符串元素过滤行
Filter rows based on presence of a string element from another column
我正在尝试根据 R 中是否存在字符串或 part/element 字符串来过滤掉相关的行。示例如下:
colA colb flag
New York Metropolitan Area New York Yes
New York Metropolitan Area York Yes
New York Metropolitan Area New York Area Yes
New York Metropolitan Area Los Angeles No
到目前为止我尝试过的事情:
- 存在 2 个不同的数据帧
df1<- df1 %>% fuzzy_inner_join(df2, by = c("colA" = "colB"), match_fun = str_detect)
由于括号和其他特殊字符,此选项失败,将它们全部清除也无济于事。
- 我加入了基于上层层次结构的 2 个数据框以限制行并创建了一个数据框 df
df[, "lookup"] <- gsub(" ", "|", df[,"colB"])
df[,"flag"] <- mapply(grepl, df[,"lookup"], df[,"colA"])
结果不令人满意,因为只过滤了有限的行。
提前致谢。
如果我没有正确理解你的问题,那么你正在尝试匹配部分字符串并获取指示匹配的新列:
df1 <- data.frame(colA = rep("New York Metropolitan Area ", 4),
colb = c("New York", "York", "New York Area", "Los Angeles") )
我的第一次尝试是一个简单的 str_detect
但这尝试匹配 colb
中的整个字符串 colA
:
df3 = df1%>%
mutate(flag = str_detect(colA, colb))
> df3
colA colb flag
1 New York Metropolitan Area New York TRUE
2 New York Metropolitan Area York TRUE
3 New York Metropolitan Area New York Area FALSE
4 New York Metropolitan Area Los Angeles FALSE
这不太对;尽管在此示例中,您可以先添加 df1$colb = gsub("Area", "", df1$colb )
。
或者:
library(dplyr) # for pipe
library(stringr) # for str_detect
library(tidyr) # for separate
#separate colb into 3 columns (called b1,b2 and b3) with separate words (can be increased if more words)
df1 = df1 %>% separate(col = colb, c("b1","b2","b3"))
# detect contents of columns b1, b2 or b3 in colA and create new column with logical value
df2 = df1%>%
mutate(flag = str_detect(colA, b1)|
str_detect(colA, b2)|
str_detect(colA, b3))
这给出了输出
> df2
colA b1 b2 b3 flag
1 New York Metropolitan Area New York <NA> TRUE
2 New York Metropolitan Area York <NA> <NA> TRUE
3 New York Metropolitan Area New York Area TRUE
4 New York Metropolitan Area Los Angeles <NA> NA
这是一个基本的 R 解决方案。
匿名 lambda 函数 \(x, y)
是在 R 4.1.0 中引入的,旧版本的 R 使用 function(x, y)
.
pattern <- gsub(" ", "|", df1$colb)
i <- mapply(\(x, y)grepl(x, y), pattern, df1$colA)
df1$flag <- c("No", "Yes")[i + 1L]
df1
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
#4 New York Metropolitan Area Los Angeles No
要删除与模式不匹配的行:
df1[i, ]
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
数据
df1 <-
structure(list(colA = c("New York Metropolitan Area",
"New York Metropolitan Area", "New York Metropolitan Area",
"New York Metropolitan Area"), colb = c("New York", "York",
"New York Area", "Los Angeles"), flag = c("Yes", "Yes", "Yes",
"No")), row.names = c(NA, -4L), class = "data.frame")
我正在尝试根据 R 中是否存在字符串或 part/element 字符串来过滤掉相关的行。示例如下:
colA colb flag
New York Metropolitan Area New York Yes
New York Metropolitan Area York Yes
New York Metropolitan Area New York Area Yes
New York Metropolitan Area Los Angeles No
到目前为止我尝试过的事情:
- 存在 2 个不同的数据帧
df1<- df1 %>% fuzzy_inner_join(df2, by = c("colA" = "colB"), match_fun = str_detect)
由于括号和其他特殊字符,此选项失败,将它们全部清除也无济于事。
- 我加入了基于上层层次结构的 2 个数据框以限制行并创建了一个数据框 df
df[, "lookup"] <- gsub(" ", "|", df[,"colB"])
df[,"flag"] <- mapply(grepl, df[,"lookup"], df[,"colA"])
结果不令人满意,因为只过滤了有限的行。
提前致谢。
如果我没有正确理解你的问题,那么你正在尝试匹配部分字符串并获取指示匹配的新列:
df1 <- data.frame(colA = rep("New York Metropolitan Area ", 4),
colb = c("New York", "York", "New York Area", "Los Angeles") )
我的第一次尝试是一个简单的 str_detect
但这尝试匹配 colb
中的整个字符串 colA
:
df3 = df1%>%
mutate(flag = str_detect(colA, colb))
> df3
colA colb flag
1 New York Metropolitan Area New York TRUE
2 New York Metropolitan Area York TRUE
3 New York Metropolitan Area New York Area FALSE
4 New York Metropolitan Area Los Angeles FALSE
这不太对;尽管在此示例中,您可以先添加 df1$colb = gsub("Area", "", df1$colb )
。
或者:
library(dplyr) # for pipe
library(stringr) # for str_detect
library(tidyr) # for separate
#separate colb into 3 columns (called b1,b2 and b3) with separate words (can be increased if more words)
df1 = df1 %>% separate(col = colb, c("b1","b2","b3"))
# detect contents of columns b1, b2 or b3 in colA and create new column with logical value
df2 = df1%>%
mutate(flag = str_detect(colA, b1)|
str_detect(colA, b2)|
str_detect(colA, b3))
这给出了输出
> df2
colA b1 b2 b3 flag
1 New York Metropolitan Area New York <NA> TRUE
2 New York Metropolitan Area York <NA> <NA> TRUE
3 New York Metropolitan Area New York Area TRUE
4 New York Metropolitan Area Los Angeles <NA> NA
这是一个基本的 R 解决方案。
匿名 lambda 函数 \(x, y)
是在 R 4.1.0 中引入的,旧版本的 R 使用 function(x, y)
.
pattern <- gsub(" ", "|", df1$colb)
i <- mapply(\(x, y)grepl(x, y), pattern, df1$colA)
df1$flag <- c("No", "Yes")[i + 1L]
df1
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
#4 New York Metropolitan Area Los Angeles No
要删除与模式不匹配的行:
df1[i, ]
# colA colb flag
#1 New York Metropolitan Area New York Yes
#2 New York Metropolitan Area York Yes
#3 New York Metropolitan Area New York Area Yes
数据
df1 <-
structure(list(colA = c("New York Metropolitan Area",
"New York Metropolitan Area", "New York Metropolitan Area",
"New York Metropolitan Area"), colb = c("New York", "York",
"New York Area", "Los Angeles"), flag = c("Yes", "Yes", "Yes",
"No")), row.names = c(NA, -4L), class = "data.frame")