stringr 包使用 str_detect - 搜索一个词并排除词
stringr package using str_detect - Search for one word and exclude word
我有一个示例项目,需要使用 stringr
包搜索字符串。在该示例中,为了消除其他大小写拼写,我以 str_to_lower(example$remarks)
开头,这使备注全部小写。备注栏为住宅物业。
我需要搜索“商店”一词。但是,“购物”二字也在备注栏里,我不要那个字
一些观察:a) 只有“shop”这个词; b) 只有“shopping”这个词; c) 既没有“shop”也没有“shopping”; d) 同时使用“shop”和“shopping”这两个词。
当使用 str_detect()
时,我希望它给我一个 TRUE
来检测单词“shop”,但我不希望它给我一个 TRUE
来检测单词“shopping”中的字符串“shop”。目前,如果我 运行 str_detect(example$remarks, "shop")
我会得到 TRUE
两个单词“shop”和“shopping”。实际上,我只想要一个 TRUE
用于 4 个字符的字符串“shop”,如果字符“shop”出现但后面有任何其他字符,如 shop(ping),我希望代码排除检测它并且没有将其识别为 TRUE
.
此外,如果备注中包含“shop”和“shopping”这两个词,我希望结果TRUE
仅用于检测“shop”而不是“shopping”。
最终,我希望使用 str_detect()
的一行代码可以给我以下结果:
- 如果备注观察只有“店铺”字样=
TRUE
- 如果备注观察只有“购物”两个字=
FALSE
- 如果备注观察既没有“shop”也没有“shopping”字样=
FALSE
- 如果评论观察同时包含“shop”和“shopping”两个词 =
TRUE
仅检测 4 个字符的字符串“shop”并且它不会输出 TRUE
因为“购物”这个词。
我需要将所有观察结果保留在数据集中并且不能排除它们,因为我需要创建一个新列,我已将其标记为 shop_YN
,该列仅对具有4 个字符的字符串“shop”。一旦我有了正确的 str_detect()
代码,我计划将结果包装在 mutate()
和 if_else()
函数中,如下所示(除了我不知道在 [=14 中使用什么代码=] 以获得我需要的结果):
shop_YN <- example %>% mutate(shop_YN = if_else(str_detect(example$remarks, ), "Yes", "No"))
这是使用 dput()
:
的数据示例
structure(list(price = c(195000, 213000, 215000, 240000, 241000,
250000, 255000, 256500, 260000, 263500, 265000, 277000, 280000,
280000, 150000), remarks = c("large home with a 1200 sf shop. great location close to shopping.",
"updated home close to shopping & schools.", "nice location. 2br home with updating.",
"huge shop on property!", "close to shopping.", "updated, clean, great location, garage.",
"close to shopping and massive shop on property.", "updated home near shopping, schools, restaurants.",
"large home with updated interior.", "close to schools, updated, stick-built shop 1500sf.",
"home and shop.", "near schools, shopping, restaurants. partially updated home.",
"located close to shopping. high quality home with shop in backyard.",
"brick 2-story. lots of shopping near by. detached garage and large shop in backyard.",
"fixer! needs work.")), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
您可能正在寻找单词边界 (\b
)。在两个单词边界之间包裹所需的模式以仅匹配单词,而不匹配较长单词的部分。
library(dplyr)
library(sitrngr)
df %>% mutate(shop_YN = str_detect(remarks, '\bshop\b'))
# A tibble: 15 × 3
price remarks shop_YN
<dbl> <chr> <lgl>
1 195000 large home with a 1200 sf shop. great location close to shopping. TRUE
2 213000 updated home close to shopping & schools. FALSE
3 215000 nice location. 2br home with updating. FALSE
4 240000 huge shop on property! TRUE
5 241000 close to shopping. FALSE
6 250000 updated, clean, great location, garage. FALSE
7 255000 close to shopping and massive shop on property. TRUE
8 256500 updated home near shopping, schools, restaurants. FALSE
9 260000 large home with updated interior. FALSE
10 263500 close to schools, updated, stick-built shop 1500sf. TRUE
11 265000 home and shop. TRUE
12 277000 near schools, shopping, restaurants. partially updated home. FALSE
13 280000 located close to shopping. high quality home with shop in backyard. TRUE
14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in back… TRUE
15 150000 fixer! needs work. FALSE
如果您想要 Yes
或 No
而不是逻辑 shop_YN,只需将 str_detect
的输出通过管道传输到 ifelse
:
df %>% mutate(shop_YN = str_detect(remarks, '\bshop\b') %>% ifelse('Yes', 'No'))
我们也可以使用 grepl
而不是 str_detect
:
df %>%
mutate(check = grepl("\bshop\b", remarks))
price remarks check
<dbl> <chr> <lgl>
1 195000 large home with a 1200 sf shop. great location close to shopping. TRUE
2 213000 updated home close to shopping & schools. FALSE
3 215000 nice location. 2br home with updating. FALSE
4 240000 huge shop on property! TRUE
5 241000 close to shopping. FALSE
6 250000 updated, clean, great location, garage. FALSE
7 255000 close to shopping and massive shop on property. TRUE
8 256500 updated home near shopping, schools, restaurants. FALSE
9 260000 large home with updated interior. FALSE
10 263500 close to schools, updated, stick-built shop 1500sf. TRUE
11 265000 home and shop. TRUE
12 277000 near schools, shopping, restaurants. partially updated home. FALSE
13 280000 located close to shopping. high quality home with shop in backyard. TRUE
14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in backyard. TRUE
15 150000 fixer! needs work. FALSE
我有一个示例项目,需要使用 stringr
包搜索字符串。在该示例中,为了消除其他大小写拼写,我以 str_to_lower(example$remarks)
开头,这使备注全部小写。备注栏为住宅物业。
我需要搜索“商店”一词。但是,“购物”二字也在备注栏里,我不要那个字
一些观察:a) 只有“shop”这个词; b) 只有“shopping”这个词; c) 既没有“shop”也没有“shopping”; d) 同时使用“shop”和“shopping”这两个词。
当使用 str_detect()
时,我希望它给我一个 TRUE
来检测单词“shop”,但我不希望它给我一个 TRUE
来检测单词“shopping”中的字符串“shop”。目前,如果我 运行 str_detect(example$remarks, "shop")
我会得到 TRUE
两个单词“shop”和“shopping”。实际上,我只想要一个 TRUE
用于 4 个字符的字符串“shop”,如果字符“shop”出现但后面有任何其他字符,如 shop(ping),我希望代码排除检测它并且没有将其识别为 TRUE
.
此外,如果备注中包含“shop”和“shopping”这两个词,我希望结果TRUE
仅用于检测“shop”而不是“shopping”。
最终,我希望使用 str_detect()
的一行代码可以给我以下结果:
- 如果备注观察只有“店铺”字样=
TRUE
- 如果备注观察只有“购物”两个字=
FALSE
- 如果备注观察既没有“shop”也没有“shopping”字样=
FALSE
- 如果评论观察同时包含“shop”和“shopping”两个词 =
TRUE
仅检测 4 个字符的字符串“shop”并且它不会输出TRUE
因为“购物”这个词。
我需要将所有观察结果保留在数据集中并且不能排除它们,因为我需要创建一个新列,我已将其标记为 shop_YN
,该列仅对具有4 个字符的字符串“shop”。一旦我有了正确的 str_detect()
代码,我计划将结果包装在 mutate()
和 if_else()
函数中,如下所示(除了我不知道在 [=14 中使用什么代码=] 以获得我需要的结果):
shop_YN <- example %>% mutate(shop_YN = if_else(str_detect(example$remarks, ), "Yes", "No"))
这是使用 dput()
:
structure(list(price = c(195000, 213000, 215000, 240000, 241000,
250000, 255000, 256500, 260000, 263500, 265000, 277000, 280000,
280000, 150000), remarks = c("large home with a 1200 sf shop. great location close to shopping.",
"updated home close to shopping & schools.", "nice location. 2br home with updating.",
"huge shop on property!", "close to shopping.", "updated, clean, great location, garage.",
"close to shopping and massive shop on property.", "updated home near shopping, schools, restaurants.",
"large home with updated interior.", "close to schools, updated, stick-built shop 1500sf.",
"home and shop.", "near schools, shopping, restaurants. partially updated home.",
"located close to shopping. high quality home with shop in backyard.",
"brick 2-story. lots of shopping near by. detached garage and large shop in backyard.",
"fixer! needs work.")), row.names = c(NA, -15L), class = c("tbl_df",
"tbl", "data.frame"))
您可能正在寻找单词边界 (\b
)。在两个单词边界之间包裹所需的模式以仅匹配单词,而不匹配较长单词的部分。
library(dplyr)
library(sitrngr)
df %>% mutate(shop_YN = str_detect(remarks, '\bshop\b'))
# A tibble: 15 × 3
price remarks shop_YN
<dbl> <chr> <lgl>
1 195000 large home with a 1200 sf shop. great location close to shopping. TRUE
2 213000 updated home close to shopping & schools. FALSE
3 215000 nice location. 2br home with updating. FALSE
4 240000 huge shop on property! TRUE
5 241000 close to shopping. FALSE
6 250000 updated, clean, great location, garage. FALSE
7 255000 close to shopping and massive shop on property. TRUE
8 256500 updated home near shopping, schools, restaurants. FALSE
9 260000 large home with updated interior. FALSE
10 263500 close to schools, updated, stick-built shop 1500sf. TRUE
11 265000 home and shop. TRUE
12 277000 near schools, shopping, restaurants. partially updated home. FALSE
13 280000 located close to shopping. high quality home with shop in backyard. TRUE
14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in back… TRUE
15 150000 fixer! needs work. FALSE
如果您想要 Yes
或 No
而不是逻辑 shop_YN,只需将 str_detect
的输出通过管道传输到 ifelse
:
df %>% mutate(shop_YN = str_detect(remarks, '\bshop\b') %>% ifelse('Yes', 'No'))
我们也可以使用 grepl
而不是 str_detect
:
df %>%
mutate(check = grepl("\bshop\b", remarks))
price remarks check
<dbl> <chr> <lgl>
1 195000 large home with a 1200 sf shop. great location close to shopping. TRUE
2 213000 updated home close to shopping & schools. FALSE
3 215000 nice location. 2br home with updating. FALSE
4 240000 huge shop on property! TRUE
5 241000 close to shopping. FALSE
6 250000 updated, clean, great location, garage. FALSE
7 255000 close to shopping and massive shop on property. TRUE
8 256500 updated home near shopping, schools, restaurants. FALSE
9 260000 large home with updated interior. FALSE
10 263500 close to schools, updated, stick-built shop 1500sf. TRUE
11 265000 home and shop. TRUE
12 277000 near schools, shopping, restaurants. partially updated home. FALSE
13 280000 located close to shopping. high quality home with shop in backyard. TRUE
14 280000 brick 2-story. lots of shopping near by. detached garage and large shop in backyard. TRUE
15 150000 fixer! needs work. FALSE