使用 stringr 从 R 中的文本字符串中提取一个或多个单词

Question

我有以下数据框：

df <- data.frame(city=c("in London", "in Manchester city", "in Sao Paolo"))

我在单独的一栏中使用 str_extract 和 return 'in' 之后的词。

library(stringr)
str_extract(df$city, '(?<=in\s)\w+')

在 95% 的情况下，这对我来说都很好。但是，在上面的 "Sao Paolo" 等情况下，我的正则表达式会 return "Sao" 而不是城市名称。

有人可以帮我修改它以捕获：

1) 我要从中提取的文本字符串末尾的所有内容？或者

2) 'in' 后面多了一个词，那么 return 那也是

非常感谢。

Answer 1

这一款衬垫适合您吗？

unlist(lapply(strsplit(c("in London", "in Sao Paulo", "in Manchester City"), "in "), function(x) x[2]))
[1] "London"          "Sao Paulo"       "Manchester City"

Answer 2

你可以试试这个：

library(stringr)
df$onlyCity <- str_extract(df$city, '[^in ](.)*')
df
                city        onlyCity
1          in London          London
2 in Manchester city Manchester city
3       in Sao Paolo       Sao Paolo

Answer 3

gsub("^in[ ]*(.*$)", "\1", df$city)
[1] "London"          "Manchester city" "Sao Paolo"

假设您的字符串以 "in" 开头，后跟一些空格（不会超过一个空格），然后是从第一个 [=14= 中捕获的感兴趣的文本] 字符到字符串末尾。

Answer 4

要匹配第一个 in 后跟 space 之后的所有其余字符串，您可以使用

(?<=in\s).+

lookbehind 匹配 in 介词后有白色 space，但不 return 它在匹配中，因为 lookbehinds 是 zero-width断言.

使用 stringr 从 R 中的文本字符串中提取一个或多个单词

Using stringr to extract one or multiple words from text string in R

regex

r

stringr