在R中的特定表达式后提取第一个单词

Question

我有一列包含数千条这样的描述（示例）：

描述
在美国洛杉矶市建造医院
在美国纽约市建一所学校
在美国芝加哥市开店

我想用“city of”之后的第一个词创建一个列，就像这样：

描述	城市
在美国洛杉矶市建造医院	洛杉矶
在美国纽约市建一所学校	纽约
在美国芝加哥市开店	芝加哥

看到这个话题后我尝试了下面的代码，但是我的专栏只填充了缺失值

library(stringr)

df$city <- data.frame(str_extract(df$Description, "(?<=city of:\s)[^;]+"))

df$city <- data.frame(str_extract(df$Description, "(?<=of:\s)[^;]+"))

我查看了 dput()，输出与我直接在数据框中看到的描述相同。

Answer 1

解决方案

这应该可以解决您显示的数据问题：

df$city <- str_extract(df$Description, "(?<=city of )(\w+)")

df
#>                                  Description    city
#> 1 Building a hospital in the city of LA, USA      LA
#> 2  Building a school in the city of NYC, USA     NYC
#> 3 Building shops in the city of Chicago, USA Chicago

备选

但是，如果您想要整个字符串直到第一个逗号（例如城市名称中有空格），您可以使用：

df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")

查看以下示例：

df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
                                 "Building a school in the city of NYC, USA",
                                 "Building shops in the city of Chicago, USA",
                                 "Building a church in the city of Salt Lake City, USA"))

str_extract(df$Description, "(?<=the city of )(\w+)")
#> [1] "LA"      "NYC"     "Chicago" "Salt"   

str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA"             "NYC"            "Chicago"        "Salt Lake City"

文档

查看 ?regex：

Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed. Patterns (?<=...) and (?<!...) are the lookbehind equivalents: they do not allow repetition quantifiers nor \C in ....

在R中的特定表达式后提取第一个单词

Extracting first word after a specific expression in R

r

text-mining

dataframe

stringr

解决方案

备选

文档