在R中的特定表达式后提取第一个单词
Extracting first word after a specific expression in R
我有一列包含数千条这样的描述(示例):
描述
在美国洛杉矶市建造医院
在美国纽约市建一所学校
在美国芝加哥市开店
我想用“city of”之后的第一个词创建一个列,就像这样:
描述
城市
在美国洛杉矶市建造医院
洛杉矶
在美国纽约市建一所学校
纽约
在美国芝加哥市开店
芝加哥
看到这个话题后我尝试了下面的代码,但是我的专栏只填充了缺失值
library(stringr)
df$city <- data.frame(str_extract(df$Description, "(?<=city of:\s)[^;]+"))
df$city <- data.frame(str_extract(df$Description, "(?<=of:\s)[^;]+"))
我查看了 dput(),输出与我直接在数据框中看到的描述相同。
解决方案
这应该可以解决您显示的数据问题:
df$city <- str_extract(df$Description, "(?<=city of )(\w+)")
df
#> Description city
#> 1 Building a hospital in the city of LA, USA LA
#> 2 Building a school in the city of NYC, USA NYC
#> 3 Building shops in the city of Chicago, USA Chicago
备选
但是,如果您想要整个字符串直到第一个逗号(例如城市名称中有空格),您可以使用:
df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")
查看以下示例:
df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
"Building a school in the city of NYC, USA",
"Building shops in the city of Chicago, USA",
"Building a church in the city of Salt Lake City, USA"))
str_extract(df$Description, "(?<=the city of )(\w+)")
#> [1] "LA" "NYC" "Chicago" "Salt"
str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA" "NYC" "Chicago" "Salt Lake City"
文档
查看 ?regex
:
Patterns (?=...) and (?!...) are zero-width positive and negative
lookahead assertions: they match if an attempt to match the ...
forward from the current position would succeed (or not), but use up
no characters in the string being processed. Patterns (?<=...) and
(?<!...) are the lookbehind equivalents: they do not allow repetition
quantifiers nor \C in ....
我有一列包含数千条这样的描述(示例):
描述 |
---|
在美国洛杉矶市建造医院 |
在美国纽约市建一所学校 |
在美国芝加哥市开店 |
我想用“city of”之后的第一个词创建一个列,就像这样:
描述 | 城市 |
---|---|
在美国洛杉矶市建造医院 | 洛杉矶 |
在美国纽约市建一所学校 | 纽约 |
在美国芝加哥市开店 | 芝加哥 |
看到这个话题后我尝试了下面的代码
library(stringr)
df$city <- data.frame(str_extract(df$Description, "(?<=city of:\s)[^;]+"))
df$city <- data.frame(str_extract(df$Description, "(?<=of:\s)[^;]+"))
我查看了 dput(),输出与我直接在数据框中看到的描述相同。
解决方案
这应该可以解决您显示的数据问题:
df$city <- str_extract(df$Description, "(?<=city of )(\w+)")
df
#> Description city
#> 1 Building a hospital in the city of LA, USA LA
#> 2 Building a school in the city of NYC, USA NYC
#> 3 Building shops in the city of Chicago, USA Chicago
备选
但是,如果您想要整个字符串直到第一个逗号(例如城市名称中有空格),您可以使用:
df$city <- str_extract(df$Description, "(?<=city of )(.+)(?=,)")
查看以下示例:
df <- data.frame(Description = c("Building a hospital in the city of LA, USA",
"Building a school in the city of NYC, USA",
"Building shops in the city of Chicago, USA",
"Building a church in the city of Salt Lake City, USA"))
str_extract(df$Description, "(?<=the city of )(\w+)")
#> [1] "LA" "NYC" "Chicago" "Salt"
str_extract(df$Description, "(?<=the city of )(.+)(?=,)")
#> [1] "LA" "NYC" "Chicago" "Salt Lake City"
文档
查看 ?regex
:
Patterns (?=...) and (?!...) are zero-width positive and negative lookahead assertions: they match if an attempt to match the ... forward from the current position would succeed (or not), but use up no characters in the string being processed. Patterns (?<=...) and (?<!...) are the lookbehind equivalents: they do not allow repetition quantifiers nor \C in ....