从字符串地址中提取非特定名称，忽略特定模式

Question

我有很长的地址，有些地址在我试图提取的不同位置只有一般建筑物名称。我已经确定了如何提取地址中更标准化的部分，但在尝试提取通用名称时遇到困难。

示例数据。

addresses<-c("big fake plaza, 12 this street, district, city", 
"Green mansion, district, city", 
 "Block 7c of orange building  district, city",
"98 main street block a blue plaza, city",
"tower 10, caribbean coast, district",
"block 3a, the latitude, city", 
 "blue red mansion, 46 pearl street, city"
"dorsett hotel, city"
"block 9, Willowland, disctrict, city",
 tower 2, the coronation, 1 fake street, district")

目标是提取非特定建筑物名称，并且仅提取它们。代码中的计划是提取前面没有通用建筑物名称的单词，并忽略任何块或塔名称。

我有什么

df$add.gen<-str_extract(df$address,""^[^block|^tower](([a-z]+\s+[a-z]*\s*[a-z]*\s*[a-z]*\s*[a-z]*))(?!building)(?!mansion)(?!garden)(?!house)")

But its not working clearly

我的目标是什么

df$add.gen<-

(NA, 
NA, 
NA,
NA,
"caribbean coast",
"the latitude", 
"dorsett hotel"
"Willowland",
"the coronation")

提前致谢！！

Answer 1

你可以使用

df$add.gen <- trimws(str_extract(df$address, "(?i)(?<=,|^)(?:(?!\b(?:city|disc?trict|street|plaza|square|tower|block|mansion|garden|house|building)\b)[^,])*(?=,|$)"))

见regex demo

详情:

(?i) - 匹配不区分大小写
(?<=,|^) - 紧靠左边，必须有逗号或字符串开头
(?:(?!\b(?:city|disc?trict|street|plaza|square|tower|block|mansion|garden|house|building)\b)[^,])* - 除逗号之外的任何字符，零次或多次出现（尽可能多），这不是以下整个单词的起始字符：city、disctrict, district, street, plaza, square, tower, block, mansion, garden, house、building
(?=,|$) - 紧靠右边，必须有逗号或字符串结尾。

trimws 是删除 leading/trailing 个空格所必需的。

从字符串地址中提取非特定名称，忽略特定模式

Extract non-specific name form string address, ignoring specific patterns

regex

string

r

extract

stringr