从 R 中的文本变量中删除自定义词
Removing Custom Words From Text Variables in R
我有如下所示的数据集:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
> dat
ID ADDRESS
1 1 EAST SS BLVD
2 2 SOUTH AA STREET
3 3 XX EAST ST
4 4 ZZ NORTH ROAD
5 5 WEST TR TRAIL
我想删除地址中不在我想要的单词列表中的所有详细信息。我正在使用以下不正确且无法正常工作的代码。
dat$FEATURE <- gsub("^[(BLVD)|(BOULEVARD)|(DRIVE)|(DR)|(ROAD)|(RD)|(PL)|(PLACE)
|(SL)|(CIRCLE)|(CT)|(COURT)|(WY)|(WAY)|(ST)|(STREET)|(AVE)
|(AVENUE)|(PKWY)|(WAY)|(PARKWAY)|(LN)|(LANE)|(HWY)|(HIGHWAY)
|(TRAIL$)|(CIR$)]","",dat$ADDRESS)
> dat
ID ADDRESS FEATURE
1 1 EAST SS BLVD AST SS BLVD
2 2 SOUTH AA STREET OUTH AA STREET
3 3 XX EAST ST XX EAST ST
4 4 ZZ NORTH ROAD ZZ NORTH ROAD
5 5 WEST TR TRAIL EST TR TRAIL
我想要的输出是:
> dat1
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
我不是很好的正则表达式,感谢任何帮助,任何对 R 中正则表达式的引用都会有所帮助。
您可以使用
(?xs).*\b # any 0+ chars, as many as possible, then word boundary
( # Group 1 start:
BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)? # Various words
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)? # you need to keep
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY # here
|TRAIL$|CIR$ # and here
) # Group 1 end
\b # Word boundary
.* # Rest of the string.
这里,(?x)
是一个免费的 spacing/comment/verbose 修饰符,可以在模式和注释中格式化空格。 (?s)
是一个 DOTALL 修饰符,允许 .
匹配任何字符,包括换行符(这是必要的,因为它是 PCRE 模式,请注意 perl=TRUE
)。
"\1"
替换将第 1 组中的值插入到被替换的字符串中。
参见R demo:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("(?xs).*\b(BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)?
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)?
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY
|TRAIL$|CIR$)\b.*","\1",dat$ADDRESS, perl=TRUE)
dat
输出:
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
你可以这样做
#R version 3.3.2
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("\b(?!AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y)).+?\b","",dat$ADDRESS, perl=TRUE)
dat
http://rextester.com/GGYN78288
https://regex101.com/r/6RcXTi/1
我想从技术上讲,这更准确:
"\b(?!(?:AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y))\b).+?\b"
我有如下所示的数据集:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
> dat
ID ADDRESS
1 1 EAST SS BLVD
2 2 SOUTH AA STREET
3 3 XX EAST ST
4 4 ZZ NORTH ROAD
5 5 WEST TR TRAIL
我想删除地址中不在我想要的单词列表中的所有详细信息。我正在使用以下不正确且无法正常工作的代码。
dat$FEATURE <- gsub("^[(BLVD)|(BOULEVARD)|(DRIVE)|(DR)|(ROAD)|(RD)|(PL)|(PLACE)
|(SL)|(CIRCLE)|(CT)|(COURT)|(WY)|(WAY)|(ST)|(STREET)|(AVE)
|(AVENUE)|(PKWY)|(WAY)|(PARKWAY)|(LN)|(LANE)|(HWY)|(HIGHWAY)
|(TRAIL$)|(CIR$)]","",dat$ADDRESS)
> dat
ID ADDRESS FEATURE
1 1 EAST SS BLVD AST SS BLVD
2 2 SOUTH AA STREET OUTH AA STREET
3 3 XX EAST ST XX EAST ST
4 4 ZZ NORTH ROAD ZZ NORTH ROAD
5 5 WEST TR TRAIL EST TR TRAIL
我想要的输出是:
> dat1
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
我不是很好的正则表达式,感谢任何帮助,任何对 R 中正则表达式的引用都会有所帮助。
您可以使用
(?xs).*\b # any 0+ chars, as many as possible, then word boundary
( # Group 1 start:
BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)? # Various words
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)? # you need to keep
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY # here
|TRAIL$|CIR$ # and here
) # Group 1 end
\b # Word boundary
.* # Rest of the string.
这里,(?x)
是一个免费的 spacing/comment/verbose 修饰符,可以在模式和注释中格式化空格。 (?s)
是一个 DOTALL 修饰符,允许 .
匹配任何字符,包括换行符(这是必要的,因为它是 PCRE 模式,请注意 perl=TRUE
)。
"\1"
替换将第 1 组中的值插入到被替换的字符串中。
参见R demo:
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("(?xs).*\b(BLVD|BOULEVARD|DR(?:IVE)?|R(?:OA)?D|PL(?:ACE)?
|SL|CIRCLE|CT|COURT|WA?Y|ST(?:REET)?|AVE(?:NUE)?
|PKWY|(PARK)?:WAY|LN|LANE|HWY|HIGHWAY
|TRAIL$|CIR$)\b.*","\1",dat$ADDRESS, perl=TRUE)
dat
输出:
ID ADDRESS FEATURE
1 1 EAST SS BLVD BLVD
2 2 SOUTH AA STREET STREET
3 3 XX EAST ST ST
4 4 ZZ NORTH ROAD ROAD
5 5 WEST TR TRAIL TRAIL
你可以这样做
#R version 3.3.2
dat <- data.frame(ID=c(1,2,3,4,5),ADDRESS=c("EAST SS BLVD","SOUTH AA STREET","XX EAST ST","ZZ NORTH ROAD","WEST TR TRAIL"))
dat$FEATURE <- gsub("\b(?!AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y)).+?\b","",dat$ADDRESS, perl=TRUE)
dat
http://rextester.com/GGYN78288
https://regex101.com/r/6RcXTi/1
我想从技术上讲,这更准确:
"\b(?!(?:AVE(?:NUE)?|B(?:LV|OULEVAR)D|C(?:IR(?:CLE)?|OURT|T)|DR(?:IVE)?|H(?:IGHWA|W)Y|L(?:ANE|N)|P(?:ARKWAY|KWY|L(?:ACE)?)|R(?:|OA)D|S(?:L|T(?:REET)?)|TRAIL|W(?:AY|Y))\b).+?\b"