使用正则表达式组合在 Strsplit 中保留定界符
Keep delimiter in Strsplit with regex combinations
我正在处理一些需要我使用 strsplit
组合 regex
函数的数据。我已经想出如何拆分我的字符串,但我正在努力应用 中关于保留定界符的指南。
这是我正在抓取的字符串示例:
text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")
并且,这里是成功拆分字符串但修剪定界符的代码:
strsplit(as.character(free_text), "[0-9](?=[A-Z])|[a-z](?=[A-Z])|[')'](?=[A-Z])", perl=TRUE)
您会注意到,我正在寻找以下地点:
- 小写字母紧挨着大写字母
- 大写字母旁边的数字
- 大写字母旁边的右括号符号
不幸的是,下面的输出显示了我的代码的问题:
[1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembl"
[2] "Material: Woo"<br>
[3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L"<br>
[4] "Weight: 6.0 pound"<br>
[5] "Holds up to: 20.0 pound"<br>
[6] "Intended Pet Type: Bir"<br>
[7] "Care and Cleaning: Hand was"<br>
[8] "Pet activity: Clim"<br>
[9] "TCIN: 1670783"<br>
[10] "UPC: 03017202559"<br>
[11] "Item Number (DPCI): 083-01-024"<br>
[12] "Report incorrect product information"
即最后一个字母从 assemble [1]
、Wood [2]
等中删除。当您在寻找像我这样的正则表达式组合时,如何保留分隔符?
您可以将正则表达式中的消费模式放入 lookbehinds 中:
> text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")
> strsplit(text, "(?<=[0-9])(?=[A-Z])|(?<=[a-z])(?=[A-Z])|(?<=\))(?=[A-Z])", perl=TRUE)
[[1]]
[1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembly"
[2] "Material: Wood"
[3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)"
[4] "Weight: 6.0 pounds"
[5] "Holds up to: 20.0 pounds"
[6] "Intended Pet Type: Bird"
[7] "Care and Cleaning: Hand wash"
[8] "Pet activity: Climb"
[9] "TCIN: 16707835"
[10] "UPC: 030172025594"
[11] "Item Number (DPCI): 083-01-0246"
[12] "Report incorrect product information"
参见regex demo and the online R demo。
[0-9]
转换为(?<=[0-9])
,[a-z]
现在是(?<=[a-z])
,[')']
现在是(?<=\))
。
请注意,(?<=...)
是一个正值 lookbehind,它匹配字符串中紧接在回顾中定义的某个模式之前的位置。
我正在处理一些需要我使用 strsplit
组合 regex
函数的数据。我已经想出如何拆分我的字符串,但我正在努力应用
这是我正在抓取的字符串示例:
text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")
并且,这里是成功拆分字符串但修剪定界符的代码:
strsplit(as.character(free_text), "[0-9](?=[A-Z])|[a-z](?=[A-Z])|[')'](?=[A-Z])", perl=TRUE)
您会注意到,我正在寻找以下地点:
- 小写字母紧挨着大写字母
- 大写字母旁边的数字
- 大写字母旁边的右括号符号
不幸的是,下面的输出显示了我的代码的问题:
[1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembl"
[2] "Material: Woo"<br>
[3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L"<br>
[4] "Weight: 6.0 pound"<br>
[5] "Holds up to: 20.0 pound"<br>
[6] "Intended Pet Type: Bir"<br>
[7] "Care and Cleaning: Hand was"<br>
[8] "Pet activity: Clim"<br>
[9] "TCIN: 1670783"<br>
[10] "UPC: 03017202559"<br>
[11] "Item Number (DPCI): 083-01-024"<br>
[12] "Report incorrect product information"
即最后一个字母从 assemble [1]
、Wood [2]
等中删除。当您在寻找像我这样的正则表达式组合时,如何保留分隔符?
您可以将正则表达式中的消费模式放入 lookbehinds 中:
> text<-c("This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assemblyMaterial: WoodDimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)Weight: 6.0 poundsHolds up to: 20.0 poundsIntended Pet Type: BirdCare and Cleaning: Hand washPet activity: ClimbTCIN: 16707835UPC: 030172025594Item Number (DPCI): 083-01-0246Report incorrect product information")
> strsplit(text, "(?<=[0-9])(?=[A-Z])|(?<=[a-z])(?=[A-Z])|(?<=\))(?=[A-Z])", perl=TRUE)
[[1]]
[1] "This activity center is fun and helps give your birds exercise! With climbing ladders, a swing, tightrope and an assortment of engaging toys, the Activity Center has everything your bird needs to relieve stress and boredom all in one place. Relieves stress & boredom Durable & brightly colored wood Easy to clean bottom Simple installation & assembly"
[2] "Material: Wood"
[3] "Dimensions (Overall): 12.0 inches (H) x 15.0 inches (W) x 18.5 inches (L)"
[4] "Weight: 6.0 pounds"
[5] "Holds up to: 20.0 pounds"
[6] "Intended Pet Type: Bird"
[7] "Care and Cleaning: Hand wash"
[8] "Pet activity: Climb"
[9] "TCIN: 16707835"
[10] "UPC: 030172025594"
[11] "Item Number (DPCI): 083-01-0246"
[12] "Report incorrect product information"
参见regex demo and the online R demo。
[0-9]
转换为(?<=[0-9])
,[a-z]
现在是(?<=[a-z])
,[')']
现在是(?<=\))
。
请注意,(?<=...)
是一个正值 lookbehind,它匹配字符串中紧接在回顾中定义的某个模式之前的位置。