如何在 R 中使用正则表达式从句子中提取字符串？

Question

我想在 R 中使用正则表达式从句子中提取字符串。我是 R 的新手，不知道从哪里开始或如何做？

string<-c(".\n                Written by\nJ-S-Golden            \n        
\n        \n         \n                Plot Summary\n    |\n        Plot 
Synopsis\n    \n        \n            Plot Keywords:\n wrongful 
imprisonment\n                        |\n escape from prison\n                        
|\n based on the works of stephen king\n                        |\n 
prison\n                        |\n voice over narration\n            | See 
All (296) »      \n        \n            Taglines:\nFear can hold you 
prisoner. Hope can set you free.        \n        \n")

我有字符串，我想要输出的是：

Plot Keywords:
\n wrongful imprisonment\n
|\n escape from prison\n
|\n based on the works of stephen king\n                        
|\n prison\n                        
|\n voice over narration\n            
| See All (296) »      \n        \n

我不知道如何从字符串中提取干净的数据。谁能帮帮我。

Answer 1

这里是使用基础 R 的 sub 函数的解决方案。这匹配（并包括）前导文本 Plot Keywords:。然后，它使用经过调和的点来匹配任何字符，直到但不包括第一个后跟冒号的标签。

sub("(?s).*(Plot Keywords:(?:(?![^: ]+:).)*).*", "\1", string, perl=TRUE)

[1] "Plot Keywords:\n wrongful \nimprisonment\n
                    |\n escape from prison\n
                    \n|\n based on the works of
     stephen king\n
                    |\n \nprison\n                        |\n voice over narration\n
        | See \nAll (296) »      \n        \n            "

在这种特殊情况下，纯正则表达式演示可能比 R 演示更有帮助，所以这里是 link 到一个：

如何在 R 中使用正则表达式从句子中提取字符串？

How to extract string from sentence using regex in R?

regex

string

r

regex-group

web-scraping

Demo