Trim \n\n\n\n 之间文本中的模式

Question

我正在清理 R 中的文本。我的文本格式为

but he could not avoid the subject FULLSTOP \n\n\n\n\nsimilar pieces by the author\n\n\nlife is great 13022015\nreal men don t eath quiche 22042013\nback to the future 01072012\n\n\n\n and as he takes the stage here wednesday night to rally democrats around hillary clinton mr FULLSTOP obama will revisit his own promise to guide the nation into an era of reconciliation and unity harking back to the themes that propelled his improbable rise but that seem even more out of reach today FULLSTOP \n\n\n\n\nobama at convention to lay out stakes for a divided nation \n\n\n\n we get frustrated with political gridlock worry about racial divisions are shocked and saddened by the madness of orlando or nice mr FULLSTOP

我正在尝试摆脱

\n\n\n\n\nsimilar pieces by the author\n\n\nlife is great 13022015\nreal men don t eath quiche 22042013\nback to the future 01072012\n\n\n\n

所以要获得类似

的东西

but he could not avoid the subject FULLSTOP and as he takes the stage here wednesday night to rally democrats around hillary clinton mr FULLSTOP obama will revisit his own promise to guide the nation into an era of reconciliation and unity harking back to the themes that propelled his improbable rise but that seem even more out of reach today FULLSTOP \n\n\n\n\nobama at convention to lay out stakes for a divided nation \n\n\n\n we get frustrated with political gridlock worry about racial divisions are shocked and saddened by the madness of orlando or nice mr FULLSTOP

我正在尝试

gsub("\\n{3,}(similar pieces)?.*\\n{3,}", "", my_string)或gsub("\\n{3,}(similar pieces)?.*?\\n{3,}", "", my_string)

但它过度修剪或不起作用。

任何帮助（以及对我做错了什么的解释以及替代方案为何有效）将不胜感激。

Answer 1

您需要匹配前 5 个换行符到前 4 个换行符之间的所有内容。

我建议使用 *\n{5}.*?\n{4} * 正则表达式：

* - 零个或多个文字 spaces
\n{5} - 5 个换行符
.*? - 第一个字符之前的零个或多个字符....
\n{4} - 4 个 LF 符号
* - 零个或多个文字 spaces（只是为了 trim 匹配）

并替换为 space。

使用 sub 因为您只需要 1 个替换：

sub(" *\n{5}.*?\n{4} *", " ", s)

见R demo

Trim \n\n\n\n 之间文本中的模式

Trim pattern in a text between \n\n\n\n

regex

nlp

r

data-cleaning