如何通过 R 中的单词拆分一段文本?(在特定单词后拆分文本)

how to split a piece text by a word in R?( break the text after a specific word)

我需要将 pdf 文件拆分成它们的章节。在每个 pdf 中,在每一章的开头,我都添加了“Hirfar”这个词来查找和拆分文本。考虑以下示例:

t <- c(" Hirfar Mark Zuckerberg has hit back at the testimony of the Facebook whistleblower Frances Haugen, saying her claims the company puts profit over people’s safety are “just not true”.

Hirfar In a blogpost, the Facebook founder and chief executive addressed one of the most damaging statements in Haugen’s opening speech to US senators on Tuesday, that Facebook puts “astronomical profits before people”.

 Hirfar “At the heart of these accusations is this idea that we prioritise profit over safety and wellbeing. That’s just not true,” he said.

Hirfar He added: “The argument that we deliberately push content that makes people angry for profit is deeply illogical. We make money from ads, and advertisers consistently tell us they don’t want their ads next to harmful or angry content.”

Hirfar Zuckerberg said many of the claims made by Haugen – and in the Wall Street Journal, based on documents she leaked – “don’t make any sense”. The most damaging reporting in the WSJ, reiterated at length by Haugen in testimony to the US Senate on Tuesday, was that Facebook failed to act on internal research showing that its Instagram app was damaging teenagers’ mental health.")

这里我用这段代码把它分解成文字:

library(stringr)
wrds <- str_split(t, pattern = boundary(type = "word")

现在,我想查找“Hirfar”这个词并将此文本分成 5 个不同的文本。每一个都必须包括 Hirfar 之后的第一个单词到 Hirfar 之前的下一个单词。

我们可以使用正则表达式查找

strsplit(t, "\s+(?=Hirfar)", perl = TRUE)[[1]][-1]

-输出

[1] "Hirfar Mark Zuckerberg has hit back at the testimony of the Facebook whistleblower Frances Haugen, saying her claims the company puts profit over people’s safety are “just not true”."                                                                                                                                                                                                        
[2] "Hirfar In a blogpost, the Facebook founder and chief executive addressed one of the most damaging statements in Haugen’s opening speech to US senators on Tuesday, that Facebook puts “astronomical profits before people”."                                                                                                                                                                   
[3] "Hirfar “At the heart of these accusations is this idea that we prioritise profit over safety and wellbeing. That’s just not true,” he said."                                                                                                                                                                                                                                                   
[4] "Hirfar He added: “The argument that we deliberately push content that makes people angry for profit is deeply illogical. We make money from ads, and advertisers consistently tell us they don’t want their ads next to harmful or angry content.”"                                                                                                                                            
[5] "Hirfar Zuckerberg said many of the claims made by Haugen – and in the Wall Street Journal, based on documents she leaked – “don’t make any sense”. The most damaging reporting in the WSJ, reiterated at length by Haugen in testimony to the US Senate on Tuesday, was that Facebook failed to act on internal research showing that its Instagram app was damaging teenagers’ mental health."

如果它不应该包括 Hirfar

strsplit(t, "Hirfar\s+")[[1]][-1]
[1] "Mark Zuckerberg has hit back at the testimony of the Facebook whistleblower Frances Haugen, saying her claims the company puts profit over people’s safety are “just not true”.\n\n"                                                                                                                                                                                                    
[2] "In a blogpost, the Facebook founder and chief executive addressed one of the most damaging statements in Haugen’s opening speech to US senators on Tuesday, that Facebook puts “astronomical profits before people”.\n\n "                                                                                                                                                              
[3] "“At the heart of these accusations is this idea that we prioritise profit over safety and wellbeing. That’s just not true,” he said.\n\n"                                                                                                                                                                                                                                               
[4] "He added: “The argument that we deliberately push content that makes people angry for profit is deeply illogical. We make money from ads, and advertisers consistently tell us they don’t want their ads next to harmful or angry content.”\n\n"                                                                                                                                        
[5] "Zuckerberg said many of the claims made by Haugen – and in the Wall Street Journal, based on documents she leaked – “don’t make any sense”. The most damaging reporting in the WSJ, reiterated at length by Haugen in testimony to the US Senate on Tuesday, was that Facebook failed to act on internal research showing that its Instagram app was damaging teenagers’ mental health."