如何通过 html 标签或正则表达式拆分 txt 文件，以便将其另存为 R 中的单独 txt 文件？

Question

我有 html 和 txt 格式的 LexisNexis 新闻文章批量下载输出。该文件本身包含几篇不同新闻文章的 headers、元数据和 body，我需要将它们系统地分离并另存为独立的 txt 文件。 txt版本的头部是这样的：

> head(textz, 100)
[1] "ï»¿"                                                                              
[2] "                               1 of 103 DOCUMENTS"                                
[3] ""                                                                                 
[4] ""                                                                                 

[5] "                                Foreign Affairs"                                  

[6] ""                                                                                 
[7] "                              May 2013 - June 2013"                               
[8] ""                                                                                 
[9] "Why the U.S. Army Needs Armor Subtitle: The Case for a Balanced Force"            
[10] ""                                                                                 

[11] "BYLINE: Chris McKinney, Mark Elfendahl, and H. R. McMaster Authors BIOS: CHRIS"   
[12] "MCKINNEY is a Lieutenant Colonel in the U.S. Army and an adviser to the Saudi"    
[13] "Arabian National Guard. MARK ELFENDAHL is a Colonel in the U.S. Army and a"       
[14] "student at the Joint Advanced Warfighting School in Norfolk, Virginia. H. R."     
[15] "MCMASTER is a Major General in the U.S. Army and Commander of the Maneuver"       
[16] "Center of Excellence at Fort Benning, Georgia."                                   

[17] ""                                                                                 

[18] "SECTION: Vol. 92 No. 4 PAGE: 129"                                                 

[19] ""                                                                                 

[20] "LENGTH: 2856 words"                                                               

[21] ""                                                                                 

[22] ""                                                                                 

[23] "Ever since World War II, the United States has depended on armored forces --"     
[24] "forces equipped with tanks and other protected vehicles -- to wage its wars."
....
....

html 版本的快照如下所示：

<DOC NUMBER=103>
<DOCFULL> -->
<br><div class="c0">
<p class="c1"><span class="c2">103 of 103 DOCUMENTS</span></p>
</div>
<br><div class="c0">
<br><p class="c1"><span class="c2">The New York Times</span></p>
</div>
<br><div class="c3">
<p class="c1"><span class="c4">July</span>
<span class="c2"> 26, 2011 Tuesday</span>
<span class="c2">Â </span>
<span class="c2">Â <br>Late Edition - Final</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c7">A Step Toward Trust With China</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">BYLINE: </span><span class="c2">By MIKE MULLEN. </span></p>
<p class="c9"><span class="c2">Mike Mullen, a </span>
<span class="c4">Navy admiral,</span><span class="c2"> is the chairman of the Joint Chiefs of Staff.
</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">SECTION: </span>
<span class="c2">Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23</span></p>
</div>
<br><div class="c5">
<p class="c6"><span class="c8">LENGTH: </span>
<span class="c2">794 words</span></p>
</div>
<br><div class="c5">
<p class="c9"><span class="c2">Washington</span></p>
<p class="c9"><span class="c2">THE military relationship between the United States and China is one of the world's most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.
</span></p>

独特的文档由每个文档中的“[0-9] of [0-9] DOCUMENTS”行分隔，但在 grep 系列和 strsplit 之间我一直无法找到拆分 txt 的方法（或 html) R 中的文件，以一种完全分离组件文章的方式，并允许我将它们保存为独立的 txt 文件。彻底搜索其他问题线程要么没有帮助，要么需要使用 Python。任何建议都会很棒！

Answer 1

库 rvest 使其易于解析 html。您的文档与 <DOCFULL> 和 <DOC NUMBER > headers 不太一致。下面的答案使用您提供的扩展文档来显示下一个文档 (104)。您可以使用 lapply 结构来做其他事情，例如为每篇文章编写一个文本文件。请注意 html_nodes 中的 css 选择器。 html 中似乎没有太多结构，但如果您找到一些模式，您可以使用选择器定位每篇文章的位。

library(rvest)
library(stringr)

articles  <- str_replace_all(doc, "\n", " ") %>%    # remove new line to simplify
  str_replace_all("<DOCFULL>\s+\-\->", " " ) %>%  # remove redundant header
  strsplit("<DOC NUMBER=\d+>") %>%                  # split on DOC NUMBER header
  unlist()                                           # to a vector

# drop the first empty result form the split
articles <- articles[-1]

# use lapply to travers all articles. 
c2_texts <- lapply(articles, function (article) {
  article %>% 
    read_html() %>%           # character input parsed as html
    html_nodes(css=".c2") %>% # find nodes with CSS selector, ex: c2
    html_text() })            # extract text from within the node

c2_texts
# [[1]]
# [1] "103 of 103 DOCUMENTS"                                                                                                                                                                                                                                                                                                                                                           
# [2] "The New York Times"                                                                                                                                                                                                                                                                                                                                                             
# [3] " 26, 2011 Tuesday"                                                                                                                                                                                                                                                                                                                                                              
# [4] "Â "                                                                                                                                                                                                                                                                                                                                                                             
# [5] "Â Late Edition - Final"                                                                                                                                                                                                                                                                                                                                                         
# [6] "By MIKE MULLEN. "                                                                                                                                                                                                                                                                                                                                                               
# [7] "Mike Mullen, a "                                                                                                                                                                                                                                                                                                                                                                
# [8] " is the chairman of the Joint Chiefs of Staff.     "                                                                                                                                                                                                                                                                                                                            
# [9] "Section A; Column 0; Editorial Desk; OP-ED CONTRIBUTOR; Pg. 23"                                                                                                                                                                                                                                                                                                                 
# [10] "794 words"                                                                                                                                                                                                                                                                                                                                                                      
# [11] "Washington"                                                                                                                                                                                                                                                                                                                                                                     
# [12] "THE military relationship between the United States and China is one of the worlds most important. And yet, clouded by some misunderstanding and suspicion, it remains among the most challenging. There are issues on which we disagree and are tempted to confront each other. But there are crucial areas where our interests coincide, on which we must work together.     "
# 
# [[2]]
# [1] "104 of 104 DOCUMENTS" "The Added Item"

Answer 2

要拆分 txt 版本，假设文本在 doc_text 中，并将每个写入按顺序命名的文件 .txt、file2.txt 等

lapply 用于文件写入改编自@P Lapointe

texts <- unlist(strsplit(doc_text, "\s+\d+\sof\s\d+\sDOCUMENTS") )
texts <- texts[-1]  # drop the first empty split

lapply (1:length(texts), function(i){ write(texts[i], paste0("file", i, ".txt"))})

如何通过 html 标签或正则表达式拆分 txt 文件，以便将其另存为 R 中的单独 txt 文件？

How do I split a txt file by html tags or regex in order to save it as separate txt files in R?

html

r

strsplit