如何从单个文本文件中提取不同的文章?
How to extract different articles from a single text file?
我有一个收集报纸文章的 .rtf /.txt 文件。
The .rtf file can be found here. And the .txt file can be found here.
我想提取文章的 (1) 日期、(2) 标题和 (3) Body。最后,我想要一个数据框,其中每一行都是一篇文章,标题、日期和 body 三列。正如我在this screenshot中明确指出的那样,标题是加粗的句子(此处以黄色下划线),body是下面的几个段落(此处为蓝色方块)。
我已经成功地使用正则表达式提取了日期。但是,我无法提取文章的标题和正文。
是否可以使用正则表达式从此 .rtf /.txt 中提取文章的标题和正文?
我使用了以下代码:
library(readr)
library(stringr)
htmlText <- read_file("bild_afd_all.rtf")
#replace "\n" with a space
removeNewLines <- gsub("\n"," ",htmlText)
removeNewLines
# 1. extract the DATE from removedNewLines
date <- str_extract_all(removeNewLines, "\d{1,2} [A-Z][a-z]+ \d{4}")[[1]]
# 2. extract the TITLE from removedNewLines
## how?
# 3. extract the BODY from removedNewLines
## how?
这个问题与之前回答的问题相关:How do I extract dates from .rtf in R 在那个 post 中,正则表达式用于从 .rtf 文件中提取日期。该文件是 collection 的报纸文章。
非常非常感谢!
好吧,一些样本数据实际上很有用。
但是,我建议执行以下操作:
- 加载 R 文件(我在 txt 变量中生成了示例替换)
- 删除空行
- 查找字数索引
- 根据这些指标,把你感兴趣的都画出来。
我假设结构相同,文章的正文总是从字数以下 9 行开始。
查看我的代码。
library(tidyverse)
#1
txt = c("title1", "", "12 words", "12-12-2004", "BILD", "ZBILD", "BIBU",
"2", "295", "German", "Copyright", "first bla bla bla", "bla bla bla", "bla bla bla",
"bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla",
"bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla",
"bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla",
"bla bla bla", "last bla bla bla", "", "", "", "", "", "title2", "",
"", "12 words", "10-12-2004", "BILD", "ZBILD", "BIBU", "2", "1235",
"German", "Copyright", "first da da da", "da da da", "da da da", "da da da", "da da da",
"da da da", "da da da", "da da da", "da da da", "da da da", "da da da",
"da da da", "da da da", "last da da da", "", "", "", "title3", "",
"", "12 words", "10-12-2004", "BILD", "ZBILD", "BIBU", "2", "1235",
"German", "Copyright", "first info info", "info info", "info info", "info info", "info info",
"info info", "info info", "info info", "last info info")
#2
txt = txt[txt!=""]
#3
idx_word = which(str_detect(txt, "[0-9]+ +words$"))
fart = function(txt, idx_word){
out = rep("", length(idx_word))
for(i in 1:length(idx_word)){
if(i<length(idx_word)){
idx_txt=(idx_word[i]+9):(idx_word[i+1]-2)
} else{
idx_txt=(idx_word[i]+9):length(txt)
}
out[i]=paste(txt[idx_txt], collapse="\n")
}
out
}
#4
df = tibble(
title = txt[idx_word-1],
date = txt[idx_word+1],
article = fart(txt, idx_word)
)
输出
# A tibble: 3 x 3
title date article
<chr> <chr> <chr>
1 title1 12-12-2004 "first bla bla bla\nbla bla bla\nbla bla~
2 title2 10-12-2004 "first da da da\nda da da\nda da da\nda ~
3 title3 10-12-2004 "first info info\ninfo info\ninfo info\n~
请根据需要调整
这是程序的新版本。
不幸的是,就将文本彻底清理为格式化字符串而言,您将不得不自己动手。我不知道你是否需要它。将变音符号转换为 UTF-8、换行符、新页面等也是如此。
这只是如何执行此操作的一般方法。正如您在下面看到的那样,它有效。剩下的你得自己做。
library(fs)
library(tidyverse)
readTxt = function(FileName){
lines = character()
if(fs::file_exists(FileName)){
con = file(FileName, open = "r")
on.exit(close(con))
lines = readLines(con)
}
lines
}
remove_f2_b = function(txt) txt = str_replace(txt, "\\f2\\b ", "")
remove_f1_b0 = function(txt) txt = str_replace(txt, "\\f1\\b0 ", "")
remove_f1_fs20 = function(txt) txt = str_replace(txt, "\\f1\\fs20 ", "")
remove_f0_fs24 = function(txt) txt = str_replace(txt, "\\f0\\fs24 ", "")
remove_cf0 = function(txt) txt = str_replace(txt, "\\cf0 ", "")
remove_format = function(txt) txt %>% remove_f2_b() %>% remove_f1_b0 %>%
remove_f1_fs20() %>% remove_f0_fs24() %>% remove_cf0
txt = suppressWarnings(readTxt("bild_afd_all.rtf")) %>% remove_format()
txt = txt[txt!=""]
txt = txt[txt!="\"]
view(txt)
#3
idx_word = which(str_detect(txt, "\d?\,?\d+ words\\$"))
fart = function(txt, idx_word){
out = rep("", length(idx_word))
for(i in 1:length(idx_word)){
if(i<length(idx_word)){
idx_txt=(idx_word[i]+9):(idx_word[i+1]-2)
} else{
idx_txt=(idx_word[i]+9):length(txt)
}
out[i]=paste(txt[idx_txt], collapse="\n")
}
out
}
#4
df = tibble(
title = txt[idx_word-1],
date = str_replace(txt[idx_word+1], "\\$", ""),
article = fart(txt, idx_word)
)
head(df)
输出
title date article
<chr> <chr> <chr>
1 "Exklusiv-Umfrage; SO DENKEN DIE DEUTSCHEN \'dcER PEGIDA" 18 December 2014 "Berlin - Immer mehr Zulauf f\'fc die sogenannt~
2 "Seite 2" 18 December 2014 "Berlin - Endlich mehr Geld im Portemonnaie: Die~
3 "Ex-Minister Friedrich greift Kanzlerin an" 29 December 2014 "Berlin - Wieder Kurs-Debatten in der Union! Ex-~
4 "Seite 4" 29 December 2014 "KOMMENTAR\\nVon FRITZ ESSER Zusatzbeitrag, Dur~
5 "Wegen Pegida-Schelte; AUSLAND FEIERT MERKEL" 2 January 2015 "Berlin - \"Ich sage allen, die auf solche Demon~
6 "Seite 2" 2 January 2015 "Von\\nHANNO KAUTZ u. RALF SCHULER\\nBerlin - ~
txt 文件版本
library(fs)
library(tidyverse)
readTxt = function(FileName){
lines = character()
if(fs::file_exists(FileName)){
con = file(FileName, open = "r")
on.exit(close(con))
lines = readLines(con)
}
lines
}
txt = suppressWarnings(readTxt("bild_afd_all.txt"))
txt = txt[txt!=""]
txt = txt[txt!=" "]
view(txt)
#3
idx_word = which(str_detect(txt, "\d?\,?\d+ words$"))
fart = function(txt, idx_word){
out = rep("", length(idx_word))
for(i in 1:length(idx_word)){
if(i<length(idx_word)){
idx_txt=(idx_word[i]+9):(idx_word[i+1]-2)
} else{
idx_txt=(idx_word[i]+9):length(txt)
}
out[i]=paste(txt[idx_txt], collapse="\n")
}
out
}
#4
df = tibble(
title = txt[idx_word-1],
date = str_replace(txt[idx_word+1], "\\$", ""),
article = fart(txt, idx_word)
)
head(df, 10)
输出
# A tibble: 10 x 3
title date article
<chr> <chr> <chr>
1 "Exklusiv-Umfrage; SO DENKEN DIE DEUTSCHEN ĂśER PEGIDA" 18 December 2014 "Berlin - Immer mehr Zulauf fĂĽ die sogenannten Pegida-Demonstrationen in Dresden ~
2 "Seite 2" 18 December 2014 "Berlin - Endlich mehr Geld im Portemonnaie: Die verfĂĽbaren Einkommen der deutsch~
3 "Ex-Minister Friedrich greift Kanzlerin an" 29 December 2014 "Berlin - Wieder Kurs-Debatten in der Union! Ex-Innenminister Hans-Peter Friedrich~
4 "Seite 4" 29 December 2014 "KOMMENTAR\nVon FRITZ ESSER Zusatzbeitrag, Durchschnittsbeitrag, Sonderbeitrag - M~
5 "Wegen Pegida-Schelte; AUSLAND FEIERT MERKEL" 2 January 2015 "Berlin - \"Ich sage allen, die auf solche Demonstrationen gehen: Folgen Sie denen~
6 "Seite 2" 2 January 2015 "Von\nHANNO KAUTZ u. RALF SCHULER\nBerlin - Exakt 118 Stufen sind es bis in den se~
7 "ROLF KLEINE " 3 January 2015 "Von\nROLF KLEINE\nAlle Jahre wieder ...\nDie Forderung aus der CSU, abgelehnte As~
8 " AfD -Spitze lät Pegida-Vertreter in Landtag ein" 3 January 2015 "Dresden - Es wird ein brisantes Treffen: AfDCo-Chefin Frauke Petry (39) lud Vertr~
9 "Seite 2" 3 January 2015 "Koblenz - Im Stadtrat von Koblenz kracht es krätig! Anlass ist die geplante Wied~
10 " AfD -Spitze füchtet Zerfall der Partei" 5 January 2015 "Berlin - Der Machtkampf in der AfD-Spitze wird immer häter, die Fürung hät ein~
>
我有一个收集报纸文章的 .rtf /.txt 文件。 The .rtf file can be found here. And the .txt file can be found here.
我想提取文章的 (1) 日期、(2) 标题和 (3) Body。最后,我想要一个数据框,其中每一行都是一篇文章,标题、日期和 body 三列。正如我在this screenshot中明确指出的那样,标题是加粗的句子(此处以黄色下划线),body是下面的几个段落(此处为蓝色方块)。
我已经成功地使用正则表达式提取了日期。但是,我无法提取文章的标题和正文。
是否可以使用正则表达式从此 .rtf /.txt 中提取文章的标题和正文?
我使用了以下代码:
library(readr)
library(stringr)
htmlText <- read_file("bild_afd_all.rtf")
#replace "\n" with a space
removeNewLines <- gsub("\n"," ",htmlText)
removeNewLines
# 1. extract the DATE from removedNewLines
date <- str_extract_all(removeNewLines, "\d{1,2} [A-Z][a-z]+ \d{4}")[[1]]
# 2. extract the TITLE from removedNewLines
## how?
# 3. extract the BODY from removedNewLines
## how?
这个问题与之前回答的问题相关:How do I extract dates from .rtf in R 在那个 post 中,正则表达式用于从 .rtf 文件中提取日期。该文件是 collection 的报纸文章。
非常非常感谢!
好吧,一些样本数据实际上很有用。 但是,我建议执行以下操作:
- 加载 R 文件(我在 txt 变量中生成了示例替换)
- 删除空行
- 查找字数索引
- 根据这些指标,把你感兴趣的都画出来。
我假设结构相同,文章的正文总是从字数以下 9 行开始。
查看我的代码。
library(tidyverse)
#1
txt = c("title1", "", "12 words", "12-12-2004", "BILD", "ZBILD", "BIBU",
"2", "295", "German", "Copyright", "first bla bla bla", "bla bla bla", "bla bla bla",
"bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla",
"bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla",
"bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla", "bla bla bla",
"bla bla bla", "last bla bla bla", "", "", "", "", "", "title2", "",
"", "12 words", "10-12-2004", "BILD", "ZBILD", "BIBU", "2", "1235",
"German", "Copyright", "first da da da", "da da da", "da da da", "da da da", "da da da",
"da da da", "da da da", "da da da", "da da da", "da da da", "da da da",
"da da da", "da da da", "last da da da", "", "", "", "title3", "",
"", "12 words", "10-12-2004", "BILD", "ZBILD", "BIBU", "2", "1235",
"German", "Copyright", "first info info", "info info", "info info", "info info", "info info",
"info info", "info info", "info info", "last info info")
#2
txt = txt[txt!=""]
#3
idx_word = which(str_detect(txt, "[0-9]+ +words$"))
fart = function(txt, idx_word){
out = rep("", length(idx_word))
for(i in 1:length(idx_word)){
if(i<length(idx_word)){
idx_txt=(idx_word[i]+9):(idx_word[i+1]-2)
} else{
idx_txt=(idx_word[i]+9):length(txt)
}
out[i]=paste(txt[idx_txt], collapse="\n")
}
out
}
#4
df = tibble(
title = txt[idx_word-1],
date = txt[idx_word+1],
article = fart(txt, idx_word)
)
输出
# A tibble: 3 x 3
title date article
<chr> <chr> <chr>
1 title1 12-12-2004 "first bla bla bla\nbla bla bla\nbla bla~
2 title2 10-12-2004 "first da da da\nda da da\nda da da\nda ~
3 title3 10-12-2004 "first info info\ninfo info\ninfo info\n~
请根据需要调整
这是程序的新版本。 不幸的是,就将文本彻底清理为格式化字符串而言,您将不得不自己动手。我不知道你是否需要它。将变音符号转换为 UTF-8、换行符、新页面等也是如此。
这只是如何执行此操作的一般方法。正如您在下面看到的那样,它有效。剩下的你得自己做。
library(fs)
library(tidyverse)
readTxt = function(FileName){
lines = character()
if(fs::file_exists(FileName)){
con = file(FileName, open = "r")
on.exit(close(con))
lines = readLines(con)
}
lines
}
remove_f2_b = function(txt) txt = str_replace(txt, "\\f2\\b ", "")
remove_f1_b0 = function(txt) txt = str_replace(txt, "\\f1\\b0 ", "")
remove_f1_fs20 = function(txt) txt = str_replace(txt, "\\f1\\fs20 ", "")
remove_f0_fs24 = function(txt) txt = str_replace(txt, "\\f0\\fs24 ", "")
remove_cf0 = function(txt) txt = str_replace(txt, "\\cf0 ", "")
remove_format = function(txt) txt %>% remove_f2_b() %>% remove_f1_b0 %>%
remove_f1_fs20() %>% remove_f0_fs24() %>% remove_cf0
txt = suppressWarnings(readTxt("bild_afd_all.rtf")) %>% remove_format()
txt = txt[txt!=""]
txt = txt[txt!="\"]
view(txt)
#3
idx_word = which(str_detect(txt, "\d?\,?\d+ words\\$"))
fart = function(txt, idx_word){
out = rep("", length(idx_word))
for(i in 1:length(idx_word)){
if(i<length(idx_word)){
idx_txt=(idx_word[i]+9):(idx_word[i+1]-2)
} else{
idx_txt=(idx_word[i]+9):length(txt)
}
out[i]=paste(txt[idx_txt], collapse="\n")
}
out
}
#4
df = tibble(
title = txt[idx_word-1],
date = str_replace(txt[idx_word+1], "\\$", ""),
article = fart(txt, idx_word)
)
head(df)
输出
title date article
<chr> <chr> <chr>
1 "Exklusiv-Umfrage; SO DENKEN DIE DEUTSCHEN \'dcER PEGIDA" 18 December 2014 "Berlin - Immer mehr Zulauf f\'fc die sogenannt~
2 "Seite 2" 18 December 2014 "Berlin - Endlich mehr Geld im Portemonnaie: Die~
3 "Ex-Minister Friedrich greift Kanzlerin an" 29 December 2014 "Berlin - Wieder Kurs-Debatten in der Union! Ex-~
4 "Seite 4" 29 December 2014 "KOMMENTAR\\nVon FRITZ ESSER Zusatzbeitrag, Dur~
5 "Wegen Pegida-Schelte; AUSLAND FEIERT MERKEL" 2 January 2015 "Berlin - \"Ich sage allen, die auf solche Demon~
6 "Seite 2" 2 January 2015 "Von\\nHANNO KAUTZ u. RALF SCHULER\\nBerlin - ~
txt 文件版本
library(fs)
library(tidyverse)
readTxt = function(FileName){
lines = character()
if(fs::file_exists(FileName)){
con = file(FileName, open = "r")
on.exit(close(con))
lines = readLines(con)
}
lines
}
txt = suppressWarnings(readTxt("bild_afd_all.txt"))
txt = txt[txt!=""]
txt = txt[txt!=" "]
view(txt)
#3
idx_word = which(str_detect(txt, "\d?\,?\d+ words$"))
fart = function(txt, idx_word){
out = rep("", length(idx_word))
for(i in 1:length(idx_word)){
if(i<length(idx_word)){
idx_txt=(idx_word[i]+9):(idx_word[i+1]-2)
} else{
idx_txt=(idx_word[i]+9):length(txt)
}
out[i]=paste(txt[idx_txt], collapse="\n")
}
out
}
#4
df = tibble(
title = txt[idx_word-1],
date = str_replace(txt[idx_word+1], "\\$", ""),
article = fart(txt, idx_word)
)
head(df, 10)
输出
# A tibble: 10 x 3
title date article
<chr> <chr> <chr>
1 "Exklusiv-Umfrage; SO DENKEN DIE DEUTSCHEN ĂśER PEGIDA" 18 December 2014 "Berlin - Immer mehr Zulauf fĂĽ die sogenannten Pegida-Demonstrationen in Dresden ~
2 "Seite 2" 18 December 2014 "Berlin - Endlich mehr Geld im Portemonnaie: Die verfĂĽbaren Einkommen der deutsch~
3 "Ex-Minister Friedrich greift Kanzlerin an" 29 December 2014 "Berlin - Wieder Kurs-Debatten in der Union! Ex-Innenminister Hans-Peter Friedrich~
4 "Seite 4" 29 December 2014 "KOMMENTAR\nVon FRITZ ESSER Zusatzbeitrag, Durchschnittsbeitrag, Sonderbeitrag - M~
5 "Wegen Pegida-Schelte; AUSLAND FEIERT MERKEL" 2 January 2015 "Berlin - \"Ich sage allen, die auf solche Demonstrationen gehen: Folgen Sie denen~
6 "Seite 2" 2 January 2015 "Von\nHANNO KAUTZ u. RALF SCHULER\nBerlin - Exakt 118 Stufen sind es bis in den se~
7 "ROLF KLEINE " 3 January 2015 "Von\nROLF KLEINE\nAlle Jahre wieder ...\nDie Forderung aus der CSU, abgelehnte As~
8 " AfD -Spitze lät Pegida-Vertreter in Landtag ein" 3 January 2015 "Dresden - Es wird ein brisantes Treffen: AfDCo-Chefin Frauke Petry (39) lud Vertr~
9 "Seite 2" 3 January 2015 "Koblenz - Im Stadtrat von Koblenz kracht es krätig! Anlass ist die geplante Wied~
10 " AfD -Spitze füchtet Zerfall der Partei" 5 January 2015 "Berlin - Der Machtkampf in der AfD-Spitze wird immer häter, die Fürung hät ein~
>