R：解析引文的文本文件/拆分成段落

Question

我正在寻找一种 R 解决方案来解决解析引文文本文件（如下所示）的问题，给出 data.frame 每个引文一个观察值，以及变量 text 和 source 如下所述。

DIAGRAMS are of great utility for illustrating certain questions of vital statistics by
conveying ideas on the subject through the eye, which cannot be so readily grasped when
contained in figures.
--- Florence Nightingale, Mortality of the British Army, 1857

To give insight to statistical information it occurred to me, that making an
appeal to the eye when proportion and magnitude are concerned, is the best and
readiest method of conveying a distinct idea. 
--- William Playfair, The Statistical Breviary (1801), p. 2


Regarding numbers and proportions, the best way to catch the imagination is to speak to the eyes.
--- William Playfair, Elemens de statistique, Paris, 1802, p. XX.

The aim of my carte figurative is to convey promptly to the eye the relation not given quickly by numbers requiring mental calculation.
--- Charles Joseph Minard

在这里，每个引用都是一个段落，与下一个之间用 "\n\n" 分隔。在该段内，所有行到开头的 --- 都包含 text，接下来的 --- 是 source.

我想我可以解决这个问题，如果我可以先将文本行分成段落（由 '\n\n+' 分隔（2 个或更多空行），但我在这样做时遇到了麻烦。

Answer 1

这应该可以完成您需要实现的大部分工作。我假设您已经在名为 txt:

的长度为 1 的字符向量中拥有该文件

library(tidyverse)

txt                                             %>% 
strsplit("\n{2,5}")                             %>% 
unlist()                                        %>% 
lapply(function(x) unlist(strsplit(x, "--- "))) %>%
{do.call("rbind", .)}                           %>%
as.data.frame(stringsAsFactors = FALSE)         %>%
setNames(c("Text", "Source"))                    ->
df

如果您随后通过用空格替换换行符来整理文本，您将得到以下内容：

df$Text <- gsub("\n", " ", df$Text)
as_tibble(df)
#> # A tibble: 4 x 2
#>   Text                                              Source                             
#>   <chr>                                             <chr>                              
#> 1 "DIAGRAMS are of great utility for illustrating ~ Florence Nightingale, Mortality of~
#> 2 "To give insight to statistical information it o~ William Playfair, The Statistical ~
#> 3 "Regarding numbers and proportions, the best way~ William Playfair, Elemens de stati~
#> 4 "The aim of my carte figurative is to convey pro~ Charles Joseph Minard

Answer 2

假设您已将初始文本加载到 rawText 变量

library(stringr)

strsplit(rawText, "\n\n")[[1]] %>% 
  str_split_fixed("\n--- ", 2) %>% 
  as.data.frame() %>% 
  setNames(c("text", "source"))

Answer 3

假设您的文本文件在工作目录中 quote.txt。

R base 解决方案：拆分 2 次：(1) 按 \n\n 和 (2) 按 ---，然后组合成数据框。

quote <- readLines("quote.txt")
quote <- paste(quote, collapse = "\n")

DF <- strsplit(unlist(strsplit(quote, "\n\n")), "---")
DF <- data.frame(text= trimws(sapply(DF, "[[", 1)), 
           source = trimws(sapply(DF, "[[", 2)))

输出

DF
                                                                                                                                                                                                                                                                                 # text
# 1     DIAGRAMS are of great utility for illustrating certain questions of vital statistics by\nconveying ideas on the subject through the eye, which cannot be so readily grasped when\ncontained in figures.
# 2 To give insight to statistical information it occurred to me, that making an\nappeal to the eye when proportion and magnitude are concerned, is the best and\nreadiest method of conveying a distinct idea.
# 3                                                                                                           Regarding numbers and proportions, the best way to catch the imagination is to speak to the eyes.
# 4                                                                     The aim of my carte figurative is to convey promptly to the eye the relation not given quickly by numbers requiring mental calculation.
#                                                          source
# 1     Florence Nightingale, Mortality of the British Army, 1857
# 2       William Playfair, The Statistical Breviary (1801), p. 2
# 3 William Playfair, Elemens de statistique, Paris, 1802, p. XX.
# 4                                         Charles Joseph Minard

R：解析引文的文本文件/拆分成段落

R: parsing text file of quotations / splitting into paragraphs

parsing

r

quotations

paragraph