解析文本以在 R 中进行分析

Question

我有一个包含短文章的 .txt 文件，我想使用 R 创建一个数据集来解析每篇文章并提取日期、作者、期刊、标题、行号和每行的文本数据框中每篇文章中的文本。例如，每篇文章的文本数据重复相同的结构并采用以下格式：

This is a Title  
December 15, 2005 | Publisher  
Author: JANE DOE  
Section: Movies and More  
2554 Words
Page: C3  
OpenURL  
Link  

Text Text Text Text   
Another line of text  
One more thing  
End of article.   

Citation (asa Style)  
DOE, JANE. 2005. "This is a Title," Publisher, December 15, pp.C3.

Different Title  
December 18, 2005 | Publisher  
Author: JOHN DOE  
Section: News 
662 Words
Page: C8  
OpenURL  
Link  

Here is more text   
It is still text
But also shorter.  

Citation (asa Style)  
DOE, JOHN. 2005. "Different Title," Publisher, December 18, pp.C8.

对于每篇文章，我想提取作者、发表的数据、期刊和每一行来创建一个如下所示的数据框：

Date           Journal       Title             Author            Line              Text
15-Dec-2005    Publication   This is a title   Doe, Jane         1                 Text Text Text Text
15-Dec-2005    Publication   This is a title   Doe, Jane         2                 Another line of text
15-Dec-2005    Publication   This is a title   Doe, Jane         3                 One more thing
15-Dec-2005    Publication   This is a title   Doe, Jane         4                 End of article.
18-Dec-2005    Publication   Different Title   Doe, John         1                 Here is more text 
18-Dec-2005    Publication   Different Title   Doe, John         2                 It is still text
18-Dec-2005    Publication   Different Title   Doe, John         3                 But also shorter.

我想使用下面的代码将上面的数据框（我们称之为 text_df）转换为整洁的文本格式，并以 one-token-per-row 格式重组，

library(tidytext)
tidy_dat <- text_df %>%
  unnest_tokens(word, text)

我知道这是一个很大的问题。任何帮助将不胜感激。

Answer 1

因为您有许多要提取的字段，所以我创建了主要思想 - 您可以从这里开始:)

首先，加载 tidyverse 和文章：

library(tidyverse)
articles <- read.delim("C:/Your_Path/temp.txt",
                      stringsAsFactors = FALSE, header = FALSE)

我们可以用grep得到"Link"和"Citation"在文本中的位置。

positions <- grep(pattern = "Link|Citation", 
                 x = articles$V1)

因为 Link 总是出现在引用之前，我们可以 split 将 positions 放入一个列表中。

positions <- split(positions, ceiling(seq_along(positions)/2))

现在，我们可以使用相同的思路提取作者的位置 (grep)。

authors <- grep(pattern = "Author", 
               x = articles$V1)

检查一下 vectors/list 的长度是否相同总是好的。这样，您可以查看您提取的作者是否多于 link 和引用。

length(authors) == length(positions)

因为我假设您有两个以上的运行参数（文本、作者、出版、年份等），所以我使用 purrr::pmap。 pmap，类似于 map/map2 运行s 在一个或多个 lists/vectors 上执行一个函数。在这种情况下，该函数（每次）采用 articles 到 positions 和 author

的对应行

books <- purrr::pmap(
    list(positions, authors), 
  function(position, author) {
    cbind(
      data.frame(text = articles[seq(position[1] + 1, position[2] - 1, 1), ]),
      data.frame(author = articles[author,]))})

作为pmapreturns一个list，我们可以把list里面的data.frame绑定成一个data.frame.

do.call(rbind.data.frame, books)

结果：

                      text             author
1.1 Text Text Text Text    Author: JANE DOE  
1.2 Another line of text   Author: JANE DOE  
1.3       One more thing   Author: JANE DOE  
1.4     End of article.    Author: JANE DOE  
2.1   Here is more text    Author: JOHN DOE  
2.2       It is still text Author: JOHN DOE  
2.3    But also shorter.   Author: JOHN DOE

现在，你可以为所欲为，tidytext分析你想要的。

解析文本以在 R 中进行分析

Parsing text for analysis in R

parsing

text

r

dplyr

tidytext