解析文本以在 R 中进行分析

Parsing text for analysis in R

我有一个包含短文章的 .txt 文件,我想使用 R 创建一个数据集来解析每篇文章并提取日期、作者、期刊、标题、行号和每行的文本数据框中每篇文章中的文本。例如,每篇文章的文本数据重复相同的结构并采用以下格式:

This is a Title  
December 15, 2005 | Publisher  
Author: JANE DOE  
Section: Movies and More  
2554 Words
Page: C3  
OpenURL  
Link  

Text Text Text Text   
Another line of text  
One more thing  
End of article.   

Citation (asa Style)  
DOE, JANE. 2005. "This is a Title," Publisher, December 15, pp.C3.

Different Title  
December 18, 2005 | Publisher  
Author: JOHN DOE  
Section: News 
662 Words
Page: C8  
OpenURL  
Link  

Here is more text   
It is still text
But also shorter.  

Citation (asa Style)  
DOE, JOHN. 2005. "Different Title," Publisher, December 18, pp.C8.

对于每篇文章,我想提取作者、发表的数据、期刊和每一行来创建一个如下所示的数据框:

Date           Journal       Title             Author            Line              Text
15-Dec-2005    Publication   This is a title   Doe, Jane         1                 Text Text Text Text
15-Dec-2005    Publication   This is a title   Doe, Jane         2                 Another line of text
15-Dec-2005    Publication   This is a title   Doe, Jane         3                 One more thing
15-Dec-2005    Publication   This is a title   Doe, Jane         4                 End of article.
18-Dec-2005    Publication   Different Title   Doe, John         1                 Here is more text 
18-Dec-2005    Publication   Different Title   Doe, John         2                 It is still text
18-Dec-2005    Publication   Different Title   Doe, John         3                 But also shorter.

我想使用下面的代码将上面的数据框(我们称之为 text_df)转换为整洁的文本格式,并以 one-token-per-row 格式重组,

library(tidytext)
tidy_dat <- text_df %>%
  unnest_tokens(word, text)

我知道这是一个很大的问题。任何帮助将不胜感激。

因为您有许多要提取的字段,所以我创建了主要思想 - 您可以从这里开始:)

首先,加载 tidyverse 和文章:

library(tidyverse)
articles <- read.delim("C:/Your_Path/temp.txt",
                      stringsAsFactors = FALSE, header = FALSE)

我们可以用grep得到"Link"和"Citation"在文本中的位置。

positions <- grep(pattern = "Link|Citation", 
                 x = articles$V1)

因为 Link 总是出现在引用之前,我们可以 splitpositions 放入一个列表中。

positions <- split(positions, ceiling(seq_along(positions)/2))

现在,我们可以使用相同的思路提取作者的位置 (grep)。

authors <- grep(pattern = "Author", 
               x = articles$V1)

检查一下 vectors/list 的长度是否相同总是好的。这样,您可以查看您提取的作者是否多于 link 和引用。

length(authors) == length(positions)

因为我假设您有两个以上的 运行 参数(文本、作者、出版、年份等),所以我使用 purrr::pmappmap,类似于 map/map2 运行s 在一个或多个 lists/vectors 上执行一个函数。在这种情况下,该函数(每次)采用 articlespositionsauthor

的对应行
books <- purrr::pmap(
    list(positions, authors), 
  function(position, author) {
    cbind(
      data.frame(text = articles[seq(position[1] + 1, position[2] - 1, 1), ]),
      data.frame(author = articles[author,]))})

作为pmapreturns一个list,我们可以把list里面的data.frame绑定成一个data.frame.

do.call(rbind.data.frame, books)

结果:

                      text             author
1.1 Text Text Text Text    Author: JANE DOE  
1.2 Another line of text   Author: JANE DOE  
1.3       One more thing   Author: JANE DOE  
1.4     End of article.    Author: JANE DOE  
2.1   Here is more text    Author: JOHN DOE  
2.2       It is still text Author: JOHN DOE  
2.3    But also shorter.   Author: JOHN DOE  

现在,你可以为所欲为,tidytext分析你想要的。