解析文本以在 R 中进行分析
Parsing text for analysis in R
我有一个包含短文章的 .txt 文件,我想使用 R 创建一个数据集来解析每篇文章并提取日期、作者、期刊、标题、行号和每行的文本数据框中每篇文章中的文本。例如,每篇文章的文本数据重复相同的结构并采用以下格式:
This is a Title
December 15, 2005 | Publisher
Author: JANE DOE
Section: Movies and More
2554 Words
Page: C3
OpenURL
Link
Text Text Text Text
Another line of text
One more thing
End of article.
Citation (asa Style)
DOE, JANE. 2005. "This is a Title," Publisher, December 15, pp.C3.
Different Title
December 18, 2005 | Publisher
Author: JOHN DOE
Section: News
662 Words
Page: C8
OpenURL
Link
Here is more text
It is still text
But also shorter.
Citation (asa Style)
DOE, JOHN. 2005. "Different Title," Publisher, December 18, pp.C8.
对于每篇文章,我想提取作者、发表的数据、期刊和每一行来创建一个如下所示的数据框:
Date Journal Title Author Line Text
15-Dec-2005 Publication This is a title Doe, Jane 1 Text Text Text Text
15-Dec-2005 Publication This is a title Doe, Jane 2 Another line of text
15-Dec-2005 Publication This is a title Doe, Jane 3 One more thing
15-Dec-2005 Publication This is a title Doe, Jane 4 End of article.
18-Dec-2005 Publication Different Title Doe, John 1 Here is more text
18-Dec-2005 Publication Different Title Doe, John 2 It is still text
18-Dec-2005 Publication Different Title Doe, John 3 But also shorter.
我想使用下面的代码将上面的数据框(我们称之为 text_df)转换为整洁的文本格式,并以 one-token-per-row 格式重组,
library(tidytext)
tidy_dat <- text_df %>%
unnest_tokens(word, text)
我知道这是一个很大的问题。任何帮助将不胜感激。
因为您有许多要提取的字段,所以我创建了主要思想 - 您可以从这里开始:)
首先,加载 tidyverse
和文章:
library(tidyverse)
articles <- read.delim("C:/Your_Path/temp.txt",
stringsAsFactors = FALSE, header = FALSE)
我们可以用grep
得到"Link"和"Citation"在文本中的位置。
positions <- grep(pattern = "Link|Citation",
x = articles$V1)
因为 Link 总是出现在引用之前,我们可以 split
将 positions
放入一个列表中。
positions <- split(positions, ceiling(seq_along(positions)/2))
现在,我们可以使用相同的思路提取作者的位置 (grep
)。
authors <- grep(pattern = "Author",
x = articles$V1)
检查一下 vectors/list 的长度是否相同总是好的。这样,您可以查看您提取的作者是否多于 link 和引用。
length(authors) == length(positions)
因为我假设您有两个以上的 运行 参数(文本、作者、出版、年份等),所以我使用 purrr::pmap
。 pmap
,类似于 map
/map2
运行s 在一个或多个 lists/vectors 上执行一个函数。在这种情况下,该函数(每次)采用 articles
到 positions
和 author
的对应行
books <- purrr::pmap(
list(positions, authors),
function(position, author) {
cbind(
data.frame(text = articles[seq(position[1] + 1, position[2] - 1, 1), ]),
data.frame(author = articles[author,]))})
作为pmap
returns一个list
,我们可以把list里面的data.frame
绑定成一个data.frame
.
do.call(rbind.data.frame, books)
结果:
text author
1.1 Text Text Text Text Author: JANE DOE
1.2 Another line of text Author: JANE DOE
1.3 One more thing Author: JANE DOE
1.4 End of article. Author: JANE DOE
2.1 Here is more text Author: JOHN DOE
2.2 It is still text Author: JOHN DOE
2.3 But also shorter. Author: JOHN DOE
现在,你可以为所欲为,tidytext
分析你想要的。
我有一个包含短文章的 .txt 文件,我想使用 R 创建一个数据集来解析每篇文章并提取日期、作者、期刊、标题、行号和每行的文本数据框中每篇文章中的文本。例如,每篇文章的文本数据重复相同的结构并采用以下格式:
This is a Title
December 15, 2005 | Publisher
Author: JANE DOE
Section: Movies and More
2554 Words
Page: C3
OpenURL
Link
Text Text Text Text
Another line of text
One more thing
End of article.
Citation (asa Style)
DOE, JANE. 2005. "This is a Title," Publisher, December 15, pp.C3.
Different Title
December 18, 2005 | Publisher
Author: JOHN DOE
Section: News
662 Words
Page: C8
OpenURL
Link
Here is more text
It is still text
But also shorter.
Citation (asa Style)
DOE, JOHN. 2005. "Different Title," Publisher, December 18, pp.C8.
对于每篇文章,我想提取作者、发表的数据、期刊和每一行来创建一个如下所示的数据框:
Date Journal Title Author Line Text
15-Dec-2005 Publication This is a title Doe, Jane 1 Text Text Text Text
15-Dec-2005 Publication This is a title Doe, Jane 2 Another line of text
15-Dec-2005 Publication This is a title Doe, Jane 3 One more thing
15-Dec-2005 Publication This is a title Doe, Jane 4 End of article.
18-Dec-2005 Publication Different Title Doe, John 1 Here is more text
18-Dec-2005 Publication Different Title Doe, John 2 It is still text
18-Dec-2005 Publication Different Title Doe, John 3 But also shorter.
我想使用下面的代码将上面的数据框(我们称之为 text_df)转换为整洁的文本格式,并以 one-token-per-row 格式重组,
library(tidytext)
tidy_dat <- text_df %>%
unnest_tokens(word, text)
我知道这是一个很大的问题。任何帮助将不胜感激。
因为您有许多要提取的字段,所以我创建了主要思想 - 您可以从这里开始:)
首先,加载 tidyverse
和文章:
library(tidyverse)
articles <- read.delim("C:/Your_Path/temp.txt",
stringsAsFactors = FALSE, header = FALSE)
我们可以用grep
得到"Link"和"Citation"在文本中的位置。
positions <- grep(pattern = "Link|Citation",
x = articles$V1)
因为 Link 总是出现在引用之前,我们可以 split
将 positions
放入一个列表中。
positions <- split(positions, ceiling(seq_along(positions)/2))
现在,我们可以使用相同的思路提取作者的位置 (grep
)。
authors <- grep(pattern = "Author",
x = articles$V1)
检查一下 vectors/list 的长度是否相同总是好的。这样,您可以查看您提取的作者是否多于 link 和引用。
length(authors) == length(positions)
因为我假设您有两个以上的 运行 参数(文本、作者、出版、年份等),所以我使用 purrr::pmap
。 pmap
,类似于 map
/map2
运行s 在一个或多个 lists/vectors 上执行一个函数。在这种情况下,该函数(每次)采用 articles
到 positions
和 author
books <- purrr::pmap(
list(positions, authors),
function(position, author) {
cbind(
data.frame(text = articles[seq(position[1] + 1, position[2] - 1, 1), ]),
data.frame(author = articles[author,]))})
作为pmap
returns一个list
,我们可以把list里面的data.frame
绑定成一个data.frame
.
do.call(rbind.data.frame, books)
结果:
text author
1.1 Text Text Text Text Author: JANE DOE
1.2 Another line of text Author: JANE DOE
1.3 One more thing Author: JANE DOE
1.4 End of article. Author: JANE DOE
2.1 Here is more text Author: JOHN DOE
2.2 It is still text Author: JOHN DOE
2.3 But also shorter. Author: JOHN DOE
现在,你可以为所欲为,tidytext
分析你想要的。