从 pdf 文本到文档列中带有文件名的整洁数据框
From pdf text to tidy dataframe with file names in document column
我想分析近 300 个 pdf 文档中的文本。现在我使用 pdftools
和 tm
、tidytext
包来阅读文本,将其转换为语料库,然后转换为文档术语矩阵,最后我想将其结构化为整洁的数据框。
我有几个问题:
- 如何删除页面数据(在每个 pdf 页面的顶部 and/or 底部)
- 我宁愿将文件名作为
document
列中的值而不是索引数字。
- 以下代码仅包含 2 个 pdf 文件以实现再现性。当我 运行 我的所有文件时,我在
corpus
对象中得到 294 个文档,但是当我整理它时,我似乎丢失了一些文件,因为 converted %>% distinct(document)
返回了 275 个。我想知道为什么会这样。
我有以下可重现的脚本:
library(tidyverse)
library(tidytext)
library(pdftools)
library(tm)
library(broom)
# Create a temporary empty directory
# (don't worry at the end of this script I'll remove this directory and its files)
dir.create("~/Desktop/sample-pdfs")
# Fill directory with 2 pdf files from my github repo
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/'s-Gravenhage_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/'s-Gravenhage_coalitieakkoord.pdf")
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/Aa%20en%20Hunze_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/Aa en Hunze_coalitieakkoord.pdf")
# Create vector of file paths
dir <- "~/Desktop/sample-pdfs"
pdfs <- paste(dir, "/", list.files(dir, pattern = "*.pdf"), sep = "")
# Read the text from pdf's with pdftools package
pdfs_text <- map(pdfs, pdf_text)
# Convert to document-term-matrix
converted <- Corpus(VectorSource(pdfs_text)) %>%
DocumentTermMatrix()
# Now I want to convert this to a tidy format
converted %>%
tidy() %>%
filter(!grepl("[0-9]+", term))
具有以下输出:
# A tibble: 5,305 x 3
document term count
<chr> <chr> <dbl>
1 1 aan 158
2 1 aanbesteding 2
3 1 aanbestedingen 1
4 1 aanbevelingen 1
5 1 aanbieden 3
6 1 aanbieders 1
7 1 aanbod 8
8 1 aandacht 16
9 1 aandachtspunt 3
10 1 aandeel 1
# ... with 5,295 more rows
这似乎很有效,但我宁愿将文件名("'s-Gravenhage"
和 "Aa en Hunze"
)作为文档列中的值而不是索引编号。我该怎么做?
期望的输出:
# A tibble: 5,305 x 3
document term count
<chr> <chr> <dbl>
1 's-Gravenhage aan 158
2 's-Gravenhage aanbesteding 2
3 's-Gravenhage aanbestedingen 1
4 's-Gravenhage aanbevelingen 1
5 's-Gravenhage aanbieden 3
6 's-Gravenhage aanbieders 1
7 's-Gravenhage aanbod 8
8 's-Gravenhage aandacht 16
9 's-Gravenhage aandachtspunt 3
10 's-Gravenhage aandeel 1
# ... with 5,295 more rows
从桌面删除下载的文件及其目录运行使用以下行:
unlink("~/Desktop/sample-pdfs", recursive = TRUE)
非常感谢所有帮助!
我建议为您要执行的操作编写一个包装函数,这样您就可以将每个文件名添加为一列。
read_PDF <- function(file){
pdfs_text <- pdf_text(file)
converted <- Corpus(VectorSource(pdfs_text)) %>%
DocumentTermMatrix()
converted %>%
tidy() %>%
filter(!grepl("[0-9]+", term)) %>%
# add FileName as a column
mutate(FileName = file)
}
final <- map(pdfs, read_PDF) %>% data.table::rbindlist()
很好的例子!
- 我添加了几行来添加名称。
- 不确定是否会丢失文件,我没有遇到这种情况。
- 只是提到你的文件名不是很标准,建议再次检查名称,还有你在第一个文件的开头有一个撇号。还将建议清洁空间。
- 我测试的是英文文档,你可以在语料库中添加不同的语言。
代码如下:
library(tidyverse)
library(tidytext)
library(pdftools)
library(tm)
library(broom)
# Create a temporary empty directory
dir <- "PDFs/"
pdfs <- paste0(dir, list.files(dir, pattern = "*.pdf"))
names <- list.files(dir, pattern = "*.pdf")
# create a table of names
namesDocs <-
names %>%
str_remove(pattern = ".pdf") %>%
as.tibble() %>%
mutate(ids = as.character(seq_along(names)))
namesDocs
# Read the text from pdf's with pdftools package
pdfs_text <- map(pdfs, pdftools::pdf_text)
# Convert to document-term-matrix
# add cleaning process
converted <-
Corpus(VectorSource(pdfs_text)) %>%
DocumentTermMatrix(
control = list(removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE))
converted
# Now I want to convert this to a tidy format
# add names of documents
mytable <-
converted %>%
tidy() %>%
arrange(desc(count)) %>%
left_join(y = namesDocs, by = c("document" = "ids"))
head(mytable)
View(mytable)
您可以使用 tm 将文档直接读入语料库。 reader readPDF 使用 pdftools 作为引擎。无需先创建一组文本,然后通过语料库将其放入输出。我创建了 2 个示例。第一个与您正在做的事情一致,但首先要通过语料库。第二个纯粹基于 tidyverse + tidytext。无需在 tm、tidytext 等之间切换
示例之间标记数量的差异是由于 tidytext / tokenizer 中的自动清理造成的。
如果您有很多文档要做,您可能希望使用 quanteda
作为您的主力,因为它可以开箱即用地在多个内核上工作,并且可能会加快分词器部分的速度。不要忘记使用 stopwords
包来获取荷兰语停用词的良好列表。如果您需要对荷兰语单词进行 POS 标记,请检查 updipe
包。
library(tidyverse)
library(tidytext)
library(tm)
directory <- "D:/sample-pdfs"
# create corpus from pdfs
converted <- VCorpus(DirSource(directory), readerControl = list(reader = readPDF)) %>%
DocumentTermMatrix()
converted %>%
tidy() %>%
filter(!grepl("[0-9]+", term))
# A tibble: 5,707 x 3
document term count
<chr> <chr> <dbl>
1 's-Gravenhage_coalitieakkoord.pdf "\ade" 4
2 's-Gravenhage_coalitieakkoord.pdf "\adeze" 1
3 's-Gravenhage_coalitieakkoord.pdf "\aeen" 2
4 's-Gravenhage_coalitieakkoord.pdf "\aer" 2
5 's-Gravenhage_coalitieakkoord.pdf "\aextra" 2
6 's-Gravenhage_coalitieakkoord.pdf "\agroei" 1
7 's-Gravenhage_coalitieakkoord.pdf "\ahet" 1
8 's-Gravenhage_coalitieakkoord.pdf "\amet" 1
9 's-Gravenhage_coalitieakkoord.pdf "\aonderwijs," 1
10 's-Gravenhage_coalitieakkoord.pdf "\aop" 11
# ... with 5,697 more rows
只使用 tidytext 而不是 tm
directory <- "D:/sample-pdfs"
pdfs <- paste(directory, "/", list.files(directory, pattern = "*.pdf"), sep = "")
pdf_names <- list.files(directory, pattern = "*.pdf")
pdfs_text <- map(pdfs, pdftools::pdf_text)
my_data <- data_frame(document = pdf_names, text = pdfs_text)
my_data %>%
unnest %>% # pdfs_text is a list
unnest_tokens(word, text, strip_numeric = TRUE) %>% # removing all numbers
group_by(document, word) %>%
summarise(count = n())
# A tibble: 4,646 x 3
# Groups: document [?]
document word count
<chr> <chr> <int>
1 's-Gravenhage_coalitieakkoord.pdf 1e 2
2 's-Gravenhage_coalitieakkoord.pdf 2e 2
3 's-Gravenhage_coalitieakkoord.pdf 3e 1
4 's-Gravenhage_coalitieakkoord.pdf 4e 1
5 's-Gravenhage_coalitieakkoord.pdf aan 164
6 's-Gravenhage_coalitieakkoord.pdf aanbesteding 2
7 's-Gravenhage_coalitieakkoord.pdf aanbestedingen 1
8 's-Gravenhage_coalitieakkoord.pdf aanbestedingsprocedures 1
9 's-Gravenhage_coalitieakkoord.pdf aanbevelingen 1
10 's-Gravenhage_coalitieakkoord.pdf aanbieden 4
# ... with 4,636 more rows
我认为我在网上找到的最简单的是来自 Julien Brun Text minning
您需要两个包裹
library("readtext")
library("quanteda")
对于此代码,将您的 PDF 命名为 Author_date,并将它们放在您的工作目录中的一个文件夹中例如,我将我的 pdf 放在 PDFs 文件夹中
# set path to the PDF
pdf_path <- "PDFs/"
# List the PDFs
pdfs <- list.files(path = pdf_path, pattern = 'pdf$', full.names = TRUE)
# Import the PDFs into R
spill_texts <- readtext(pdfs,
docvarsfrom = "filenames",
sep = "_",
docvarnames = c("First_author", "Year"))
# Transform the pdfs into a corpus object
spill_corpus <- corpus(spill_texts)
spill_corpus
# Some stats about the pdfs
tokenInfo <- summary(spill_corpus)
tokenInfo
我想分析近 300 个 pdf 文档中的文本。现在我使用 pdftools
和 tm
、tidytext
包来阅读文本,将其转换为语料库,然后转换为文档术语矩阵,最后我想将其结构化为整洁的数据框。
我有几个问题:
- 如何删除页面数据(在每个 pdf 页面的顶部 and/or 底部)
- 我宁愿将文件名作为
document
列中的值而不是索引数字。 - 以下代码仅包含 2 个 pdf 文件以实现再现性。当我 运行 我的所有文件时,我在
corpus
对象中得到 294 个文档,但是当我整理它时,我似乎丢失了一些文件,因为converted %>% distinct(document)
返回了 275 个。我想知道为什么会这样。
我有以下可重现的脚本:
library(tidyverse)
library(tidytext)
library(pdftools)
library(tm)
library(broom)
# Create a temporary empty directory
# (don't worry at the end of this script I'll remove this directory and its files)
dir.create("~/Desktop/sample-pdfs")
# Fill directory with 2 pdf files from my github repo
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/'s-Gravenhage_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/'s-Gravenhage_coalitieakkoord.pdf")
download.file("https://github.com/thomasdebeus/colourful-facts/raw/master/projects/sample-data/Aa%20en%20Hunze_coalitieakkoord.pdf", destfile = "~/Desktop/sample-pdfs/Aa en Hunze_coalitieakkoord.pdf")
# Create vector of file paths
dir <- "~/Desktop/sample-pdfs"
pdfs <- paste(dir, "/", list.files(dir, pattern = "*.pdf"), sep = "")
# Read the text from pdf's with pdftools package
pdfs_text <- map(pdfs, pdf_text)
# Convert to document-term-matrix
converted <- Corpus(VectorSource(pdfs_text)) %>%
DocumentTermMatrix()
# Now I want to convert this to a tidy format
converted %>%
tidy() %>%
filter(!grepl("[0-9]+", term))
具有以下输出:
# A tibble: 5,305 x 3
document term count
<chr> <chr> <dbl>
1 1 aan 158
2 1 aanbesteding 2
3 1 aanbestedingen 1
4 1 aanbevelingen 1
5 1 aanbieden 3
6 1 aanbieders 1
7 1 aanbod 8
8 1 aandacht 16
9 1 aandachtspunt 3
10 1 aandeel 1
# ... with 5,295 more rows
这似乎很有效,但我宁愿将文件名("'s-Gravenhage"
和 "Aa en Hunze"
)作为文档列中的值而不是索引编号。我该怎么做?
期望的输出:
# A tibble: 5,305 x 3
document term count
<chr> <chr> <dbl>
1 's-Gravenhage aan 158
2 's-Gravenhage aanbesteding 2
3 's-Gravenhage aanbestedingen 1
4 's-Gravenhage aanbevelingen 1
5 's-Gravenhage aanbieden 3
6 's-Gravenhage aanbieders 1
7 's-Gravenhage aanbod 8
8 's-Gravenhage aandacht 16
9 's-Gravenhage aandachtspunt 3
10 's-Gravenhage aandeel 1
# ... with 5,295 more rows
从桌面删除下载的文件及其目录运行使用以下行:
unlink("~/Desktop/sample-pdfs", recursive = TRUE)
非常感谢所有帮助!
我建议为您要执行的操作编写一个包装函数,这样您就可以将每个文件名添加为一列。
read_PDF <- function(file){
pdfs_text <- pdf_text(file)
converted <- Corpus(VectorSource(pdfs_text)) %>%
DocumentTermMatrix()
converted %>%
tidy() %>%
filter(!grepl("[0-9]+", term)) %>%
# add FileName as a column
mutate(FileName = file)
}
final <- map(pdfs, read_PDF) %>% data.table::rbindlist()
很好的例子!
- 我添加了几行来添加名称。
- 不确定是否会丢失文件,我没有遇到这种情况。
- 只是提到你的文件名不是很标准,建议再次检查名称,还有你在第一个文件的开头有一个撇号。还将建议清洁空间。
- 我测试的是英文文档,你可以在语料库中添加不同的语言。
代码如下:
library(tidyverse)
library(tidytext)
library(pdftools)
library(tm)
library(broom)
# Create a temporary empty directory
dir <- "PDFs/"
pdfs <- paste0(dir, list.files(dir, pattern = "*.pdf"))
names <- list.files(dir, pattern = "*.pdf")
# create a table of names
namesDocs <-
names %>%
str_remove(pattern = ".pdf") %>%
as.tibble() %>%
mutate(ids = as.character(seq_along(names)))
namesDocs
# Read the text from pdf's with pdftools package
pdfs_text <- map(pdfs, pdftools::pdf_text)
# Convert to document-term-matrix
# add cleaning process
converted <-
Corpus(VectorSource(pdfs_text)) %>%
DocumentTermMatrix(
control = list(removeNumbers = TRUE,
stopwords = TRUE,
removePunctuation = TRUE))
converted
# Now I want to convert this to a tidy format
# add names of documents
mytable <-
converted %>%
tidy() %>%
arrange(desc(count)) %>%
left_join(y = namesDocs, by = c("document" = "ids"))
head(mytable)
View(mytable)
您可以使用 tm 将文档直接读入语料库。 reader readPDF 使用 pdftools 作为引擎。无需先创建一组文本,然后通过语料库将其放入输出。我创建了 2 个示例。第一个与您正在做的事情一致,但首先要通过语料库。第二个纯粹基于 tidyverse + tidytext。无需在 tm、tidytext 等之间切换
示例之间标记数量的差异是由于 tidytext / tokenizer 中的自动清理造成的。
如果您有很多文档要做,您可能希望使用 quanteda
作为您的主力,因为它可以开箱即用地在多个内核上工作,并且可能会加快分词器部分的速度。不要忘记使用 stopwords
包来获取荷兰语停用词的良好列表。如果您需要对荷兰语单词进行 POS 标记,请检查 updipe
包。
library(tidyverse)
library(tidytext)
library(tm)
directory <- "D:/sample-pdfs"
# create corpus from pdfs
converted <- VCorpus(DirSource(directory), readerControl = list(reader = readPDF)) %>%
DocumentTermMatrix()
converted %>%
tidy() %>%
filter(!grepl("[0-9]+", term))
# A tibble: 5,707 x 3
document term count
<chr> <chr> <dbl>
1 's-Gravenhage_coalitieakkoord.pdf "\ade" 4
2 's-Gravenhage_coalitieakkoord.pdf "\adeze" 1
3 's-Gravenhage_coalitieakkoord.pdf "\aeen" 2
4 's-Gravenhage_coalitieakkoord.pdf "\aer" 2
5 's-Gravenhage_coalitieakkoord.pdf "\aextra" 2
6 's-Gravenhage_coalitieakkoord.pdf "\agroei" 1
7 's-Gravenhage_coalitieakkoord.pdf "\ahet" 1
8 's-Gravenhage_coalitieakkoord.pdf "\amet" 1
9 's-Gravenhage_coalitieakkoord.pdf "\aonderwijs," 1
10 's-Gravenhage_coalitieakkoord.pdf "\aop" 11
# ... with 5,697 more rows
只使用 tidytext 而不是 tm
directory <- "D:/sample-pdfs"
pdfs <- paste(directory, "/", list.files(directory, pattern = "*.pdf"), sep = "")
pdf_names <- list.files(directory, pattern = "*.pdf")
pdfs_text <- map(pdfs, pdftools::pdf_text)
my_data <- data_frame(document = pdf_names, text = pdfs_text)
my_data %>%
unnest %>% # pdfs_text is a list
unnest_tokens(word, text, strip_numeric = TRUE) %>% # removing all numbers
group_by(document, word) %>%
summarise(count = n())
# A tibble: 4,646 x 3
# Groups: document [?]
document word count
<chr> <chr> <int>
1 's-Gravenhage_coalitieakkoord.pdf 1e 2
2 's-Gravenhage_coalitieakkoord.pdf 2e 2
3 's-Gravenhage_coalitieakkoord.pdf 3e 1
4 's-Gravenhage_coalitieakkoord.pdf 4e 1
5 's-Gravenhage_coalitieakkoord.pdf aan 164
6 's-Gravenhage_coalitieakkoord.pdf aanbesteding 2
7 's-Gravenhage_coalitieakkoord.pdf aanbestedingen 1
8 's-Gravenhage_coalitieakkoord.pdf aanbestedingsprocedures 1
9 's-Gravenhage_coalitieakkoord.pdf aanbevelingen 1
10 's-Gravenhage_coalitieakkoord.pdf aanbieden 4
# ... with 4,636 more rows
我认为我在网上找到的最简单的是来自 Julien Brun Text minning
您需要两个包裹
library("readtext")
library("quanteda")
对于此代码,将您的 PDF 命名为 Author_date,并将它们放在您的工作目录中的一个文件夹中例如,我将我的 pdf 放在 PDFs 文件夹中
# set path to the PDF
pdf_path <- "PDFs/"
# List the PDFs
pdfs <- list.files(path = pdf_path, pattern = 'pdf$', full.names = TRUE)
# Import the PDFs into R
spill_texts <- readtext(pdfs,
docvarsfrom = "filenames",
sep = "_",
docvarnames = c("First_author", "Year"))
# Transform the pdfs into a corpus object
spill_corpus <- corpus(spill_texts)
spill_corpus
# Some stats about the pdfs
tokenInfo <- summary(spill_corpus)
tokenInfo