检查字符串匹配中的多个单词以在 r 中进行文本搜索
Check for multiple words in string match for text search in r
目前我有一个用于单词搜索的代码,我们可以搜索多个词并将这些匹配的词写入数据框吗? (有关说明,请参阅此 ) this is 解决方案,该解决方案适用于一个词。
这是代码:
library(pdftools)
library(tesseract)
All_files <- Sys.glob("*.pdf")
v1 <- numeric(length(All_files))
word <- "school"
df <- data.frame()
Status <- "Present"
for (i in seq_along(All_files)){
file_name <- All_files[i]
cnt <- pdf_info(All_files[i])$pages
print(cnt)
for(j in seq_len(cnt)){
img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400)
text <- ocr(img_file)
ocr_text <- capture.output(cat(text))
check <- sapply(ocr_text, paste, collapse="")
junk <- dir(path= paste0(path, "/tiff"), pattern="tiff")
file.remove(junk)
br <-if(length(which(stri_detect_fixed(tolower(check),tolower(word)))) <= 0) "Not Present"
else "Present"
print(br)
if(br=="Present") {
v1[i] <- j
break}
}
Status <- if(v1[i] == 0) "Not Present" else "Present"
pages <- if(v1[i] == 0) "-" else
paste0(tools::file_path_sans_ext(basename(file_name)), "_", v1[i])
words <- if(v1[i] == 0) "-" else word
df <- rbind(df, cbind(file_name = basename(file_name),
Status, pages = pages, words = words))
}
这里我们只搜索一个词,即school
。我们可以搜索 school
、gym
、swimming pool
等多个词吗?
预计O/P
fileName Status Page Words TEXT
test.pdf Present test_1 gym I go gym regularly
test.pdf Present test_3 school Here is the next school
test1.pdf Present test1_4 swimming pool In swimming pool
test1.pdf Present test1_7 gym next to Gold gym
test2.pdf Not Present - -
文件名=文件名
状态=如果找到任何单词则 "Present" else "Not Present"
Page=这里"_1","_3"定义了找到这个词的页码;;在第 "test_1" 页上找到了单词 "gym",在第 "test_3" 页上找到了单词 "school"。
字数=找到所有字;;就像在 test.pdf 文件的第 1 页和第 3 页上只找到 "gym" 和 "school" 并且在 [=] 的第 4 页和第 7 页上只找到 "swimming pool" 和 "gym" 61=] 文件.
TEXT = 找到单词的文本
对此的任何建议都会有所帮助。
谢谢
您使用外部循环遍历目录中的每个 PDF。然后浏览 PDF 的所有页面并提取内循环中的文本。您想要检查每个文档是否至少有一页包含 school
、gym
或 swimming pool
。您要使用的返回值是:
- 包含
Present
或 Not present
. 的 PDF 文档数量的向量
- 带有一些字符串的三个向量,包含关于哪个单词出现在何时何地的信息。
对吗?
您可以跳过循环中的几个步骤,尤其是在使用 ocr
:
将 PDF 转换为 TIFF 并从中读取文本时
all_files <- Sys.glob("*.pdf")
strings <- c("school", "gym", "swimming pool")
# Read text from pdfs
texts <- lapply(all_files, function(x){
img_file <- pdf_convert(x, format="tiff", dpi=400)
return( tolower(ocr(img_file)) )
})
# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
for(w in seq_along(strings)){
intermed <- grep(strings[w], texts[[d]])
words[[d]] <- c(words[[d]],
strings[w][ (length(intermed) > 0) ])
pages[[d]] <- unique(c(pages[[d]], intermed))
}
}
# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))
Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))
Words <- sapply(words, paste0, collapse=", ")
Status <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")
data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)
# Status Page Words
# pdf1 Present pdf1_1, pdf1_2 gym, swimming pool
# pdf2 Present pdf2_2, pdf2_5, pdf2_8, pdf2_3, pdf2_6 school, gym, swimming pool
它不像我希望的那样可读。可能是因为要求不高w.r.t。输出需要一些小的中间步骤,使代码看起来有点混乱。它运作良好,但
目前我有一个用于单词搜索的代码,我们可以搜索多个词并将这些匹配的词写入数据框吗? (有关说明,请参阅此
library(pdftools)
library(tesseract)
All_files <- Sys.glob("*.pdf")
v1 <- numeric(length(All_files))
word <- "school"
df <- data.frame()
Status <- "Present"
for (i in seq_along(All_files)){
file_name <- All_files[i]
cnt <- pdf_info(All_files[i])$pages
print(cnt)
for(j in seq_len(cnt)){
img_file <- pdftools::pdf_convert(All_files[i], format = 'tiff', pages = j, dpi = 400)
text <- ocr(img_file)
ocr_text <- capture.output(cat(text))
check <- sapply(ocr_text, paste, collapse="")
junk <- dir(path= paste0(path, "/tiff"), pattern="tiff")
file.remove(junk)
br <-if(length(which(stri_detect_fixed(tolower(check),tolower(word)))) <= 0) "Not Present"
else "Present"
print(br)
if(br=="Present") {
v1[i] <- j
break}
}
Status <- if(v1[i] == 0) "Not Present" else "Present"
pages <- if(v1[i] == 0) "-" else
paste0(tools::file_path_sans_ext(basename(file_name)), "_", v1[i])
words <- if(v1[i] == 0) "-" else word
df <- rbind(df, cbind(file_name = basename(file_name),
Status, pages = pages, words = words))
}
这里我们只搜索一个词,即school
。我们可以搜索 school
、gym
、swimming pool
等多个词吗?
预计O/P
fileName Status Page Words TEXT
test.pdf Present test_1 gym I go gym regularly
test.pdf Present test_3 school Here is the next school
test1.pdf Present test1_4 swimming pool In swimming pool
test1.pdf Present test1_7 gym next to Gold gym
test2.pdf Not Present - -
文件名=文件名
状态=如果找到任何单词则 "Present" else "Not Present"
Page=这里"_1","_3"定义了找到这个词的页码;;在第 "test_1" 页上找到了单词 "gym",在第 "test_3" 页上找到了单词 "school"。
字数=找到所有字;;就像在 test.pdf 文件的第 1 页和第 3 页上只找到 "gym" 和 "school" 并且在 [=] 的第 4 页和第 7 页上只找到 "swimming pool" 和 "gym" 61=] 文件.
TEXT = 找到单词的文本
对此的任何建议都会有所帮助。
谢谢
您使用外部循环遍历目录中的每个 PDF。然后浏览 PDF 的所有页面并提取内循环中的文本。您想要检查每个文档是否至少有一页包含 school
、gym
或 swimming pool
。您要使用的返回值是:
- 包含
Present
或Not present
. 的 PDF 文档数量的向量
- 带有一些字符串的三个向量,包含关于哪个单词出现在何时何地的信息。
对吗?
您可以跳过循环中的几个步骤,尤其是在使用 ocr
:
all_files <- Sys.glob("*.pdf")
strings <- c("school", "gym", "swimming pool")
# Read text from pdfs
texts <- lapply(all_files, function(x){
img_file <- pdf_convert(x, format="tiff", dpi=400)
return( tolower(ocr(img_file)) )
})
# Check for presence of each word in checkthese
pages <- words <- vector("list", length(texts))
for(d in seq_along(texts)){
for(w in seq_along(strings)){
intermed <- grep(strings[w], texts[[d]])
words[[d]] <- c(words[[d]],
strings[w][ (length(intermed) > 0) ])
pages[[d]] <- unique(c(pages[[d]], intermed))
}
}
# Organize data so that it suits your wanted output
fileName <- tools::file_path_sans_ext(basename(all_files))
Page <- Map(paste0, fileName, "_", pages, collapse=", ")
Page[!grepl(",", Page)] <- "-"
Page <- t(data.frame(Page))
Words <- sapply(words, paste0, collapse=", ")
Status <- ifelse(sapply(Words, nchar) > 0, "Present", "Not present")
data.frame(row.names=fileName, Status=Status, Page=Page, Words=Words)
# Status Page Words
# pdf1 Present pdf1_1, pdf1_2 gym, swimming pool
# pdf2 Present pdf2_2, pdf2_5, pdf2_8, pdf2_3, pdf2_6 school, gym, swimming pool
它不像我希望的那样可读。可能是因为要求不高w.r.t。输出需要一些小的中间步骤,使代码看起来有点混乱。它运作良好,但