在r中提取pdf文本的子部分
Extract Subpart of pdf text in r
我在一个文件夹中有一个 .pdf 文件列表,我想先访问其中的前两段文本,然后将它们存储在 .csv 文件中,我可以转换 pdf 文本但不能提取前两段。
这是我试过的
setwd("D/All_PDF_Files")
install.packages("pdftools")
install.packages("qdapRegex")
library(pdftools)
library(qdapRegex)
All_files=Sys.glob("*.pdf")
txt <- pdf_text("first.pdf")
cat(txt[1])
rm_between(txt, 'This ', '1. ', extract=TRUE)[[1]]
但这给了我“NA”
cat(txt[1]) 的输出是:
"Maharashtra Real Estate Regulatory Authority
REGISTRATION CERTIFICATE OF PROJECT
FORM 'C'
[See rule 6(a)]
This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;
1. Goel Ganga Developers (I) Pvt Ltd having its registered office / principal place of business at Tehsil: Pune City,
District: Pune, Pin: 411001.
2. This registration is granted subject to the following conditions, namely:"
我要提取的是文字
This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;
有更好的方法吗?
library(pdftools)
setwd("D/All_PDF_Files")
All_files=Sys.glob("*.pdf")
df <- data.frame()
for (i in 1:length(All_files))
{
txt <- pdf_text(All_files[i])
file_name <- All_files[i]
#skip first 4 header rows (you may need to adjust this count according to your use case)
FirstPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[1+4]
SecondPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[2+4]
df <- rbind(df, cbind(file_name, FirstPara, SecondPara))
}
df
使用@Prem 的代码发布答案,如果有人需要的话。
All_files=Sys.glob("*.pdf")
df <- data.frame()
for (i in 1:length(All_files))
{
txt <- pdf_text(All_files[i])
file_name <- All_files[i]
FirstPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[1+4]
SecondPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[2+4]
ThirdPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[3+4]
ThirdPara_new <- sub("[^:]+:\s*([^,]+),.*", "\1",ThirdPara)
t1=unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[4+4]
t2=unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[5+4]
conct=paste(t1,t2)
FourthPara=gsub(".*1. \s*|having.*|son.*", "", conct)
df <- rbind(df, cbind(file_name, SecondPara, ThirdPara_new, FourthPara))
}
我在一个文件夹中有一个 .pdf 文件列表,我想先访问其中的前两段文本,然后将它们存储在 .csv 文件中,我可以转换 pdf 文本但不能提取前两段。
这是我试过的
setwd("D/All_PDF_Files")
install.packages("pdftools")
install.packages("qdapRegex")
library(pdftools)
library(qdapRegex)
All_files=Sys.glob("*.pdf")
txt <- pdf_text("first.pdf")
cat(txt[1])
rm_between(txt, 'This ', '1. ', extract=TRUE)[[1]]
但这给了我“NA”
cat(txt[1]) 的输出是:
"Maharashtra Real Estate Regulatory Authority
REGISTRATION CERTIFICATE OF PROJECT
FORM 'C'
[See rule 6(a)]
This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;
1. Goel Ganga Developers (I) Pvt Ltd having its registered office / principal place of business at Tehsil: Pune City,
District: Pune, Pin: 411001.
2. This registration is granted subject to the following conditions, namely:"
我要提取的是文字
This registration is granted under section 5 of the Act to the following project under project registration number :
P52100000255
Project: Ganga Legend A3 And B3.., Plot Bearing / CTS / Survey / Final Plot No.: Sr No 305 P , 306 P and 339 P ,
Village Bavdhan Budruk, Taluka Mulashi,District Pune at Pune (M Corp.), Pune City, Pune, 411001;
有更好的方法吗?
library(pdftools)
setwd("D/All_PDF_Files")
All_files=Sys.glob("*.pdf")
df <- data.frame()
for (i in 1:length(All_files))
{
txt <- pdf_text(All_files[i])
file_name <- All_files[i]
#skip first 4 header rows (you may need to adjust this count according to your use case)
FirstPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[1+4]
SecondPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[2+4]
df <- rbind(df, cbind(file_name, FirstPara, SecondPara))
}
df
使用@Prem 的代码发布答案,如果有人需要的话。
All_files=Sys.glob("*.pdf")
df <- data.frame()
for (i in 1:length(All_files))
{
txt <- pdf_text(All_files[i])
file_name <- All_files[i]
FirstPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[1+4]
SecondPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[2+4]
ThirdPara <- unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[3+4]
ThirdPara_new <- sub("[^:]+:\s*([^,]+),.*", "\1",ThirdPara)
t1=unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[4+4]
t2=unlist(strsplit(txt[1], split=c("\r\n", "\r", "\n")))[5+4]
conct=paste(t1,t2)
FourthPara=gsub(".*1. \s*|having.*|son.*", "", conct)
df <- rbind(df, cbind(file_name, SecondPara, ThirdPara_new, FourthPara))
}