从 python 数据框中的链接打开、保存和提取文本 PDF
Open, save and extract text PDFs from links in python dataframe
我想遍历保存在 python 数据框中的 PDF link。目标是打开 PDF link,保存 PDF 并从中提取文本,然后将每个对应 link 的文本保存在新列中。
数据框看起来像这样:
URL
0 https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf
1 https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.pdf
2 https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-2_Full-Application-Template_Senegal-RMNCAH-Activity_English-Version.docx
3 https://westafricatradehub.com/wp-content/uploads/2021/07/Pi%C3%A8ce-Jointe-2_Mod%C3%A8le-de-Demande-Complet_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.docx
4 https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-3_Trade-Hub-Performance-Indicators-Table.xlsx
5 https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-10_Project-Budget-Template-RMNCAH.xlsx
6 https://westafricatradehub.com/wp-content/uploads/2021/08/Senegal-Health-RFA-Webinar-QA.pdf
7 https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Catalytic-Business-Concepts-Round-2.pdf
8 https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Concepts-d%E2%80%99Affaires-Catalytiques-2ieme-Tour.pdf
9 https://westafricatradehub.com/wp-content/uploads/2021/06/APS-WATIH-1247_Research-Development-Round-2.pdf
我能够为一个 link 做到这一点,但不能为整个数据帧做到这一点
import urllib.request
pdf_link = "https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf"
def download_file(download_url, filename):
response = urllib.request.urlopen(download_url)
file = open(filename + ".pdf", 'wb')
file.write(response.read())
file.close()
download_file(pdf_link, "Test")
#Code to extract text from PDF
import textract
text = textract.process("/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/Test.PDF")
print(text)
谢谢!
给你:
import urllib.request
import textract
def download_file(download_url, filename):
response = urllib.request.urlopen(download_url)
file = open(filename + ".pdf", 'wb')
file.write(response.read())
file.close()
df['Text']=''
for i in range(df.shape[0]):
pdf_link=df.iloc[i,0]
download_file(pdf_link, f"pdf_{i}")
text = textract.process(f"/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/pdf_{i}.PDF")
df['Text'][i]=text
我想遍历保存在 python 数据框中的 PDF link。目标是打开 PDF link,保存 PDF 并从中提取文本,然后将每个对应 link 的文本保存在新列中。
数据框看起来像这样:
URL
0 https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf
1 https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.pdf
2 https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-2_Full-Application-Template_Senegal-RMNCAH-Activity_English-Version.docx
3 https://westafricatradehub.com/wp-content/uploads/2021/07/Pi%C3%A8ce-Jointe-2_Mod%C3%A8le-de-Demande-Complet_Activit%C3%A9-RMNCAH-S%C3%A9n%C3%A9gal_Version-Fran%C3%A7aise.docx
4 https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-3_Trade-Hub-Performance-Indicators-Table.xlsx
5 https://westafricatradehub.com/wp-content/uploads/2021/07/Attachment-10_Project-Budget-Template-RMNCAH.xlsx
6 https://westafricatradehub.com/wp-content/uploads/2021/08/Senegal-Health-RFA-Webinar-QA.pdf
7 https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Catalytic-Business-Concepts-Round-2.pdf
8 https://westafricatradehub.com/wp-content/uploads/2021/02/APS-WATIH-1021_Concepts-d%E2%80%99Affaires-Catalytiques-2ieme-Tour.pdf
9 https://westafricatradehub.com/wp-content/uploads/2021/06/APS-WATIH-1247_Research-Development-Round-2.pdf
我能够为一个 link 做到这一点,但不能为整个数据帧做到这一点
import urllib.request
pdf_link = "https://westafricatradehub.com/wp-content/uploads/2021/07/RFA-WATIH-1295_Senegal-RMNCAH-Activity_English-Version.pdf"
def download_file(download_url, filename):
response = urllib.request.urlopen(download_url)
file = open(filename + ".pdf", 'wb')
file.write(response.read())
file.close()
download_file(pdf_link, "Test")
#Code to extract text from PDF
import textract
text = textract.process("/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/Test.PDF")
print(text)
谢谢!
给你:
import urllib.request
import textract
def download_file(download_url, filename):
response = urllib.request.urlopen(download_url)
file = open(filename + ".pdf", 'wb')
file.write(response.read())
file.close()
df['Text']=''
for i in range(df.shape[0]):
pdf_link=df.iloc[i,0]
download_file(pdf_link, f"pdf_{i}")
text = textract.process(f"/Users/fze/Dropbox (LCG Team)/LCG Folder (1)/BD Scan Automation/Python codes/pdf_{i}.PDF")
df['Text'][i]=text