正在根据请求下载 txt 文件 python
Downloading txt files with request python
我想从 API 下载多个 txt
文件。我可以使用以下代码下载 pdf 文件。但是,请问有没有人愿意帮忙,如何自定义request
下载txt文件的文档类型?非常感谢。
links = ["P167897", "P173997", "P166309"]
for link in links:
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={link}&apilang=en"
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
file_path = Path(f"K:/downloading_text/{link}/{pdf_url.rsplit('/')[-1]}")
file_path.parent.mkdir(parents=True, exist_ok=True)
with file_path.open("wb") as f:
f.write(requests.get(pdf_url).content)
time.sleep(1)
except KeyError:
continue
如果您可以不使用请求,您通常可以使用 curl 或 wget(如果 url 已打开)。所以你可以为此使用子流程。例如
import subprocess
subprocess.run(['wget', 'url'])
https://www.gnu.org/software/wget/
您只需将 URL 更改为:
.../pdf/Sierra-Leone-AFRICA-WEST-P167897-Sierra-Leone-Free-Education-Project-Procurement-Plan.pdf
至:
.../text/Sierra-Leone-AFRICA-WEST-P167897-Sierra-Leone-Free-Education-Project-Procurement-Plan.txt
可以使用 str.replace()
轻松完成:
links = ["P167897", "P173997", "P166309"]
for link in links:
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={link}&apilang=en"
#print(requests.get(end_point).json())
#break
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
txt_url = pdf_url.replace('.pdf', '.txt')
txt_url = txt_url.replace('/pdf/', '/text/')
print(f"Downloading: {txt_url}")
uniqueId = txt_url[6:].split('/')[4]
file_path = Path(
f"/tmp/{link}/{uniqueId}-{txt_url.rsplit('/')[-1]}"
)
file_path.parent.mkdir(parents=True, exist_ok=True)
with file_path.open("wb") as f:
f.write(requests.get(txt_url).content)
time.sleep(1)
except KeyError:
continue
输出:
Downloading: http://documents.worldbank.org/curated/en/106981614570591392/text/Official-Documents-Grant-Agreement-for-Additional-Financing-Grant-TF0B4694.txt
Downloading: http://documents.worldbank.org/curated/en/331341614570579132/text/Official-Documents-First-Restatement-to-the-Disbursement-Letter-for-Grant-D6810-SL-and-for-Additional-Financing-Grant-TF0B4694.txt
...
我想从 API 下载多个 txt
文件。我可以使用以下代码下载 pdf 文件。但是,请问有没有人愿意帮忙,如何自定义request
下载txt文件的文档类型?非常感谢。
links = ["P167897", "P173997", "P166309"]
for link in links:
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={link}&apilang=en"
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
file_path = Path(f"K:/downloading_text/{link}/{pdf_url.rsplit('/')[-1]}")
file_path.parent.mkdir(parents=True, exist_ok=True)
with file_path.open("wb") as f:
f.write(requests.get(pdf_url).content)
time.sleep(1)
except KeyError:
continue
如果您可以不使用请求,您通常可以使用 curl 或 wget(如果 url 已打开)。所以你可以为此使用子流程。例如
import subprocess
subprocess.run(['wget', 'url'])
https://www.gnu.org/software/wget/
您只需将 URL 更改为:
.../pdf/Sierra-Leone-AFRICA-WEST-P167897-Sierra-Leone-Free-Education-Project-Procurement-Plan.pdf
至:
.../text/Sierra-Leone-AFRICA-WEST-P167897-Sierra-Leone-Free-Education-Project-Procurement-Plan.txt
可以使用 str.replace()
轻松完成:
links = ["P167897", "P173997", "P166309"]
for link in links:
end_point = f"https://search.worldbank.org/api/v2/wds?" \
f"format=json&includepublicdocs=1&" \
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" \
f"os=0&rows=20&proid={link}&apilang=en"
#print(requests.get(end_point).json())
#break
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
txt_url = pdf_url.replace('.pdf', '.txt')
txt_url = txt_url.replace('/pdf/', '/text/')
print(f"Downloading: {txt_url}")
uniqueId = txt_url[6:].split('/')[4]
file_path = Path(
f"/tmp/{link}/{uniqueId}-{txt_url.rsplit('/')[-1]}"
)
file_path.parent.mkdir(parents=True, exist_ok=True)
with file_path.open("wb") as f:
f.write(requests.get(txt_url).content)
time.sleep(1)
except KeyError:
continue
输出:
Downloading: http://documents.worldbank.org/curated/en/106981614570591392/text/Official-Documents-Grant-Agreement-for-Additional-Financing-Grant-TF0B4694.txt
Downloading: http://documents.worldbank.org/curated/en/331341614570579132/text/Official-Documents-First-Restatement-to-the-Disbursement-Letter-for-Grant-D6810-SL-and-for-Additional-Financing-Grant-TF0B4694.txt
...