为什么我的 zip 文件没有从使用 python 的代码中输出?
why are my zip files not being output from code using python?
我想从这个网页上抓取所有文件,它们是 zip 文件:http://data.gdeltproject.org/events/index.html
这是我的代码:
from bs4 import BeautifulSoup as bs
import requests
import re
DOMAIN = "insert here"
URL = "insert here"
def get_soup(URL):
return bs(requests.get(URL).text, 'html.parser')
for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".zip")}):
file_link = link.get('href')
print(file_link)
with open(link.text, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)
代码似乎创建了一个文件,但是文件的内容是空的。我可以在 python 运行 输出中看到所有的 zip 文件,但它们不在文件中。有人可以帮我找出如何将这些文件放入我的计算机吗?我被困在这里了!
非常感谢,百合
你能检查一下是否:
for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".zip")}):
file_link = link.get('href')
print(file_link)
with open(link.text, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)
会解决这个问题吗?
for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".zip")}):
file_link = link.get('href')
print(file_link)
with open(link.text, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)
由于 Python 对 ident 非常明确,它可能会造成伤害(您的代码不是 运行 在 for 循环内,而是在那之后)。
我建议下次使用一些 GUI 调试器(看看,在 VScode 或其他 IDE GUI 中设置有多容易)或使用 ipython 和 ipdb(import ipdb; ipdb.set_trace()
)
这不是完整的答案,因为如果您使用调试器并使用您的代码,您应该可以轻松地克服它。感谢您进一步的学习和坚持:)
from bs4 import BeautifulSoup
import httpx
import trio
mainurl = "http://data.gdeltproject.org/events/index.html"
async def downloader(rec):
async with rec:
async for client, link in rec:
print(f'[*] Downloading --> {link}')
async with await trio.open_file(link.split('/')[-1], 'wb') as f:
r = await client.get(link)
await f.write(r.content)
async def main():
async with httpx.AsyncClient(timeout=None) as client, trio.open_nursery() as nurse:
r = await client.get(mainurl)
soup = BeautifulSoup(r.text, 'lxml')
links = [mainurl[:36] + x['href'] for x in soup.select('a[href$=zip]')]
sender, receiver = trio.open_memory_channel(0)
async with receiver:
for _ in range(3):
nurse.start_soon(downloader, receiver.clone())
async with sender:
for link in links:
await sender.send([client, link])
if __name__ == "__main__":
trio.run(main)
我想从这个网页上抓取所有文件,它们是 zip 文件:http://data.gdeltproject.org/events/index.html
这是我的代码:
from bs4 import BeautifulSoup as bs
import requests
import re
DOMAIN = "insert here"
URL = "insert here"
def get_soup(URL):
return bs(requests.get(URL).text, 'html.parser')
for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".zip")}):
file_link = link.get('href')
print(file_link)
with open(link.text, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)
代码似乎创建了一个文件,但是文件的内容是空的。我可以在 python 运行 输出中看到所有的 zip 文件,但它们不在文件中。有人可以帮我找出如何将这些文件放入我的计算机吗?我被困在这里了!
非常感谢,百合
你能检查一下是否:
for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".zip")}):
file_link = link.get('href')
print(file_link)
with open(link.text, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)
会解决这个问题吗?
for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".zip")}):
file_link = link.get('href')
print(file_link)
with open(link.text, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)
由于 Python 对 ident 非常明确,它可能会造成伤害(您的代码不是 运行 在 for 循环内,而是在那之后)。
我建议下次使用一些 GUI 调试器(看看,在 VScode 或其他 IDE GUI 中设置有多容易)或使用 ipython 和 ipdb(import ipdb; ipdb.set_trace()
)
这不是完整的答案,因为如果您使用调试器并使用您的代码,您应该可以轻松地克服它。感谢您进一步的学习和坚持:)
from bs4 import BeautifulSoup
import httpx
import trio
mainurl = "http://data.gdeltproject.org/events/index.html"
async def downloader(rec):
async with rec:
async for client, link in rec:
print(f'[*] Downloading --> {link}')
async with await trio.open_file(link.split('/')[-1], 'wb') as f:
r = await client.get(link)
await f.write(r.content)
async def main():
async with httpx.AsyncClient(timeout=None) as client, trio.open_nursery() as nurse:
r = await client.get(mainurl)
soup = BeautifulSoup(r.text, 'lxml')
links = [mainurl[:36] + x['href'] for x in soup.select('a[href$=zip]')]
sender, receiver = trio.open_memory_channel(0)
async with receiver:
for _ in range(3):
nurse.start_soon(downloader, receiver.clone())
async with sender:
for link in links:
await sender.send([client, link])
if __name__ == "__main__":
trio.run(main)