为什么我的 zip 文件没有从使用 python 的代码中输出?

why are my zip files not being output from code using python?

我想从这个网页上抓取所有文件,它们是 zip 文件:http://data.gdeltproject.org/events/index.html

这是我的代码:

from bs4 import BeautifulSoup as bs
import requests
import re

DOMAIN = "insert here"
URL = "insert here"

def get_soup(URL):
 return bs(requests.get(URL).text, 'html.parser')


for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".zip")}):
    file_link = link.get('href')
    print(file_link)

with open(link.text, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)

代码似乎创建了一个文件,但是文件的内容是空的。我可以在 python 运行 输出中看到所有的 zip 文件,但它们不在文件中。有人可以帮我找出如何将这些文件放入我的计算机吗?我被困在这里了!

非常感谢,百合

你能检查一下是否:

for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".zip")}):
    file_link = link.get('href')
    print(file_link)

with open(link.text, 'wb') as file:
response = requests.get(DOMAIN + file_link)
file.write(response.content)

会解决这个问题吗?

for link in get_soup(URL).findAll("a", attrs={'href': re.compile(".zip")}):
    file_link = link.get('href')
    print(file_link)

    with open(link.text, 'wb') as file:
        response = requests.get(DOMAIN + file_link)
        file.write(response.content)

由于 Python 对 ident 非常明确,它可能会造成伤害(您的代码不是 运行 在 for 循环内,而是在那之后)。

我建议下次使用一些 GUI 调试器(看看,在 VScode 或其他 IDE GUI 中设置有多容易)或使用 ipython 和 ipdb(import ipdb; ipdb.set_trace() )

这不是完整的答案,因为如果您使用调试器并使用您的代码,您应该可以轻松地克服它。感谢您进一步的学习和坚持:)

from bs4 import BeautifulSoup
import httpx
import trio

mainurl = "http://data.gdeltproject.org/events/index.html"


async def downloader(rec):
    async with rec:
        async for client, link in rec:
            print(f'[*] Downloading --> {link}')
            async with await trio.open_file(link.split('/')[-1], 'wb') as f:
                r = await client.get(link)
                await f.write(r.content)


async def main():
    async with httpx.AsyncClient(timeout=None) as client, trio.open_nursery() as nurse:
        r = await client.get(mainurl)
        soup = BeautifulSoup(r.text, 'lxml')
        links = [mainurl[:36] + x['href'] for x in soup.select('a[href$=zip]')]

        sender, receiver = trio.open_memory_channel(0)

        async with receiver:
            for _ in range(3):
                nurse.start_soon(downloader, receiver.clone())

            async with sender:
                for link in links:
                    await sender.send([client, link])


if __name__ == "__main__":
    trio.run(main)