BeautifulSoup 不抓取所有数据

Question

我正在尝试抓取一个网站，但是当我运行这段代码时，它只打印了一半的数据（包括评论数据）。这是我的脚本：

from bs4 import BeautifulSoup
from urllib.request import urlopen

inputfile = "Chicago.csv"
f = open(inputfile, "w")
Headers = "Name, Link\n"
f.write(Headers)

url = "https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")

page_details = soup.find("dl", {"class":"boccat"})
Readers = page_details.find_all("a")

for i in Readers:
    poll = i.contents[0]
    link = i['href']
    print(poll)
    print(link)
    f.write("{}".format(poll) + ",https://www.chicagoreader.com{}".format(link)+ "\n")
f.close()

我的脚本风格有问题吗？
如何让代码变短？
何时使用 find_all 和 find 才不会出现属性错误。我阅读了文档但不明白。

Answer 1

为了让您的代码更短，您可以切换到 Requests 库。它易于使用且精确。如果你想让它更短，你可以使用cssselect。

find 选择容器，find_all 在 for 循环中选择该容器的单个项目。这是完整的代码：

from bs4 import BeautifulSoup
import csv ; import requests

outfile = open("chicagoreader.csv","w",newline='')
writer = csv.writer(outfile)
writer.writerow(["Name","Link"])

base = "https://www.chicagoreader.com"

response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".boccat dd a"):
    writer.writerow([item.text,base + item.get('href')])
    print(item.text,base + item.get('href'))

或使用查找和 find_all:

from bs4 import BeautifulSoup
import requests

base = "https://www.chicagoreader.com"

response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for items in soup.find("dl",{"class":"boccat"}).find_all("dd"):
    item = items.find_all("a")[0]
    print(item.text, base + item.get("href"))

BeautifulSoup 不抓取所有数据

BeautifulSoup does not scrape all data

beautifulsoup

web-scraping

python-3.6