BeautifulSoup 不抓取所有数据
BeautifulSoup does not scrape all data
我正在尝试抓取一个网站,但是当我 运行 这段代码时,它只打印了一半的数据(包括评论数据)。这是我的脚本:
from bs4 import BeautifulSoup
from urllib.request import urlopen
inputfile = "Chicago.csv"
f = open(inputfile, "w")
Headers = "Name, Link\n"
f.write(Headers)
url = "https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
page_details = soup.find("dl", {"class":"boccat"})
Readers = page_details.find_all("a")
for i in Readers:
poll = i.contents[0]
link = i['href']
print(poll)
print(link)
f.write("{}".format(poll) + ",https://www.chicagoreader.com{}".format(link)+ "\n")
f.close()
- 我的脚本风格有问题吗?
- 如何让代码变短?
- 何时使用
find_all
和 find
才不会出现属性错误。我阅读了文档但不明白。
为了让您的代码更短,您可以切换到 Requests 库。它易于使用且精确。如果你想让它更短,你可以使用cssselect。
find
选择容器,find_all
在 for 循环中选择该容器的单个项目。这是完整的代码:
from bs4 import BeautifulSoup
import csv ; import requests
outfile = open("chicagoreader.csv","w",newline='')
writer = csv.writer(outfile)
writer.writerow(["Name","Link"])
base = "https://www.chicagoreader.com"
response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".boccat dd a"):
writer.writerow([item.text,base + item.get('href')])
print(item.text,base + item.get('href'))
或使用查找和 find_all:
from bs4 import BeautifulSoup
import requests
base = "https://www.chicagoreader.com"
response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for items in soup.find("dl",{"class":"boccat"}).find_all("dd"):
item = items.find_all("a")[0]
print(item.text, base + item.get("href"))
我正在尝试抓取一个网站,但是当我 运行 这段代码时,它只打印了一半的数据(包括评论数据)。这是我的脚本:
from bs4 import BeautifulSoup
from urllib.request import urlopen
inputfile = "Chicago.csv"
f = open(inputfile, "w")
Headers = "Name, Link\n"
f.write(Headers)
url = "https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
page_details = soup.find("dl", {"class":"boccat"})
Readers = page_details.find_all("a")
for i in Readers:
poll = i.contents[0]
link = i['href']
print(poll)
print(link)
f.write("{}".format(poll) + ",https://www.chicagoreader.com{}".format(link)+ "\n")
f.close()
- 我的脚本风格有问题吗?
- 如何让代码变短?
- 何时使用
find_all
和find
才不会出现属性错误。我阅读了文档但不明白。
为了让您的代码更短,您可以切换到 Requests 库。它易于使用且精确。如果你想让它更短,你可以使用cssselect。
find
选择容器,find_all
在 for 循环中选择该容器的单个项目。这是完整的代码:
from bs4 import BeautifulSoup
import csv ; import requests
outfile = open("chicagoreader.csv","w",newline='')
writer = csv.writer(outfile)
writer.writerow(["Name","Link"])
base = "https://www.chicagoreader.com"
response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for item in soup.select(".boccat dd a"):
writer.writerow([item.text,base + item.get('href')])
print(item.text,base + item.get('href'))
或使用查找和 find_all:
from bs4 import BeautifulSoup
import requests
base = "https://www.chicagoreader.com"
response = requests.get("https://www.chicagoreader.com/chicago/best-of-chicago-2011-food-drink/BestOf?oid=4106228")
soup = BeautifulSoup(response.text, "lxml")
for items in soup.find("dl",{"class":"boccat"}).find_all("dd"):
item = items.find_all("a")[0]
print(item.text, base + item.get("href"))