使用 BeautifulSoup 问题抓取数据
Scraping data with BeautifulSoup issue
这里是 newby 网络抓取工具。我想知道为什么我不能提取我需要的信息。代码如下:
from bs4 import BeautifulSoup
import requests
url = "http://www.mortgagenewsdaily.com/mortgage_rates/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
table = soup.find("table", {"class": "rangetable mtg-rates"})
for trs in table.find_all("tr", {"class": "rate-row"}):
for tds in trs.find_all("td"):
try:
product = tds[0].get_text()
today = tds[1].get_text()
yesterday = tds[2].get_text()
change = tds[3].get_text()
low = tds[4].get_text()
high = tds[5].get_text()
except:
print "-"
continue
我需要注意tds中的类吗?而且,是否有 better/simpler 方法来抓取这些信息?
感谢您的帮助!
您的主要问题是在 tds
上有一个 额外循环。
此外,您可以对代码应用以下改进:
- 使用
CSS Selectors
获取适当的 tr
元素
- 不要为每个单元格设置变量,而是使用数据结构,例如
namedtuple
改进后的工作代码:
from collections import namedtuple
from bs4 import BeautifulSoup
import requests
url = "http://www.mortgagenewsdaily.com/mortgage_rates/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
Item = namedtuple('Item', "product,today,yesterday,change,low,high")
for tr in soup.select("table.mtg-rates tr.rate-row"):
item = Item(*(td.get_text(strip=True) for td in tr.find_all("td")))
print item
打印:
Item(product=u'30 Yr FRM', today=u'3.62%', yesterday=u'3.64%', change=u'-0.023.62%', low=u'3.60%', high=u'4.56%')
Item(product=u'15 Yr FRM', today=u'3.00%', yesterday=u'3.02%', change=u'-0.023.00%', low=u'2.98%', high=u'3.55%')
Item(product=u'FHA 30 Year Fixed', today=u'3.25%', yesterday=u'3.25%', change=u'--3.25%', low=u'3.25%', high=u'4.25%')
Item(product=u'Jumbo 30 Year Fixed', today=u'3.61%', yesterday=u'3.63%', change=u'-0.023.61%', low=u'3.58%', high=u'4.38%')
Item(product=u'5/1 Yr ARM', today=u'3.23%', yesterday=u'3.22%', change=u'+0.013.23%', low=u'3.17%', high=u'3.26%')
或者,在 dict
的情况下:
headers = ['product', 'today', 'yesterday' ,' change', 'low', 'high']
for tr in soup.select("table.mtg-rates tr.rate-row"):
item = dict(zip(headers, [td.get_text(strip=True) for td in tr.find_all("td")]))
print item
打印:
{'product': u'30 Yr FRM', 'yesterday': u'3.64%', 'high': u'4.56%', 'low': u'3.60%', 'today': u'3.62%', ' change': u'-0.023.62%'}
{'product': u'15 Yr FRM', 'yesterday': u'3.02%', 'high': u'3.55%', 'low': u'2.98%', 'today': u'3.00%', ' change': u'-0.023.00%'}
{'product': u'FHA 30 Year Fixed', 'yesterday': u'3.25%', 'high': u'4.25%', 'low': u'3.25%', 'today': u'3.25%', ' change': u'--3.25%'}
{'product': u'Jumbo 30 Year Fixed', 'yesterday': u'3.63%', 'high': u'4.38%', 'low': u'3.58%', 'today': u'3.61%', ' change': u'-0.023.61%'}
{'product': u'5/1 Yr ARM', 'yesterday': u'3.22%', 'high': u'3.26%', 'low': u'3.17%', 'today': u'3.23%', ' change': u'+0.013.23%'}
这里是 newby 网络抓取工具。我想知道为什么我不能提取我需要的信息。代码如下:
from bs4 import BeautifulSoup
import requests
url = "http://www.mortgagenewsdaily.com/mortgage_rates/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
table = soup.find("table", {"class": "rangetable mtg-rates"})
for trs in table.find_all("tr", {"class": "rate-row"}):
for tds in trs.find_all("td"):
try:
product = tds[0].get_text()
today = tds[1].get_text()
yesterday = tds[2].get_text()
change = tds[3].get_text()
low = tds[4].get_text()
high = tds[5].get_text()
except:
print "-"
continue
我需要注意tds中的类吗?而且,是否有 better/simpler 方法来抓取这些信息?
感谢您的帮助!
您的主要问题是在 tds
上有一个 额外循环。
此外,您可以对代码应用以下改进:
- 使用
CSS Selectors
获取适当的tr
元素 - 不要为每个单元格设置变量,而是使用数据结构,例如
namedtuple
改进后的工作代码:
from collections import namedtuple
from bs4 import BeautifulSoup
import requests
url = "http://www.mortgagenewsdaily.com/mortgage_rates/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
Item = namedtuple('Item', "product,today,yesterday,change,low,high")
for tr in soup.select("table.mtg-rates tr.rate-row"):
item = Item(*(td.get_text(strip=True) for td in tr.find_all("td")))
print item
打印:
Item(product=u'30 Yr FRM', today=u'3.62%', yesterday=u'3.64%', change=u'-0.023.62%', low=u'3.60%', high=u'4.56%')
Item(product=u'15 Yr FRM', today=u'3.00%', yesterday=u'3.02%', change=u'-0.023.00%', low=u'2.98%', high=u'3.55%')
Item(product=u'FHA 30 Year Fixed', today=u'3.25%', yesterday=u'3.25%', change=u'--3.25%', low=u'3.25%', high=u'4.25%')
Item(product=u'Jumbo 30 Year Fixed', today=u'3.61%', yesterday=u'3.63%', change=u'-0.023.61%', low=u'3.58%', high=u'4.38%')
Item(product=u'5/1 Yr ARM', today=u'3.23%', yesterday=u'3.22%', change=u'+0.013.23%', low=u'3.17%', high=u'3.26%')
或者,在 dict
的情况下:
headers = ['product', 'today', 'yesterday' ,' change', 'low', 'high']
for tr in soup.select("table.mtg-rates tr.rate-row"):
item = dict(zip(headers, [td.get_text(strip=True) for td in tr.find_all("td")]))
print item
打印:
{'product': u'30 Yr FRM', 'yesterday': u'3.64%', 'high': u'4.56%', 'low': u'3.60%', 'today': u'3.62%', ' change': u'-0.023.62%'}
{'product': u'15 Yr FRM', 'yesterday': u'3.02%', 'high': u'3.55%', 'low': u'2.98%', 'today': u'3.00%', ' change': u'-0.023.00%'}
{'product': u'FHA 30 Year Fixed', 'yesterday': u'3.25%', 'high': u'4.25%', 'low': u'3.25%', 'today': u'3.25%', ' change': u'--3.25%'}
{'product': u'Jumbo 30 Year Fixed', 'yesterday': u'3.63%', 'high': u'4.38%', 'low': u'3.58%', 'today': u'3.61%', ' change': u'-0.023.61%'}
{'product': u'5/1 Yr ARM', 'yesterday': u'3.22%', 'high': u'3.26%', 'low': u'3.17%', 'today': u'3.23%', ' change': u'+0.013.23%'}