如何使用 BeautifulSoup 从页面中抓取内容

How to do scraping from a page with BeautifulSoup

问的问题很简单,但对我来说,行不通,我也不知道!

我想用 BeautifulSoup 从这个页面 https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone 抓取评分啤酒,但它不起作用。

这是我的代码:

import requests
import bs4
from bs4 import BeautifulSoup



url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'

test_html = requests.get(url).text

soup = BeautifulSoup(test_html, "lxml")

rating = soup.findAll("span", class_="ratingValue")

rating

当我完成时,它不起作用,但如果我对另一个页面做同样的事情就可以了……我不知道。有人可以帮助我吗?评分结果为4.58

谢谢大家!

你遇到这个错误是因为有些网站不能被美汤抓取。所以对于这些类型的网站,你必须使用硒

  • 根据您的操作系统
  • 从此link下载最新的chrome驱动程序
  • 通过此命令安装 selenium 驱动程序 "pip install selenium"
# import required modules 
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time, os

curren_dir  = os.getcwd()
print(curren_dir)

# concatinate web driver with your current dir && if you are using window change "/" to '\' .

# make sure , you placed chromedriver in current directory 
driver = webdriver.Chrome(curren_dir+'/chromedriver')
# driver.get open url on your browser 
driver.get('https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone')
time.sleep(1)

# it fetch data html data from driver
super_html = driver.page_source

# now convert raw data with 'html.parser'

soup=BeautifulSoup(super_html,"html.parser")
rating = soup.findAll("span",itemprop="ratingValue")
rating[0].text

如果您打印 test_html,您会发现收到 403 禁止响应。

您应该在 GET 请求中添加 header(至少 user-agent : ))。

import requests
from bs4 import BeautifulSoup


headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'

test_html = requests.get(url, headers=headers).text

soup = BeautifulSoup(test_html, 'html5lib')

rating = soup.find('span', {'itemprop': 'ratingValue'})

print(rating.text)

# 4.58

获取禁止状态代码(HTTP 错误 403)的原因,这意味着尽管理解响应,服务器仍不会满足您的请求。如果您尝试抓取许多具有防止机器人程序的安全功能的更受欢迎的网站,您肯定会遇到此错误。所以你需要伪装你的要求!

  1. 为此你需要使用 Headers.
  2. 您还需要更正您试图获取其数据的标签属性,即 itemprop
  3. 使用 lxml 作为您的树构建器,或您选择的任何其他工具

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
    
    # Add this 
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
    
    test_html = requests.get(url, headers=headers).text      
    
    soup = BeautifulSoup(test_html, 'lxml')
    
    rating = soup.find('span', {'itemprop':'ratingValue'})
    
    print(rating.text)
    

您请求的页面响应为 403 禁止,因此您可能不会收到错误,但它会为您提供空白结果 []。为避免这种情况,我们添加了用户代理,此代码将为您提供所需的结果。

import urllib.request
from bs4 import BeautifulSoup

user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

url = "https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone"
headers={'User-Agent':user_agent} 

request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response, "lxml")

rating = soup.find('span', {'itemprop':'ratingValue'})

rating.text
    import requests
    from bs4 import BeautifulSoup


    headers = {
   'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
 AppleWebKit/537.36 
   (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
   }

 url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southerntier-pumking 
clone'

test_html = requests.get(url, headers=headers).text

soup = BeautifulSoup(test_html, 'html5lib')

rating = soup.find('span', {'itemprop': 'ratingValue'})

 print(rating.text)