Python 2.7 BeautifulSoup4 正在返回一个空集

Python 2.7 BeautifulSoup4 is returning an empty set

我正在尝试使用 bs4 从 google 搜索中获取 links,但我的代码返回一个空集。

import requests
from bs4 import BeautifulSoup

website = "https://www.google.co.uk/?gws_rd=ssl#q=science"

response=requests.get(website)

soup = BeautifulSoup(response.content)

link_info = soup.find_all("h3", {class": "r"})
print link_info

<h3 class="r"> 是所有结果的 link 不仅仅是第一个结果的 link。

作为回应,我得到 [],这是针对任何其他 class 我尝试请求包括 <div class="rc">

这是我想要的 prt sc,

尝试使用下面的代码

url = 'http://www.google.com/search?'
params = {'q': 'science'}
response = requests.get(url, params=params).content
soup = BeautifulSoup(response)
link_info = soup.find_all("h3", {"class": "r"})
print link_info

您正在寻找这个:

# select container with needed elements and grab each element in a loop
for result in soup.select('.tF2Cxc'):

  # grabs each <a> tag from the container and then grabs an href attribute
  link = result.select_one('.yuRUbf a')['href']

看看SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference

来自 的代码将抛出错误,因为没有这样的 r CSS 选择器。变了。


确保您使用的是 user-agent,因为默认 requests user-agentpython-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.

我专门写了一篇关于 how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions 的博客。

在请求中传递 user-agent headers:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)

代码和example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "samurai cop what does katana mean",
  "gl": "us",
  "hl": "en"
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc')[:5]:
  link = result.select_one('.yuRUbf a')['href']

  print(link, sep='\n')

--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''

或者,您可以使用 SerpApi 中的 Google Organic Results API 来实现相同的目的。这是付费 API 和免费计划。

你的情况的不同之处在于你不必处理选择正确的选择器或弄清楚为什么某些事情没有按预期工作然后随着时间的推移维护它。相反,您只需要遍历结构化 JSON 并快速获取您想要的日期。

要集成的代码:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "samurai cop what does katana mean",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"][:5]:
  print(result['link'])

--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''

Disclaimer, I work for SerpApi.