在抓取 google 搜索结果时提取特定段落

Extract specific paragraph while scraping a google search result

我目前正在从事网络抓取工作,我需要在 google 搜索结果中提取对城市的描述。

假设我想要马德里城市的描述,我搜索并得到以下结果:

这是目标的源代码div:

<div jscontroller="GCSbhd" class="kno-rdesc" jsaction="seM7Qe:c0XUbe;Iigoee:c0XUbe;rcuQ6b:npT2md">
    <h3 class="Uo8X3b OhScic zsYMMe">Description</h3>
    <span>Située au centre de l'Espagne, Madrid, sa capitale, est une ville dotée d'élégants boulevards et de vastes parcs très bien entretenus comme le Retiro. Elle est réputée pour ses riches collections d'œuvres d'art européennes, avec notamment celles du musée du Prado, réalisées par Goya, Velázquez et d'autres maîtres espagnols. Au cœur de la vieille Madrid des Habsbourgs se trouve la Plaza&nbsp;Mayor, bordée de portiques, et, à proximité, le Palais royal baroque et son Armurerie, qui comporte des armes historiques.
        <span>
            <span class="eHaQD"> ―&nbsp;Google
            </span>
        </span>
    </span>
</div>

我尝试抓取内容并 select 编辑 <h3> 标签,然后 select 它的同级标签,但结果是 None,这是使用的代码:

import requests
from bs4 import BeautifulSoup
url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, 'html.parser')
target_div_PresMadrid = soup_PresMadrid.find('h3', {'class': 'Uo8X3b OhScic zsYMMe'})
print(target_div_PresMadrid)

我什至尝试 select 唯一不更改其 class 但代码 returns None 的父 <div>,这它的代码:

import requests
from bs4 import BeautifulSoup
url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, 'html.parser')
target_div_PresMadrid = soup_PresMadrid.find('div', {'class': 'liYKde g VjDLd'})
print(target_div_PresMadrid)

任何人都可以帮助我了解搜索引擎的机制,以便我可以提取该段落

如果您在浏览器中禁用 JavaScript,您会看到您想要的段落实际上位于 class BNeawe s3v9rd AP7Wnd:

<div class="BNeawe s3v9rd AP7Wnd">
 Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry.
</div>

requests 库不支持 JavaScript。因此,您需要访问此 class BNeawe s3v9rd AP7Wnd.

虽然有多个 class 同名,因为 find() 只有 returns 第一个 匹配,你可以使用它

import requests
from bs4 import BeautifulSoup


url_PresMadrid = "https://www.google.com/search?q=madrid"
req_PresPadrid = requests.get(url_PresMadrid)
soup_PresMadrid = BeautifulSoup(req_PresPadrid.content, "html.parser")
target_div_PresMadrid = soup_PresMadrid.find("div", {"class": "BNeawe s3v9rd AP7Wnd"})
print(target_div_PresMadrid.text)

输出:

Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry.

另请参阅:

  • Web-scraping JavaScript page with Python

您正在寻找这个:

soup.select_one('.zsYMMe+ span') # css selector for knowledge graph description

尝试SelectorGadget Chrome extenstion to grab css selectors. CSS selectors reference

确保您使用的是 user-agent 又名 headers to decrease the number of blocked requests. What is my user-agent?

代码和full example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    "User-agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
    "Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  'q': 'Madrid',
  'hl': 'en',
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

# not every knowledge graph has snippet (description), that's why try/except is here
try:
    snippet = soup.select_one('.zsYMMe+ span').text
except: snippet = None
print(snippet)

----
'''
Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry. &horbar; Google
'''

或者,您可以使用 SerpApi 中的 Google Knowledge Graph API。这是付费 API 和免费计划。

要集成的代码:

from serpapi import GoogleSearch
import os

params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "google",
    "q": "dell",
    "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

snippet = results['knowledge_graph']['description']
print(snippet)

-------
'''
Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro. It’s renowned for its rich repositories of European art, including the Prado Museum’s works by Goya, Velázquez and other Spanish masters. The heart of old Hapsburg Madrid is the portico-lined Plaza Mayor, and nearby is the baroque Royal Palace and Armory, displaying historic weaponry. 
'''

Disclaimer, I work for SerpApi.