如何抓取 google 上的相关搜索？

Question

我试图在给定关键字列表时抓取 google 相关搜索，然后将这些相关搜索输出到 csv 文件中。我的问题是获取漂亮的汤来识别相关搜索 html 标签。

这是源代码中的示例 html 标记：

<div data-ved="2ahUKEwitr8CPkLT3AhVRVsAKHVF-C80QmoICKAV6BAgEEBE">iphone xr</div>

这是我的网络驱动程序设置：

from selenium import webdriver

user_agent = 'Chrome/100.0.4896.60'

webdriver_options = webdriver.ChromeOptions()
webdriver_options.add_argument('user-agent={0}'.format(user_agent))


capabilities = webdriver_options.to_capabilities()
capabilities["acceptSslCerts"] = True
capabilities["acceptInsecureCerts"] = True

这是我的代码：

queries = ["iphone"]

driver = webdriver.Chrome(options=webdriver_options, desired_capabilities=capabilities, port=4444)

df2 = []

driver.get("https://google.com")
time.sleep(3)
driver.find_element(By.CSS_SELECTOR, "[aria-label='Agree to the use of cookies and other data for the purposes described']").click()

# get_current_related_searches
for query in queries:
    driver.get("https://google.com/search?q=" + query)
    time.sleep(3)
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    p = soup.find_all('div data-ved')
    print(p)
    d = pd.DataFrame({'loop': 1, 'source': query, 'from': query, 'to': [s.text for s in p]})
    terms = d["to"]
    df2.append(d)
    time.sleep(3)

df = pd.concat(df2).reset_index(drop=False)

df.to_csv("related_searches.csv")

它的 p=soup.find_all 不正确我只是不确定如何让 BS 识别这些特定的 html 标签。任何帮助都会很棒:)

Answer 1

@jakecohensol，正如您所指出的，p = soup.find_all 中的选择器是错误的。正确的 CSS 选择器：.y6Uyqe .AB4Wff.

Chrome/100.0.4896.60 User-Agent header 不正确。 Google 阻止带有此类代理字符串的请求。使用完整的 User-Agent 字符串 Google returns 正确的 HTML 响应。

Google 无需浏览器即可抓取相关搜索。它会更快更可靠。

这是您的固定代码片段 (link to the full code in online IDE)

import time
import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 14526.89.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.133 Safari/537.36"
}

queries = ["iphone", "pixel", "samsung"]

df2 = []

# get_current_related_searches
for query in queries:
    params = {"q": query}
    response = requests.get("https://google.com/search", params=params, headers=headers)

    soup = BeautifulSoup(response.text, "html.parser")

    p = soup.select(".y6Uyqe .AB4Wff")

    d = pd.DataFrame(
        {"loop": 1, "source": query, "from": query, "to": [s.text for s in p]}
    )

    terms = d["to"]
    df2.append(d)

    time.sleep(3)

df = pd.concat(df2).reset_index(drop=False)

df.to_csv("related_searches.csv")

示例输出：

,index,loop,source,from,to
0,0,1,iphone,iphone,iphone 13
1,1,1,iphone,iphone,iphone 12
2,2,1,iphone,iphone,iphone x
3,3,1,iphone,iphone,iphone 8
4,4,1,iphone,iphone,iphone 7
5,5,1,iphone,iphone,iphone xr
6,6,1,iphone,iphone,find my iphone
7,0,1,pixel,pixel,pixel 6
8,1,1,pixel,pixel,google pixel
9,2,1,pixel,pixel,pixel phone
10,3,1,pixel,pixel,pixel 6 pro
11,4,1,pixel,pixel,pixel 3
12,5,1,pixel,pixel,google pixel price
13,6,1,pixel,pixel,pixel 6 release date
14,0,1,samsung,samsung,samsung galaxy
15,1,1,samsung,samsung,samsung tv
16,2,1,samsung,samsung,samsung tablet
17,3,1,samsung,samsung,samsung account
18,4,1,samsung,samsung,samsung mobile
19,5,1,samsung,samsung,samsung store
20,6,1,samsung,samsung,samsung a21s
21,7,1,samsung,samsung,samsung login

如何抓取 google 上的相关搜索？

How to scrape related searches on google?

python

selenium

google-chrome

beautifulsoup

web-scraping