如何使用 bs4 正确打印 Google 搜索结果？

Question

我有一个工作代码，它首先打印搜索标题，然后打印 url，但它在网站标题之间打印了很多 url。但是如何以这样的格式打印它们并避免为每个打印相同的 url 10 次：

1) Title url
2) Title url
and so on...

我的代码：

search = input("Search:")

page = requests.get(f"https://www.google.com/search?q=" + search)

soup = BeautifulSoup(page.content, "html5lib")

links = soup.findAll("a")

heading_object = soup.find_all('h3')

for info in heading_object:
    x = info.getText()
    print(x)
    for link in links:
        link_href = link.get('href')
        if "url?q=" in link_href:
            y = (link.get('href').split("?q=")[1].split("&sa=U")[0])
            print(y)

Answer 1

如果您分别获得标题和 link，那么您可以使用 zip() 将它们成对分组

for info, link in zip(heading_object, links):
    info = info.getText()

    link = link.get('href')
    if "?q=" in link:
        link = link.split("?q=")[1].split("&sa=U")[0]

    print(info, link)

但是当页面上不存在某些标题或 link 时，这可能会出现问题，因为它会创建错误的对。它会将标题与下一个元素的 link 配对。您应该搜索同时保留标题和 link 的元素，并在每个元素内搜索单个标题和单个 link 来创建对。如果没有标题或 link 那么你可以输入一些默认值，它不会创建错误的对。

Answer 2

您正在寻找这个：

for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  url = result.a['href']
  print(f'{title}, {url}\n') # prints TITLE, URL followed by a new line.

如果您使用的是 f-string，那么 appropriate way 就是这样使用它的：

page = requests.get(f"https://www.google.com/search?q=" + search) # not proper f-string
page = requests.get(f"https://www.google.com/search?q={search}")  # proper f-string

代码：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
  'User-agent':
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "python memes",
  "hl": "en"
}

soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')

for result in soup.select('.yuRUbf'):
  title = result.select_one('.DKV0Md').text
  url = result.a['href']
  print(f'{title}, {url}\n')

--------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/

ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en

28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''

或者，您可以使用 SerpApi 中的 Google Organic Results API 来做同样的事情。这是付费 API 和免费计划。

其中一个区别是您只需要遍历 JSON 而不是弄清楚如何抓取东西。

要集成的代码：

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "python memes",
  "hl": "en"
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  title = result['title']
  url = result['link']
  print(f'{title}, {url}\n')

-------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/

ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en

28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''

Disclaimer, I work for SerpApi.

如何使用 bs4 正确打印 Google 搜索结果？

How to print Google Search results properly with bs4?

python

beautifulsoup

request