如何使用 bs4 正确打印 Google 搜索结果?
How to print Google Search results properly with bs4?
我有一个工作代码,它首先打印搜索标题,然后打印 url,但它在网站标题之间打印了很多 url。但是如何以这样的格式打印它们并避免为每个打印相同的 url 10 次:
1) Title url
2) Title url
and so on...
我的代码:
search = input("Search:")
page = requests.get(f"https://www.google.com/search?q=" + search)
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a")
heading_object = soup.find_all('h3')
for info in heading_object:
x = info.getText()
print(x)
for link in links:
link_href = link.get('href')
if "url?q=" in link_href:
y = (link.get('href').split("?q=")[1].split("&sa=U")[0])
print(y)
如果您分别获得标题和 link,那么您可以使用 zip()
将它们成对分组
for info, link in zip(heading_object, links):
info = info.getText()
link = link.get('href')
if "?q=" in link:
link = link.split("?q=")[1].split("&sa=U")[0]
print(info, link)
但是当页面上不存在某些标题或 link 时,这可能会出现问题,因为它会创建错误的对。它会将标题与下一个元素的 link 配对。您应该搜索同时保留标题和 link 的元素,并在每个元素内搜索单个标题和单个 link 来创建对。如果没有标题或 link 那么你可以输入一些默认值,它不会创建错误的对。
您正在寻找这个:
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n') # prints TITLE, URL followed by a new line.
如果您使用的是 f
-string,那么 appropriate way 就是这样使用它的:
page = requests.get(f"https://www.google.com/search?q=" + search) # not proper f-string
page = requests.get(f"https://www.google.com/search?q={search}") # proper f-string
代码:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "python memes",
"hl": "en"
}
soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n')
--------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
或者,您可以使用 SerpApi 中的 Google Organic Results API 来做同样的事情。这是付费 API 和免费计划。
其中一个区别是您只需要遍历 JSON 而不是弄清楚如何抓取东西。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "python memes",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
url = result['link']
print(f'{title}, {url}\n')
-------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
Disclaimer, I work for SerpApi.
我有一个工作代码,它首先打印搜索标题,然后打印 url,但它在网站标题之间打印了很多 url。但是如何以这样的格式打印它们并避免为每个打印相同的 url 10 次:
1) Title url
2) Title url
and so on...
我的代码:
search = input("Search:")
page = requests.get(f"https://www.google.com/search?q=" + search)
soup = BeautifulSoup(page.content, "html5lib")
links = soup.findAll("a")
heading_object = soup.find_all('h3')
for info in heading_object:
x = info.getText()
print(x)
for link in links:
link_href = link.get('href')
if "url?q=" in link_href:
y = (link.get('href').split("?q=")[1].split("&sa=U")[0])
print(y)
如果您分别获得标题和 link,那么您可以使用 zip()
将它们成对分组
for info, link in zip(heading_object, links):
info = info.getText()
link = link.get('href')
if "?q=" in link:
link = link.split("?q=")[1].split("&sa=U")[0]
print(info, link)
但是当页面上不存在某些标题或 link 时,这可能会出现问题,因为它会创建错误的对。它会将标题与下一个元素的 link 配对。您应该搜索同时保留标题和 link 的元素,并在每个元素内搜索单个标题和单个 link 来创建对。如果没有标题或 link 那么你可以输入一些默认值,它不会创建错误的对。
您正在寻找这个:
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n') # prints TITLE, URL followed by a new line.
如果您使用的是 f
-string,那么 appropriate way 就是这样使用它的:
page = requests.get(f"https://www.google.com/search?q=" + search) # not proper f-string
page = requests.get(f"https://www.google.com/search?q={search}") # proper f-string
代码:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "python memes",
"hl": "en"
}
soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n')
--------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
或者,您可以使用 SerpApi 中的 Google Organic Results API 来做同样的事情。这是付费 API 和免费计划。
其中一个区别是您只需要遍历 JSON 而不是弄清楚如何抓取东西。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "python memes",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
url = result['link']
print(f'{title}, {url}\n')
-------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (@python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
28 Python Memes ideas - Pinterest, https://in.pinterest.com/codeitbro/python-memes/
'''
Disclaimer, I work for SerpApi.