使用 Beautiful Soup 从 Google 搜索中提取 Data/Links
Pull Data/Links from Google Searches using Beautiful Soup
晚上的人们,
我想问 Google 一个问题,并从其受尊重的搜索查询中提取所有相关链接(即我搜索 "site: Wikipedia.com Thomas Jefferson",它给了我 wiki。com/jeff , wiki.com/tom, 等等)
这是我的代码:
from bs4 import BeautifulSoup
from urllib2 import urlopen
query = 'Thomas Jefferson'
query.replace (" ", "+")
#replaces whitespace with a plus sign for Google compatibility purposes
soup = BeautifulSoup(urlopen("https://www.google.com/?gws_rd=ssl#q=site:wikipedia.com+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.
for item in soup.find_all('h3', attrs={'class' : 'r'}):
print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
我的目标是设置查询变量,python 查询 Google,如果你愿意,Beautiful Soup 会提取所有 "green" 链接。
Here is a picture of a Google results page
我只想完整地拉动绿色链接。奇怪的是 Google 的源代码是 "hidden"(他们搜索架构的一个症状),所以 Beautiful Soup 不能直接从 h3 标签中提取 href。我可以在检查元素时看到 h3 href,但在查看源代码时看不到。
Here is a picture of the Inspect Element
我的问题是:如果我无法访问他们的源代码,我该如何通过 BeautifulSoup 从 Google 中提取前 5 个最相关的绿色链接,只能检查元素?
PS:为了了解我想要完成的事情,我发现了两个相对接近的 Stack Overflow 问题,例如我的问题:
beautiful soup extract a href from google search
How to collect data of Google Search with beautiful soup using python
这不适用于散列搜索(#q=site:wikipedia.com
,就像您拥有的那样),因为它通过 AJAX 加载数据,而不是为您提供完整的可解析 HTML 有了结果,你应该改用这个:
soup = BeautifulSoup(urlopen("https://www.google.com/search?gbv=1&q=site:wikipedia.com+" + query), "html.parser")
作为参考,我禁用了 javascript 并执行了 google 搜索以获取此 url 结构。
当我尝试禁用 JavaScript 进行搜索时,我得到的 URL 与 Rob M 不同 -
https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw
要使其适用于任何查询,您应该首先确保您的查询中没有空格(这就是您会收到 400:错误请求的原因)。您可以使用 urllib.quote_plus()
:
query = "Thomas Jefferson"
query = urllib.quote_plus(query)
这会将所有空格 urlencode 为加号 - 创建有效的 URL.
但是,这 不 与 urllib 一起工作 - 你得到一个 403: Forbidden .我通过使用像这样的 python-requests
模块让它工作:
import requests
import urllib
from bs4 import BeautifulSoup
query = 'Thomas Jefferson'
query = urllib.quote_plus(query)
r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.
links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
打印链接给出:
print links
# [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
# u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
# u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
# u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
# u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
# u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
# u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
# u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
# u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
# u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']
实际上,没有必要禁用 JavaScript。这可能是因为您需要指定 user-agent
来充当“真实”用户访问。
如果在使用 requests
库时未指定 user-agent
,则默认为 python-requests,因此 Google 或其他搜索引擎认为它是 bot/script并且可能会阻止请求并收到 HTML 将包含某种具有不同元素的错误,这就是您得到空结果的原因。
检查what's your user-agent
or see a list of user-agents
。
代码和full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get( 'https://www.google.com/search?q=site:wikipedia.com thomas edison', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for links in soup.find_all('div', class_='yuRUbf'):
link = links.a['href']
print(link)
# or using select() method which accepts CSS selectors
for links in soup.select('.yuRUbf a'):
link = links['href']
print(link)
输出:
https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw
或者,您可以使用 SerpApi 中的 Google Search Engine Results API。这是付费 API 和免费计划。
不同之处在于,您不必为了提取数据而弄清楚要抓取哪些 HTML 元素,了解如何绕过 Google 或其他搜索引擎的块,并且随着时间的推移维护它(如果 HTML 中的某些内容将被更改)。
要集成的示例代码:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "site:wikipedia.com thomas edison",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Link: {result['link']}")
输出:
Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw
Disclaimer, I work for SerpApi.
P.S。有一个 dedicated web scraping blog 我的。
晚上的人们,
我想问 Google 一个问题,并从其受尊重的搜索查询中提取所有相关链接(即我搜索 "site: Wikipedia.com Thomas Jefferson",它给了我 wiki。com/jeff , wiki.com/tom, 等等)
这是我的代码:
from bs4 import BeautifulSoup
from urllib2 import urlopen
query = 'Thomas Jefferson'
query.replace (" ", "+")
#replaces whitespace with a plus sign for Google compatibility purposes
soup = BeautifulSoup(urlopen("https://www.google.com/?gws_rd=ssl#q=site:wikipedia.com+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.
for item in soup.find_all('h3', attrs={'class' : 'r'}):
print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
我的目标是设置查询变量,python 查询 Google,如果你愿意,Beautiful Soup 会提取所有 "green" 链接。
Here is a picture of a Google results page
我只想完整地拉动绿色链接。奇怪的是 Google 的源代码是 "hidden"(他们搜索架构的一个症状),所以 Beautiful Soup 不能直接从 h3 标签中提取 href。我可以在检查元素时看到 h3 href,但在查看源代码时看不到。
Here is a picture of the Inspect Element
我的问题是:如果我无法访问他们的源代码,我该如何通过 BeautifulSoup 从 Google 中提取前 5 个最相关的绿色链接,只能检查元素?
PS:为了了解我想要完成的事情,我发现了两个相对接近的 Stack Overflow 问题,例如我的问题:
beautiful soup extract a href from google search
How to collect data of Google Search with beautiful soup using python
这不适用于散列搜索(#q=site:wikipedia.com
,就像您拥有的那样),因为它通过 AJAX 加载数据,而不是为您提供完整的可解析 HTML 有了结果,你应该改用这个:
soup = BeautifulSoup(urlopen("https://www.google.com/search?gbv=1&q=site:wikipedia.com+" + query), "html.parser")
作为参考,我禁用了 javascript 并执行了 google 搜索以获取此 url 结构。
当我尝试禁用 JavaScript 进行搜索时,我得到的 URL 与 Rob M 不同 -
https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw
要使其适用于任何查询,您应该首先确保您的查询中没有空格(这就是您会收到 400:错误请求的原因)。您可以使用 urllib.quote_plus()
:
query = "Thomas Jefferson"
query = urllib.quote_plus(query)
这会将所有空格 urlencode 为加号 - 创建有效的 URL.
但是,这 不 与 urllib 一起工作 - 你得到一个 403: Forbidden .我通过使用像这样的 python-requests
模块让它工作:
import requests
import urllib
from bs4 import BeautifulSoup
query = 'Thomas Jefferson'
query = urllib.quote_plus(query)
r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.
links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
打印链接给出:
print links
# [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
# u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
# u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
# u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
# u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
# u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
# u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
# u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
# u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
# u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']
实际上,没有必要禁用 JavaScript。这可能是因为您需要指定 user-agent
来充当“真实”用户访问。
如果在使用 requests
库时未指定 user-agent
,则默认为 python-requests,因此 Google 或其他搜索引擎认为它是 bot/script并且可能会阻止请求并收到 HTML 将包含某种具有不同元素的错误,这就是您得到空结果的原因。
检查what's your user-agent
or see a list of user-agents
。
代码和full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get( 'https://www.google.com/search?q=site:wikipedia.com thomas edison', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for links in soup.find_all('div', class_='yuRUbf'):
link = links.a['href']
print(link)
# or using select() method which accepts CSS selectors
for links in soup.select('.yuRUbf a'):
link = links['href']
print(link)
输出:
https://en.wikipedia.com/wiki/Edison,_New_Jersey
https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
https://www.wikipedia.com/wiki/Thomas_E._Murray
https://en.wikipedia.com/wiki/Incandescent_light_bulb
https://en.wikipedia.com/wiki/Phonograph_cylinder
https://en.wikipedia.com/wiki/Emile_Berliner
https://wikipedia.com/wiki/Consolidated_Edison
https://www.wikipedia.com/wiki/hello
https://www.wikipedia.com/wiki/Tom%20Alston
https://en.wikipedia.com/wiki/Edison_screw
或者,您可以使用 SerpApi 中的 Google Search Engine Results API。这是付费 API 和免费计划。
不同之处在于,您不必为了提取数据而弄清楚要抓取哪些 HTML 元素,了解如何绕过 Google 或其他搜索引擎的块,并且随着时间的推移维护它(如果 HTML 中的某些内容将被更改)。
要集成的示例代码:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "site:wikipedia.com thomas edison",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Link: {result['link']}")
输出:
Link: https://en.wikipedia.com/wiki/Edison,_New_Jersey
Link: https://en.wikipedia.com/wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia.com/wiki/Thomas_E._Murray
Link: https://en.wikipedia.com/wiki/Incandescent_light_bulb
Link: https://en.wikipedia.com/wiki/Phonograph_cylinder
Link: https://en.wikipedia.com/wiki/Emile_Berliner
Link: https://wikipedia.com/wiki/Consolidated_Edison
Link: https://www.wikipedia.com/wiki/hello
Link: https://www.wikipedia.com/wiki/Tom%20Alston
Link: https://en.wikipedia.com/wiki/Edison_screw
Disclaimer, I work for SerpApi.
P.S。有一个 dedicated web scraping blog 我的。