Beautifulsoup 返回错误的 href 值
Beautifulsoup Returning Wrong href Value
我正在使用以下 SERP 代码来执行一些 SEO,但是当我尝试读取 href
属性时,我得到的结果不正确,显示了页面中的其他有线 URL,但不是预期的 URL。我的代码有什么问题?
import requests
from bs4 import BeautifulSoup
URL = "https://www.google.com/search?q=beautiful+soup&rlz=1C1GCEB_enIN922IN922&oq=beautiful+soup&aqs=chrome..69i57j69i60l3.2455j0j7&sourceid=chrome&ie=UTF-8"
r = requests.get(URL)
webPage = html.unescape(r.text)
soup = BeautifulSoup(webPage, 'html.parser')
text =''
gresults = soup.findAll('h3')
for result in gresults:
print (result.text)
links = result.parent.parent.find_all('a', href=True)
for link in links:
print(link.get('href'))
输出如下所示:
/url?q=https://www.crummy.com/software/BeautifulSoup/bs4/doc/&sa=U&ved=2ahUKEwjv6-q3tJ30AhX_r1YBHU9OAeMQFnoECAAQAg&usg=AOvVaw2Q
会发生什么?
Selecting <h3>
只会给你一个包含不需要元素的结果集。
向上移动到父 parent
是可以的,但尝试 find_all()
(不要在新代码中使用旧语法 findAll()
) 不是必需的,这也会给你 <a>
你可能不想要的。
如何修复?
Select 你的目标元素更具体然后你可以使用:
result.parent.parent.find('a',href=True).get('href')
但我建议使用以下示例。
例子
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = f'http://www.google.com/search?q=beautiful+soup'
r = requests.get(url, headers= headers)
soup = BeautifulSoup(r.text, 'lxml')
data = []
for r in soup.select('#search a h3'):
data.append({
'title':r.text,
'url':r.parent['href'],
})
data
输出
[{'title': 'Beautiful Soup 4.9.0 documentation - Crummy',
'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'},
{'title': 'Beautiful Soup Tutorial: Web Scraping mit Python',
'url': 'https://lerneprogrammieren.de/beautiful-soup-tutorial/'},
{'title': 'Beautiful Soup 4 - Web Scraping mit Python | HelloCoding',
'url': 'https://hellocoding.de/blog/coding-language/python/beautiful-soup-4'},
{'title': 'Beautiful Soup - Wikipedia',
'url': 'https://de.wikipedia.org/wiki/Beautiful_Soup'},
{'title': 'Beautiful Soup (HTML parser) - Wikipedia',
'url': 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'},
{'title': 'Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...',
'url': 'https://beautiful-soup-4.readthedocs.io/'},
{'title': 'BeautifulSoup4 - PyPI',
'url': 'https://pypi.org/project/beautifulsoup4/'},
{'title': 'Web Scraping und Parsen von HTML in Python mit Beautiful ...',
'url': 'https://www.twilio.com/blog/web-scraping-und-parsen-von-html-python-mit-beautiful-soup'}]
1。它将 return 来自 HTML 的所有 <h3>
元素,包括“相关搜索、视频、人们也问”等部分的文本,在这种情况下,这不是您要查找的内容。
gresults = soup.findAll('h3')
2。这种搜索方法在某些情况下很好,但在特定情况下不是首选,因为你这样做有点盲目或成像,如果其中一个 .parent
节点(元素)消失,它会抛出错误。
不要执行所有这些操作,而是调用适当的 CSS
选择器(下面的更多内容)而不执行此方法链接,这可能是不可读的(如果有'有很多父节点)。
result.parent.parent.find_all()
3。 get('href')
会工作,但是你得到这样的 URL 是因为没有将 user-agent
传递给请求 headers
,这是“充当”真实用户访问所必需的。当 user-agent
被传递给请求 headers
时,你会得到一个正确的 URL 正如你所期望的(我不知道对这种行为的正确解释).
如果在使用 requests
库时没有将 user-agent
传递给请求 headers
,则默认为 python-requests, so Google or other search engines (websites) understands that it's a bot/script, and might block a request or a received HTML will be different from the one you see in your browser. Check what's your user-agent
. List of user-agents
。
传递user-agent
请求headers
:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('URL', headers=headers)
要使其正常工作,您需要:
1。找到一个包含所有需要数据的容器(查看 SelectorGadget extension) by calling a specific CSS
selector. CSS
selectors reference.
Think of the container as a box with stuff inside from which you'll grab items by specifying which item you want to get. In your case, it would be (without using 2 for
loops):
# .yuRUbf -> container
for result in soup.select('.yuRUbf'):
# .DKV0Md -> CSS selector for title which is located inside a container
title = result.select_one('.DKV0Md').text
# grab <a> and extract href attribute.
# .get('href') equal to ['href']
link = result.select_one('a')['href']
完整代码和example in the online IDE:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582'
}
response = requests.get('https://www.google.com/search?q=beautiful+soup', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# enumerate() -> adds a counter to an iterable and returns it
# https://www.programiz.com/python-programming/methods/built-in/enumerate
for index, result in enumerate(soup.select('.yuRUbf')):
position = index + 1
title = result.select_one('.DKV0Md').text
link = result.select_one('a')['href']
print(position, title, link, sep='\n')
# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''
或者,您可以使用 SerpApi 中的 Google Organic Results API 来实现相同的目的。这是付费 API 和免费计划。
您的案例的不同之处在于它是为此类任务创建的。您不必弄清楚要使用哪个 CSS
选择器,如何绕过 Google 或其他搜索引擎的块,随着时间的推移维护代码(如果 HTML 中的某些内容将改变了)。相反,请专注于您想要获取的数据。查看 playground(需要登录)。
要集成的代码:
import os
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"), # YOUR API KEY
"engine": "google", # search engine
"q": "Beautiful Soup", # query
"hl": "en" # language
# other parameters
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
position = result["position"] # website rank position
title = result["title"]
link = result["link"]
print(position, title, link, sep="\n")
# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''
Disclaimer, I work for SerpApi.
P.S。我有一个dedicated web scraping blog。
我正在使用以下 SERP 代码来执行一些 SEO,但是当我尝试读取 href
属性时,我得到的结果不正确,显示了页面中的其他有线 URL,但不是预期的 URL。我的代码有什么问题?
import requests
from bs4 import BeautifulSoup
URL = "https://www.google.com/search?q=beautiful+soup&rlz=1C1GCEB_enIN922IN922&oq=beautiful+soup&aqs=chrome..69i57j69i60l3.2455j0j7&sourceid=chrome&ie=UTF-8"
r = requests.get(URL)
webPage = html.unescape(r.text)
soup = BeautifulSoup(webPage, 'html.parser')
text =''
gresults = soup.findAll('h3')
for result in gresults:
print (result.text)
links = result.parent.parent.find_all('a', href=True)
for link in links:
print(link.get('href'))
输出如下所示:
/url?q=https://www.crummy.com/software/BeautifulSoup/bs4/doc/&sa=U&ved=2ahUKEwjv6-q3tJ30AhX_r1YBHU9OAeMQFnoECAAQAg&usg=AOvVaw2Q
会发生什么?
Selecting
<h3>
只会给你一个包含不需要元素的结果集。向上移动到父
parent
是可以的,但尝试find_all()
(不要在新代码中使用旧语法findAll()
) 不是必需的,这也会给你<a>
你可能不想要的。
如何修复?
Select 你的目标元素更具体然后你可以使用:
result.parent.parent.find('a',href=True).get('href')
但我建议使用以下示例。
例子
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = f'http://www.google.com/search?q=beautiful+soup'
r = requests.get(url, headers= headers)
soup = BeautifulSoup(r.text, 'lxml')
data = []
for r in soup.select('#search a h3'):
data.append({
'title':r.text,
'url':r.parent['href'],
})
data
输出
[{'title': 'Beautiful Soup 4.9.0 documentation - Crummy',
'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'},
{'title': 'Beautiful Soup Tutorial: Web Scraping mit Python',
'url': 'https://lerneprogrammieren.de/beautiful-soup-tutorial/'},
{'title': 'Beautiful Soup 4 - Web Scraping mit Python | HelloCoding',
'url': 'https://hellocoding.de/blog/coding-language/python/beautiful-soup-4'},
{'title': 'Beautiful Soup - Wikipedia',
'url': 'https://de.wikipedia.org/wiki/Beautiful_Soup'},
{'title': 'Beautiful Soup (HTML parser) - Wikipedia',
'url': 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'},
{'title': 'Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...',
'url': 'https://beautiful-soup-4.readthedocs.io/'},
{'title': 'BeautifulSoup4 - PyPI',
'url': 'https://pypi.org/project/beautifulsoup4/'},
{'title': 'Web Scraping und Parsen von HTML in Python mit Beautiful ...',
'url': 'https://www.twilio.com/blog/web-scraping-und-parsen-von-html-python-mit-beautiful-soup'}]
1。它将 return 来自 HTML 的所有 <h3>
元素,包括“相关搜索、视频、人们也问”等部分的文本,在这种情况下,这不是您要查找的内容。
gresults = soup.findAll('h3')
2。这种搜索方法在某些情况下很好,但在特定情况下不是首选,因为你这样做有点盲目或成像,如果其中一个 .parent
节点(元素)消失,它会抛出错误。
不要执行所有这些操作,而是调用适当的 CSS
选择器(下面的更多内容)而不执行此方法链接,这可能是不可读的(如果有'有很多父节点)。
result.parent.parent.find_all()
3。 get('href')
会工作,但是你得到这样的 URL 是因为没有将 user-agent
传递给请求 headers
,这是“充当”真实用户访问所必需的。当 user-agent
被传递给请求 headers
时,你会得到一个正确的 URL 正如你所期望的(我不知道对这种行为的正确解释).
如果在使用 requests
库时没有将 user-agent
传递给请求 headers
,则默认为 python-requests, so Google or other search engines (websites) understands that it's a bot/script, and might block a request or a received HTML will be different from the one you see in your browser. Check what's your user-agent
. List of user-agents
。
传递user-agent
请求headers
:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('URL', headers=headers)
要使其正常工作,您需要:
1。找到一个包含所有需要数据的容器(查看 SelectorGadget extension) by calling a specific CSS
selector. CSS
selectors reference.
Think of the container as a box with stuff inside from which you'll grab items by specifying which item you want to get. In your case, it would be (without using 2
for
loops):
# .yuRUbf -> container
for result in soup.select('.yuRUbf'):
# .DKV0Md -> CSS selector for title which is located inside a container
title = result.select_one('.DKV0Md').text
# grab <a> and extract href attribute.
# .get('href') equal to ['href']
link = result.select_one('a')['href']
完整代码和example in the online IDE:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582'
}
response = requests.get('https://www.google.com/search?q=beautiful+soup', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
# enumerate() -> adds a counter to an iterable and returns it
# https://www.programiz.com/python-programming/methods/built-in/enumerate
for index, result in enumerate(soup.select('.yuRUbf')):
position = index + 1
title = result.select_one('.DKV0Md').text
link = result.select_one('a')['href']
print(position, title, link, sep='\n')
# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''
或者,您可以使用 SerpApi 中的 Google Organic Results API 来实现相同的目的。这是付费 API 和免费计划。
您的案例的不同之处在于它是为此类任务创建的。您不必弄清楚要使用哪个 CSS
选择器,如何绕过 Google 或其他搜索引擎的块,随着时间的推移维护代码(如果 HTML 中的某些内容将改变了)。相反,请专注于您想要获取的数据。查看 playground(需要登录)。
要集成的代码:
import os
from serpapi import GoogleSearch
params = {
"api_key": os.getenv("API_KEY"), # YOUR API KEY
"engine": "google", # search engine
"q": "Beautiful Soup", # query
"hl": "en" # language
# other parameters
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
position = result["position"] # website rank position
title = result["title"]
link = result["link"]
print(position, title, link, sep="\n")
# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''
Disclaimer, I work for SerpApi.
P.S。我有一个dedicated web scraping blog。