Beautifulsoup 返回错误的 href 值

Question

我正在使用以下 SERP 代码来执行一些 SEO，但是当我尝试读取 href 属性时，我得到的结果不正确，显示了页面中的其他有线 URL，但不是预期的 URL。我的代码有什么问题？

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=beautiful+soup&rlz=1C1GCEB_enIN922IN922&oq=beautiful+soup&aqs=chrome..69i57j69i60l3.2455j0j7&sourceid=chrome&ie=UTF-8"
r = requests.get(URL)
webPage = html.unescape(r.text) 

soup = BeautifulSoup(webPage, 'html.parser')
text =''
gresults = soup.findAll('h3') 

for result in gresults:
    print (result.text)
    links = result.parent.parent.find_all('a', href=True)
    for link in links:
        print(link.get('href'))

输出如下所示：

/url?q=https://www.crummy.com/software/BeautifulSoup/bs4/doc/&sa=U&ved=2ahUKEwjv6-q3tJ30AhX_r1YBHU9OAeMQFnoECAAQAg&usg=AOvVaw2Q

Answer 1

会发生什么？

Selecting <h3> 只会给你一个包含不需要元素的结果集。
向上移动到父 parent 是可以的，但尝试 find_all()（不要在新代码中使用旧语法 findAll()) 不是必需的，这也会给你 <a> 你可能不想要的。

如何修复？

Select 你的目标元素更具体然后你可以使用：

result.parent.parent.find('a',href=True).get('href')

但我建议使用以下示例。

例子

from bs4 import BeautifulSoup
import requests

    
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = f'http://www.google.com/search?q=beautiful+soup'

r = requests.get(url, headers= headers)
soup = BeautifulSoup(r.text, 'lxml')

data = []

for r in soup.select('#search a h3'):
    data.append({
        'title':r.text,
        'url':r.parent['href'],
     })
data

输出

[{'title': 'Beautiful Soup 4.9.0 documentation - Crummy',
  'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'},
 {'title': 'Beautiful Soup Tutorial: Web Scraping mit Python',
  'url': 'https://lerneprogrammieren.de/beautiful-soup-tutorial/'},
 {'title': 'Beautiful Soup 4 - Web Scraping mit Python | HelloCoding',
  'url': 'https://hellocoding.de/blog/coding-language/python/beautiful-soup-4'},
 {'title': 'Beautiful Soup - Wikipedia',
  'url': 'https://de.wikipedia.org/wiki/Beautiful_Soup'},
 {'title': 'Beautiful Soup (HTML parser) - Wikipedia',
  'url': 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'},
 {'title': 'Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...',
  'url': 'https://beautiful-soup-4.readthedocs.io/'},
 {'title': 'BeautifulSoup4 - PyPI',
  'url': 'https://pypi.org/project/beautifulsoup4/'},
 {'title': 'Web Scraping und Parsen von HTML in Python mit Beautiful ...',
  'url': 'https://www.twilio.com/blog/web-scraping-und-parsen-von-html-python-mit-beautiful-soup'}]

Answer 2

1。它将 return 来自 HTML 的所有 <h3> 元素，包括“相关搜索、视频、人们也问”等部分的文本，在这种情况下，这不是您要查找的内容。

gresults = soup.findAll('h3')

2。这种搜索方法在某些情况下很好，但在特定情况下不是首选，因为你这样做有点盲目或成像，如果其中一个 .parent 节点（元素）消失，它会抛出错误。

不要执行所有这些操作，而是调用适当的 CSS 选择器（下面的更多内容）而不执行此方法链接，这可能是不可读的（如果有'有很多父节点）。

result.parent.parent.find_all()

3。 get('href') 会工作，但是你得到这样的 URL 是因为没有将 user-agent 传递给请求 headers，这是“充当”真实用户访问所必需的。当 user-agent 被传递给请求 headers 时，你会得到一个正确的 URL 正如你所期望的（我不知道对这种行为的正确解释).

如果在使用 requests 库时没有将 user-agent 传递给请求 headers，则默认为 python-requests, so Google or other search engines (websites) understands that it's a bot/script, and might block a request or a received HTML will be different from the one you see in your browser. Check what's your user-agent. List of user-agents。

传递user-agent请求headers:

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('URL', headers=headers)

要使其正常工作，您需要：

1。找到一个包含所有需要数据的容器（查看 SelectorGadget extension) by calling a specific CSS selector. CSS selectors reference.

Think of the container as a box with stuff inside from which you'll grab items by specifying which item you want to get. In your case, it would be (without using 2 for loops):

# .yuRUbf -> container
for result in soup.select('.yuRUbf'):
    
    # .DKV0Md -> CSS selector for title which is located inside a container
    title = result.select_one('.DKV0Md').text

    # grab <a> and extract href attribute.
    # .get('href') equal to ['href']
    link = result.select_one('a')['href']

完整代码和example in the online IDE:

import requests
from bs4 import BeautifulSoup


headers = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582'
}

response = requests.get('https://www.google.com/search?q=beautiful+soup', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')


# enumerate() -> adds a counter to an iterable and returns it
# https://www.programiz.com/python-programming/methods/built-in/enumerate
for index, result in enumerate(soup.select('.yuRUbf')):
    position = index + 1
    title = result.select_one('.DKV0Md').text
    link = result.select_one('a')['href']

    print(position, title, link, sep='\n')


# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''

或者，您可以使用 SerpApi 中的 Google Organic Results API 来实现相同的目的。这是付费 API 和免费计划。

您的案例的不同之处在于它是为此类任务创建的。您不必弄清楚要使用哪个 CSS 选择器，如何绕过 Google 或其他搜索引擎的块，随着时间的推移维护代码（如果 HTML 中的某些内容将改变了）。相反，请专注于您想要获取的数据。查看 playground（需要登录）。

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "api_key": os.getenv("API_KEY"),  # YOUR API KEY
    "engine": "google",               # search engine
    "q": "Beautiful Soup",            # query
    "hl": "en"                        # language
    # other parameters
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
    position = result["position"]          # website rank position
    title = result["title"]
    link = result["link"]

    print(position, title, link, sep="\n")


# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''

Disclaimer, I work for SerpApi.

P.S。我有一个dedicated web scraping blog。

Beautifulsoup 返回错误的 href 值

Beautifulsoup Returning Wrong href Value

python

beautifulsoup

href

python-requests

会发生什么？

如何修复？

例子

输出