从 Google 抓取 Span 文本

Question

我是新手，正在尝试从 google 搜索结果中抓取文本，但我总是得到空结果。

我有一个姓名列表，我需要从 <span class="st"> 获取他们的 google 搜索文本结果。

我试过使用

text_results = soup.find_all("span", attrs={'class':'st'})

但是 text_results 结果是 []

应该是返回描述文字。

代码:

i = 0
names = data['Names'] # list of names
while i < len(names):
    i += 1
list_url = ["https://www.google.com/search?q="+ name for name in names + tags]

soup_df = pd.DataFrame()
for l in list_url:
    url = requests.get(l)
    soup = bs(url.text, "html.parser")

    text_results = soup.find_all("span", attrs={'class':'st'})
    name_soup = []
    row = (l, text_results)
    name_soup.append(row)

    Search = (name_soup[0][0])
    Link = (name_soup[0][0])
    Text = (name_soup[0][1])
    print(Text)

    soup_df = soup_df.append({'Name': Search, 'Link': Link, 'About': Text}, ignore_index=True)
    soup_df['Name'] = soup_df['Name'].map(lambda x: x.lstrip("https://www.google.com/search?q="))
    soup_df['Name'] = soup_df['Name'].str.rstrip(tags)

预期结果

About                           | Name       | Link
Joan Smith. Engineer at Apple...|JOAN S SMITH|https://www.google...
Joey Smith. Engineer at Apple...|JOEY S SMITH|https://www.google...
John Smith. Engineer at Apple...|JOHN S SMITH|https://www.google...
Josh Smith. Engineer at Apple...|JOSH S SMITH|https://www.google...

实际结果：

About | Name       | Link
[]    |JOAN S SMITH|https://www.google.com/search?q=JOAN S SMITH..
[]    |JOEY S SMITH|https://www.google.com/search?q=JOEY S SMITH..
[]    |JOHN S SMITH|https://www.google.com/search?q=JOHN S SMITH..
[]    |JOSH S SMITH|https://www.google.com/search?q=JOSH S SMITH..

Answer 1

看起来，google return 与您从浏览器中获得的有所不同。您应该更改代码：

 soup.find_all("span", attrs={'class':'st'})

到其他一些有效路径。

Answer 2

确保您使用的是 user-agent。这可能是您得到空结果的原因，因为 Google 最终会阻止您的请求。 Check what's your user-agent. Check this 前段时间回答过

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('YOUR_URL', headers=headers)

或者，您可以使用 SerpApi 中的 Google Organic Results API 来获取此输出。这是付费 API 和免费计划。

不同之处在于，您只需要迭代结构化 JSON 并获得您想要的内容，而不是弄清楚如何让这些东西发挥作用。

JSON的一部分：

 {
  "position": 1,
  "title": "Bill Clinton - Wikipedia",
  "link": "https://en.wikipedia.org/wiki/Bill_Clinton",
  "displayed_link": "https://en.wikipedia.org › wiki › Bill_Clinton",
  "snippet": "William Jefferson Clinton is an American lawyer and politician who served as the 42nd president of the United States from 1993 to 2001. Prior to his presidency, ...",
  "sitelinks": {
    "inline": [
      {
        "title": "Presidency of Bill Clinton",
        "link": "https://en.wikipedia.org/wiki/Presidency_of_Bill_Clinton"
      }
    ]
  }
}

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "bill clinton",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
   print(f"Description text: {result['snippet']}\n")

来自replit.com的输出：

Description text: William Jefferson Clinton is an American lawyer and politician who served as the 42nd president of the United States from 1993 to 2001. Prior to his presidency, ...

Description text: Bill Clinton is an American politician from Arkansas who served as the 42nd President of the United States (1993-2001). He took office at the end of the Cold War ...

Description text: William Jefferson Clinton, the first Democratic president in six decades to be elected twice, led the U.S. to the longest economic expansion in American history, ...

Description text: Bill Clinton, byname of William Jefferson Clinton, original name William Jefferson Blythe III, (born August 19, 1946, Hope, Arkansas, U.S.), 42nd president of the ...

Description text: Bill Clinton was the 42nd president of the United States, serving from 1993 to 2001. In 1978 Clinton became the youngest governor in the ...

Description text: President Bill Clinton. 3834926 likes · 1078 talking about this. Founder, Clinton Foundation and 42nd President of the United States. Posts by Bill...

Description text: William Jefferson Clinton spent the first six years of his life in Hope, Arkansas, where he was born on August 19, 1946. His father, William Jefferson Blythe, had ...

Disclaimer, I work for SerpApi.

从 Google 抓取 Span 文本

Scrape Span Text from Google

python

beautifulsoup

google-search

web-scraping

pandas