从 Google 抓取 Span 文本
Scrape Span Text from Google
我是新手,正在尝试从 google 搜索结果中抓取文本,但我总是得到空结果。
我有一个姓名列表,我需要从 <span class="st">
获取他们的 google 搜索文本结果。
我试过使用
text_results = soup.find_all("span", attrs={'class':'st'})
但是 text_results
结果是 []
应该是返回描述文字。
代码:
i = 0
names = data['Names'] # list of names
while i < len(names):
i += 1
list_url = ["https://www.google.com/search?q="+ name for name in names + tags]
soup_df = pd.DataFrame()
for l in list_url:
url = requests.get(l)
soup = bs(url.text, "html.parser")
text_results = soup.find_all("span", attrs={'class':'st'})
name_soup = []
row = (l, text_results)
name_soup.append(row)
Search = (name_soup[0][0])
Link = (name_soup[0][0])
Text = (name_soup[0][1])
print(Text)
soup_df = soup_df.append({'Name': Search, 'Link': Link, 'About': Text}, ignore_index=True)
soup_df['Name'] = soup_df['Name'].map(lambda x: x.lstrip("https://www.google.com/search?q="))
soup_df['Name'] = soup_df['Name'].str.rstrip(tags)
预期结果
About | Name | Link
Joan Smith. Engineer at Apple...|JOAN S SMITH|https://www.google...
Joey Smith. Engineer at Apple...|JOEY S SMITH|https://www.google...
John Smith. Engineer at Apple...|JOHN S SMITH|https://www.google...
Josh Smith. Engineer at Apple...|JOSH S SMITH|https://www.google...
实际结果:
About | Name | Link
[] |JOAN S SMITH|https://www.google.com/search?q=JOAN S SMITH..
[] |JOEY S SMITH|https://www.google.com/search?q=JOEY S SMITH..
[] |JOHN S SMITH|https://www.google.com/search?q=JOHN S SMITH..
[] |JOSH S SMITH|https://www.google.com/search?q=JOSH S SMITH..
看起来,google return 与您从浏览器中获得的有所不同。您应该更改代码:
soup.find_all("span", attrs={'class':'st'})
到其他一些有效路径。
确保您使用的是 user-agent
。这可能是您得到空结果的原因,因为 Google 最终会阻止您的请求。 Check what's your user-agent
. Check this 前段时间回答过
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
或者,您可以使用 SerpApi 中的 Google Organic Results API 来获取此输出。这是付费 API 和免费计划。
不同之处在于,您只需要迭代结构化 JSON 并获得您想要的内容,而不是弄清楚如何让这些东西发挥作用。
JSON的一部分:
{
"position": 1,
"title": "Bill Clinton - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Bill_Clinton",
"displayed_link": "https://en.wikipedia.org › wiki › Bill_Clinton",
"snippet": "William Jefferson Clinton is an American lawyer and politician who served as the 42nd president of the United States from 1993 to 2001. Prior to his presidency, ...",
"sitelinks": {
"inline": [
{
"title": "Presidency of Bill Clinton",
"link": "https://en.wikipedia.org/wiki/Presidency_of_Bill_Clinton"
}
]
}
}
要集成的代码:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "bill clinton",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Description text: {result['snippet']}\n")
来自replit.com的输出:
Description text: William Jefferson Clinton is an American lawyer and politician who served as the 42nd president of the United States from 1993 to 2001. Prior to his presidency, ...
Description text: Bill Clinton is an American politician from Arkansas who served as the 42nd President of the United States (1993-2001). He took office at the end of the Cold War ...
Description text: William Jefferson Clinton, the first Democratic president in six decades to be elected twice, led the U.S. to the longest economic expansion in American history, ...
Description text: Bill Clinton, byname of William Jefferson Clinton, original name William Jefferson Blythe III, (born August 19, 1946, Hope, Arkansas, U.S.), 42nd president of the ...
Description text: Bill Clinton was the 42nd president of the United States, serving from 1993 to 2001. In 1978 Clinton became the youngest governor in the ...
Description text: President Bill Clinton. 3834926 likes · 1078 talking about this. Founder, Clinton Foundation and 42nd President of the United States. Posts by Bill...
Description text: William Jefferson Clinton spent the first six years of his life in Hope, Arkansas, where he was born on August 19, 1946. His father, William Jefferson Blythe, had ...
Disclaimer, I work for SerpApi.
我是新手,正在尝试从 google 搜索结果中抓取文本,但我总是得到空结果。
我有一个姓名列表,我需要从 <span class="st">
获取他们的 google 搜索文本结果。
我试过使用
text_results = soup.find_all("span", attrs={'class':'st'})
但是 text_results
结果是 []
应该是返回描述文字。
代码:
i = 0
names = data['Names'] # list of names
while i < len(names):
i += 1
list_url = ["https://www.google.com/search?q="+ name for name in names + tags]
soup_df = pd.DataFrame()
for l in list_url:
url = requests.get(l)
soup = bs(url.text, "html.parser")
text_results = soup.find_all("span", attrs={'class':'st'})
name_soup = []
row = (l, text_results)
name_soup.append(row)
Search = (name_soup[0][0])
Link = (name_soup[0][0])
Text = (name_soup[0][1])
print(Text)
soup_df = soup_df.append({'Name': Search, 'Link': Link, 'About': Text}, ignore_index=True)
soup_df['Name'] = soup_df['Name'].map(lambda x: x.lstrip("https://www.google.com/search?q="))
soup_df['Name'] = soup_df['Name'].str.rstrip(tags)
预期结果
About | Name | Link
Joan Smith. Engineer at Apple...|JOAN S SMITH|https://www.google...
Joey Smith. Engineer at Apple...|JOEY S SMITH|https://www.google...
John Smith. Engineer at Apple...|JOHN S SMITH|https://www.google...
Josh Smith. Engineer at Apple...|JOSH S SMITH|https://www.google...
实际结果:
About | Name | Link
[] |JOAN S SMITH|https://www.google.com/search?q=JOAN S SMITH..
[] |JOEY S SMITH|https://www.google.com/search?q=JOEY S SMITH..
[] |JOHN S SMITH|https://www.google.com/search?q=JOHN S SMITH..
[] |JOSH S SMITH|https://www.google.com/search?q=JOSH S SMITH..
看起来,google return 与您从浏览器中获得的有所不同。您应该更改代码:
soup.find_all("span", attrs={'class':'st'})
到其他一些有效路径。
确保您使用的是 user-agent
。这可能是您得到空结果的原因,因为 Google 最终会阻止您的请求。 Check what's your user-agent
. Check this
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
或者,您可以使用 SerpApi 中的 Google Organic Results API 来获取此输出。这是付费 API 和免费计划。
不同之处在于,您只需要迭代结构化 JSON 并获得您想要的内容,而不是弄清楚如何让这些东西发挥作用。
JSON的一部分:
{
"position": 1,
"title": "Bill Clinton - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Bill_Clinton",
"displayed_link": "https://en.wikipedia.org › wiki › Bill_Clinton",
"snippet": "William Jefferson Clinton is an American lawyer and politician who served as the 42nd president of the United States from 1993 to 2001. Prior to his presidency, ...",
"sitelinks": {
"inline": [
{
"title": "Presidency of Bill Clinton",
"link": "https://en.wikipedia.org/wiki/Presidency_of_Bill_Clinton"
}
]
}
}
要集成的代码:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "bill clinton",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Description text: {result['snippet']}\n")
来自replit.com的输出:
Description text: William Jefferson Clinton is an American lawyer and politician who served as the 42nd president of the United States from 1993 to 2001. Prior to his presidency, ...
Description text: Bill Clinton is an American politician from Arkansas who served as the 42nd President of the United States (1993-2001). He took office at the end of the Cold War ...
Description text: William Jefferson Clinton, the first Democratic president in six decades to be elected twice, led the U.S. to the longest economic expansion in American history, ...
Description text: Bill Clinton, byname of William Jefferson Clinton, original name William Jefferson Blythe III, (born August 19, 1946, Hope, Arkansas, U.S.), 42nd president of the ...
Description text: Bill Clinton was the 42nd president of the United States, serving from 1993 to 2001. In 1978 Clinton became the youngest governor in the ...
Description text: President Bill Clinton. 3834926 likes · 1078 talking about this. Founder, Clinton Foundation and 42nd President of the United States. Posts by Bill...
Description text: William Jefferson Clinton spent the first six years of his life in Hope, Arkansas, where he was born on August 19, 1946. His father, William Jefferson Blythe, had ...
Disclaimer, I work for SerpApi.