Beautiful soup - html 解析器 returns 点而不是网络上可见的字符串
Beautiful soup - html parser returns dots instead of string visible on web
我正在尝试从以下 HTML:
中获取演员的数量:https://apify.com/store
<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>
当我发送 get
请求并使用 BeautifulSoup
解析响应时:
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text
我得到三个点 ...
而不是数字 895
元素是 <span class="ActorStore-statusNbHitsNumber">...</span>
我怎样才能得到号码?
如果您在浏览器中检查网络调用(按 F12)并按 XHR
过滤,您会看到数据已加载 动态地 通过发送 POST
请求:
您可以通过发送正确的 json
数据来模拟该请求。不需要 BeautifulSoup
你可以只使用 requests
模块。
这是一个完整的工作示例:
import requests
data = {
"query": "",
"page": 0,
"hitsPerPage": 24,
"restrictSearchableAttributes": [],
"attributesToHighlight": [],
"attributesToRetrieve": [
"title",
"name",
"username",
"userFullName",
"stats",
"description",
"pictureUrl",
"userPictureUrl",
"notice",
"currentPricingInfo",
],
}
response = requests.post(
"https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
json=data,
)
print(response.json()["nbHits"])
输出:
895
要查看所有 JSON 数据以访问 key/value 对,您可以使用:
from pprint import pprint
pprint(response.json(), indent=4)
部分输出:
{ 'exhaustiveNbHits': True,
'exhaustiveTypo': True,
'hits': [ { 'currentPricingInfo': None,
'description': 'Crawls arbitrary websites using the Chrome '
'browser and extracts data from pages using '
'a provided JavaScript code. The actor '
'supports both recursive crawling and lists '
'of URLs and automatically manages '
'concurrency for maximum performance. This '
"is Apify's basic tool for web crawling and "
'scraping.',
'name': 'web-scraper',
'objectID': 'moJRLRc85AitArpNN',
'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
'stats': { 'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
'totalBuilds': 104,
'totalMetamorphs': 102660,
'totalRuns': 68036112,
'totalUsers': 23492,
'totalUsers30Days': 1726,
'totalUsers7Days': 964,
'totalUsers90Days': 3205},
我正在尝试从以下 HTML:
中获取演员的数量:https://apify.com/store<div class="ActorStore-statusNbHits">
<span class="ActorStore-statusNbHitsNumber">895</span>results</div>
当我发送 get
请求并使用 BeautifulSoup
解析响应时:
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
return soup.find("span", class_="ActorStore-statusNbHitsNumber").text
我得到三个点 ...
而不是数字 895
元素是 <span class="ActorStore-statusNbHitsNumber">...</span>
我怎样才能得到号码?
如果您在浏览器中检查网络调用(按 F12)并按 XHR
过滤,您会看到数据已加载 动态地 通过发送 POST
请求:
您可以通过发送正确的 json
数据来模拟该请求。不需要 BeautifulSoup
你可以只使用 requests
模块。
这是一个完整的工作示例:
import requests
data = {
"query": "",
"page": 0,
"hitsPerPage": 24,
"restrictSearchableAttributes": [],
"attributesToHighlight": [],
"attributesToRetrieve": [
"title",
"name",
"username",
"userFullName",
"stats",
"description",
"pictureUrl",
"userPictureUrl",
"notice",
"currentPricingInfo",
],
}
response = requests.post(
"https://ow0o5i3qo7-dsn.algolia.net/1/indexes/prod_PUBLIC_STORE/query?x-algolia-agent=Algolia%20for%20JavaScript%20(4.12.1)%3B%20Browser%20(lite)&x-algolia-api-key=0ecccd09f50396a4dbbe5dbfb17f4525&x-algolia-application-id=OW0O5I3QO7",
json=data,
)
print(response.json()["nbHits"])
输出:
895
要查看所有 JSON 数据以访问 key/value 对,您可以使用:
from pprint import pprint
pprint(response.json(), indent=4)
部分输出:
{ 'exhaustiveNbHits': True,
'exhaustiveTypo': True,
'hits': [ { 'currentPricingInfo': None,
'description': 'Crawls arbitrary websites using the Chrome '
'browser and extracts data from pages using '
'a provided JavaScript code. The actor '
'supports both recursive crawling and lists '
'of URLs and automatically manages '
'concurrency for maximum performance. This '
"is Apify's basic tool for web crawling and "
'scraping.',
'name': 'web-scraper',
'objectID': 'moJRLRc85AitArpNN',
'pictureUrl': 'https://apify-image-uploads-prod.s3.amazonaws.com/moJRLRc85AitArpNN/Zn8vbWTika7anCQMn-SD-02-02.png',
'stats': { 'lastRunStartedAt': '2022-03-06T21:57:00.831Z',
'totalBuilds': 104,
'totalMetamorphs': 102660,
'totalRuns': 68036112,
'totalUsers': 23492,
'totalUsers30Days': 1726,
'totalUsers7Days': 964,
'totalUsers90Days': 3205},