Python - 使用 BeautifulSoup 进行抓取

Question

我正在尝试使用 Beautiful Soup 4 和 URLLIB 作为个人项目来抓取 Stack Overflow 作业页面。我遇到了一个问题，我试图将所有 link 抓取到每页上列出的 50 个职位。我正在使用正则表达式来识别这些 link。即使我正确引用了标签，我也面临着这两个具体问题：

而不是在源代码中清晰可见的 50 links，我每次只得到 25 个结果作为我的输出（在考虑删除初始不相关的 link)
源代码中 link 的排序方式与我的输出不同。

这是我的代码。对此的任何帮助将不胜感激：

import bs4
import urllib.request
import re


#Obtaining source code to parse

sauce = urllib.request.urlopen('https://whosebug.com/jobs?med=site-ui&ref=jobs-tab&sort=p&pg=0').read()

soup = bs4.BeautifulSoup(sauce, 'html.parser')

snippet = soup.find_all("script",type="application/ld+json")
strsnippet = str(snippet)

print(strsnippet)

joburls = re.findall('https://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', strsnippet)

print("Urls: ",joburls)
print(len(joburls))

Answer 1

免责声明：我自己做了一些部分回答。

from bs4 import BeautifulSoup
import requests
import json

# note: link is slightly different; yours just redirects here
link = 'https://whosebug.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')

s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]

print(len(urls))
50

进程：

使用 soup.find 而不是 soup.find_all。这将给出 JSON bs4.element.Tag
json.loads(s.text) 是嵌套字典。访问 itemListElement 键的值以获取 url 字典，并转换为列表。

Python - 使用 BeautifulSoup 进行抓取

Python - Issue Scraping with BeautifulSoup

python-3.x

web-scraping

beautifulsoup

urllib