使用 CSS 选择器 (Python, BS4) 报废数据

Question

我是第一次使用 CSS 选择器抓取数据。

并且抓取锚点内容时出现问题。

这是我的代码：

import requests
from bs4 import BeautifulSoup

url = "https://weworkremotely.com/remote-jobs/search?utf8=✓&term=ruby"
wwr_result = requests.get(url)
wwr_soup = BeautifulSoup(wwr_result.text, "html.parser")
posts = wwr_soup.find_all("li", {"class": "feature"})
link = post.select("#category-2 > article > ul > li:nth-child(1) > a[href]")

title = post.find("span", {"class": "title"}).get_text()
company = post.find("span", {"class": "company"}).get_text()
location = post.find("span", {"class": "region company"}).get_text()
link = post.select("#category-2 > article > ul > li:nth-child(1) > a[href]")

print {"title": title, "company": company, "location": location, "link":f"https://weworkremotely.com/{link}"}

我想废弃锚点的内容，为每个 post 制作一个 link。所以，我放了一个[href].

但它不起作用，只能废弃所有子类别的内容。

我如何更改为仅废弃锚点的内容？

Answer 1

假设您已经从列出的所有工作中正确选择了感兴趣的工作，您需要一个循环，然后在循环期间提取带有子字符串 -jobs 的第一个 href 属性，即 post.select_one('[href*=-jobs]'：

import requests
from bs4 import BeautifulSoup

url = "https://weworkremotely.com/remote-jobs/search?utf8=✓&term=ruby"
wwr_result = requests.get(url)
wwr_soup = BeautifulSoup(wwr_result.text, "html.parser")
posts = wwr_soup.find_all("li", {"class": "feature"})

for post in posts:
    print('https://weworkremotely.com' + post.select_one('a[href*=-jobs]')['href'])

要获取页面上的所有列表，请切换到：

posts = wwr_soup.select('li:has(.tooltip)')

使用 CSS 选择器 (Python, BS4) 报废数据

Scrap a datas using CSS selector (Python, BS4)

python

beautifulsoup

css-selectors

web-scraping