使用 CSS 选择器 (Python, BS4) 报废数据
Scrap a datas using CSS selector (Python, BS4)
我是第一次使用 CSS 选择器抓取数据。
并且抓取锚点内容时出现问题。
这是我的代码:
import requests
from bs4 import BeautifulSoup
url = "https://weworkremotely.com/remote-jobs/search?utf8=✓&term=ruby"
wwr_result = requests.get(url)
wwr_soup = BeautifulSoup(wwr_result.text, "html.parser")
posts = wwr_soup.find_all("li", {"class": "feature"})
link = post.select("#category-2 > article > ul > li:nth-child(1) > a[href]")
title = post.find("span", {"class": "title"}).get_text()
company = post.find("span", {"class": "company"}).get_text()
location = post.find("span", {"class": "region company"}).get_text()
link = post.select("#category-2 > article > ul > li:nth-child(1) > a[href]")
print {"title": title, "company": company, "location": location, "link":f"https://weworkremotely.com/{link}"}
我想废弃锚点的内容,为每个 post 制作一个 link。所以,我放了一个[href].
但它不起作用,只能废弃所有子类别的内容。
我如何更改为仅废弃锚点的内容?
假设您已经从列出的所有工作中正确选择了感兴趣的工作,您需要一个循环,然后在循环期间提取带有子字符串 -jobs
的第一个 href 属性,即 post.select_one('[href*=-jobs]'
:
import requests
from bs4 import BeautifulSoup
url = "https://weworkremotely.com/remote-jobs/search?utf8=✓&term=ruby"
wwr_result = requests.get(url)
wwr_soup = BeautifulSoup(wwr_result.text, "html.parser")
posts = wwr_soup.find_all("li", {"class": "feature"})
for post in posts:
print('https://weworkremotely.com' + post.select_one('a[href*=-jobs]')['href'])
要获取页面上的所有列表,请切换到:
posts = wwr_soup.select('li:has(.tooltip)')
我是第一次使用 CSS 选择器抓取数据。
并且抓取锚点内容时出现问题。
这是我的代码:
import requests
from bs4 import BeautifulSoup
url = "https://weworkremotely.com/remote-jobs/search?utf8=✓&term=ruby"
wwr_result = requests.get(url)
wwr_soup = BeautifulSoup(wwr_result.text, "html.parser")
posts = wwr_soup.find_all("li", {"class": "feature"})
link = post.select("#category-2 > article > ul > li:nth-child(1) > a[href]")
title = post.find("span", {"class": "title"}).get_text()
company = post.find("span", {"class": "company"}).get_text()
location = post.find("span", {"class": "region company"}).get_text()
link = post.select("#category-2 > article > ul > li:nth-child(1) > a[href]")
print {"title": title, "company": company, "location": location, "link":f"https://weworkremotely.com/{link}"}
我想废弃锚点的内容,为每个 post 制作一个 link。所以,我放了一个[href].
但它不起作用,只能废弃所有子类别的内容。
我如何更改为仅废弃锚点的内容?
假设您已经从列出的所有工作中正确选择了感兴趣的工作,您需要一个循环,然后在循环期间提取带有子字符串 -jobs
的第一个 href 属性,即 post.select_one('[href*=-jobs]'
:
import requests
from bs4 import BeautifulSoup
url = "https://weworkremotely.com/remote-jobs/search?utf8=✓&term=ruby"
wwr_result = requests.get(url)
wwr_soup = BeautifulSoup(wwr_result.text, "html.parser")
posts = wwr_soup.find_all("li", {"class": "feature"})
for post in posts:
print('https://weworkremotely.com' + post.select_one('a[href*=-jobs]')['href'])
要获取页面上的所有列表,请切换到:
posts = wwr_soup.select('li:has(.tooltip)')