无法缩小网络抓取工具中的搜索条件以搜索 "job titles" 并计算每个搜索条件的频率
Can't narrow down the search criteria in a web scraper to search "job titles" and count the frequency of each one
对于我所做的一些工作,我需要收集有关职位名称的数据以及它们在搜索结果中出现的频率,因此我决定招募 Python 来帮助我完成这项工作。唯一的问题是我似乎无法弄清楚为什么我发现的这段代码片段没有提供我需要的正确信息。
这是我目前所拥有的:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
# We get the url
r = requests.get("https://www.usajobs.gov/Search/Results?j=0602&d=VA&p=1")
soup = BeautifulSoup(r.content, "html.parser")
# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
total = c_div
print(total)
我知道这部分涉及检查代码,但我不知道我需要输入什么才能让抓取工具缩小到这些标题:
<a id="usajobs-search-result-0" class="usajobs-search-result--core__title search-joa-link" href="/GetJob/ViewDetails/568337700" itemprop="title" data-document-id="568337700">
非常感谢任何帮助
数据通过发送 POST
请求动态加载:
https://www.usajobs.gov/Search/ExecuteSearch
查看此示例以获得正确的职位。
(您可以更改 page
键以指定页码)。
import requests
data = {
"JobTitle": [],
"GradeBucket": [],
"JobCategoryCode": ["0602"],
"JobCategoryFamily": [],
"LocationName": [],
"PostingChannel": [],
"Department": ["VA"],
"Agency": [],
"PositionOfferingTypeCode": [],
"TravelPercentage": [],
"PositionScheduleTypeCode": [],
"SecurityClearanceRequired": [],
"PositionSensitivity": [],
"ShowAllFilters": [],
"HiringPath": [],
"SocTitle": [],
"MCOTags": [],
"CyberWorkRole": [],
"CyberWorkGrouping": [],
"Page": "1", # <-- Change page number here
"UniqueSearchID": "9d417c5e-adc2-469c-af1d-e786cc41bc97",
"IsAuthenticated": "false",
}
response = requests.post(
"https://www.usajobs.gov/Search/ExecuteSearch", json=data
).json()
job_titles = [job["Title"] for job in response["Jobs"]]
print(job_titles)
输出:
['Psychiatrist - OCA', 'Physician - Electromyography (Temporary)', 'Physician Owensboro CBOC PC', 'Physician-Primary Care', 'OPHTHALMOLOGIST', 'UROLOGIST', 'PHYSICIAN (OTOLARYNGOLOGIST', 'Physician-Hospitalist', 'Physician - Hemotology/Oncology', 'Academic Gastroenterologist', 'Physician - Gastroenterologist', 'Physician - Orthopedic Surgeon', 'Physician (Internal Medicine or Family Practice)', 'Physician (Regular Ft)- Hematologist/Oncologist', 'Physician- Hematologist/Oncologist', 'Physician - Diagnostic Radiologist', 'Physician (Psychiatrist)', 'Physician (Endocrinologist)', 'Physician (Cardiologist)', 'Physician (Neurologist)', 'Physician (Chief Hospitalist)', 'Physician (Hospitalist)', 'Physician (Medical Director of Extended Care/Chief of Geriatrics)', 'Physician (Primary Care)', 'Physician (Hematologist/Oncologist)']
对于我所做的一些工作,我需要收集有关职位名称的数据以及它们在搜索结果中出现的频率,因此我决定招募 Python 来帮助我完成这项工作。唯一的问题是我似乎无法弄清楚为什么我发现的这段代码片段没有提供我需要的正确信息。 这是我目前所拥有的:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
# We get the url
r = requests.get("https://www.usajobs.gov/Search/Results?j=0602&d=VA&p=1")
soup = BeautifulSoup(r.content, "html.parser")
# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
c_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
total = c_div
print(total)
我知道这部分涉及检查代码,但我不知道我需要输入什么才能让抓取工具缩小到这些标题:
<a id="usajobs-search-result-0" class="usajobs-search-result--core__title search-joa-link" href="/GetJob/ViewDetails/568337700" itemprop="title" data-document-id="568337700">
非常感谢任何帮助
数据通过发送 POST
请求动态加载:
https://www.usajobs.gov/Search/ExecuteSearch
查看此示例以获得正确的职位。
(您可以更改 page
键以指定页码)。
import requests
data = {
"JobTitle": [],
"GradeBucket": [],
"JobCategoryCode": ["0602"],
"JobCategoryFamily": [],
"LocationName": [],
"PostingChannel": [],
"Department": ["VA"],
"Agency": [],
"PositionOfferingTypeCode": [],
"TravelPercentage": [],
"PositionScheduleTypeCode": [],
"SecurityClearanceRequired": [],
"PositionSensitivity": [],
"ShowAllFilters": [],
"HiringPath": [],
"SocTitle": [],
"MCOTags": [],
"CyberWorkRole": [],
"CyberWorkGrouping": [],
"Page": "1", # <-- Change page number here
"UniqueSearchID": "9d417c5e-adc2-469c-af1d-e786cc41bc97",
"IsAuthenticated": "false",
}
response = requests.post(
"https://www.usajobs.gov/Search/ExecuteSearch", json=data
).json()
job_titles = [job["Title"] for job in response["Jobs"]]
print(job_titles)
输出:
['Psychiatrist - OCA', 'Physician - Electromyography (Temporary)', 'Physician Owensboro CBOC PC', 'Physician-Primary Care', 'OPHTHALMOLOGIST', 'UROLOGIST', 'PHYSICIAN (OTOLARYNGOLOGIST', 'Physician-Hospitalist', 'Physician - Hemotology/Oncology', 'Academic Gastroenterologist', 'Physician - Gastroenterologist', 'Physician - Orthopedic Surgeon', 'Physician (Internal Medicine or Family Practice)', 'Physician (Regular Ft)- Hematologist/Oncologist', 'Physician- Hematologist/Oncologist', 'Physician - Diagnostic Radiologist', 'Physician (Psychiatrist)', 'Physician (Endocrinologist)', 'Physician (Cardiologist)', 'Physician (Neurologist)', 'Physician (Chief Hospitalist)', 'Physician (Hospitalist)', 'Physician (Medical Director of Extended Care/Chief of Geriatrics)', 'Physician (Primary Care)', 'Physician (Hematologist/Oncologist)']