如何用 Python 提取 Workday 作业 post link?
How can I extract the Workday job post link with Python?
我是 Python 和网络抓取的新手。最近,我被 Workday 的工作 post link 的生成模式所困扰。正常情况下,我发现一个工作post的link模式是这样的,可以提取所有元素:(粗体文本是固定的)
https://employer's domain.com/en-US/employers'subtext/job/location/job title_job ID
例如,UPenn 的 Workday 主页是:
https://wd1.myworkdaysite.com/recruiting/upenn/careers-at-penn
并接受这份工作 post:
Program Coordinator for Community Care,
JR00035938 | VPUL | Posted Yesterday
因此,要编写此作业 post 的 link 应该如下所示:
网站上正确的如下:
如您所见,页面上显示的元素(也就是 HTML 源代码 被 Python 删除)是不同的,尽管模式是正确的。
从 Chrome 检查的来源来看,作业 ID 是 JR00035938,没有额外的“-1”。
<span class="gwt-InlineLabel WEAG WD5F" title="JR00035938 | VPUL | Posted Yesterday" id="gwt-uid-106" data-automation-id="compositeSubHeaderOne">JR00035938 | VPUL | Posted Yesterday</span>
而且这不是唯一的奇数,还有很多不同之处。
这里有几个例子:
1)
Research Specialist A/B (Pennsylvania Muscle Institute) JR00035941
| Clinical Research Building - 7th Floor | Posted Yesterday
其代码:
<div class="gwt-Label WCCP WLAP" data-automation-id="promptOption" id="promptOption-gwt-uid-99" data-automation-label="Research Specialist A/B (Pennsylvania Muscle Institute)" title="Research Specialist A/B (Pennsylvania Muscle Institute)" aria-label="Research Specialist A/B (Pennsylvania Muscle Institute)" role="link" tabindex="0">Research Specialist A/B (Pennsylvania Muscle Institute)</div>
<span class="gwt-InlineLabel WEAG WD5F" title="JR00035941 | Clinical Research Building - 7th Floor | Posted Yesterday" id="gwt-uid-100" data-automation-id="compositeSubHeaderOne">JR00035941 | Clinical Research Building - 7th Floor | Posted Yesterday</span>
而且它的 link 不仅在职位 ID 后有额外的后缀,而且 lacks/rewrites 职位名称斜线后的部分。
Research Investigator/Research Investigator Sr. (Dept. of Radiology)
JR00033660 | HUP | Posted Yesterday
其代码:
<div class="gwt-Label WCCP WLAP" data-automation-id="promptOption" id="promptOption-gwt-uid-107" data-automation-label="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" title="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" aria-label="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" role="link" tabindex="0">
Research
<span class=" WHK2 WIK2 ">Investigator/Research</span>
Investigator Sr. (Dept. of Radiology)
</div>
<span class="gwt-InlineLabel WEAG WD5F" title="JR00033660 | HUP | Posted Yesterday" id="gwt-uid-108" data-automation-id="compositeSubHeaderOne">JR00033660 | HUP | Posted Yesterday</span>
及其 link:
最后,我的问题来了,Workday 生成作业 post 的 link 的模式是什么?有什么方法可以获取它的 link 而 Workday 显然阻止了其他人提取数据?没有 a/href/src 作业 post link.
在此先感谢您!
链接是从 GET 请求动态添加的,因此您始终可以通过这种方式获取链接,而不必担心尝试复制模式。
import requests
headers = { 'User-Agent': 'Mozilla/5.0', 'Accept': 'application/json,application/xml'}
r = requests.get('https://wd1.myworkdaysite.com/en-US/recruiting/upenn/careers-at-penn', headers=headers)
links = ['https://wd1.myworkdaysite.com' + i['title']['commandLink'] for i in r.json()['body']['children'][0]['children'][0]['listItems']]
print(links)
我是 Python 和网络抓取的新手。最近,我被 Workday 的工作 post link 的生成模式所困扰。正常情况下,我发现一个工作post的link模式是这样的,可以提取所有元素:(粗体文本是固定的)
https://employer's domain.com/en-US/employers'subtext/job/location/job title_job ID
例如,UPenn 的 Workday 主页是:
https://wd1.myworkdaysite.com/recruiting/upenn/careers-at-penn
并接受这份工作 post:
Program Coordinator for Community Care, JR00035938 | VPUL | Posted Yesterday
因此,要编写此作业 post 的 link 应该如下所示:
网站上正确的如下:
如您所见,页面上显示的元素(也就是 HTML 源代码 被 Python 删除)是不同的,尽管模式是正确的。 从 Chrome 检查的来源来看,作业 ID 是 JR00035938,没有额外的“-1”。
<span class="gwt-InlineLabel WEAG WD5F" title="JR00035938 | VPUL | Posted Yesterday" id="gwt-uid-106" data-automation-id="compositeSubHeaderOne">JR00035938 | VPUL | Posted Yesterday</span>
而且这不是唯一的奇数,还有很多不同之处。 这里有几个例子:
1)
Research Specialist A/B (Pennsylvania Muscle Institute) JR00035941
| Clinical Research Building - 7th Floor | Posted Yesterday
其代码:
<div class="gwt-Label WCCP WLAP" data-automation-id="promptOption" id="promptOption-gwt-uid-99" data-automation-label="Research Specialist A/B (Pennsylvania Muscle Institute)" title="Research Specialist A/B (Pennsylvania Muscle Institute)" aria-label="Research Specialist A/B (Pennsylvania Muscle Institute)" role="link" tabindex="0">Research Specialist A/B (Pennsylvania Muscle Institute)</div>
<span class="gwt-InlineLabel WEAG WD5F" title="JR00035941 | Clinical Research Building - 7th Floor | Posted Yesterday" id="gwt-uid-100" data-automation-id="compositeSubHeaderOne">JR00035941 | Clinical Research Building - 7th Floor | Posted Yesterday</span>
而且它的 link 不仅在职位 ID 后有额外的后缀,而且 lacks/rewrites 职位名称斜线后的部分。
Research Investigator/Research Investigator Sr. (Dept. of Radiology) JR00033660 | HUP | Posted Yesterday
其代码:
<div class="gwt-Label WCCP WLAP" data-automation-id="promptOption" id="promptOption-gwt-uid-107" data-automation-label="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" title="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" aria-label="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" role="link" tabindex="0">
Research
<span class=" WHK2 WIK2 ">Investigator/Research</span>
Investigator Sr. (Dept. of Radiology)
</div>
<span class="gwt-InlineLabel WEAG WD5F" title="JR00033660 | HUP | Posted Yesterday" id="gwt-uid-108" data-automation-id="compositeSubHeaderOne">JR00033660 | HUP | Posted Yesterday</span>
及其 link:
最后,我的问题来了,Workday 生成作业 post 的 link 的模式是什么?有什么方法可以获取它的 link 而 Workday 显然阻止了其他人提取数据?没有 a/href/src 作业 post link.
在此先感谢您!
链接是从 GET 请求动态添加的,因此您始终可以通过这种方式获取链接,而不必担心尝试复制模式。
import requests
headers = { 'User-Agent': 'Mozilla/5.0', 'Accept': 'application/json,application/xml'}
r = requests.get('https://wd1.myworkdaysite.com/en-US/recruiting/upenn/careers-at-penn', headers=headers)
links = ['https://wd1.myworkdaysite.com' + i['title']['commandLink'] for i in r.json()['body']['children'][0]['children'][0]['listItems']]
print(links)