如何用 Python 提取 Workday 作业 post link?

How can I extract the Workday job post link with Python?

我是 Python 和网络抓取的新手。最近,我被 Workday 的工作 post link 的生成模式所困扰。正常情况下,我发现一个工作post的link模式是这样的,可以提取所有元素:(粗体文本是固定的)

https://employer's domain.com/en-US/employers'subtext/job/location/job title_job ID

例如,UPenn 的 Workday 主页是:

https://wd1.myworkdaysite.com/recruiting/upenn/careers-at-penn

并接受这份工作 post:

Program Coordinator for Community Care, JR00035938 | VPUL | Posted Yesterday

因此,要编写此作业 post 的 link 应该如下所示:

https://wd1.myworkdaysite.com/en-US/recruting/upenn/careers-at-penn/job/VPUL/Program-Coordinnator-for-Community-Care_JR00035938

网站上正确的如下:

https://wd1.myworkdaysite.com/en-US/recruiting/upenn/careers-at-penn/job/VPUL/Program-Coordinator-for-Community-Care_JR00035938-1

如您所见,页面上显示的元素(也就是 HTML 源代码 被 Python 删除)是不同的,尽管模式是正确的。 从 Chrome 检查的来源来看,作业 ID 是 JR00035938,没有额外的“-1”。

 <span class="gwt-InlineLabel WEAG WD5F" title="JR00035938   |   VPUL   |   Posted Yesterday" id="gwt-uid-106" data-automation-id="compositeSubHeaderOne">JR00035938   |   VPUL   |   Posted Yesterday</span>

而且这不是唯一的奇数,还有很多不同之处。 这里有几个例子:

1)

Research Specialist A/B (Pennsylvania Muscle Institute) JR00035941
| Clinical Research Building - 7th Floor | Posted Yesterday

其代码:

<div class="gwt-Label WCCP WLAP" data-automation-id="promptOption" id="promptOption-gwt-uid-99" data-automation-label="Research Specialist A/B (Pennsylvania Muscle Institute)" title="Research Specialist A/B (Pennsylvania Muscle Institute)" aria-label="Research Specialist A/B (Pennsylvania Muscle Institute)" role="link" tabindex="0">Research Specialist A/B (Pennsylvania Muscle Institute)</div>
<span class="gwt-InlineLabel WEAG WD5F" title="JR00035941   |   Clinical Research Building - 7th Floor   |   Posted Yesterday" id="gwt-uid-100" data-automation-id="compositeSubHeaderOne">JR00035941   |   Clinical Research Building - 7th Floor   |   Posted Yesterday</span>

而且它的 link 不仅在职位 ID 后有额外的后缀,而且 lacks/rewrites 职位名称斜线后的部分。

https://wd1.myworkdaysite.com/en-US/recruiting/upenn/careers-at-penn/job/Clinical-Research-Building---7th-Floor/Research-Specialist-A--Physiology-_JR00035941-1

Research Investigator/Research Investigator Sr. (Dept. of Radiology) JR00033660 | HUP | Posted Yesterday

其代码:

<div class="gwt-Label WCCP WLAP" data-automation-id="promptOption" id="promptOption-gwt-uid-107" data-automation-label="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" title="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" aria-label="Research Investigator/Research Investigator Sr. (Dept. of Radiology)" role="link" tabindex="0">
Research 
<span class=" WHK2 WIK2 ">Investigator/Research</span> 
Investigator Sr. (Dept. of Radiology)
</div>
<span class="gwt-InlineLabel WEAG WD5F" title="JR00033660   |   HUP   |   Posted Yesterday" id="gwt-uid-108" data-automation-id="compositeSubHeaderOne">JR00033660   |   HUP   |   Posted Yesterday</span>

及其 link:

https://wd1.myworkdaysite.com/en-US/recruiting/upenn/careers-at-penn/job/HUP/Research-Investigator-Sr--Dept-of-Radiology-_JR00033660-1

最后,我的问题来了,Workday 生成作业 post 的 link 的模式是什么?有什么方法可以获取它的 link 而 Workday 显然阻止了其他人提取数据?没有 a/href/src 作业 post link.

在此先感谢您!

链接是从 GET 请求动态添加的,因此您始终可以通过这种方式获取链接,而不必担心尝试复制模式。

import requests

headers = { 'User-Agent': 'Mozilla/5.0', 'Accept': 'application/json,application/xml'}

r = requests.get('https://wd1.myworkdaysite.com/en-US/recruiting/upenn/careers-at-penn', headers=headers)
links = ['https://wd1.myworkdaysite.com' + i['title']['commandLink'] for i in r.json()['body']['children'][0]['children'][0]['listItems']]
print(links)