使用 beautifulsoup 进行工地抓取

Jobsite scraping with beautifulsoup

我想在网站上抓取职位描述信息,但我似乎只得到不相关的文本。这是 soup 对象的创建:

url = 'https://www.glassdoor.com/Job/boston-full-stack-engineer-jobs-SRCH_IL.0,6_IC1154532_KO7,26.htm?jl=3188635682&guid=0000016a8432102e99e9b5232325d3d5&pos=102&src=GD_JOB_AD&srs=MY_JOBS&s=58&ao=599212'
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"}) 
soup = bs4.BeautifulSoup(urlopen(req),"html.parser")
divliul=soup.body.findAll(['div','li','ul'])
for i in divliul:
    if i.string is not None:
        print(i.string)

如果您浏览该网站一秒钟,您会发现汤似乎只包含左侧栏中的元素,而没有包含职位描述容器中的任何元素。我认为这可能是一个 urllib 请求问题,但我尝试只下载 html 文件并以这种方式读取它,结果是相似的。 输出:

Jobs
Company Reviews
Company Reviews
Companies near you
 Best Buy Reviews in Boston
 Target Reviews in Boston
 IBM Reviews in Boston
 AT&T Reviews in Boston
 The Home Depot Reviews in Boston
 Walmart Reviews in Boston

 Macy's Reviews in Boston
 Microsoft Reviews in Boston
 Deloitte Reviews in Boston
 Amazon Reviews in Boston
 Bank of America Reviews in Boston
 Wells Fargo Reviews in Boston
Company Culture
 Best Places to Work
 12 Companies That Will Pay You to Travel the World
 7 Types of Companies You Should Never Work For
 20 Companies Hiring for the Best Jobs In America
 How to Become the Candidate Recruiters Can’t Resist
 13 Companies With Enviable Work From Home Options
 New On Glassdoor
Salaries
Interviews
Salary Calculator
Account Settings
Account Settings
Account Settings
Account Settings
empty notification btn
My Profile
Saved Jobs
Email & Alerts
Contributions
My Resumes
Company Follows
Account
Help / Contact Us
Account Settings
Account Settings
Account Settings
empty notification btn
For Employers
For Employers
Unlock Employer Account
Unlock Employer Account
Post a Job
Post a Job
Employer Branding
Job Advertising
Employer Blog
Talk to Sales
 Post Jobs Free
Full Stack Engineer Jobs in Boston, MA
Jobs
Companies
Salaries
Interviews
Full Stack Engineer
EASY APPLY
EASY APPLY
Full Stack Engineer | Noodle.com
EASY APPLY
EASY APPLY
Full Stack Engineer
Hot
Software Engineer
EASY APPLY
EASY APPLY
Senior Software Engineer
EASY APPLY
EASY APPLY
We're Hiring
We're Hiring
Full Stack Engineer
Hot
Software Engineer
Hot
Hot
Full Stack Engineer
We're Hiring
Full Stack Software Engineer
EASY APPLY
EASY APPLY
We're Hiring
We're Hiring
Software Engineer
New
New
Full Stack Engineer
EASY APPLY
EASY APPLY
We're Hiring
We're Hiring
Pre-Sales Engineer / Full-Stack Developer
Top Company
Top Company
Full Stack Software Engineer
Software Engineer
Top Company
Top Company
Associate Software Engineer
Full Stack Software Engineer
Software Engineer
New
New
Mid-level Full Stack Software Engineer (Java/React
EASY APPLY
EASY APPLY
Junior Software Engineer - Infrastructure
Software Engineer
Software Engineer
New
New
Associate Software Engineer
C# Engineer - Full Stack
EASY APPLY
EASY APPLY
Software Engineer, Platform
Software Engineer
EASY APPLY
EASY APPLY
Software Engineer
Associate Software Engineer
Software Engineer
Software Engineer
Software Engineer - Features
EASY APPLY
EASY APPLY
 Page 1 of 81
Previous
1
2
3
4
5
Next
 People Also Searched
 Top Cities for Full Stack Engineer:  
 Top Companies for full stack engineer in Boston, MA:  
 Help / Contact Us
 Terms of Use
 Privacy & Cookies (New)
Copyright © 2008–2019, Glassdoor, Inc. "Glassdoor" and logo are proprietary trademarks of Glassdoor, Inc.
 Email me jobs for:
Create a Job Alert
Your job alert has been created.
Create more job alerts for related jobs with one click:

您可以从该页面提取一些 ID,并将其连接成一个 url,该页面使用该 ID 检索 json,它会在您滚动时填充右侧的卡片。处理 json 以提取您想要的任何信息。

找到 urls - 当您在左侧向下滚动时,右侧会更新内容,所以我在网络选项卡中寻找与更新相关的 activity。当我看到在滚动过程中生成的新 urls 时,它看起来像是有共同的字符串和不同的部分,即可能是查询字符串格式。我猜变化的部分来自页面(有些看起来像生成的 ID,我们可以保留 static/ignore - 我测试过的基于经验的假设)。我在 html 中寻找我期望的用于区分服务器作业的重要标识符,即两组 id。您从网络选项卡中获取在 url 字符串中连接的两个 ID 中的任何一个,然后按 Ctrl + F 搜索页面 HTML 为他们;您将看到这些值的来源。

from bs4 import BeautifulSoup as bs
import requests
import re

results = []
with requests.Session() as s:
    url = 'https://www.glassdoor.co.uk/Job/json/details.htm?pos=&ao={}&s=58&guid=0000016a88f962649d396c5b606d567b&src=GD_JOB_AD&t=SR&extid=1&exst=OL&ist=&ast=OL&vt=w&slr=true&cs=1_1d8f42ad&cb=1557076206569&jobListingId={}&gdToken=uo8hehXn6nNuwhjMyBW14w:3RBFWgOD-0e7hK8o-Fgo0bUtD6jw5wJ3UujVq6L-v0ux9mlLjMxjW8-KF9xsDk41j7I11QHOHgcj9LBoWYaCxg:wAFOqHzOjgAxIGQVmbyibsaECrQO-HWfxb8Ugq-x_tU'
    headers = {'User-Agent' : 'Mozilla/5.0'}
    r = s.get('https://www.glassdoor.co.uk/Job/boston-full-stack-engineer-jobs-SRCH_IL.0,6_IC1154532_KO7,26.htm?jl=3188635682&s=58&pos=102&src=GD_JOB_AD&srs=MY_JOBS&guid=0000016a8432102e99e9b5232325d3d5&ao=599212&countryRedirect=true', headers = headers)
    soup = bs(r.content, 'lxml')
    ids = [item['data-ad-order-id'] for item in soup.select('[data-ad-order-id]')]
    p1 = re.compile(r"jobIds':\[(.*)'segmentType'", re.DOTALL)
    init = p1.findall(r.text)[0]
    p2 = re.compile(r"(\d{10})")
    job_ids = p2.findall(init)
    loop_var = list(zip(ids, job_ids))

    for x, y in loop_var:
        data = s.get(url.format(x,y), headers = headers).json()
        results.append(data)

我发现了这种含硒和铬的替代解决方案

import bs4
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r"C:\Users\username\Downloads\chromedriver_win32\chromedriver.exe")
driver.get(url)

html = driver.page_source
soup = bs4.BeautifulSoup(html,'lxml')

for tag in soup.find_all("div", class_="jobDescriptionContent desc"):
    print (tag.text)