BeautifulSoup 从 Indeed 中提取数据的问题

Question

我正在尝试从 Indeed 网站提取每个 post 的职位描述，但结果与我预期的不同！

我写了一个代码来获取职位描述。我正在使用 python 2.7 和最新的 beautifulsoup。当您打开页面并点击每个职位时，您将在屏幕右侧看到相关信息。我需要在此页面上提取每个职位的职位描述。我的代码：

import sys

import urllib2 

from BeautifulSoup import BeautifulSoup

url = "https://www.indeed.com/jobs?q=construction%20manager&l=Houston%2C%20TX&vjk=8000b2656aae5c08"

html = urllib2.urlopen(url).read()

soup = BeautifulSoup(html)

N = soup.findAll("div", {"id" : "vjs-desc"})

print N

我希望看到结果，但结果却是 []。是因为 Id 不唯一吗？如果是这样，我应该如何编辑代码？

Answer 1

#vjs-desc 元素由 javascript 生成，内容来自 Ajax 请求。要获得描述，您需要执行该请求。

# -*- coding: utf-8 -*-

# it easier to create http request/session using this
import requests
import re, urllib
from BeautifulSoup import BeautifulSoup

url = "https://www......"

# create session
s = requests.session()
html = s.get(url).text

# exctract job IDs
job_ids = ','.join(re.findall(r"jobKeysWithInfo\['(.+?)'\]", html))
ajax_url = 'https://www.indeed.com/rpc/jobdescs?jks=' + urllib.quote(job_ids)
# do Ajax request and convert the response to json 
ajax_content = s.get(ajax_url).json()
print(ajax_content)

for id, desc in ajax_content.items():
    print id
    soup = BeautifulSoup(desc, 'html.parser')
    # or try this
    # soup = BeautifulSoup(desc.decode('unicode-escape'), 'html.parser')
    print soup.text.encode('utf-8')
    print('==============================')

BeautifulSoup 从 Indeed 中提取数据的问题

Problem with data extraction from Indeed by BeautifulSoup

python

urllib2

beautifulsoup