如何使用scrapy获取职位描述？

Question

我是 scrapy 和 XPath 的新手，但在 Python 编程有一段时间了。我想使用 scrapy 从页面 https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ 获取 email、name of the person making the offer 和 phone 号码。如您所见，电子邮件和 phone 在 <p> 标签内以文本形式提供，因此很难提取。

我的想法是首先获取 Job Overview 中的文本，或者至少获取所有谈论该工作的文本，然后使用 ReGex 获取 email、phone number，如果可能的话 name of the person。

因此，我使用以下命令启动 scrapy shell：scrapy shell https://www.germanystartupjobs.com/job/joblift-berlin-germany-3-working-student-offpage-seo-french-market/ 并从那里获取 response。

现在，我尝试从 div job_description 中获取所有文本，但我实际上一无所获。我用了

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

它returns [u'\t\t\t\n\t\t ']

如何从提到的页面中获取所有文本？很显然，获得之前提到的属性的任务会在后面，但是，重要的是第一位的。

更新：仅此选择 returns [] response.xpath('//div[@class="job_description"]/div[@class="container"]/div[@class="row"]/text()').extract()

Answer 1

您与

关系密切

full_des = response.xpath('//div[@class="job_description"]/text()').extract()

div-标签实际上除了您得到的内容之外没有任何文本。

<div class="job_description" (...)>
    "This is the text you are getting"
    <p>"This is the text you want"</p>
</div>

如您所见，使用 response.xpath('//div[@class="job_description"]/text()').extract() 获得的文本是在和 div 标签之间的文本，而不是在div 标签内的标签。为此，您需要：

response.xpath('//div[@class="job_description"]//*/text()').extract()

它的作用是从 div[@class="job_description] 和 returns 文本中选择所有子节点（有关不同 xpath 的作用，请参阅 here）。

你会看到这个 returns 还有很多无用的文本，因为你仍然得到所有 \n 等等。为此，我建议您将 xpath 缩小到您想要的元素，而不是采用广泛的方法。

例如，整个职位描述将在

response.xpath('//div[@class="col-sm-5 justify-text"]//*/text()').extract()

如何使用scrapy获取职位描述？

How to get the job description using scrapy?

python

xpath

scrapy-spider