Python scrapy 提取特定的 Xpath 字段

Python scrapy to extract specific Xpath fields

我有以下结构(示例)。我正在使用 scrapy 来提取细节。我需要提取 'href' 的字段和 'Accounting' 之类的文本。我正在使用以下代码。我是 Xpath 的新手。对提取特定字段的任何帮助。

<div class = 'something'>
    <ul>
        <li><a href="http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="1">Accounting</a></li> 

        <li><a href="http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="2">Administrative</a></li> 

        <li><a href="http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="3">Advertising</a></li> 

        <li><a href="http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="4">Airline</a></li> 
    </ul>
</div>

我的代码是:

from scrapy.spider import BaseSpider

from jobfetch.items import JobfetchItem

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose


class JobFetchSpider(BaseSpider):
"""Spider for regularly updated livingsocial.com site, San Francisco Page"""
name = "Jobsearch"
    allowed_domains = ["jobsearch.about.com/"]
    start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']

    def parse(self, response):
    count = 0
    for sel in response.xpath('//*[@id="main"]/div/div[2]/div[1]/div/div[2]/article/div[2]/ul[1]'):
        item = JobfetchItem()
        item['title'] = sel.extract()
        item['link'] = sel.extract()
        count = count+1
        print item

    yield item

您在代码中遇到的问题:

  • yield item 应该在循环内,因为您正在那里实例化项目
  • 你的 xpath 非常混乱而且不太可靠,因为它严重依赖于父标签内的元素位置并且几乎从文档的顶部父级开始
  • 您的 xpath 不正确 - 它应该向下到 li 内的 a 元素 ul
  • sel.extract() 只会给你提取的 ul 元素

举个例子,在这里使用 CSS selector 来访问 li 标签:

import scrapy

from jobfetch.items import JobfetchItem


class JobFetchSpider(scrapy.Spider):
    name = "Jobsearch"
    allowed_domains = ["jobsearch.about.com/"]
    start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']

    def parse(self, response):
        for sel in response.css('article[itemprop="articleBody"] div.expert-content-text > ul > li > a'):
            item = JobfetchItem()
            item['title'] = sel.xpath('text()').extract()[0]
            item['link'] = sel.xpath('@href').extract()[0]
            yield item

运行 蜘蛛产生:

{'link': u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm', 'title': u'Accounting'}
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm', 'title': u'Administrative'}
...
{'link': u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm', 'title': u'Yacht Jobs'}

仅供参考,我们也可以使用 xpath()

//article[@itemprop="articleBody"]//div[@class="expert-content-text"]/ul/li/a

使用下面的脚本来提取你想要抓取的数据。

In [1]: response.xpath('//div[@class="expert-content-text"]/ul/li/a/text()').extract()
Out[1]: 

[u'Accounting',
 u'Administrative',
 u'Advertising',
 u'Airline',
 u'Animal',
 u'Alternative Energy',
 u'Auction House',
 u'Banking',
 u'Biotechnology',
 u'Business',
 u'Business Intelligence',
 u'Chef',
 u'College Admissions',
 u'College Alumni Relations and Development ',
 u'College Student Services',
 u'Construction',
 u'Consulting',
 u'Corporate',
 u'Cruise Ship',
 u'Customer Service',
 u'Data Science',
 u'Engineering',
 u'Entry Level Jobs',
 u'Environmental',
 u'Event Planning',
 u'Fashion',
 u'Film',
 u'First Job',
 u'Fundraiser',
 u'Healthcare/Medical',
 u'Health/Safety',
 u'Hospitality',
 u'Human Resources',
 u'Human Services / Social Work',
 u'Information Technology (IT)',
 u'Insurance',
 u'International Affairs / Development',
 u'International Business',
 u'Investment Banking',
 u'Law Enforcement',
 u'Legal',
 u'Maintenance',
 u'Management',
 u'Manufacturing',
 u'Marketing',
 u'Media',
 u'Museum',
 u'Music',
 u'Non Profit',
 u'Nursing',
 u'Outdoor ',
 u'Public Administration',
 u'Public Relations',
 u'Purchasing',
 u'Radio',
 u'Real Estate ',
 u'Restaurant',
 u'Retail',
 u'Sales',
 u'School',
 u'Science',
 u'Ski and Snow Jobs',
 u'Social Media',
 u'Social Work',
 u'Sports',
 u'Television',
 u'Trades',
 u'Transportation',
 u'Travel',
 u'Yacht Jobs']


In [1]: response.xpath('//div[@class="expert-content-text"]/ul/li/a/@href').extract()

Out[2]: 
[u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm',
 u'http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/animal-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/alternative-energy-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/auction-house-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/banking-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/biotechnology-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/business-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/business-intelligence-job-titles.htm',
 u'http://culinaryarts.about.com/od/culinaryfundamentals/a/whatisachef.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/college-admissions-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/college-alumni-relations-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/college-student-service-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/construction-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/consulting-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/c-level-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/cruise-ship-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/customer-service-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/data-science-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/engineering-job-titles.htm',
 u'http://jobsearch.about.com/od/best-jobs/a/best-entry-level-jobs.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/environmental-job-titles.htm',
 u'http://eventplanning.about.com/od/eventcareers/tp/corporateevents.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/fashion-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/film-job-titles.htm',
 u'http://jobsearch.about.com/od/justforstudents/a/first-job-list.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/fundraiser-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/health-care-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/health-safety-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/hospitality-job-titles.htm',
 u'http://humanresources.about.com/od/HR-Roles-And-Responsibilities/fl/human-resources-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/human-services-social-work-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/it-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/insurance-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/international-affairs-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/international-business-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/investment-banking-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/law-enforcement-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/legal-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/maintenance-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/management-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/manufacturing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/marketing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/media-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/museum-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/music-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/nonprofit-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/nursing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/outdoor-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/public-administration-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/public-relations-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/purchasing-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/radio-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/real-estate-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/restaurant-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/retail-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/sales-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/high-school-middle-school-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/science-job-titles.htm',
 u'http://jobsearch.about.com/od/skiandsnowjobs/a/skijob2_2.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/social-media-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/social-work-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/sports-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/television-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/trades-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/a/transportation-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/travel-job-titles.htm',
 u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm']