Scrapy 响应返回 [] 但打印在终端

Scrapy response returning [] but prints in terminal

我正在尝试抓取 Indeed.com 并希望获取与各自 div 中每项工作相关的信息。响应将在终端中打印出来,但是当我写入文件或 运行 蜘蛛时,我得到一个空白文件并且没有返回任何项目。我该如何解决这个问题?

我已经尝试将我的 xpath 更改为相对于它从中拉出的容器,但它仍然 运行 是空白的。

    def parse(self, response):
        html = response.body
        container3 = response.xpath(".//div[contains(@class,'jobsearch-SerpJobCard unifiedRow row result clickcard')]").extract()
        print(container3)
        with open('container.txt', 'w') as cont:
            cont.write(container3)
        cont.close()
        title = Selector(response=container3).xpath(".//*[@class='title']/a/@title").get()
        titles = container3.xpath(".//*[@class='title']/a/@title").getall()
        locations = container3.xpath(".//*[@class= 'sjcl']/span/text()").getall()
        companies = container3.xpath(".//*[@class= 'company']/a/text()").getall()
        summarys = container3.xpath(".//*[@class= 'summary']/.").getall()
links = response.css("div.title a::attr(href)").getall()

        webscrape = WebscrapeItem()
        webscrape['title'] = []
        webscrape['company'] = []
        webscrape['location'] = []
        webscrape['desc'] = []
        webscrape['link'] = []
        for link in links:
            self.links.append('https://www.indeed.com/' + link)
            webscrape['link'].append('https://www.indeed.com/' + link)

        for title, local in itertools.zip_longest(titles, locations):
            webscrape['title'].append(title)
            webscrape['location'].append(local)

        for suma, com in itertools.zip_longest(summarys, companies):
            webscrape['desc'].append(suma)
            webscrape['company'].append(com)

        yield webscrape

container3 输出:


<div class="jobsearch-SerpJobCard unifiedRow row result clickcard" id="pj_23e4270b7501bb9b" data-jk="23e4270b7501bb9b" data-empn="5625259597886418" data-ci="291406065">\n\n    <div class="title">\n        <a target="_blank" id="sja2" href="/pagead/clk?mo=r&amp;ad=-6NYlbfkN0AGcPE08CwaySIkGkcc_oP1ITgH03VIz0r4xVHFv1QhAqfdykiPOMynTjgufJX7HvDowBKp7j-7NHJP9GOjbo56Vjxh5NURcHO8VKHA2Y_kPQaP89uziwg10G1Cy7gxqliSnkyvAjNozb3dIZaFvs20PbgIEbVp-Hlps87Ix3AR1T6shfkApixB3pFjOLL7mVL86YGAk8ZDtjg1RSW02V3Z21NoirneOsjdmwulvgL84YrSuUydYlJaqi5F8aPMUi7pz0h9-mKPlGF9g2xadVCCe2GDYCw9Svjigifq0j5m6WWsToS9ZsU4_uJu3ZNLRr92Eiwq9QHaT2tJcVrjqtO1X7Lz2bHVDj0RBD_MvoO_FmG0_Sr_tCm8gCxu55S7Vk4GEi0nBslmfj4br8hgZ1AuLs4D_XWmJF6MErKJSgPJFZWn7X2SAlVC&amp;p=2&amp;fvj=1&amp;vjs=3" onmousedown="sjomd(\'sja2\'); clk(\'sja2\');" onclick=" setRefineByCookie([]); sjoc(\'sja2\', 0); convCtr(\'SJ\')" rel="noopener nofollow" title="EMS Executive Director" class="jobtitle turnstileLink " data-tn-element="jobTitle">\n            EMS Executive Director</a>\n\n        </div>\n\n    <div class="sjcl">\n        <div>\n    <span class="company">\n        <a data-tn-element="companyName" class="turnstileLink" target="_blank" href="/cmp/Remsa-1" onmousedown="this.href = appendParamsOnce(this.href, \'from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=23e4270b7501bb9b&amp;jcid=1075eae744bf7959\')" rel="noopener">\n        REMSA</a></span>\n\n    <a data-tn-element="reviewStars" data-tn-variant="cmplinktst2" class="turnstileLink slNoUnderline " href="/cmp/Remsa-1/reviews" title="Remsa reviews" onmousedown="this.href = appendParamsOnce(this.href, \'?campaignid=cmplinktst2&amp;from=SERP&amp;jt=EMS+Executive+Director&amp;fromjk=23e4270b7501bb9b&amp;jcid=1075eae744bf7959\');" target="_blank" rel="noopener">\n            <span class="ratings" aria-label="3.9 out of 5 star rating"><span class="rating" style="width:44.4px"><!-- --></span></span>\n<span class="slNoUnderline">7 reviews</span>\n            </a>\n    </div>\n<div id="recJobLoc_23e4270b7501bb9b" class="recJobLoc" data-rc-loc="United States" style="display: none"></div>\n\n        <div class="location ">United States</div>\n                </div>\n\n    <div class="summary">\n            Responsible for the <b>financial</b>, operational and management performance of Healthcare services for the company. Directs daily operations in support of the mission…</div>

我希望每个 'jobsearch-SerpJobCard unifiedRow row result clickcard' 都被提取到一个列表中,然后使用相对 xpath 从该列表中获取标题、位置、公司和摘要。

但是,我得到的是一个空白容器 3,并且没有任何物品被退回。这是来自已完成蜘蛛的 response.text 信息。

"{\"status\": \"ok\", \"items\": [], \"items_dropped\": [], \"stats\": {\"downloader/request_bytes\": 1132, \"downloader/request_count\": 3, \"downloader/request_method_count/GET\": 2, \"downloader/request_method_count/POST\": 1, \"downloader/response_bytes\": 1012262, \"downloader/response_count\": 3, \"downloader/response_status_count/200\": 2, \"downloader/response_status_count/404\": 1, \"finish_reason\": \"finished\", \"finish_time\": \"2019-08-21 06:29:40\", \"log_count/DEBUG\": 3, \"log_count/ERROR\": 1, \"log_count/INFO\": 8, \"log_count/WARNING\": 1, ...

检查一下,它有效

        for item in response.xpath('//div[@class="jobsearch-SerpJobCard unifiedRow row result"]'):
            titles = item.xpath(".//*[@class='title']/a/@title").getall()
            print(titles)
            locations = item.xpath(".//*[@class= 'sjcl']/span/text()").getall()
            print(locations)

输出

['Python Developer Freshers Trainees', 'Python Developer', 'Python Developer', 'Python Developer', 'Python Developers', 'Software Trainee', 'Python\Django Developer', 'Hiring 2016 / 2017 / 2018 / 2019 freshers as software trainee', 'Python/Django Developer', 'Senior Python Developer']
['Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala']