Scrapy 响应返回 [] 但打印在终端
Scrapy response returning [] but prints in terminal
我正在尝试抓取 Indeed.com 并希望获取与各自 div 中每项工作相关的信息。响应将在终端中打印出来,但是当我写入文件或 运行 蜘蛛时,我得到一个空白文件并且没有返回任何项目。我该如何解决这个问题?
我已经尝试将我的 xpath 更改为相对于它从中拉出的容器,但它仍然 运行 是空白的。
def parse(self, response):
html = response.body
container3 = response.xpath(".//div[contains(@class,'jobsearch-SerpJobCard unifiedRow row result clickcard')]").extract()
print(container3)
with open('container.txt', 'w') as cont:
cont.write(container3)
cont.close()
title = Selector(response=container3).xpath(".//*[@class='title']/a/@title").get()
titles = container3.xpath(".//*[@class='title']/a/@title").getall()
locations = container3.xpath(".//*[@class= 'sjcl']/span/text()").getall()
companies = container3.xpath(".//*[@class= 'company']/a/text()").getall()
summarys = container3.xpath(".//*[@class= 'summary']/.").getall()
links = response.css("div.title a::attr(href)").getall()
webscrape = WebscrapeItem()
webscrape['title'] = []
webscrape['company'] = []
webscrape['location'] = []
webscrape['desc'] = []
webscrape['link'] = []
for link in links:
self.links.append('https://www.indeed.com/' + link)
webscrape['link'].append('https://www.indeed.com/' + link)
for title, local in itertools.zip_longest(titles, locations):
webscrape['title'].append(title)
webscrape['location'].append(local)
for suma, com in itertools.zip_longest(summarys, companies):
webscrape['desc'].append(suma)
webscrape['company'].append(com)
yield webscrape
container3 输出:
<div class="jobsearch-SerpJobCard unifiedRow row result clickcard" id="pj_23e4270b7501bb9b" data-jk="23e4270b7501bb9b" data-empn="5625259597886418" data-ci="291406065">\n\n <div class="title">\n <a target="_blank" id="sja2" href="/pagead/clk?mo=r&ad=-6NYlbfkN0AGcPE08CwaySIkGkcc_oP1ITgH03VIz0r4xVHFv1QhAqfdykiPOMynTjgufJX7HvDowBKp7j-7NHJP9GOjbo56Vjxh5NURcHO8VKHA2Y_kPQaP89uziwg10G1Cy7gxqliSnkyvAjNozb3dIZaFvs20PbgIEbVp-Hlps87Ix3AR1T6shfkApixB3pFjOLL7mVL86YGAk8ZDtjg1RSW02V3Z21NoirneOsjdmwulvgL84YrSuUydYlJaqi5F8aPMUi7pz0h9-mKPlGF9g2xadVCCe2GDYCw9Svjigifq0j5m6WWsToS9ZsU4_uJu3ZNLRr92Eiwq9QHaT2tJcVrjqtO1X7Lz2bHVDj0RBD_MvoO_FmG0_Sr_tCm8gCxu55S7Vk4GEi0nBslmfj4br8hgZ1AuLs4D_XWmJF6MErKJSgPJFZWn7X2SAlVC&p=2&fvj=1&vjs=3" onmousedown="sjomd(\'sja2\'); clk(\'sja2\');" onclick=" setRefineByCookie([]); sjoc(\'sja2\', 0); convCtr(\'SJ\')" rel="noopener nofollow" title="EMS Executive Director" class="jobtitle turnstileLink " data-tn-element="jobTitle">\n EMS Executive Director</a>\n\n </div>\n\n <div class="sjcl">\n <div>\n <span class="company">\n <a data-tn-element="companyName" class="turnstileLink" target="_blank" href="/cmp/Remsa-1" onmousedown="this.href = appendParamsOnce(this.href, \'from=SERP&campaignid=serp-linkcompanyname&fromjk=23e4270b7501bb9b&jcid=1075eae744bf7959\')" rel="noopener">\n REMSA</a></span>\n\n <a data-tn-element="reviewStars" data-tn-variant="cmplinktst2" class="turnstileLink slNoUnderline " href="/cmp/Remsa-1/reviews" title="Remsa reviews" onmousedown="this.href = appendParamsOnce(this.href, \'?campaignid=cmplinktst2&from=SERP&jt=EMS+Executive+Director&fromjk=23e4270b7501bb9b&jcid=1075eae744bf7959\');" target="_blank" rel="noopener">\n <span class="ratings" aria-label="3.9 out of 5 star rating"><span class="rating" style="width:44.4px"><!-- --></span></span>\n<span class="slNoUnderline">7 reviews</span>\n </a>\n </div>\n<div id="recJobLoc_23e4270b7501bb9b" class="recJobLoc" data-rc-loc="United States" style="display: none"></div>\n\n <div class="location ">United States</div>\n </div>\n\n <div class="summary">\n Responsible for the <b>financial</b>, operational and management performance of Healthcare services for the company. Directs daily operations in support of the mission…</div>
我希望每个 'jobsearch-SerpJobCard unifiedRow row result clickcard' 都被提取到一个列表中,然后使用相对 xpath 从该列表中获取标题、位置、公司和摘要。
但是,我得到的是一个空白容器 3,并且没有任何物品被退回。这是来自已完成蜘蛛的 response.text 信息。
"{\"status\": \"ok\", \"items\": [], \"items_dropped\": [], \"stats\": {\"downloader/request_bytes\": 1132, \"downloader/request_count\": 3, \"downloader/request_method_count/GET\": 2, \"downloader/request_method_count/POST\": 1, \"downloader/response_bytes\": 1012262, \"downloader/response_count\": 3, \"downloader/response_status_count/200\": 2, \"downloader/response_status_count/404\": 1, \"finish_reason\": \"finished\", \"finish_time\": \"2019-08-21 06:29:40\", \"log_count/DEBUG\": 3, \"log_count/ERROR\": 1, \"log_count/INFO\": 8, \"log_count/WARNING\": 1, ...
检查一下,它有效
for item in response.xpath('//div[@class="jobsearch-SerpJobCard unifiedRow row result"]'):
titles = item.xpath(".//*[@class='title']/a/@title").getall()
print(titles)
locations = item.xpath(".//*[@class= 'sjcl']/span/text()").getall()
print(locations)
输出
['Python Developer Freshers Trainees', 'Python Developer', 'Python Developer', 'Python Developer', 'Python Developers', 'Software Trainee', 'Python\Django Developer', 'Hiring 2016 / 2017 / 2018 / 2019 freshers as software trainee', 'Python/Django Developer', 'Senior Python Developer']
['Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala']
我正在尝试抓取 Indeed.com 并希望获取与各自 div 中每项工作相关的信息。响应将在终端中打印出来,但是当我写入文件或 运行 蜘蛛时,我得到一个空白文件并且没有返回任何项目。我该如何解决这个问题?
我已经尝试将我的 xpath 更改为相对于它从中拉出的容器,但它仍然 运行 是空白的。
def parse(self, response):
html = response.body
container3 = response.xpath(".//div[contains(@class,'jobsearch-SerpJobCard unifiedRow row result clickcard')]").extract()
print(container3)
with open('container.txt', 'w') as cont:
cont.write(container3)
cont.close()
title = Selector(response=container3).xpath(".//*[@class='title']/a/@title").get()
titles = container3.xpath(".//*[@class='title']/a/@title").getall()
locations = container3.xpath(".//*[@class= 'sjcl']/span/text()").getall()
companies = container3.xpath(".//*[@class= 'company']/a/text()").getall()
summarys = container3.xpath(".//*[@class= 'summary']/.").getall()
links = response.css("div.title a::attr(href)").getall()
webscrape = WebscrapeItem()
webscrape['title'] = []
webscrape['company'] = []
webscrape['location'] = []
webscrape['desc'] = []
webscrape['link'] = []
for link in links:
self.links.append('https://www.indeed.com/' + link)
webscrape['link'].append('https://www.indeed.com/' + link)
for title, local in itertools.zip_longest(titles, locations):
webscrape['title'].append(title)
webscrape['location'].append(local)
for suma, com in itertools.zip_longest(summarys, companies):
webscrape['desc'].append(suma)
webscrape['company'].append(com)
yield webscrape
container3 输出:
<div class="jobsearch-SerpJobCard unifiedRow row result clickcard" id="pj_23e4270b7501bb9b" data-jk="23e4270b7501bb9b" data-empn="5625259597886418" data-ci="291406065">\n\n <div class="title">\n <a target="_blank" id="sja2" href="/pagead/clk?mo=r&ad=-6NYlbfkN0AGcPE08CwaySIkGkcc_oP1ITgH03VIz0r4xVHFv1QhAqfdykiPOMynTjgufJX7HvDowBKp7j-7NHJP9GOjbo56Vjxh5NURcHO8VKHA2Y_kPQaP89uziwg10G1Cy7gxqliSnkyvAjNozb3dIZaFvs20PbgIEbVp-Hlps87Ix3AR1T6shfkApixB3pFjOLL7mVL86YGAk8ZDtjg1RSW02V3Z21NoirneOsjdmwulvgL84YrSuUydYlJaqi5F8aPMUi7pz0h9-mKPlGF9g2xadVCCe2GDYCw9Svjigifq0j5m6WWsToS9ZsU4_uJu3ZNLRr92Eiwq9QHaT2tJcVrjqtO1X7Lz2bHVDj0RBD_MvoO_FmG0_Sr_tCm8gCxu55S7Vk4GEi0nBslmfj4br8hgZ1AuLs4D_XWmJF6MErKJSgPJFZWn7X2SAlVC&p=2&fvj=1&vjs=3" onmousedown="sjomd(\'sja2\'); clk(\'sja2\');" onclick=" setRefineByCookie([]); sjoc(\'sja2\', 0); convCtr(\'SJ\')" rel="noopener nofollow" title="EMS Executive Director" class="jobtitle turnstileLink " data-tn-element="jobTitle">\n EMS Executive Director</a>\n\n </div>\n\n <div class="sjcl">\n <div>\n <span class="company">\n <a data-tn-element="companyName" class="turnstileLink" target="_blank" href="/cmp/Remsa-1" onmousedown="this.href = appendParamsOnce(this.href, \'from=SERP&campaignid=serp-linkcompanyname&fromjk=23e4270b7501bb9b&jcid=1075eae744bf7959\')" rel="noopener">\n REMSA</a></span>\n\n <a data-tn-element="reviewStars" data-tn-variant="cmplinktst2" class="turnstileLink slNoUnderline " href="/cmp/Remsa-1/reviews" title="Remsa reviews" onmousedown="this.href = appendParamsOnce(this.href, \'?campaignid=cmplinktst2&from=SERP&jt=EMS+Executive+Director&fromjk=23e4270b7501bb9b&jcid=1075eae744bf7959\');" target="_blank" rel="noopener">\n <span class="ratings" aria-label="3.9 out of 5 star rating"><span class="rating" style="width:44.4px"><!-- --></span></span>\n<span class="slNoUnderline">7 reviews</span>\n </a>\n </div>\n<div id="recJobLoc_23e4270b7501bb9b" class="recJobLoc" data-rc-loc="United States" style="display: none"></div>\n\n <div class="location ">United States</div>\n </div>\n\n <div class="summary">\n Responsible for the <b>financial</b>, operational and management performance of Healthcare services for the company. Directs daily operations in support of the mission…</div>
我希望每个 'jobsearch-SerpJobCard unifiedRow row result clickcard' 都被提取到一个列表中,然后使用相对 xpath 从该列表中获取标题、位置、公司和摘要。
但是,我得到的是一个空白容器 3,并且没有任何物品被退回。这是来自已完成蜘蛛的 response.text 信息。
"{\"status\": \"ok\", \"items\": [], \"items_dropped\": [], \"stats\": {\"downloader/request_bytes\": 1132, \"downloader/request_count\": 3, \"downloader/request_method_count/GET\": 2, \"downloader/request_method_count/POST\": 1, \"downloader/response_bytes\": 1012262, \"downloader/response_count\": 3, \"downloader/response_status_count/200\": 2, \"downloader/response_status_count/404\": 1, \"finish_reason\": \"finished\", \"finish_time\": \"2019-08-21 06:29:40\", \"log_count/DEBUG\": 3, \"log_count/ERROR\": 1, \"log_count/INFO\": 8, \"log_count/WARNING\": 1, ...
检查一下,它有效
for item in response.xpath('//div[@class="jobsearch-SerpJobCard unifiedRow row result"]'):
titles = item.xpath(".//*[@class='title']/a/@title").getall()
print(titles)
locations = item.xpath(".//*[@class= 'sjcl']/span/text()").getall()
print(locations)
输出
['Python Developer Freshers Trainees', 'Python Developer', 'Python Developer', 'Python Developer', 'Python Developers', 'Software Trainee', 'Python\Django Developer', 'Hiring 2016 / 2017 / 2018 / 2019 freshers as software trainee', 'Python/Django Developer', 'Senior Python Developer']
['Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala']