无法在 Div 中获取 return 文本
Can't get Scrapy to return text in Div
我无法抓取 return 来自此 div 的文本。当它执行 return 数据时,它比我想象的要多得多 return。
目标HTML:
<div class="DivTimeSpan" title="Full Time">12:00 PM - 09:00 PM </div>
尝试 1:
def parse_schedule(self, response):
s_item = ScheduleItem()
for sel in response.xpath("//div[@class='DivTimeSpan']"):
s_item['schedule'] = sel.select('//text()').extract()
return s_item
Returns:
"\r\n\r\n ", "\r\n ", "\r\n \r\n\r\n var allowedUrls = [];\r\n allowedUrls.push(\"Login.net\");\r\n allowedUrls.push(\"Login\");\r\n allowedUrls.push(\"AccountLogin.net\");\r\n allowedUrls.push(\"AccountLogin\");\r\n allowedUrls.push(\"CreateAccount\");\r\n allowedUrls.push(\"CreateAccount.net\");\r\n allowedUrls.push(\"UpdateAccount\");\r\n allowedUrls.push(\"UpdateAccount.net\");\r\n allowedUrls.push(\"CreateResellersAccount\");\r\n allowedUrls.push(\"CreateResellersAccount.net\");\r\n allowedUrls.push(\"CreateQqestSAASAccount\");\r\n
"11:00 AM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM"
整个文件可能有数千行长,并且包含看起来像我指定的div
之外的文本
我将 //text() 理解为 return 元素及其子元素的文本。我定位的 html 元素没有任何子元素,所以我假设它只会 return div.
中的数据
接下来我尝试只使用“/text()”。这是唯一的变化
尝试 2:
for sel in response.xpath("//div[@class='DivTimeSpan']"):
s_item['schedule'] = sel.select('/text()').extract()
return s_item
Returns:
[{"schedule": []}]
期望的结果:
[{"schedule": ["11:00 AM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM
- 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM"]}]
我正在抓取的 url 是在公司登录后,所以我无法给出实际的 url。
Elisha 的 post 指引我正确的方向,谢谢!!! :)
答案:
for sel in response.xpath("//div[@class='DivTimeSpan']"):
s_item['schedule'] = map(unicode.strip, sel.select('//div/text()').extract())
return s_item
第二次尝试更接近提取值。然而,您需要从节点而不是文档根目录中提取文本:
s_item['schedule'] = sel.select('/div/text()').extract()[0]
如果文档包含更多标签(不是div),您可以尝试:
s_item['schedule'] = sel.select('//div/text()').extract()[0]
我无法抓取 return 来自此 div 的文本。当它执行 return 数据时,它比我想象的要多得多 return。
目标HTML:
<div class="DivTimeSpan" title="Full Time">12:00 PM - 09:00 PM </div>
尝试 1:
def parse_schedule(self, response):
s_item = ScheduleItem()
for sel in response.xpath("//div[@class='DivTimeSpan']"):
s_item['schedule'] = sel.select('//text()').extract()
return s_item
Returns:
"\r\n\r\n ", "\r\n ", "\r\n \r\n\r\n var allowedUrls = [];\r\n allowedUrls.push(\"Login.net\");\r\n allowedUrls.push(\"Login\");\r\n allowedUrls.push(\"AccountLogin.net\");\r\n allowedUrls.push(\"AccountLogin\");\r\n allowedUrls.push(\"CreateAccount\");\r\n allowedUrls.push(\"CreateAccount.net\");\r\n allowedUrls.push(\"UpdateAccount\");\r\n allowedUrls.push(\"UpdateAccount.net\");\r\n allowedUrls.push(\"CreateResellersAccount\");\r\n allowedUrls.push(\"CreateResellersAccount.net\");\r\n allowedUrls.push(\"CreateQqestSAASAccount\");\r\n
"11:00 AM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM"
整个文件可能有数千行长,并且包含看起来像我指定的div
之外的文本我将 //text() 理解为 return 元素及其子元素的文本。我定位的 html 元素没有任何子元素,所以我假设它只会 return div.
中的数据接下来我尝试只使用“/text()”。这是唯一的变化
尝试 2:
for sel in response.xpath("//div[@class='DivTimeSpan']"):
s_item['schedule'] = sel.select('/text()').extract()
return s_item
Returns:
[{"schedule": []}]
期望的结果:
[{"schedule": ["11:00 AM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM", "12:00 PM - 09:00 PM"]}]
我正在抓取的 url 是在公司登录后,所以我无法给出实际的 url。
Elisha 的 post 指引我正确的方向,谢谢!!! :) 答案:
for sel in response.xpath("//div[@class='DivTimeSpan']"):
s_item['schedule'] = map(unicode.strip, sel.select('//div/text()').extract())
return s_item
第二次尝试更接近提取值。然而,您需要从节点而不是文档根目录中提取文本:
s_item['schedule'] = sel.select('/div/text()').extract()[0]
如果文档包含更多标签(不是div),您可以尝试:
s_item['schedule'] = sel.select('//div/text()').extract()[0]