HTML 通过 XPath 抓取 td 内容不会产生任何数据
HTML scraping td content via XPath yields no data
我正在尝试从我的大学网站中提取一些数据用于一个项目。这是我的代码。但是项目的字段不包含任何数据。
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import scrapy
from vasavi.items import VasaviItem
class MySpider(InitSpider):
name = 'myspider'
allowed_domains = ['domainsite']
login_page = 'domainsite/index.aspx'
start_urls = ['domainsite/My_Info.aspx']
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'txtLoginID': 'srichakra', 'txtPWD': '12345'},
callback=self.check_login_response)
def check_login_response(self, response):
if "SRI CHAKRA GOUD" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
return self.initialized()
def parse(self, response):
print "Parsing"
item = VasaviItem()
ur = response.url
print ur
item['rollno'] = response.xpath('//*[@id="divStudInfo"]/table/tbody/tr[2]/td[1]/text()').extract()
item['name'] = response.css('#divStudInfo > table > tbody > tr:nth-child(3) > td:nth-child(2)::text').extract()
item['Marks'] = response.xpath('//*[@id="divStudySummary"]/table/tbody/tr[3]/td[9]/a/text()').extract()
yield item
不允许我在这里 post 超过 2 个 url,所以我用 domainsite
替换了所有 http://www.domain.com
输出:
2015-01-03 18:45:06+0530 [myspider] INFO: Spider opened
2015-01-03 18:45:06+0530 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-03 18:45:06+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-01-03 18:45:06+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-01-03 18:45:07+0530 [myspider] DEBUG: Crawled (200) <GET domainsite> (referer: None)
2015-01-03 18:45:09+0530 [myspider] DEBUG: Redirecting (302) to <GET domainsite/My_Info.aspx> from <POST domainsite/index.aspx>
2015-01-03 18:45:15+0530 [myspider] DEBUG: Crawled (200) <GET domainsite/My_Info.aspx> (referer: domainsite/index.aspx)
2015-01-03 18:45:15+0530 [myspider] DEBUG: Successfully logged in. Let's start crawling!
2015-01-03 18:45:21+0530 [myspider] DEBUG: Crawled (200) <GET domainsite/My_Info.aspx>(referer: domainsite/My_Info.aspx)
Parsing
domainsite/My_Info.aspx
2015-01-03 18:45:21+0530 [myspider] DEBUG: Scraped from <200 domainsite/My_Info.aspx>
{'rollno': [], 'Marks': [], 'name': []}
2015-01-03 18:45:21+0530 [myspider] INFO: Closing spider (finished)
2015-01-03 18:45:21+0530 [myspider] INFO: Stored json feed (1 items) in: vce.json
2015-01-03 18:45:21+0530 [myspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1370,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 92491,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 3, 13, 15, 21, 528000),
'item_scraped_count': 1,
'log_count/DEBUG': 8,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 1, 3, 13, 15, 6, 518000)}
2015-01-03 18:45:21+0530 [myspider] INFO: Spider closed (finished)
正如其他评论者所指出的,您确实需要显示 HTML 输入。如果我不得不猜测,我会说 tbody
并没有真正出现在页面上 - 参见例如this question or this question。 tbody
出现在您显示的两个路径表达式中,也出现在 CSS.
中
要检验这个假设,请跳过表达式中的 tbody
元素:
item['rollno'] = response.xpath('//*[@id="divStudInfo"]/table//tr[2]/td[1]/text()').extract()
我正在尝试从我的大学网站中提取一些数据用于一个项目。这是我的代码。但是项目的字段不包含任何数据。
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
import scrapy
from vasavi.items import VasaviItem
class MySpider(InitSpider):
name = 'myspider'
allowed_domains = ['domainsite']
login_page = 'domainsite/index.aspx'
start_urls = ['domainsite/My_Info.aspx']
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'txtLoginID': 'srichakra', 'txtPWD': '12345'},
callback=self.check_login_response)
def check_login_response(self, response):
if "SRI CHAKRA GOUD" in response.body:
self.log("Successfully logged in. Let's start crawling!")
# Now the crawling can begin..
return self.initialized()
def parse(self, response):
print "Parsing"
item = VasaviItem()
ur = response.url
print ur
item['rollno'] = response.xpath('//*[@id="divStudInfo"]/table/tbody/tr[2]/td[1]/text()').extract()
item['name'] = response.css('#divStudInfo > table > tbody > tr:nth-child(3) > td:nth-child(2)::text').extract()
item['Marks'] = response.xpath('//*[@id="divStudySummary"]/table/tbody/tr[3]/td[9]/a/text()').extract()
yield item
不允许我在这里 post 超过 2 个 url,所以我用 domainsite
替换了所有 http://www.domain.com输出:
2015-01-03 18:45:06+0530 [myspider] INFO: Spider opened
2015-01-03 18:45:06+0530 [myspider] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-01-03 18:45:06+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-01-03 18:45:06+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-01-03 18:45:07+0530 [myspider] DEBUG: Crawled (200) <GET domainsite> (referer: None)
2015-01-03 18:45:09+0530 [myspider] DEBUG: Redirecting (302) to <GET domainsite/My_Info.aspx> from <POST domainsite/index.aspx>
2015-01-03 18:45:15+0530 [myspider] DEBUG: Crawled (200) <GET domainsite/My_Info.aspx> (referer: domainsite/index.aspx)
2015-01-03 18:45:15+0530 [myspider] DEBUG: Successfully logged in. Let's start crawling!
2015-01-03 18:45:21+0530 [myspider] DEBUG: Crawled (200) <GET domainsite/My_Info.aspx>(referer: domainsite/My_Info.aspx)
Parsing
domainsite/My_Info.aspx
2015-01-03 18:45:21+0530 [myspider] DEBUG: Scraped from <200 domainsite/My_Info.aspx>
{'rollno': [], 'Marks': [], 'name': []}
2015-01-03 18:45:21+0530 [myspider] INFO: Closing spider (finished)
2015-01-03 18:45:21+0530 [myspider] INFO: Stored json feed (1 items) in: vce.json
2015-01-03 18:45:21+0530 [myspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1370,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 92491,
'downloader/response_count': 4,
'downloader/response_status_count/200': 3,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 3, 13, 15, 21, 528000),
'item_scraped_count': 1,
'log_count/DEBUG': 8,
'log_count/INFO': 8,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 1, 3, 13, 15, 6, 518000)}
2015-01-03 18:45:21+0530 [myspider] INFO: Spider closed (finished)
正如其他评论者所指出的,您确实需要显示 HTML 输入。如果我不得不猜测,我会说 tbody
并没有真正出现在页面上 - 参见例如this question or this question。 tbody
出现在您显示的两个路径表达式中,也出现在 CSS.
要检验这个假设,请跳过表达式中的 tbody
元素:
item['rollno'] = response.xpath('//*[@id="divStudInfo"]/table//tr[2]/td[1]/text()').extract()