如何使用 python 遍历 scrapy 中的 XML 个子节点?
How to iterate through XML children node in scrapy with python?
我想抓取 this page 上的评论,但我似乎无法了解如何遍历包含评论的节点的子节点并获取数据点。
这是 hmtl 的一部分:
<div class="comment">
<div class="comment-user">
<div class="comment-user-avatar">
<a href="https://www.picuki.com/profile/alexandera_300">
<img src="https://scontent-yyz1-1.cdninstagram.com/v/t51.2885-19/s150x150/98342975_2815537605343770_6875611169034338304_n.jpg?_nc_ht=scontent-yyz1-1.cdninstagram.com&_nc_ohc=VjMtcOxXuaQAX_ZCqee&oh=4cf78fecbadcb57a81672c6edecc15a2&oe=5F02D580" alt="alexandera_300">
</a>
</div>
<div class="comment-user-nickname">
<a href="https://www.picuki.com/profile/alexandera_300">@alexandera_300</a>
</div>
</div>
<div class="comment-text">
#followforfollowback
</div>
</div>
<div class="comment">
<div class="comment-user">
<div class="comment-user-avatar">
<a href="https://www.picuki.com/profile/coxlogan2008">
<img src="https://scontent-yyz1-1.cdninstagram.com/v/t51.2885-19/s150x150/101229634_275138197009045_1475918829270859776_n.jpg?_nc_ht=scontent-yyz1-1.cdninstagram.com&_nc_ohc=e4gTZqQGpEAAX_7U-Q0&oh=36b7f5d1a0d7069f2447f4a318edec7d&oe=5F004A54" alt="coxlogan2008">
</a>
</div>
<div class="comment-user-nickname">
<a href="https://www.picuki.com/profile/coxlogan2008">@coxlogan2008</a>
</div>
</div>
<div class="comment-text">
</div>
</div>
我使用的 python 代码片段是这样的:
def parse_post(self, response):
img_url = response.meta['img_url']
caption = response.meta['caption']
url = response.meta['url']
comments = response.xpath('//div[@id="commantsPlace"]/text()')
for comment in comments:
likes = response.xpath('.//span[@class="icon-thumbs-up-alt"]/text()').get()
# need to put a regex here to get just the number value:
num_of_comments = response.xpath('.//span[@id="commentsCount"]/text()').get()
comment_user_name = comment.xpath('.//*[@class="comment-user-nickname"]/a/text()').get()
comment_text = comment.xpath('.//*[@class="comment-text"]/text()').get()
yield {'img_url': img_url,
'caption': caption,
'url': url,
'likes': likes,
'num_of_comments': num_of_comments,
'comment_user_name': comment_user_name,
'comment_text': comment_text}
但是,当我运行这个的时候,我只得到第一个评论的数据n次。有人可以帮我吗?我不明白为什么代码不遍历节点。
提前致谢!
我认为您的问题来自 'comments' 的 xpath。通过仅获取文本,您不会选择节点。
以下更改使其适用于我:
# the likes & number of comments only have to be taken once, should not be part of the loop
likes = response.xpath('.//span[@class="icon-thumbs-up-alt"]/text()').get()
num_of_comments = response.xpath('.//span[@id="commentsCount"]/text()').get()
comments = response.xpath('//div[@id="commantsPlace"]/*[@class="comment"]')
for comment in comments:
comment_user_name = comment.xpath('.//*[@class="comment-user-nickname"]/a/text()').get()
comment_text = comment.xpath('.//*[@class="comment-text"]/text()').get()
我想抓取 this page 上的评论,但我似乎无法了解如何遍历包含评论的节点的子节点并获取数据点。
这是 hmtl 的一部分:
<div class="comment">
<div class="comment-user">
<div class="comment-user-avatar">
<a href="https://www.picuki.com/profile/alexandera_300">
<img src="https://scontent-yyz1-1.cdninstagram.com/v/t51.2885-19/s150x150/98342975_2815537605343770_6875611169034338304_n.jpg?_nc_ht=scontent-yyz1-1.cdninstagram.com&_nc_ohc=VjMtcOxXuaQAX_ZCqee&oh=4cf78fecbadcb57a81672c6edecc15a2&oe=5F02D580" alt="alexandera_300">
</a>
</div>
<div class="comment-user-nickname">
<a href="https://www.picuki.com/profile/alexandera_300">@alexandera_300</a>
</div>
</div>
<div class="comment-text">
#followforfollowback
</div>
</div>
<div class="comment">
<div class="comment-user">
<div class="comment-user-avatar">
<a href="https://www.picuki.com/profile/coxlogan2008">
<img src="https://scontent-yyz1-1.cdninstagram.com/v/t51.2885-19/s150x150/101229634_275138197009045_1475918829270859776_n.jpg?_nc_ht=scontent-yyz1-1.cdninstagram.com&_nc_ohc=e4gTZqQGpEAAX_7U-Q0&oh=36b7f5d1a0d7069f2447f4a318edec7d&oe=5F004A54" alt="coxlogan2008">
</a>
</div>
<div class="comment-user-nickname">
<a href="https://www.picuki.com/profile/coxlogan2008">@coxlogan2008</a>
</div>
</div>
<div class="comment-text">
</div>
</div>
我使用的 python 代码片段是这样的:
def parse_post(self, response):
img_url = response.meta['img_url']
caption = response.meta['caption']
url = response.meta['url']
comments = response.xpath('//div[@id="commantsPlace"]/text()')
for comment in comments:
likes = response.xpath('.//span[@class="icon-thumbs-up-alt"]/text()').get()
# need to put a regex here to get just the number value:
num_of_comments = response.xpath('.//span[@id="commentsCount"]/text()').get()
comment_user_name = comment.xpath('.//*[@class="comment-user-nickname"]/a/text()').get()
comment_text = comment.xpath('.//*[@class="comment-text"]/text()').get()
yield {'img_url': img_url,
'caption': caption,
'url': url,
'likes': likes,
'num_of_comments': num_of_comments,
'comment_user_name': comment_user_name,
'comment_text': comment_text}
但是,当我运行这个的时候,我只得到第一个评论的数据n次。有人可以帮我吗?我不明白为什么代码不遍历节点。
提前致谢!
我认为您的问题来自 'comments' 的 xpath。通过仅获取文本,您不会选择节点。 以下更改使其适用于我:
# the likes & number of comments only have to be taken once, should not be part of the loop
likes = response.xpath('.//span[@class="icon-thumbs-up-alt"]/text()').get()
num_of_comments = response.xpath('.//span[@id="commentsCount"]/text()').get()
comments = response.xpath('//div[@id="commantsPlace"]/*[@class="comment"]')
for comment in comments:
comment_user_name = comment.xpath('.//*[@class="comment-user-nickname"]/a/text()').get()
comment_text = comment.xpath('.//*[@class="comment-text"]/text()').get()