如何从漂亮的汤 class 标签中获取实际文本?
How can I get the actual text from a beautiful soup class tag?
- Python版本:3.8
- bs4 库
我有以下 HTML,代表我抓取的大约 20 多条评论中的 2 条。由于 space,我没有在此处包括其余部分,但您可以想象这些块不断重复。
我需要从每条评论中检索“sml-rank-stars sml-str40 star”(如此处第二行所示)。
<div class="review-rank">
<span class="sml-rank-stars sml-str40 star"></span>
<span class="score">
<span class="item">
口味:3.5
</span>
<span class="item">
环境:4.0
</span>
<span class="item">
服务:3.5
</span>
<span class="item">人均:200元</span>
</span>
</div>
<div class="review-rank">
<span class="sml-rank-stars sml-str35 star"></span>
<span class="score">
<span class="item">
口味:3.0
</span>
<span class="item">
环境:4.5
</span>
<span class="item">
服务:3.0
</span>
</span>
</div>
这是我目前尝试过的方法:
for review in review_items.find_all('div', class_='main-review'):
review_rank = review.find('div', class_='review-rank')
star_rank = []
for review in review_rank.find_all('span')[:1]:
star_rank.append(review.get('class'))
print(star_rank)
我得到结果输出:
[['sml-rank-stars', 'sml-str5', 'star']]
然后我可以使用此代码仅获取号码:
star_rank[0][1][7:]
输出:
'5'
问题是我只收到其中一条评论,我列表中存储的每条评论都需要这一行。
我想要的输出是这样的,或者我可以迭代以获得每条评论的星数:
[['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str5', 'star']]
我已经想出如何使用以下代码打印出这样的结果,但我需要将它保存到列表或其他我可以迭代的东西中。
for review in review_items.find_all('div', class_='main-review'):
review_rank = review.find('div', class_='review-rank')
for review in review_rank.find_all('span')[:1]:
print(review.get('class'))
输出:
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str5', 'star']
遍历所有 .review-rank
select - 要获得排名,仅使用列表理解:
star_rank = []
for r in soup.select('.review-rank'):
star_rank.append([s.replace('sml-str','') for s in r.span['class'] if 'sml-str' in s][0])
或者像你的例子一样,不知道上面的一般结构是什么 review_items
如果只有一个或多个:
star_rank = []
for review in review_items.find_all('div', class_='main-review'):
for review in review.find_all('div', class_='review-rank'):
star_rank.append([s.replace('sml-str','') for s in review.span['class'] if 'sml-str' in s][0])
输出
['40', '35']
- Python版本:3.8
- bs4 库
我有以下 HTML,代表我抓取的大约 20 多条评论中的 2 条。由于 space,我没有在此处包括其余部分,但您可以想象这些块不断重复。
我需要从每条评论中检索“sml-rank-stars sml-str40 star”(如此处第二行所示)。
<div class="review-rank">
<span class="sml-rank-stars sml-str40 star"></span>
<span class="score">
<span class="item">
口味:3.5
</span>
<span class="item">
环境:4.0
</span>
<span class="item">
服务:3.5
</span>
<span class="item">人均:200元</span>
</span>
</div>
<div class="review-rank">
<span class="sml-rank-stars sml-str35 star"></span>
<span class="score">
<span class="item">
口味:3.0
</span>
<span class="item">
环境:4.5
</span>
<span class="item">
服务:3.0
</span>
</span>
</div>
这是我目前尝试过的方法:
for review in review_items.find_all('div', class_='main-review'):
review_rank = review.find('div', class_='review-rank')
star_rank = []
for review in review_rank.find_all('span')[:1]:
star_rank.append(review.get('class'))
print(star_rank)
我得到结果输出:
[['sml-rank-stars', 'sml-str5', 'star']]
然后我可以使用此代码仅获取号码:
star_rank[0][1][7:]
输出:
'5'
问题是我只收到其中一条评论,我列表中存储的每条评论都需要这一行。
我想要的输出是这样的,或者我可以迭代以获得每条评论的星数:
[['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str50', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str35', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str40', 'star'],
['sml-rank-stars', 'sml-str45', 'star'],
['sml-rank-stars', 'sml-str10', 'star'],
['sml-rank-stars', 'sml-str5', 'star']]
我已经想出如何使用以下代码打印出这样的结果,但我需要将它保存到列表或其他我可以迭代的东西中。
for review in review_items.find_all('div', class_='main-review'):
review_rank = review.find('div', class_='review-rank')
for review in review_rank.find_all('span')[:1]:
print(review.get('class'))
输出:
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str50', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str35', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str40', 'star']
['sml-rank-stars', 'sml-str45', 'star']
['sml-rank-stars', 'sml-str10', 'star']
['sml-rank-stars', 'sml-str5', 'star']
遍历所有 .review-rank
select - 要获得排名,仅使用列表理解:
star_rank = []
for r in soup.select('.review-rank'):
star_rank.append([s.replace('sml-str','') for s in r.span['class'] if 'sml-str' in s][0])
或者像你的例子一样,不知道上面的一般结构是什么 review_items
如果只有一个或多个:
star_rank = []
for review in review_items.find_all('div', class_='main-review'):
for review in review.find_all('div', class_='review-rank'):
star_rank.append([s.replace('sml-str','') for s in review.span['class'] if 'sml-str' in s][0])
输出
['40', '35']