使用 BeautifulSoup 从汤中提取标签
extract tags from soup with BeautifulSoup
'''
<div class="kt-post-card__body>
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
根据我附上的图片,我想提取所有“kt-post-card__body”属性,然后从每个属性中提取:
("kt-post-card__title", "kt-post-card__description")
喜欢列表。
我试过这个:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
但是使用 ads[0].div
我只能访问 "kt-post-card__title"
而 "kt-post-card__body"
有其他三个子标签,例如:"kt-post-card__description"
和 "kt-post-card__bottom"
...,为什么是吗?
试试这个:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
ads[0]
我认为你只得到第一个 div 因为你调用了 ads[0].div
因为你的问题不是很清楚 - 要提取 类:
for e in soup.select('.kt-post-card__body'):
print([c for t in e.find_all() for c in t.get('class')])
输出:
['kt-post-card__title', 'kt-post-card__description', 'kt-post-card__bottom', 'kt-post-card__bottom-description', 'kt-text-truncate']
要获取文本,您还必须迭代 ResultSet
并可以访问每个元素文本以填充您的列表或使用 stripped_strings
.
例子
from bs4 import BeautifulSoup
html_doc='''
<div class="kt-post-card__body">
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
soup = BeautifulSoup(html_doc)
for e in soup.select('.kt-post-card__body'):
data = [
e.select_one('.kt-post-card__title').text,
e.select_one('.kt-post-card__description').text
]
print(data)
输出:
['Example_1', 'Example_2']
或
print(list(e.stripped_strings))
输出:
['Example_1', 'Example_2', 'Example_4']
'''
<div class="kt-post-card__body>
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
根据我附上的图片,我想提取所有“kt-post-card__body”属性,然后从每个属性中提取:
("kt-post-card__title", "kt-post-card__description")
喜欢列表。
我试过这个:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
但是使用 ads[0].div
我只能访问 "kt-post-card__title"
而 "kt-post-card__body"
有其他三个子标签,例如:"kt-post-card__description"
和 "kt-post-card__bottom"
...,为什么是吗?
试试这个:
ads = soup.find_all('div',{'class':'kt-post-card__body'})
ads[0]
我认为你只得到第一个 div 因为你调用了 ads[0].div
因为你的问题不是很清楚 - 要提取 类:
for e in soup.select('.kt-post-card__body'):
print([c for t in e.find_all() for c in t.get('class')])
输出:
['kt-post-card__title', 'kt-post-card__description', 'kt-post-card__bottom', 'kt-post-card__bottom-description', 'kt-text-truncate']
要获取文本,您还必须迭代 ResultSet
并可以访问每个元素文本以填充您的列表或使用 stripped_strings
.
例子
from bs4 import BeautifulSoup
html_doc='''
<div class="kt-post-card__body">
<div class="kt-post-card__title">Example_1</div>
<div class="kt-post-card__description">Example_2</div>
<div class="kt-post-card__bottom">
<span class="kt-post-card__bottom-description kt-text-truncate" title="Example_3">Example_4</span>
</div>
</div>
'''
soup = BeautifulSoup(html_doc)
for e in soup.select('.kt-post-card__body'):
data = [
e.select_one('.kt-post-card__title').text,
e.select_one('.kt-post-card__description').text
]
print(data)
输出:
['Example_1', 'Example_2']
或
print(list(e.stripped_strings))
输出:
['Example_1', 'Example_2', 'Example_4']