BeautifulSoup - 从标签中获取所有子项而不是第一个
BeautifulSoup - Getting all the child from tag instead of the first
我正在创建一个从网站收集数据的脚本。但是,我遇到了一些问题,只能收集特定信息。导致我出现问题的 HTML 部分如下:
<div class="Content">
<article>
<blockquote class="messageText 1234">
I WANT THIS
<br/>
I WANT THIS 2
<br/>
</a>
<br/>
</blockquote>
</article>
</div>
<div class="Content">
<article>
<blockquote class="messageText 1234">
<a class="IDENTIFIER" href="WEBSITE">
</a>
NO WANT THIS
<br/>
<br/>
NO WANT THIS
<br/>
<br/>
NO WANT THIS
<div class="messageTextEndMarker">
</div>
</blockquote>
</article>
</div>
我正在尝试创建一个只打印“我想要这个”部分的过程。我有以下脚本:
import requests
from bs4 import BeautifulSoup
url = ''
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.find_all('div', class_='panels'):
for b in a.find_all('form', class_='section'):
for c in b.find_all('div', class_='message'):
for d in c.find_all('div', class_='primaryContent'):
for d in d.find_all('div', class_='messageContent'):
for e in d.content.find_all('blockquote', class_='messageText 1234')[0]:
print(e.string)
我对代码的想法是只从第一个 blockquote
元素中提取部分,但是,我从 blockquotes
:
中获取所有文本
I WANT THIS
NO WANT THIS
NO WANT THIS
NO WANT THIS
我怎样才能做到这一点?
为什么不使用 select_one 来隔离第一个块然后 stripped_strings 来分隔文本字符串?
from bs4 import BeautifulSoup as bs
html = ''' your html'''
soup = bs(html, 'lxml')
print([s for s in soup.select_one('.Content .messageText').stripped_strings])
我正在创建一个从网站收集数据的脚本。但是,我遇到了一些问题,只能收集特定信息。导致我出现问题的 HTML 部分如下:
<div class="Content">
<article>
<blockquote class="messageText 1234">
I WANT THIS
<br/>
I WANT THIS 2
<br/>
</a>
<br/>
</blockquote>
</article>
</div>
<div class="Content">
<article>
<blockquote class="messageText 1234">
<a class="IDENTIFIER" href="WEBSITE">
</a>
NO WANT THIS
<br/>
<br/>
NO WANT THIS
<br/>
<br/>
NO WANT THIS
<div class="messageTextEndMarker">
</div>
</blockquote>
</article>
</div>
我正在尝试创建一个只打印“我想要这个”部分的过程。我有以下脚本:
import requests
from bs4 import BeautifulSoup
url = ''
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.find_all('div', class_='panels'):
for b in a.find_all('form', class_='section'):
for c in b.find_all('div', class_='message'):
for d in c.find_all('div', class_='primaryContent'):
for d in d.find_all('div', class_='messageContent'):
for e in d.content.find_all('blockquote', class_='messageText 1234')[0]:
print(e.string)
我对代码的想法是只从第一个 blockquote
元素中提取部分,但是,我从 blockquotes
:
I WANT THIS
NO WANT THIS
NO WANT THIS
NO WANT THIS
我怎样才能做到这一点?
为什么不使用 select_one 来隔离第一个块然后 stripped_strings 来分隔文本字符串?
from bs4 import BeautifulSoup as bs
html = ''' your html'''
soup = bs(html, 'lxml')
print([s for s in soup.select_one('.Content .messageText').stripped_strings])