如何在beautifulsoup中按顺序打印标签的内容?
How to get print the content of tags in order in beautifulsoup?
我正在尝试从下面发布的 html 页面获取文本。我尝试了循环,但它没有按顺序打印字符串。我需要按如下顺序打印字符串:
text1:text_1.1
text2:text2.2,2.2
...
我需要打印上面的输出。
<ul>
<li>
<b>text1:</b>
<a><a href="search.php?origin=">text_1.1</a>
</li>
<li>
<b>text2</b>
<a href="search.php?origin=">text_2.1</a>
<a href="search.php?origin">text_2.2</a>
</li>
<li>
<b>text4</b>
<a href="search.php?origin=">text_4.1</a>
<a href="search.php?origin=">text_4.2</a>
<a href="search.php?origin=">text_4.3</a>
<a href="search.php?origin=">text_4.4</a>
</li>
<li>
<b>text5</b>
<a href="search.php?origin=">text5.1</a>
</li>
<li>
<b>text6</b>
<a href="search.php?origin=">text6.1</a>
<a href="search.php?origin=">text6.2</a>
<a href="search.php?origin=">text6.3</a>
<li>
<b>text7</b>
<a href="search.php?origin=">text7.1</a>
<font color="green">text7.2</font>
</li>
<li>
<b>text8</b>
<a href="dwres.php?resource=">2 </a>
</ul>
from bs4 import BeautifulSoup
for t in BeautifulSoup(html).find("ul").find_all("a"):
print(t.text)
输出:
text_1.1
text_2.1
text_2.2
text_4.1
text_4.2
text_4.3
text_4.4
text5.1
text6.1
text6.2
text6.3
text7.1
如果您同时需要 a 和 b 标签文本:
ul = BeautifulSoup(html).find("ul")
b= [b.text for b in ul.find_all("b")]
a = [a.text for a in ul.find_all("a")]
您需要决定输出如何匹配您,因为 a 标签明显多于 b。
您还可以获取 li 标签并使用 join 访问 a 和 b 标签,以获得您似乎想要的内容:
ul = BeautifulSoup(html).find("ul")
li = ul.find_all("li")
for ele in li:
print("{}:{}".format(ele.b.text,"".join([a.text for a in ele.find_all("a")])))
text1::text_1.1
text2:text_2.1text_2.2
text4:text_4.1text_4.2text_4.3text_4.4
text5:text5.1
text6:text6.1text6.2text6.3
text7:text7.1
text8:2
查找所有 <li>
元素,以便您可以按 <b>
标签对它们的内容进行分组。你可能想要一个字典来映射它们,但是为了保留文档顺序你可以使用一个 collections.OrderedDict()
对象也许:
from collections import OrderedDict
results = OrderedDict()
for li in soup.find_all('li'):
bold = li.b
if bold is None:
continue
results[bold.get_text(strip=True)] = [
link.get_text(strip=True) for link in li.find_all('a')
]
演示:
>>> from bs4 import BeautifulSoup
>>> from collections import OrderedDict
>>> soup = BeautifulSoup('''\
... <ul>
... <li>
... <b>text1:</b>
... <a><a href="search.php?origin=">text_1.1</a>
... </li>
... <li>
... <b>text2</b>
... <a href="search.php?origin=">text_2.1</a>
... <a href="search.php?origin">text_2.2</a>
... </li>
... <li>
... <b>text4</b>
... <a href="search.php?origin=">text_4.1</a>
... <a href="search.php?origin=">text_4.2</a>
... <a href="search.php?origin=">text_4.3</a>
... <a href="search.php?origin=">text_4.4</a>
... </li>
... <li>
... <b>text5</b>
... <a href="search.php?origin=">text5.1</a>
... </li>
... <li>
... <b>text6</b>
... <a href="search.php?origin=">text6.1</a>
... <a href="search.php?origin=">text6.2</a>
... <a href="search.php?origin=">text6.3</a>
... <li>
... <b>text7</b>
... <a href="search.php?origin=">text7.1</a>
... <font color="green">text7.2</font>
... </li>
... <li>
... <b>text8</b>
... <a href="dwres.php?resource=">2 </a>
... </ul>
... ''')
>>> results = OrderedDict()
>>> for li in soup.find_all('li'):
... bold = li.b
... if bold is None:
... continue
... results[bold.get_text(strip=True)] = [
... link.get_text(strip=True) for link in li.find_all('a')
... ]
...
>>> results
OrderedDict([(u'text1:', [u'', u'text_1.1']), (u'text2', [u'text_2.1', u'text_2.2']), (u'text4', [u'text_4.1', u'text_4.2', u'text_4.3', u'text_4.4']), (u'text5', [u'text5.1']), (u'text6', [u'text6.1', u'text6.2', u'text6.3']), (u'text7', [u'text7.1']), (u'text8', [u'2'])])
>>> for key, elems in results.items():
... print '{}: {}'.format(key, ', '.join(elems))
...
text1:: , text_1.1
text2: text_2.1, text_2.2
text4: text_4.1, text_4.2, text_4.3, text_4.4
text5: text5.1
text6: text6.1, text6.2, text6.3
text7: text7.1
text8: 2
print
可以集成到循环中,但是通过构建字典,您现在可以进行进一步的处理;将其写入文件、发送到其他地方等。
我正在尝试从下面发布的 html 页面获取文本。我尝试了循环,但它没有按顺序打印字符串。我需要按如下顺序打印字符串:
text1:text_1.1
text2:text2.2,2.2
...
我需要打印上面的输出。
<ul>
<li>
<b>text1:</b>
<a><a href="search.php?origin=">text_1.1</a>
</li>
<li>
<b>text2</b>
<a href="search.php?origin=">text_2.1</a>
<a href="search.php?origin">text_2.2</a>
</li>
<li>
<b>text4</b>
<a href="search.php?origin=">text_4.1</a>
<a href="search.php?origin=">text_4.2</a>
<a href="search.php?origin=">text_4.3</a>
<a href="search.php?origin=">text_4.4</a>
</li>
<li>
<b>text5</b>
<a href="search.php?origin=">text5.1</a>
</li>
<li>
<b>text6</b>
<a href="search.php?origin=">text6.1</a>
<a href="search.php?origin=">text6.2</a>
<a href="search.php?origin=">text6.3</a>
<li>
<b>text7</b>
<a href="search.php?origin=">text7.1</a>
<font color="green">text7.2</font>
</li>
<li>
<b>text8</b>
<a href="dwres.php?resource=">2 </a>
</ul>
from bs4 import BeautifulSoup
for t in BeautifulSoup(html).find("ul").find_all("a"):
print(t.text)
输出:
text_1.1
text_2.1
text_2.2
text_4.1
text_4.2
text_4.3
text_4.4
text5.1
text6.1
text6.2
text6.3
text7.1
如果您同时需要 a 和 b 标签文本:
ul = BeautifulSoup(html).find("ul")
b= [b.text for b in ul.find_all("b")]
a = [a.text for a in ul.find_all("a")]
您需要决定输出如何匹配您,因为 a 标签明显多于 b。
您还可以获取 li 标签并使用 join 访问 a 和 b 标签,以获得您似乎想要的内容:
ul = BeautifulSoup(html).find("ul")
li = ul.find_all("li")
for ele in li:
print("{}:{}".format(ele.b.text,"".join([a.text for a in ele.find_all("a")])))
text1::text_1.1
text2:text_2.1text_2.2
text4:text_4.1text_4.2text_4.3text_4.4
text5:text5.1
text6:text6.1text6.2text6.3
text7:text7.1
text8:2
查找所有 <li>
元素,以便您可以按 <b>
标签对它们的内容进行分组。你可能想要一个字典来映射它们,但是为了保留文档顺序你可以使用一个 collections.OrderedDict()
对象也许:
from collections import OrderedDict
results = OrderedDict()
for li in soup.find_all('li'):
bold = li.b
if bold is None:
continue
results[bold.get_text(strip=True)] = [
link.get_text(strip=True) for link in li.find_all('a')
]
演示:
>>> from bs4 import BeautifulSoup
>>> from collections import OrderedDict
>>> soup = BeautifulSoup('''\
... <ul>
... <li>
... <b>text1:</b>
... <a><a href="search.php?origin=">text_1.1</a>
... </li>
... <li>
... <b>text2</b>
... <a href="search.php?origin=">text_2.1</a>
... <a href="search.php?origin">text_2.2</a>
... </li>
... <li>
... <b>text4</b>
... <a href="search.php?origin=">text_4.1</a>
... <a href="search.php?origin=">text_4.2</a>
... <a href="search.php?origin=">text_4.3</a>
... <a href="search.php?origin=">text_4.4</a>
... </li>
... <li>
... <b>text5</b>
... <a href="search.php?origin=">text5.1</a>
... </li>
... <li>
... <b>text6</b>
... <a href="search.php?origin=">text6.1</a>
... <a href="search.php?origin=">text6.2</a>
... <a href="search.php?origin=">text6.3</a>
... <li>
... <b>text7</b>
... <a href="search.php?origin=">text7.1</a>
... <font color="green">text7.2</font>
... </li>
... <li>
... <b>text8</b>
... <a href="dwres.php?resource=">2 </a>
... </ul>
... ''')
>>> results = OrderedDict()
>>> for li in soup.find_all('li'):
... bold = li.b
... if bold is None:
... continue
... results[bold.get_text(strip=True)] = [
... link.get_text(strip=True) for link in li.find_all('a')
... ]
...
>>> results
OrderedDict([(u'text1:', [u'', u'text_1.1']), (u'text2', [u'text_2.1', u'text_2.2']), (u'text4', [u'text_4.1', u'text_4.2', u'text_4.3', u'text_4.4']), (u'text5', [u'text5.1']), (u'text6', [u'text6.1', u'text6.2', u'text6.3']), (u'text7', [u'text7.1']), (u'text8', [u'2'])])
>>> for key, elems in results.items():
... print '{}: {}'.format(key, ', '.join(elems))
...
text1:: , text_1.1
text2: text_2.1, text_2.2
text4: text_4.1, text_4.2, text_4.3, text_4.4
text5: text5.1
text6: text6.1, text6.2, text6.3
text7: text7.1
text8: 2
print
可以集成到循环中,但是通过构建字典,您现在可以进行进一步的处理;将其写入文件、发送到其他地方等。