如何在 python 中的两组标签之间获取文本

how to get text between two SETS of tags in python

我正在尝试获取标签之间的文本以及标签组之间的文本,我已经尝试过了,但我没有得到我想要的。谁能帮忙?非常感谢。

text = '''
 <b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />  
'''

预期输出:

Doc Type: AABB
Doc No:   BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045

我试过的代码,这只给了我标签之间的文本,而不是标签外的文本:

soup = BeautifulSoup(html, "html.parser")
print(soup.find_all('b'))

我也试过跟随,但它给了我页面上的所有文字,我只想要标签和标签外的文字,:

soup = BeautifulSoup(html, "html.parser")
lines = ''.join(soup.text)
print(lines)

当前输出为:

Doc Type: 
Doc No:   
System No: 
VCode: 
G Code: 

试试这个:

from bs4 import BeautifulSoup

text = '''
 <b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />  
'''

result = [
    i.getText(strip=True) for i in 
    BeautifulSoup(text, "html.parser").find_all(text=True)
    if i.getText(strip=True)
]
print("\n".join([" ".join(result[i:i + 2]) for i in range(0, len(result), 2)]))

输出:

Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045

您可以使用每个元素中的 .next_sibling

代码:

html = '''
 <b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
bs = soup.find_all('b')


for each in bs:
    eachFollowingText = each.next_sibling.strip()
    print(f'{each.text} {eachFollowingText}')

输出:

Doc Type:  AABB
Doc No:  BBBBF
System No:  aaa bbb
VCode:  040000033
G Code:  000045

找到问题中未给出的父标签即可获取全文,然后通过.text访问其字符串内容并进行一些格式化操作例如删除空行。

BeautifulSoup 如果缺少 html 标签,请始终添加 html 标签,因此在我的示例中 soup.html。假设您知道它,用 soup.find_all(my parent tag) 替换应该可以解决它。

html = '''
 <b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

parent_tag = soup.html
s = '\n'.join(line for line in parent_tag.text.split('\n') if line != '')
print(s)