如何在 python 中的两组标签之间获取文本
how to get text between two SETS of tags in python
我正在尝试获取标签之间的文本以及标签组之间的文本,我已经尝试过了,但我没有得到我想要的。谁能帮忙?非常感谢。
text = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
预期输出:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
我试过的代码,这只给了我标签之间的文本,而不是标签外的文本:
soup = BeautifulSoup(html, "html.parser")
print(soup.find_all('b'))
我也试过跟随,但它给了我页面上的所有文字,我只想要标签和标签外的文字,:
soup = BeautifulSoup(html, "html.parser")
lines = ''.join(soup.text)
print(lines)
当前输出为:
Doc Type:
Doc No:
System No:
VCode:
G Code:
试试这个:
from bs4 import BeautifulSoup
text = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
result = [
i.getText(strip=True) for i in
BeautifulSoup(text, "html.parser").find_all(text=True)
if i.getText(strip=True)
]
print("\n".join([" ".join(result[i:i + 2]) for i in range(0, len(result), 2)]))
输出:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
您可以使用每个元素中的 .next_sibling
。
代码:
html = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
bs = soup.find_all('b')
for each in bs:
eachFollowingText = each.next_sibling.strip()
print(f'{each.text} {eachFollowingText}')
输出:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
找到问题中未给出的父标签即可获取全文,然后通过.text
访问其字符串内容并进行一些格式化操作例如删除空行。
BeautifulSoup
如果缺少 html
标签,请始终添加 html
标签,因此在我的示例中 soup.html
。假设您知道它,用 soup.find_all(my parent tag)
替换应该可以解决它。
html = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
parent_tag = soup.html
s = '\n'.join(line for line in parent_tag.text.split('\n') if line != '')
print(s)
我正在尝试获取标签之间的文本以及标签组之间的文本,我已经尝试过了,但我没有得到我想要的。谁能帮忙?非常感谢。
text = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
预期输出:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
我试过的代码,这只给了我标签之间的文本,而不是标签外的文本:
soup = BeautifulSoup(html, "html.parser")
print(soup.find_all('b'))
我也试过跟随,但它给了我页面上的所有文字,我只想要标签和标签外的文字,:
soup = BeautifulSoup(html, "html.parser")
lines = ''.join(soup.text)
print(lines)
当前输出为:
Doc Type:
Doc No:
System No:
VCode:
G Code:
试试这个:
from bs4 import BeautifulSoup
text = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
result = [
i.getText(strip=True) for i in
BeautifulSoup(text, "html.parser").find_all(text=True)
if i.getText(strip=True)
]
print("\n".join([" ".join(result[i:i + 2]) for i in range(0, len(result), 2)]))
输出:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
您可以使用每个元素中的 .next_sibling
。
代码:
html = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
bs = soup.find_all('b')
for each in bs:
eachFollowingText = each.next_sibling.strip()
print(f'{each.text} {eachFollowingText}')
输出:
Doc Type: AABB
Doc No: BBBBF
System No: aaa bbb
VCode: 040000033
G Code: 000045
找到问题中未给出的父标签即可获取全文,然后通过.text
访问其字符串内容并进行一些格式化操作例如删除空行。
BeautifulSoup
如果缺少 html
标签,请始终添加 html
标签,因此在我的示例中 soup.html
。假设您知道它,用 soup.find_all(my parent tag)
替换应该可以解决它。
html = '''
<b>Doc Type: </b>AABB
<br />
<b>Doc No: </b>BBBBF
<br />
<b>System No: </b>aaa bbb
<br />
<b>VCode: </b>040000033
<br />
<b>G Code: </b>000045
<br />
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
parent_tag = soup.html
s = '\n'.join(line for line in parent_tag.text.split('\n') if line != '')
print(s)