Python BeautifulSoup - 如何提取这段文字
Python BeautifulSoup - How to extract this text
当前 Python 脚本:
import win_unicode_console
win_unicode_console.enable()
import requests
from bs4 import BeautifulSoup
data = '''
<div class="info">
<h1>Company Title</h1>
<p class="type">Company type</p>
<p class="address"><strong>ZIP, City</strong></p>
<p class="address"><strong>Street 123</strong></p>
<p style="margin-top:10px;"> Phone: <strong>(111) 123-456-78</strong><br />
Fax: <strong>(222) 321-654-87</strong><br />
Phone: <strong>(333) 87-654-321</strong><br />
Fax: <strong>(444) 000-1111-2222</strong><br />
</p>
<p style="margin-top:10px;"> E-mail: <a href="mailto:mail@domain.com">mail@domain.com</a><br />
E-mail: <a href="mailto:mail2@domain.com">mail2@domain.com</a><br />
</p>
<p> Web: <a href="http://www.domain.com" target="_blank">www.domain.com</a><br />
</p>
<p style="margin-top:10px;"> ID: <strong>123456789</strong><br />
VAT: <strong>987654321</strong> </p>
<p class="del" style="margin-top:10px;">Some info:</p>
<ul>
<li><a href="#category">» Category</a></li>
</ul>
</div>
'''
html = BeautifulSoup(data, "html.parser")
p = html.find_all('p', attrs={'class': None})
for pp in p:
print(pp.contents)
它returns如下:
[' Phone: ', <strong>123-456-78</strong>, <br/>, '\n\t\tFax: ', <strong>321-654-87</strong>, <br/>, '\n\t\tPhone: ', <strong>87-654-321</strong>, <br/>, '\n\t\tFax: ', <strong>000-1111-2222</strong>, <br/>, '\n']
[' E-mail: ', <a href="mailto:mail@domain.com">mail@domain.com</a>, <br/>, '\n\tE-mail: ', <a href="mailto:mail2@domain.com">mail2@domain.com</a>, <br/>, '\n']
[' Web: ', <a href="http://www.domain.com" target="_blank">www.domain.com</a>, <br/>, '\n']
[' ID: ', <strong>123456789</strong>, <br/>, '\n\t\tVAT: ', <strong>987654321</strong>, ' ']
问题:
我不知道如何提取 phone、传真和电子邮件、ID、增值税的文本并从中创建数组,例如:
phones = [123-456-78, 87-654-321]
faxes = [321-654-87, 000-1111-2222]
emails = [mail@domain.com, mail2@domain.com]
id = [123456789]
vat = [987654321]
您可以在拆分后使用 defaultdict 对数据进行分组:
html = BeautifulSoup(data, "html.parser")
p = html.find_all('p', attrs={'class': None})
from collections import defaultdict
d = defaultdict(list)
for pp in p:
spl = iter(pp.text.split(None,1))
for ele in spl:
d[ele.rstrip(":")].append(next(spl).rstrip())
print(d)
defaultdict(<class 'list'>, {'Phone': ['123-456-78', '87-654-321'],
'Fax': ['321-654-87', '000-1111-2222'], 'E-mail': ['mail@domain.com',
'mail2@domain.com'], 'VAT': ['987654321'], 'Web': ['www.domain.com'],
'ID': ['123456789']})
拆分文本可以得到数据列表:
['Phone:', '123-456-78', 'Fax:', '321-654-87', 'Phone:', '87-654-321', 'Fax:', '000-1111-2222']
['E-mail:', 'mail@domain.com', 'E-mail:', 'mail2@domain.com']
['Web:', 'www.domain.com']
['ID:', '123456789', 'VAT:', '987654321']
因此我们将每两个元素用作 key/value 对。附加重复的键。
为了让您编辑捕捉传真中的空格和 phone 数字,只需使用分割线分割成行并在空白处分割一次:
从集合导入 defaultdict
d = defaultdict(list)
for pp in p:
spl = pp.text.splitlines()
for ele in spl:
k, v = ele.strip().split(None, 1)
d[k.rstrip(":")].append(v.rstrip())
输出:
defaultdict(<class 'list'>, {'Fax': ['(222) 321-654-87', '(444) 000-1111-2222'],
'Web': ['www.domain.com'], 'ID': ['123456789'], 'E-mail': ['mail@domain.com', 'mail2@domain.com'],
'VAT': ['987654321'], 'Phone': ['(111) 123-456-78', '(333) 87-654-321']})
当前 Python 脚本:
import win_unicode_console
win_unicode_console.enable()
import requests
from bs4 import BeautifulSoup
data = '''
<div class="info">
<h1>Company Title</h1>
<p class="type">Company type</p>
<p class="address"><strong>ZIP, City</strong></p>
<p class="address"><strong>Street 123</strong></p>
<p style="margin-top:10px;"> Phone: <strong>(111) 123-456-78</strong><br />
Fax: <strong>(222) 321-654-87</strong><br />
Phone: <strong>(333) 87-654-321</strong><br />
Fax: <strong>(444) 000-1111-2222</strong><br />
</p>
<p style="margin-top:10px;"> E-mail: <a href="mailto:mail@domain.com">mail@domain.com</a><br />
E-mail: <a href="mailto:mail2@domain.com">mail2@domain.com</a><br />
</p>
<p> Web: <a href="http://www.domain.com" target="_blank">www.domain.com</a><br />
</p>
<p style="margin-top:10px;"> ID: <strong>123456789</strong><br />
VAT: <strong>987654321</strong> </p>
<p class="del" style="margin-top:10px;">Some info:</p>
<ul>
<li><a href="#category">» Category</a></li>
</ul>
</div>
'''
html = BeautifulSoup(data, "html.parser")
p = html.find_all('p', attrs={'class': None})
for pp in p:
print(pp.contents)
它returns如下:
[' Phone: ', <strong>123-456-78</strong>, <br/>, '\n\t\tFax: ', <strong>321-654-87</strong>, <br/>, '\n\t\tPhone: ', <strong>87-654-321</strong>, <br/>, '\n\t\tFax: ', <strong>000-1111-2222</strong>, <br/>, '\n']
[' E-mail: ', <a href="mailto:mail@domain.com">mail@domain.com</a>, <br/>, '\n\tE-mail: ', <a href="mailto:mail2@domain.com">mail2@domain.com</a>, <br/>, '\n']
[' Web: ', <a href="http://www.domain.com" target="_blank">www.domain.com</a>, <br/>, '\n']
[' ID: ', <strong>123456789</strong>, <br/>, '\n\t\tVAT: ', <strong>987654321</strong>, ' ']
问题: 我不知道如何提取 phone、传真和电子邮件、ID、增值税的文本并从中创建数组,例如:
phones = [123-456-78, 87-654-321]
faxes = [321-654-87, 000-1111-2222]
emails = [mail@domain.com, mail2@domain.com]
id = [123456789]
vat = [987654321]
您可以在拆分后使用 defaultdict 对数据进行分组:
html = BeautifulSoup(data, "html.parser")
p = html.find_all('p', attrs={'class': None})
from collections import defaultdict
d = defaultdict(list)
for pp in p:
spl = iter(pp.text.split(None,1))
for ele in spl:
d[ele.rstrip(":")].append(next(spl).rstrip())
print(d)
defaultdict(<class 'list'>, {'Phone': ['123-456-78', '87-654-321'],
'Fax': ['321-654-87', '000-1111-2222'], 'E-mail': ['mail@domain.com',
'mail2@domain.com'], 'VAT': ['987654321'], 'Web': ['www.domain.com'],
'ID': ['123456789']})
拆分文本可以得到数据列表:
['Phone:', '123-456-78', 'Fax:', '321-654-87', 'Phone:', '87-654-321', 'Fax:', '000-1111-2222']
['E-mail:', 'mail@domain.com', 'E-mail:', 'mail2@domain.com']
['Web:', 'www.domain.com']
['ID:', '123456789', 'VAT:', '987654321']
因此我们将每两个元素用作 key/value 对。附加重复的键。
为了让您编辑捕捉传真中的空格和 phone 数字,只需使用分割线分割成行并在空白处分割一次: 从集合导入 defaultdict
d = defaultdict(list)
for pp in p:
spl = pp.text.splitlines()
for ele in spl:
k, v = ele.strip().split(None, 1)
d[k.rstrip(":")].append(v.rstrip())
输出:
defaultdict(<class 'list'>, {'Fax': ['(222) 321-654-87', '(444) 000-1111-2222'],
'Web': ['www.domain.com'], 'ID': ['123456789'], 'E-mail': ['mail@domain.com', 'mail2@domain.com'],
'VAT': ['987654321'], 'Phone': ['(111) 123-456-78', '(333) 87-654-321']})