BS4,在未闭合 <br> 之间获得精确匹配
BS4, getting exact match between unclosed <br>
from bs4 import BeautifulSoup
html = '''<tbody id="plaintiff-body">
<tr>
<td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td>
<td>JENEE BENNETT</td>
<td></td>
<td>COURTNEY L HANNA</td>
</tr>
<tr id="pladetail0001" style="" valign="top">
<td></td>
<td>2348 WOODBROOK CIR N<br>UNIT D<br>COLUMBUS, OH 43223</td>
<td></td>
<td>JOSEPH & JOSEPH CO LPA <br>SUITE 200<br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>DEBORAH L MCNINCH<br>JOSEPH & JOSEPH CO LPA <br>THE WATERFORD, SUITE 200 <br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>S K DODDERER<br>155 W MAIN STREET<br>#200<br>COLUMBUS, OH 43215<br>(614) 449-8282</td>
</tr>
</tbody>'''
soup = BeautifulSoup(html, 'lxml')
att = [x.get_text(strip=True, separator=' ') for x in soup.select(
'#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')]
print(att)
当前输出:
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282 DEBORAH L MCNINCH JOSEPH & JOSEPH CO LPA THE WATERFORD, SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282 S K DODDERER 155 W MAIN STREET #200 COLUMBUS, OH 43215 (614) 449-8282']
期望的输出:
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']
如何实现?
我正在考虑使用 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function 传递函数并循环遍历匹配,一旦我发现 br
为空,我将停止循环。
否则我可以获得 x
本身而不是 x.get_text()
然后拆分 ><
以获得第一个索引然后使用 https://w3lib.readthedocs.io/en/latest/w3lib.html?highlight=remove#w3lib.html.remove_tags
很高兴知道是否有 CSS
的直接解决方案或简单的解决方案。
如果找到两个 <br>
标签,您可以停止:
soup = BeautifulSoup(html, 'lxml')
tds = soup.select('#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')
output = []
for td in tds:
entry = []
last_el = None
for el in td.descendants:
if el.name == 'br':
if last_el.name == 'br':
break
else:
entry.append(el.get_text(strip=True))
last_el = el
output.append(' '.join(entry))
print(output)
Happy to know if there a direct solution with CSS ...
这里的主要问题是 CSS
中的同级组合器 br+br
会忽略元素之间的所有 non-element 节点,包括注释、文本和空格,因此 CSS
是关心的是,你不会连续两次从那里去。
所以你的想法和检查标签的功能也是我的方法:
from bs4 import BeautifulSoup
html = '''<tbody id="plaintiff-body">
<tr>
<td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td>
<td>JENEE BENNETT</td>
<td></td>
<td>COURTNEY L HANNA</td>
</tr>
<tr id="pladetail0001" style="" valign="top">
<td></td>
<td>2348 WOODBROOK CIR N<br>UNIT D<br>COLUMBUS, OH 43223</td>
<td></td>
<td>JOSEPH & JOSEPH CO LPA <br>SUITE 200<br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>DEBORAH L MCNINCH<br>JOSEPH & JOSEPH CO LPA <br>THE WATERFORD, SUITE 200 <br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>S K DODDERER<br>155 W MAIN STREET<br>#200<br>COLUMBUS, OH 43215<br>(614) 449-8282</td>
</tr>
</tbody>'''
soup = BeautifulSoup(html, 'lxml')
def check(x):
s = []
for a,b in zip(x,x[1::]):
if a==b:
break
if a.name == None:
s.append(a.text.strip())
return ' '.join(s)
att = [check(x.contents) if len(x.contents) > 1 else x.get_text(strip=True) for x in soup.select('#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')]
print(att)
输出
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']
另一个版本:
import re
for br in soup.select("br"):
br.replace_with("\n")
out = [
re.sub(r"\s{2,}|\n", " ", td.text.split("\n\n")[0])
for td in soup.select("td:last-child")
]
print(out)
打印:
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']
有:
html = '''<tbody id="plaintiff-body">
<tr><td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td><td>TIMOTHY MOORE</td><td></td><td>TIMOTHY MOORE</td></tr><tr id="pladetail0001" style="" valign="top"><td></td><td>62 KEENE DRIVE<br>WESTERVILLE, OH 43081</td><td></td><td>62 KEENE DRIVE<br>WESTERVILLE, OH 43081</td></tr>
</tbody>'''
打印:
['TIMOTHY MOORE', '62 KEENE DRIVE WESTERVILLE, OH 43081']
有:
html = '''<tbody id="plaintiff-body">
<tr><td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td><td>CENA PEDRO</td><td></td><td>ELIZABETH R WERNER</td></tr><tr id="pladetail0001" style="" valign="top"><td></td><td>33 W WEISHEIMER RD<br>COLUMBUS, OH 43215</td><td></td><td>THE NIGH LAW GROUP, LLC <br>300 S. 2ND STREET<br>COLUMBUS, OH 43215<br>(614) 379-6444<br><br>JOSEPH A NIGH<br>THE NIGH LAW GROUP, LLC <br>300 S. 2ND STREET<br>COLUMBUS, OH 43215<br>(614) 379-6444</td></tr>
</tbody>'''
打印:
['ELIZABETH R WERNER', 'THE NIGH LAW GROUP, LLC 300 S. 2ND STREET COLUMBUS, OH 43215 (614) 379-6444']
from bs4 import BeautifulSoup
html = '''<tbody id="plaintiff-body">
<tr>
<td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td>
<td>JENEE BENNETT</td>
<td></td>
<td>COURTNEY L HANNA</td>
</tr>
<tr id="pladetail0001" style="" valign="top">
<td></td>
<td>2348 WOODBROOK CIR N<br>UNIT D<br>COLUMBUS, OH 43223</td>
<td></td>
<td>JOSEPH & JOSEPH CO LPA <br>SUITE 200<br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>DEBORAH L MCNINCH<br>JOSEPH & JOSEPH CO LPA <br>THE WATERFORD, SUITE 200 <br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>S K DODDERER<br>155 W MAIN STREET<br>#200<br>COLUMBUS, OH 43215<br>(614) 449-8282</td>
</tr>
</tbody>'''
soup = BeautifulSoup(html, 'lxml')
att = [x.get_text(strip=True, separator=' ') for x in soup.select(
'#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')]
print(att)
当前输出:
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282 DEBORAH L MCNINCH JOSEPH & JOSEPH CO LPA THE WATERFORD, SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282 S K DODDERER 155 W MAIN STREET #200 COLUMBUS, OH 43215 (614) 449-8282']
期望的输出:
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']
如何实现?
我正在考虑使用 https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function 传递函数并循环遍历匹配,一旦我发现 br
为空,我将停止循环。
否则我可以获得 x
本身而不是 x.get_text()
然后拆分 ><
以获得第一个索引然后使用 https://w3lib.readthedocs.io/en/latest/w3lib.html?highlight=remove#w3lib.html.remove_tags
很高兴知道是否有 CSS
的直接解决方案或简单的解决方案。
如果找到两个 <br>
标签,您可以停止:
soup = BeautifulSoup(html, 'lxml')
tds = soup.select('#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')
output = []
for td in tds:
entry = []
last_el = None
for el in td.descendants:
if el.name == 'br':
if last_el.name == 'br':
break
else:
entry.append(el.get_text(strip=True))
last_el = el
output.append(' '.join(entry))
print(output)
Happy to know if there a direct solution with CSS ...
这里的主要问题是 CSS
中的同级组合器 br+br
会忽略元素之间的所有 non-element 节点,包括注释、文本和空格,因此 CSS
是关心的是,你不会连续两次从那里去。
所以你的想法和检查标签的功能也是我的方法:
from bs4 import BeautifulSoup
html = '''<tbody id="plaintiff-body">
<tr>
<td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td>
<td>JENEE BENNETT</td>
<td></td>
<td>COURTNEY L HANNA</td>
</tr>
<tr id="pladetail0001" style="" valign="top">
<td></td>
<td>2348 WOODBROOK CIR N<br>UNIT D<br>COLUMBUS, OH 43223</td>
<td></td>
<td>JOSEPH & JOSEPH CO LPA <br>SUITE 200<br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>DEBORAH L MCNINCH<br>JOSEPH & JOSEPH CO LPA <br>THE WATERFORD, SUITE 200 <br>155 W MAIN ST<br>COLUMBUS, OH 43215<br>(614) 449-8282<br><br>S K DODDERER<br>155 W MAIN STREET<br>#200<br>COLUMBUS, OH 43215<br>(614) 449-8282</td>
</tr>
</tbody>'''
soup = BeautifulSoup(html, 'lxml')
def check(x):
s = []
for a,b in zip(x,x[1::]):
if a==b:
break
if a.name == None:
s.append(a.text.strip())
return ' '.join(s)
att = [check(x.contents) if len(x.contents) > 1 else x.get_text(strip=True) for x in soup.select('#plaintiff-body tr:first-child > td:nth-child(4), #plaintiff-body tr:nth-child(2) > td:last-child')]
print(att)
输出
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']
另一个版本:
import re
for br in soup.select("br"):
br.replace_with("\n")
out = [
re.sub(r"\s{2,}|\n", " ", td.text.split("\n\n")[0])
for td in soup.select("td:last-child")
]
print(out)
打印:
['COURTNEY L HANNA', 'JOSEPH & JOSEPH CO LPA SUITE 200 155 W MAIN ST COLUMBUS, OH 43215 (614) 449-8282']
有:
html = '''<tbody id="plaintiff-body">
<tr><td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td><td>TIMOTHY MOORE</td><td></td><td>TIMOTHY MOORE</td></tr><tr id="pladetail0001" style="" valign="top"><td></td><td>62 KEENE DRIVE<br>WESTERVILLE, OH 43081</td><td></td><td>62 KEENE DRIVE<br>WESTERVILLE, OH 43081</td></tr>
</tbody>'''
打印:
['TIMOTHY MOORE', '62 KEENE DRIVE WESTERVILLE, OH 43081']
有:
html = '''<tbody id="plaintiff-body">
<tr><td><img id="plaimg0001" src="/CaseInformationOnline/images/minus.png" onclick="showhide('pladetail0001','','plaimg0001')"></td><td>CENA PEDRO</td><td></td><td>ELIZABETH R WERNER</td></tr><tr id="pladetail0001" style="" valign="top"><td></td><td>33 W WEISHEIMER RD<br>COLUMBUS, OH 43215</td><td></td><td>THE NIGH LAW GROUP, LLC <br>300 S. 2ND STREET<br>COLUMBUS, OH 43215<br>(614) 379-6444<br><br>JOSEPH A NIGH<br>THE NIGH LAW GROUP, LLC <br>300 S. 2ND STREET<br>COLUMBUS, OH 43215<br>(614) 379-6444</td></tr>
</tbody>'''
打印:
['ELIZABETH R WERNER', 'THE NIGH LAW GROUP, LLC 300 S. 2ND STREET COLUMBUS, OH 43215 (614) 379-6444']