在没有标签标题的情况下提取由 <br> 分隔的标签之间的文本块
Extract text blocks between tags separated by <br> without a tag title
我有一个网页,该网页中有一系列带有特定 class 的标签。我感兴趣的标签如下所示:
<span class="my-span-class">
"Text of interest before break"
<br>
"Text of interest after break"
</span>
这些元素没有标题,只是填充有文本的标签,每个元素仅由 1 个
标签分隔。我希望我的最终结果将“休息前感兴趣的文本”与“休息后感兴趣的文本”放在单独的列表中,如下所示:
my_list_1 [Text of interest before break #1, Text of interest before break #2, Text of interest before break #3, etc...]
my_list _2 [Text of interest after break #1, Text of interest after break #2, Text of interest after break #3, etc....]
但是,我正在努力从下面的内容中获得两个单独的列表。这当前将两个字符串一起输出,如下所示:“中断前感兴趣的文本中断后感兴趣的文本”
from bs4 import BeautifulSoup
import urllib.request
f = urllib.request.urlopen("html.html")
soup = BeautifulSoup(f)
# get the tag type that looks like the element shown above
myText = soup.find_all("span", class_="my-span-clas")
results = []
for i in myText:
results.append(i.text.strip())
我想初始化一个单独的列表(即 results_2 = []),并将“休息后感兴趣的文本”存储在那里,并将第一个结果列表仅保留给“文本”休息前兴趣
可以使用itertools.groupby
对<br>
前后的节点进行分组。
我已经通过处理 <br>
.
之前和之后的非文本元素使其更加健壮
from bs4 import BeautifulSoup, Tag
import itertools
soup = BeautifulSoup('''
<span class="my-span-class">
before break 1
<span>before break 1.1</span>
<br>
after break 1
</span>
<span class="my-span-class">
before break 2
<br>
after break 2
<span>after break 2.1</span>
</span>
''', 'html.parser')
befores, afters = [], []
for it in soup.select('.my-span-class'):
# this will give you three groups
groups = [list(g) for _, g in itertools.groupby(it.children, lambda c: c.name != 'br')]
# we just need items before br and after br
before, after = [g for g in groups if g[0].name != 'br']
befores.extend(before)
afters.extend(after)
print(befores)
print(afters)
打印:
['\n before break 1\n ', <span>before break 1.1</span>, '\n', '\n before break 2\n ']
['\n after break 1\n', '\n after break 2\n ', <span>after break 2.1</span>, '\n']
这应该足以演示如何在元素下对子项进行分区。
唯一剩下要做的就是遍历 befores
和 afters
并清理每个项目。
根据您的 html,您可以使用 contents
从标签中获取值。
contents[0]
将 return 第一个字符串
contents[-1]
将 return 最后一个字符串
from bs4 import BeautifulSoup
html='''<span class="my-span-class">
Text of interest before break
<br>
Text of interest after break
</span>
<span class="my-span-class">
Text of interest before break 1
<br>
Text of interest after break 1
</span>
<span class="my-span-class">
Text of interest before break 2
<br>
Text of interest after break 2
</span>
'''
soup = BeautifulSoup(html, 'html.parser')
Beforelist=[]
Afterlist=[]
for item in soup.find_all("span", class_="my-span-class"):
Beforelist.append(item.contents[0].strip())
Afterlist.append(item.contents[-1].strip())
print(Beforelist)
print(Afterlist)
输出:
['Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2']
['Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2']
您也可以结合使用 .stripped_strings
和 zip(*iterable)
来单独解压它们。
myTexts = (tag.stripped_strings for tag in soup.find_all("span", class_="my-span-class"))
before, after = zip(*myTexts)
>>> before
('Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2')
>>> after
('Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2')
你可以试试htql:
import htql
page="""
<span class="my-span-class">
Text of interest before break #1
<br>
Text of interest after break #1
</span>
<span class="my-span-class">
Text of interest before break #2
<br>
Text of interest after break #2
</span>
"""
results1 = htql.query(page, "<span (class='my-span-class')>.<br>1:px &trim ")
results2 = htql.query(page, "<span (class='my-span-class')>.<br>1:fx &trim ")
它产生:
>>> results1
[('Text of interest before break #1',), ('Text of interest before break #2',)]
>>> results2
[('Text of interest after break #1',), ('Text of interest after break #2',)]
我有一个网页,该网页中有一系列带有特定 class 的标签。我感兴趣的标签如下所示:
<span class="my-span-class">
"Text of interest before break"
<br>
"Text of interest after break"
</span>
这些元素没有标题,只是填充有文本的标签,每个元素仅由 1 个
标签分隔。我希望我的最终结果将“休息前感兴趣的文本”与“休息后感兴趣的文本”放在单独的列表中,如下所示:
my_list_1 [Text of interest before break #1, Text of interest before break #2, Text of interest before break #3, etc...]
my_list _2 [Text of interest after break #1, Text of interest after break #2, Text of interest after break #3, etc....]
但是,我正在努力从下面的内容中获得两个单独的列表。这当前将两个字符串一起输出,如下所示:“中断前感兴趣的文本中断后感兴趣的文本”
from bs4 import BeautifulSoup
import urllib.request
f = urllib.request.urlopen("html.html")
soup = BeautifulSoup(f)
# get the tag type that looks like the element shown above
myText = soup.find_all("span", class_="my-span-clas")
results = []
for i in myText:
results.append(i.text.strip())
我想初始化一个单独的列表(即 results_2 = []),并将“休息后感兴趣的文本”存储在那里,并将第一个结果列表仅保留给“文本”休息前兴趣
可以使用itertools.groupby
对<br>
前后的节点进行分组。
我已经通过处理 <br>
.
from bs4 import BeautifulSoup, Tag
import itertools
soup = BeautifulSoup('''
<span class="my-span-class">
before break 1
<span>before break 1.1</span>
<br>
after break 1
</span>
<span class="my-span-class">
before break 2
<br>
after break 2
<span>after break 2.1</span>
</span>
''', 'html.parser')
befores, afters = [], []
for it in soup.select('.my-span-class'):
# this will give you three groups
groups = [list(g) for _, g in itertools.groupby(it.children, lambda c: c.name != 'br')]
# we just need items before br and after br
before, after = [g for g in groups if g[0].name != 'br']
befores.extend(before)
afters.extend(after)
print(befores)
print(afters)
打印:
['\n before break 1\n ', <span>before break 1.1</span>, '\n', '\n before break 2\n ']
['\n after break 1\n', '\n after break 2\n ', <span>after break 2.1</span>, '\n']
这应该足以演示如何在元素下对子项进行分区。
唯一剩下要做的就是遍历 befores
和 afters
并清理每个项目。
根据您的 html,您可以使用 contents
从标签中获取值。
contents[0]
将 return 第一个字符串
contents[-1]
将 return 最后一个字符串
from bs4 import BeautifulSoup
html='''<span class="my-span-class">
Text of interest before break
<br>
Text of interest after break
</span>
<span class="my-span-class">
Text of interest before break 1
<br>
Text of interest after break 1
</span>
<span class="my-span-class">
Text of interest before break 2
<br>
Text of interest after break 2
</span>
'''
soup = BeautifulSoup(html, 'html.parser')
Beforelist=[]
Afterlist=[]
for item in soup.find_all("span", class_="my-span-class"):
Beforelist.append(item.contents[0].strip())
Afterlist.append(item.contents[-1].strip())
print(Beforelist)
print(Afterlist)
输出:
['Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2']
['Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2']
您也可以结合使用 .stripped_strings
和 zip(*iterable)
来单独解压它们。
myTexts = (tag.stripped_strings for tag in soup.find_all("span", class_="my-span-class"))
before, after = zip(*myTexts)
>>> before
('Text of interest before break', 'Text of interest before break 1', 'Text of interest before break 2')
>>> after
('Text of interest after break', 'Text of interest after break 1', 'Text of interest after break 2')
你可以试试htql:
import htql
page="""
<span class="my-span-class">
Text of interest before break #1
<br>
Text of interest after break #1
</span>
<span class="my-span-class">
Text of interest before break #2
<br>
Text of interest after break #2
</span>
"""
results1 = htql.query(page, "<span (class='my-span-class')>.<br>1:px &trim ")
results2 = htql.query(page, "<span (class='my-span-class')>.<br>1:fx &trim ")
它产生:
>>> results1
[('Text of interest before break #1',), ('Text of interest before break #2',)]
>>> results2
[('Text of interest after break #1',), ('Text of interest after break #2',)]