python re.findall 匹配不同数量的模式
python re.findall pattern for different number of matches
<tr>
11:15
12:15
13:15
</tr>
<tr>
18:15
19:15
20:15
</tr>
in this case output should be: [ (11:15, 12:15, 13:15), (18:15, 19:15, 20:15) ]
我的模式:(\d\d:\d\d)[\s\S]*?(\d\d:\d\d)[\s\S]*?(\d\d:\d\d)[\s\S]*?</tr>
仅当每个 tr 标签中有 3 小时时才有效
但是如果每个 tr 标签中有 1-3 小时(格式相同 \d\d:\d\d),这应该有效。
另一个例子。为此,我的模式不再有效。
<tr>12:00 13:00</tr>
<tr>14:00 15:00 16:00</tr>
<tr>12:00</tr>
Output should be: [ (12:00, 13:00, ), (14:00, 15:00, 16:00), (12:00, , ) ]
还有一件事:每个小时不只是由空格分隔,真实文件如下所示:
为此,我使用了 [\s\S]*? or [\w\s<>="-/:;?|]*?
。一个小时要么是简单的跨度,要么是更长的形式
.
示例:
<tr>
<span class="na">16:00</span>
<span>|</span><a href="http:/21.28.147.68/msi/default.aspx?event_id=52514&typetran=1&ReturnLink=http://www.kino.pl/kina/przedwiosnie/repertuar.php" class="toolBox" data-hasqtip="true" aria-describedby="qtip-0">20:45</td>
</tr>
我将使用 HTML 解析器 解析 HTML,找到 table
中的所有 tr
元素并拆分内容或每一行使用 str.split()
- it would handle both spaces and newlines. Example using BeautifulSoup
parser:
from bs4 import BeautifulSoup
data = """
<table>
<tr>
11:15
12:15
13:15
</tr>
<tr>
18:15
19:15
20:15
</tr>
<tr>12:00 13:00</tr>
<tr>14:00 15:00 16:00</tr>
<tr>12:00</tr>
</table>"""
soup = BeautifulSoup(data, "html.parser")
result = [row.text.split() for row in soup.table.find_all("tr")]
print(result)
打印:
[['11:15', '12:15', '13:15'],
['18:15', '19:15', '20:15'],
['12:00', '13:00'],
['14:00', '15:00', '16:00'],
['12:00']]
An hour is either in simple span or in longer form .
这更好,让我们找到 tr
中匹配特定模式的每个内部元素并获取文本
[[elm.strip() for elm in row.find_all(text=re.compile(r"\d\d:\d\d"))]
for row in soup.table.find_all("tr")]
如果你更喜欢正则表达式,你可以使用这个:
found = []
for group in re.findall(r'(\d\d:\d\d.*){1,3}</tr>', data, re.DOTALL):
found.append(re.findall(r'(\d\d:\d\d)', group, re.DOTALL))
# found == [['12:00', '13:00'], ['14:00', '15:00', '16:00'], ['12:00']]
使用正则表达式试试这个解决方案:
import re
input = """
<tr>
11:15
12:15
13:15
</tr>
<tr>
18:15
19:15
20:15
</tr>
<tr>12:00 13:00</tr>
<tr>14:00 15:00 16:00</tr>
<tr>12:00</tr>
"""
print [ re.findall('(\d\d:\d\d)', tr) for tr in re.findall('<tr>([^<]*)</tr>', input)]
输出:
[['11:15', '12:15', '13:15'],
['18:15', '19:15', '20:15'],
['12:00', '13:00'],
['14:00', '15:00', '16:00'],
['12:00']]
<tr>
11:15
12:15
13:15
</tr>
<tr>
18:15
19:15
20:15
</tr>
in this case output should be: [ (11:15, 12:15, 13:15), (18:15, 19:15, 20:15) ]
我的模式:(\d\d:\d\d)[\s\S]*?(\d\d:\d\d)[\s\S]*?(\d\d:\d\d)[\s\S]*?</tr>
仅当每个 tr 标签中有 3 小时时才有效
但是如果每个 tr 标签中有 1-3 小时(格式相同 \d\d:\d\d),这应该有效。 另一个例子。为此,我的模式不再有效。
<tr>12:00 13:00</tr>
<tr>14:00 15:00 16:00</tr>
<tr>12:00</tr>
Output should be: [ (12:00, 13:00, ), (14:00, 15:00, 16:00), (12:00, , ) ]
还有一件事:每个小时不只是由空格分隔,真实文件如下所示:
为此,我使用了 [\s\S]*? or [\w\s<>="-/:;?|]*?
。一个小时要么是简单的跨度,要么是更长的形式
.
示例:
<tr>
<span class="na">16:00</span>
<span>|</span><a href="http:/21.28.147.68/msi/default.aspx?event_id=52514&typetran=1&ReturnLink=http://www.kino.pl/kina/przedwiosnie/repertuar.php" class="toolBox" data-hasqtip="true" aria-describedby="qtip-0">20:45</td>
</tr>
我将使用 HTML 解析器 解析 HTML,找到 table
中的所有 tr
元素并拆分内容或每一行使用 str.split()
- it would handle both spaces and newlines. Example using BeautifulSoup
parser:
from bs4 import BeautifulSoup
data = """
<table>
<tr>
11:15
12:15
13:15
</tr>
<tr>
18:15
19:15
20:15
</tr>
<tr>12:00 13:00</tr>
<tr>14:00 15:00 16:00</tr>
<tr>12:00</tr>
</table>"""
soup = BeautifulSoup(data, "html.parser")
result = [row.text.split() for row in soup.table.find_all("tr")]
print(result)
打印:
[['11:15', '12:15', '13:15'],
['18:15', '19:15', '20:15'],
['12:00', '13:00'],
['14:00', '15:00', '16:00'],
['12:00']]
An hour is either in simple span or in longer form .
这更好,让我们找到 tr
中匹配特定模式的每个内部元素并获取文本
[[elm.strip() for elm in row.find_all(text=re.compile(r"\d\d:\d\d"))]
for row in soup.table.find_all("tr")]
如果你更喜欢正则表达式,你可以使用这个:
found = []
for group in re.findall(r'(\d\d:\d\d.*){1,3}</tr>', data, re.DOTALL):
found.append(re.findall(r'(\d\d:\d\d)', group, re.DOTALL))
# found == [['12:00', '13:00'], ['14:00', '15:00', '16:00'], ['12:00']]
使用正则表达式试试这个解决方案:
import re
input = """
<tr>
11:15
12:15
13:15
</tr>
<tr>
18:15
19:15
20:15
</tr>
<tr>12:00 13:00</tr>
<tr>14:00 15:00 16:00</tr>
<tr>12:00</tr>
"""
print [ re.findall('(\d\d:\d\d)', tr) for tr in re.findall('<tr>([^<]*)</tr>', input)]
输出:
[['11:15', '12:15', '13:15'],
['18:15', '19:15', '20:15'],
['12:00', '13:00'],
['14:00', '15:00', '16:00'],
['12:00']]