使用正则表达式查找和插入标签
Find and insert tags using regex
我正在将一本书从 PDF 格式转换为 epub 格式。但是标题不在 header 标签内,因此尝试使用正则表达式替换它的 python 函数。
示例文本:
<p class="calibre1"><a id="p1"></a>Chapter 370: Slamming straight on</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1"><a id="p7"></a>Chapter 372: Yan Zhaoge’s plan</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
我尝试使用正则表达式
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
list = re.findall(pattern, match.group())
for x in list:
x = "</a>(?i)chapter [0-9]+: [\w\s]+(.?)<br>"
x = s.split("</a>", 1)[0] + '</a><h2>' + s.split("a>",1)[1]
x = s.split("<br>", 1)[0] + '</h2><br>' + s.split("<br>",1)[1]
return match.group()
和
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
s.replace(re.match(pattern, s), r'<h2>[=12=]')
但仍然没有得到预期的结果。我想要的是...
输入
</a>Chapter 370: Slamming straight on</p>
输出
</a><h2>Chapter 370: Slamming straight on</h2></p>
将在所有类似实例中添加 h2 标签
Jean-François 的评论会更好,但如果我们必须这样做,我猜我们会从这个表达式开始:
(<\/a>)([^<]+)?(<\/p>)
(<\/a>)(chapter\s+[0-9]+[^<]+)?(<\/p>)
被替换为:
<h2></h2>
Demo 1
Demo 2
测试
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(<\/a>)(chapter\s+[0-9]+[^<]+)?(<\/p>)"
test_str = ("<p class=\"calibre1\"><a id=\"p1\"></a>Chapter 370: Slamming straight on</p>\n"
"<p class=\"softbreak\"> </p>\n"
"<p class=\"calibre1\">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>\n"
"<p class=\"calibre1\"><a id=\"p7\"></a>Chapter 372: Yan Zhaoge’s plan</p>\n"
"<p class=\"softbreak\"> </p>\n"
"<p class=\"calibre1\">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>")
subst = "\1<h2>\2</h2>\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
regex
不应该用来解析xml。看 :
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
(Why shouldn't you..
会是一个更好的标题)
但是,您可以改用 BeautifulSoup:
from bs4 import BeautifulSoup
data = """<p class="calibre1"><a id="p1"></a>Chapter 370: Slamming straight on</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1"><a id="p7"></a>Chapter 372: Yan Zhaoge’s plan</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
i t"""
soup = BeautifulSoup(data, 'lxml')
for x in soup.find_all('p', {'class':'calibre1'}):
link = x.find('a')
title = x.text
corrected_title = soup.new_tag('h2')
corrected_title.append(title)
if link:
x.string=''
corrected_title = soup.new_tag('h2')
corrected_title.append(title)
link.append(corrected_title)
x.append(link)
print(soup.body)
输出
<body>
<p class="calibre1">
<a id="p1">
<h2>Chapter 370: Slamming straight on</h2>
</a>
</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1">
<a id="p7">
<h2>Chapter 372: Yan Zhaoge’s plan</h2>
</a>
</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
i t
</body>
我正在将一本书从 PDF 格式转换为 epub 格式。但是标题不在 header 标签内,因此尝试使用正则表达式替换它的 python 函数。
示例文本:
<p class="calibre1"><a id="p1"></a>Chapter 370: Slamming straight on</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1"><a id="p7"></a>Chapter 372: Yan Zhaoge’s plan</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
我尝试使用正则表达式
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
list = re.findall(pattern, match.group())
for x in list:
x = "</a>(?i)chapter [0-9]+: [\w\s]+(.?)<br>"
x = s.split("</a>", 1)[0] + '</a><h2>' + s.split("a>",1)[1]
x = s.split("<br>", 1)[0] + '</h2><br>' + s.split("<br>",1)[1]
return match.group()
和
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
pattern = r"</a>(?i)chapter [0-9]+: [\w\s]+(.*)<br>"
s.replace(re.match(pattern, s), r'<h2>[=12=]')
但仍然没有得到预期的结果。我想要的是...
输入
</a>Chapter 370: Slamming straight on</p>
输出
</a><h2>Chapter 370: Slamming straight on</h2></p>
将在所有类似实例中添加 h2 标签
Jean-François 的评论会更好,但如果我们必须这样做,我猜我们会从这个表达式开始:
(<\/a>)([^<]+)?(<\/p>)
(<\/a>)(chapter\s+[0-9]+[^<]+)?(<\/p>)
被替换为:
<h2></h2>
Demo 1
Demo 2
测试
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(<\/a>)(chapter\s+[0-9]+[^<]+)?(<\/p>)"
test_str = ("<p class=\"calibre1\"><a id=\"p1\"></a>Chapter 370: Slamming straight on</p>\n"
"<p class=\"softbreak\"> </p>\n"
"<p class=\"calibre1\">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>\n"
"<p class=\"calibre1\"><a id=\"p7\"></a>Chapter 372: Yan Zhaoge’s plan</p>\n"
"<p class=\"softbreak\"> </p>\n"
"<p class=\"calibre1\">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>")
subst = "\1<h2>\2</h2>\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.IGNORECASE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
regex
不应该用来解析xml。看 :
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
(Why shouldn't you..
会是一个更好的标题)
但是,您可以改用 BeautifulSoup:
from bs4 import BeautifulSoup
data = """<p class="calibre1"><a id="p1"></a>Chapter 370: Slamming straight on</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1"><a id="p7"></a>Chapter 372: Yan Zhaoge’s plan</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
i t"""
soup = BeautifulSoup(data, 'lxml')
for x in soup.find_all('p', {'class':'calibre1'}):
link = x.find('a')
title = x.text
corrected_title = soup.new_tag('h2')
corrected_title.append(title)
if link:
x.string=''
corrected_title = soup.new_tag('h2')
corrected_title.append(title)
link.append(corrected_title)
x.append(link)
print(soup.body)
输出
<body>
<p class="calibre1">
<a id="p1">
<h2>Chapter 370: Slamming straight on</h2>
</a>
</p>
<p class="softbreak"> </p>
<p class="calibre1">Hearing Yan Zhaoge’s suggestion, the Jade Sea City martial practitioners here were all stunned.</p>
<p class="calibre1">
<a id="p7">
<h2>Chapter 372: Yan Zhaoge’s plan</h2>
</a>
</p>
<p class="softbreak"> </p>
<p class="calibre1">Yan Zhaoge and Ah Hu sat on Pan-Pan’s back, black water swirling about Pan-Pan’s entire body, keeping away the seawater as he shot forward at lightning speed.</p>
i t
</body>