如何美化 HTML 使标签属性保持在一行中?
How to prettify HTML so tag attributes will remain in one single line?
我得到了这段代码:
text = """<html><head></head><body>
<h1 style="
text-align: center;
">Main site</h1>
<div>
<p style="
color: blue;
text-align: center;
">text1
</p>
<p style="
color: blueviolet;
text-align: center;
">text2
</p>
</div>
<div>
<p style="text-align:center">
<img src="./foo/test.jpg" alt="Testing static images" style="
">
</p>
</div>
</body></html>
"""
import sys
import re
import bs4
def prettify(soup, indent_width=4):
r = re.compile(r'^(\s*)', re.MULTILINE)
return r.sub(r'' * indent_width, soup.prettify())
soup = bs4.BeautifulSoup(text, "html.parser")
print(prettify(soup))
上面代码片段现在的输出是:
<html>
<head>
</head>
<body>
<h1 style="
text-align: center;
">
Main site
</h1>
<div>
<p style="
color: blue;
text-align: center;
">
text1
</p>
<p style="
color: blueviolet;
text-align: center;
">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style="
"/>
</p>
</div>
</body>
</html>
我想知道如何格式化输出,让它变成这样:
<html>
<head>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue;text-align: center;">
text1
</p>
<p style="color: blueviolet;text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style=""/>
</p>
</div>
</body>
</html>
换句话说,如果可能的话,我想在一行中保留 html 语句,例如 <tag attrib1=value1 attrib2=value2 ... attribn=valuen>
。当我说 "if possible" 时,我的意思是不搞砸属性本身的值(value1、value2、...、valuen)。
这可以用 beautifulsoup4 实现吗?就我在文档中读到的内容而言,您似乎可以使用自定义 formatter,但我不知道如何使用自定义格式化程序来满足所描述的要求。
编辑:
@alecxe 解决方案非常简单,不幸的是在一些更复杂的情况下失败,例如下面的情况,即:
test1 = """
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;">
<div id="sessionsGrid" data-columns="[
{ field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 },
{ field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}},
{ field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80},
{ field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 },
{ field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}},
{ field: 'note', title:'Note'}
]">
</div>
</div>
"""
from bs4 import BeautifulSoup
import re
def prettify(soup, indent_width=4, single_lines=True):
if single_lines:
for tag in soup():
for attr in tag.attrs:
print(tag.attrs[attr], tag.attrs[attr].__class__)
tag.attrs[attr] = " ".join(
tag.attrs[attr].replace("\n", " ").split())
r = re.compile(r'^(\s*)', re.MULTILINE)
return r.sub(r'' * indent_width, soup.prettify())
def html_beautify(text):
soup = BeautifulSoup(text, "html.parser")
return prettify(soup)
print(html_beautify(test1))
回溯:
dialer-capmaign-console <class 'str'>
['fill-vertically'] <class 'list'>
Traceback (most recent call last):
File "d:\mcve\x.py", line 35, in <module>
print(html_beautify(test1))
File "d:\mcve\x.py", line 33, in html_beautify
return prettify(soup)
File "d:\mcve\x.py", line 25, in prettify
tag.attrs[attr].replace("\n", " ").split())
AttributeError: 'list' object has no attribute 'replace'
BeautifulSoup
试图保留输入 HTML.
的属性值中的换行符和多个 space
这里的一个解决方法是迭代元素属性并在美化之前清理它们 - 删除换行符并将多个连续的 space 替换为单个space:
for tag in soup():
for attr in tag.attrs:
tag.attrs[attr] = " ".join(tag.attrs[attr].replace("\n", " ").split())
print(soup.prettify())
打印:
<html>
<head>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue; text-align: center;">
text1
</p>
<p style="color: blueviolet; text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style=""/>
</p>
</div>
</body>
</html>
更新(解决多值属性,如class
):
您只需要添加一个细微的修改,为属性属于 list
类型的情况添加特殊处理:
for tag in soup():
tag.attrs = {
attr: [" ".join(attr_value.replace("\n", " ").split()) for attr_value in value]
if isinstance(value, list)
else " ".join(value.replace("\n", " ").split())
for attr, value in tag.attrs.items()
}
虽然 BeautifulSoup 更常用,但如果您正在处理怪癖并且有更具体的要求,HTML Tidy 可能是更好的选择。
为 Python (pip install pytidylib
) 安装库后尝试以下代码:
from tidylib import Tidy
tidy = Tidy()
# assign string to text
config = {
"doctype": "omit",
# "show-body-only": True
}
print tidy.tidy_document(text, options=config)[0]
tidy.tidy_document
returns 包含 HTML 的元组和可能发生的任何错误。此代码将输出
<html>
<head>
<title></title>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue; text-align: center;">
text1
</p>
<p style="color: blueviolet; text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img src="./foo/test.jpg" alt="Testing static images" style="">
</p>
</div>
</body>
</html>
通过取消对第二个示例的 "show-body-only": True
的注释。
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;">
<div id="sessionsGrid" data-columns="[ { field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 }, { field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}}, { field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80}, { field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 }, { field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}}, { field: 'note', title:'Note'} ]"></div>
</div>
有关更多选项和自定义,请参阅 more configuration。有一些特定于属性的包装选项可能会有所帮助。如您所见,空元素只会占用一行,html-tidy 会自动尝试添加 DOCTYPE
、head
和 title
标签。
我得到了这段代码:
text = """<html><head></head><body>
<h1 style="
text-align: center;
">Main site</h1>
<div>
<p style="
color: blue;
text-align: center;
">text1
</p>
<p style="
color: blueviolet;
text-align: center;
">text2
</p>
</div>
<div>
<p style="text-align:center">
<img src="./foo/test.jpg" alt="Testing static images" style="
">
</p>
</div>
</body></html>
"""
import sys
import re
import bs4
def prettify(soup, indent_width=4):
r = re.compile(r'^(\s*)', re.MULTILINE)
return r.sub(r'' * indent_width, soup.prettify())
soup = bs4.BeautifulSoup(text, "html.parser")
print(prettify(soup))
上面代码片段现在的输出是:
<html>
<head>
</head>
<body>
<h1 style="
text-align: center;
">
Main site
</h1>
<div>
<p style="
color: blue;
text-align: center;
">
text1
</p>
<p style="
color: blueviolet;
text-align: center;
">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style="
"/>
</p>
</div>
</body>
</html>
我想知道如何格式化输出,让它变成这样:
<html>
<head>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue;text-align: center;">
text1
</p>
<p style="color: blueviolet;text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style=""/>
</p>
</div>
</body>
</html>
换句话说,如果可能的话,我想在一行中保留 html 语句,例如 <tag attrib1=value1 attrib2=value2 ... attribn=valuen>
。当我说 "if possible" 时,我的意思是不搞砸属性本身的值(value1、value2、...、valuen)。
这可以用 beautifulsoup4 实现吗?就我在文档中读到的内容而言,您似乎可以使用自定义 formatter,但我不知道如何使用自定义格式化程序来满足所描述的要求。
编辑:
@alecxe 解决方案非常简单,不幸的是在一些更复杂的情况下失败,例如下面的情况,即:
test1 = """
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;">
<div id="sessionsGrid" data-columns="[
{ field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 },
{ field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}},
{ field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80},
{ field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 },
{ field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}},
{ field: 'note', title:'Note'}
]">
</div>
</div>
"""
from bs4 import BeautifulSoup
import re
def prettify(soup, indent_width=4, single_lines=True):
if single_lines:
for tag in soup():
for attr in tag.attrs:
print(tag.attrs[attr], tag.attrs[attr].__class__)
tag.attrs[attr] = " ".join(
tag.attrs[attr].replace("\n", " ").split())
r = re.compile(r'^(\s*)', re.MULTILINE)
return r.sub(r'' * indent_width, soup.prettify())
def html_beautify(text):
soup = BeautifulSoup(text, "html.parser")
return prettify(soup)
print(html_beautify(test1))
回溯:
dialer-capmaign-console <class 'str'>
['fill-vertically'] <class 'list'>
Traceback (most recent call last):
File "d:\mcve\x.py", line 35, in <module>
print(html_beautify(test1))
File "d:\mcve\x.py", line 33, in html_beautify
return prettify(soup)
File "d:\mcve\x.py", line 25, in prettify
tag.attrs[attr].replace("\n", " ").split())
AttributeError: 'list' object has no attribute 'replace'
BeautifulSoup
试图保留输入 HTML.
这里的一个解决方法是迭代元素属性并在美化之前清理它们 - 删除换行符并将多个连续的 space 替换为单个space:
for tag in soup():
for attr in tag.attrs:
tag.attrs[attr] = " ".join(tag.attrs[attr].replace("\n", " ").split())
print(soup.prettify())
打印:
<html>
<head>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue; text-align: center;">
text1
</p>
<p style="color: blueviolet; text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img alt="Testing static images" src="./foo/test.jpg" style=""/>
</p>
</div>
</body>
</html>
更新(解决多值属性,如class
):
您只需要添加一个细微的修改,为属性属于 list
类型的情况添加特殊处理:
for tag in soup():
tag.attrs = {
attr: [" ".join(attr_value.replace("\n", " ").split()) for attr_value in value]
if isinstance(value, list)
else " ".join(value.replace("\n", " ").split())
for attr, value in tag.attrs.items()
}
虽然 BeautifulSoup 更常用,但如果您正在处理怪癖并且有更具体的要求,HTML Tidy 可能是更好的选择。
为 Python (pip install pytidylib
) 安装库后尝试以下代码:
from tidylib import Tidy
tidy = Tidy()
# assign string to text
config = {
"doctype": "omit",
# "show-body-only": True
}
print tidy.tidy_document(text, options=config)[0]
tidy.tidy_document
returns 包含 HTML 的元组和可能发生的任何错误。此代码将输出
<html>
<head>
<title></title>
</head>
<body>
<h1 style="text-align: center;">
Main site
</h1>
<div>
<p style="color: blue; text-align: center;">
text1
</p>
<p style="color: blueviolet; text-align: center;">
text2
</p>
</div>
<div>
<p style="text-align:center">
<img src="./foo/test.jpg" alt="Testing static images" style="">
</p>
</div>
</body>
</html>
通过取消对第二个示例的 "show-body-only": True
的注释。
<div id="dialer-capmaign-console" class="fill-vertically" style="flex: 1 1 auto;">
<div id="sessionsGrid" data-columns="[ { field: 'dialerSession.startTime', format:'{0:G}', title:'Start time', width:122 }, { field: 'dialerSession.endTime', format:'{0:G}', title:'End time', width:122, attributes: {class:'tooltip-column'}}, { field: 'conversationStartTime', template: cty.ui.gct.duration_dialerSession_conversationStartTime_endTime, title:'Duration', width:80}, { field: 'dialerSession.caller.lastName',template: cty.ui.gct.person_dialerSession_caller_link, title:'Caller', width:160 }, { field: 'noteType',template:cty.ui.gct.nameDescription_noteType, title:'Note type', width:150, attributes: {class:'tooltip-column'}}, { field: 'note', title:'Note'} ]"></div>
</div>
有关更多选项和自定义,请参阅 more configuration。有一些特定于属性的包装选项可能会有所帮助。如您所见,空元素只会占用一行,html-tidy 会自动尝试添加 DOCTYPE
、head
和 title
标签。