如何在没有多余换行符的情况下从 BeautifulSoup 输出 XML?

How to output XML from BeautifulSoup without extraneous newlines?

我正在使用 Python 和 BeautifulSoup 来解析和访问 XML 文档中的元素。我修改了几个元素的值,然后将 XML 写回到文件中。问题是更新后的 XML 文件在每个 XML 元素的文本值的开头和结尾包含换行符,导致文件看起来像这样:

<annotation>
 <folder>
  Definitiva
 </folder>
 <filename>
  armas_229.jpg
 </filename>
 <path>
  /tmp/tmpygedczp5/handgun/images/armas_229.jpg
 </path>
 <size>
  <width>
   1800
  </width>
  <height>
   1426
  </height>
  <depth>
   3
  </depth>
 </size>
 <segmented>
  0
 </segmented>
 <object>
  <name>
   handgun
  </name>
  <pose>
   Unspecified
  </pose>
  <truncated>
   0
  </truncated>
  <difficult>
   0
  </difficult>
  <bndbox>
   <xmin>
    1001
   </xmin>
   <ymin>
    549
   </ymin>
   <xmax>
    1453
   </xmax>
   <ymax>
    1147
   </ymax>
  </bndbox>
 </object>
</annotation>

相反,我希望输出文件看起来像这样:

<annotation>
 <folder>Definitiva</folder>
 <filename>armas_229.jpg</filename>
 <path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
 <size>
  <width>1800</width>
  <height>1426</height>
  <depth>3</depth>
 </size>
 <segmented>0</segmented>
 <object>
  <name>handgun</name>
  <pose>Unspecified</pose>
  <truncated>0</truncated>
  <difficult>0</difficult>
  <bndbox>
   <xmin>1001</xmin>
   <ymin>549</ymin>
   <xmax>1453</xmax>
   <ymax>1147</ymax>
  </bndbox>
 </object>
</annotation>

我打开文件并得到 "soup",如下所示:

    with open(pascal_xml_file_path) as pascal_file:
        pascal_contents = pascal_file.read()
    soup = BeautifulSoup(pascal_contents, "xml")

在我完成修改几个文档的值后,我使用 BeautifulSoup.prettify 将文档重写回文件,如下所示:

    with open(pascal_xml_file_path, "w") as pascal_file:
        pascal_file.write(soup.prettify())

我的假设是 BeautifulSoup.prettify 默认添加这些 superfluous/gratuitous 换行符,并且似乎没有修改此行为的好方法。我是否遗漏了 BeautifulSoup 文档中的某些内容,或者我是否真的无法修改此行为并且需要使用另一种方法将 XML 输出到文件?也许我最好改用 xml.etree.ElementTree 重写它?

考虑 XSLT with Python's third-party module, lxml (which you possibly already have with BeautifulSoup integration). Specifically, call the identity transform 按原样复制 XML,然后 运行 所有文本节点上的 normalize-space() 模板。

XSLT (另存为 .xsl,特殊的 .xml 文件或嵌入字符串)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output indent="yes"/>
    <xsl:strip-space elements="*"/>

    <!-- IDENTITY TRANSFORM -->
    <xsl:template match="@*|node()">
      <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
    </xsl:template>

    <!-- RUN normalize-space() ON ALL TEXT NODES -->
    <xsl:template match="text()">
        <xsl:copy-of select="normalize-space()"/>
    </xsl:template>            
</xsl:stylesheet>

Python

import lxml.etree as et

# LOAD FROM STRING OR PARSE FROM FILE
str_xml = '''...'''    
str_xsl = '''...'''

doc = et.fromstring(str_xml)
style = et.fromstring(str_xsl)

# INITIALIZE TRANSFORMER AND RUN 
transformer = et.XSLT(style)
result = transformer(doc)

# PRINT TO SCREEN
print(result)

# SAVE TO DISK
with open('Output.xml', 'wb') as f:
     f.write(result)

Rextester demo

My assumption is that the BeautifulSoup.prettify is adding these superfluous/gratuitous newlines by default, and there doesn't appear to be a good way to modify this behavior.

就是在bs4.Tagclassdecodedecode_contents两种方法中这样做的。

参考: Source file on github

如果你只是需要一个临时修复,你可以猴子补丁这两个方法

这是我的实现

from bs4 import Tag, NavigableString, BeautifulSoup
from bs4.element import AttributeValueWithCharsetSubstitution, EntitySubstitution


def decode(
    self, indent_level=None,
    eventual_encoding="utf-8", formatter="minimal"
):
    if not callable(formatter):
        formatter = self._formatter_for_name(formatter)

    attrs = []
    if self.attrs:
        for key, val in sorted(self.attrs.items()):
            if val is None:
                decoded = key
            else:
                if isinstance(val, list) or isinstance(val, tuple):
                    val = ' '.join(val)
                elif not isinstance(val, str):
                    val = str(val)
                elif (
                    isinstance(val, AttributeValueWithCharsetSubstitution)
                    and eventual_encoding is not None
                ):
                    val = val.encode(eventual_encoding)

                text = self.format_string(val, formatter)
                decoded = (
                    str(key) + '='
                    + EntitySubstitution.quoted_attribute_value(text))
            attrs.append(decoded)
    close = ''
    closeTag = ''
    prefix = ''
    if self.prefix:
        prefix = self.prefix + ":"

    if self.is_empty_element:
        close = '/'
    else:
        closeTag = '</%s%s>' % (prefix, self.name)

    pretty_print = self._should_pretty_print(indent_level)
    space = ''
    indent_space = ''
    if indent_level is not None:
        indent_space = (' ' * (indent_level - 1))
    if pretty_print:
        space = indent_space
        indent_contents = indent_level + 1
    else:
        indent_contents = None
    contents = self.decode_contents(
        indent_contents, eventual_encoding, formatter)

    if self.hidden:
        # This is the 'document root' object.
        s = contents
    else:
        s = []
        attribute_string = ''
        if attrs:
            attribute_string = ' ' + ' '.join(attrs)
        if indent_level is not None:
            # Even if this particular tag is not pretty-printed,
            # we should indent up to the start of the tag.
            s.append(indent_space)
        s.append('<%s%s%s%s>' % (
                prefix, self.name, attribute_string, close))
        has_tag_child = False
        if pretty_print:
            for item in self.children:
                if isinstance(item, Tag):
                    has_tag_child = True
                    break
            if has_tag_child:
                s.append("\n")
        s.append(contents)
        if not has_tag_child:
            s[-1] = s[-1].strip()
        if pretty_print and contents and contents[-1] != "\n":
            s.append("")
        if pretty_print and closeTag:
            if has_tag_child:
                s.append(space)
        s.append(closeTag)
        if indent_level is not None and closeTag and self.next_sibling:
            # Even if this particular tag is not pretty-printed,
            # we're now done with the tag, and we should add a
            # newline if appropriate.
            s.append("\n")
        s = ''.join(s)
    return s


def decode_contents(
    self,
    indent_level=None,
    eventual_encoding="utf-8",
    formatter="minimal"
):
    # First off, turn a string formatter into a function. This
    # will stop the lookup from happening over and over again.
    if not callable(formatter):
        formatter = self._formatter_for_name(formatter)

    pretty_print = (indent_level is not None)
    s = []
    for c in self:
        text = None
        if isinstance(c, NavigableString):
            text = c.output_ready(formatter)
        elif isinstance(c, Tag):
            s.append(
                c.decode(indent_level, eventual_encoding, formatter)
            )
        if text and indent_level and not self.name == 'pre':
            text = text.strip()
        if text:
            if pretty_print and not self.name == 'pre':
                s.append(" " * (indent_level - 1))
            s.append(text)
            if pretty_print and not self.name == 'pre':
                s.append("")
    return ''.join(s)


Tag.decode = decode
Tag.decode_contents= decode_contents

之后,当我执行 print(soup.prettify) 时,输出是

<annotation>
 <folder>Definitiva</folder>
 <filename>armas_229.jpg</filename>
 <path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
 <size>
  <width>1800</width>
  <height>1426</height>
  <depth>3</depth>
 </size>
 <segmented>0</segmented>
 <object>
  <name>handgun</name>
  <pose>Unspecified</pose>
  <truncated>0</truncated>
  <difficult>0</difficult>
  <bndbox>
   <xmin>1001</xmin>
   <ymin>549</ymin>
   <xmax>1453</xmax>
   <ymax>1147</ymax>
  </bndbox>
 </object>
</annotation>

我在做这个的时候做了很多假设。只是想证明这是可能的。

事实证明,如果我使用 xml.etree.ElementTree 而不是 BeautifulSoup,就可以直接获得我想要的缩进。例如,下面的代码读取 XML 文件,从文本元素中清除任何 newlines/whitespace,然后将树写入 XML 文件。

import argparse
from xml.etree import ElementTree


# ------------------------------------------------------------------------------
def reformat(
        input_xml: str,
        output_xml: str,
):
    tree = ElementTree.parse(input_xml)

    # remove extraneous newlines and whitespace from text elements
    for element in tree.getiterator():
        if element.text:
            element.text = element.text.strip()

    # write the updated XML into the annotations output directory
    tree.write(output_xml)


# ------------------------------------------------------------------------------
if __name__ == "__main__":

    # parse the command line arguments
    args_parser = argparse.ArgumentParser()
    args_parser.add_argument(
        "--in",
        required=True,
        type=str,
        help="file path of original XML",
    )
    args_parser.add_argument(
        "--out",
        required=True,
        type=str,
        help="file path of reformatted XML",
    )
    args = vars(args_parser.parse_args())

    reformat(
        args["in"],
        args["out"],
    )

我写了一个代码来做美化,没有任何额外的库。

美化逻辑

# Recursive function (do not call this method)
def _get_prettified(tag, curr_indent, indent):
    out =  ''
    for x in tag.find_all(recursive=False):
        if len(x.find_all()) == 0:
            content = x.string.strip(' \n')
        else:
            content = '\n' + _get_prettified(x, curr_indent + ' ' * indent, indent) + curr_indent
    
        attrs = ' '.join([f'{k}="{v}"' for k,v in x.attrs.items()])
        out += curr_indent + ('<%s %s>' % (x.name, attrs) if len(attrs) > 0 else '<%s>' % x.name) + content + '</%s>\n' % x.name
    
    return out 
    
# Call this method
def get_prettified(tag, indent):
    return _get_prettified(tag, '', indent);

您的输入

source = """<annotation>
 <folder>
  Definitiva
 </folder>
 <filename>
  armas_229.jpg
 </filename>
 <path>
  /tmp/tmpygedczp5/handgun/images/armas_229.jpg
 </path>
 <size>
  <width>
   1800
  </width>
  <height>
   1426
  </height>
  <depth>
   3
  </depth>
 </size>
 <segmented>
  0
 </segmented>
 <object>
  <name>
   handgun
  </name>
  <pose>
   Unspecified
  </pose>
  <truncated>
   0
  </truncated>
  <difficult>
   0
  </difficult>
  <bndbox>
   <xmin>
    1001
   </xmin>
   <ymin>
    549
   </ymin>
   <xmax>
    1453
   </xmax>
   <ymax>
    1147
   </ymax>
  </bndbox>
 </object>
</annotation>"""

输出

bs = BeautifulSoup(source, 'html.parser')
output = get_prettified(bs, indent=2)
print(output)

# Prints following
<annotation>
  <folder>Definitiva</folder>
  <filename>armas_229.jpg</filename>
  <path>/tmp/tmpygedczp5/handgun/images/armas_229.jpg</path>
  <size>
    <width>1800</width>
    <height>1426</height>
    <depth>3</depth>
  </size>
  <segmented>0</segmented>
  <object>
    <name>handgun</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
      <xmin>1001</xmin>
      <ymin>549</ymin>
      <xmax>1453</xmax>
      <ymax>1147</ymax>
    </bndbox>
  </object>
</annotation>

运行 您的代码:https://replit.com/@bikcrum/BeautifulSoup-Prettifier