逐行读取 XML 而不将整个文件加载到内存

Question

这是我的 XML:

的结构

<?xml version="1.0" encoding="utf-8"?>
<posts>
  <row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="756" ViewCount="63468" Body="&lt;p&gt;I want to use a &lt;code&gt;Track-Bar&lt;/code&gt; to change a &lt;code&gt;Form&lt;/code&gt;'s opacity.&lt;/p&gt;&#xA;&lt;p&gt;This is my code:&lt;/p&gt;&#xA;&lt;pre class=&quot;lang-cs prettyprint-override&quot;&gt;&lt;code&gt;decimal trans = trackBar1.Value / 5000;&#xA;this.Opacity = trans;&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;p&gt;When I build the application, it gives the following error:&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;pre class=&quot;lang-none prettyprint-override&quot;&gt;&lt;code&gt;Cannot implicitly convert type decimal to double&#xA;&lt;/code&gt;&lt;/pre&gt;&#xA;&lt;/blockquote&gt;&#xA;&lt;p&gt;I have tried using &lt;code&gt;trans&lt;/code&gt; and &lt;code&gt;double&lt;/code&gt;, but then the &lt;code&gt;Control&lt;/code&gt; doesn't work. This code worked fine in a past VB.NET project.&lt;/p&gt;&#xA;" OwnerUserId="8" LastEditorUserId="3072350" LastEditorDisplayName="Rich B" LastEditDate="2021-02-26T03:31:15.027" LastActivityDate="2021-11-15T21:15:29.713" Title="How to convert a Decimal to a Double in C#?" Tags="&lt;c#&gt;&lt;floating-point&gt;&lt;type-conversion&gt;&lt;double&gt;&lt;decimal&gt;" AnswerCount="12" CommentCount="4" FavoriteCount="59" CommunityOwnedDate="2012-10-31T16:42:47.213" ContentLicense="CC BY-SA 4.0" />
  <row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="313" ViewCount="22477" Body="&lt;p&gt;I have an absolutely positioned &lt;code&gt;div&lt;/code&gt; containing several children, one of which is a relatively positioned &lt;code&gt;div&lt;/code&gt;. When I use a &lt;code&gt;percentage-based width&lt;/code&gt; on the child &lt;code&gt;div&lt;/code&gt;, it collapses to &lt;code&gt;0 width&lt;/code&gt; on IE7, but not on Firefox or Safari.&lt;/p&gt;&#xA;&lt;p&gt;If I use &lt;code&gt;pixel width&lt;/code&gt;, it works. If the parent is relatively positioned, the percentage width on the child works.&lt;/p&gt;&#xA;&lt;ol&gt;&#xA;&lt;li&gt;Is there something I'm missing here?&lt;/li&gt;&#xA;&lt;li&gt;Is there an easy fix for this besides the &lt;code&gt;pixel-based width&lt;/code&gt; on the child?&lt;/li&gt;&#xA;&lt;li&gt;Is there an area of the CSS specification that covers this?&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;" OwnerUserId="9" LastEditorUserId="9134576" LastEditorDisplayName="user14723686" LastEditDate="2021-01-29T18:46:45.963" LastActivityDate="2021-01-29T18:46:45.963" Title="Why did the width collapse in the percentage width child element in an absolutely positioned parent on Internet Explorer 7?" Tags="&lt;html&gt;&lt;css&gt;&lt;internet-explorer-7&gt;" AnswerCount="7" CommentCount="0" FavoriteCount="13" ContentLicense="CC BY-SA 4.0" />
</posts>

我可以一个一个地加载每个 row 而无需将整个 XML 文件加载到内存中吗？例如打印所有的标题

Answer 1

您可以尝试这样的操作：

while line:= file.readline():

Answer 2

提供 XML 文件的结构与示例中所示的完全相同，然后 BeautifulSoup 可用于解析相关行。像这样：

from bs4 import BeautifulSoup as BS
with open('my.xml') as xml:
    for line in map(str.strip, xml):
        if line.startswith('<row'):
            soup = BS(line, 'lxml')
            if row := soup.find('row'):
                if title := row.get('title'):
                    print(title)

Answer 3

是的，你可以使用open()，它会return一个文件对象而不是将文件内容读入RAM。所以你想做这样的事情：

with open('file_name') as file:
    for row in file:
        print(row)

Answer 4

XML 中的“行”非常无关紧要；相关单位是元素、属性、开始标签、结束标签等。

流式解析器（通常称为 SAX 解析器，尽管严格来说 SAX 是 Java API）将递增地将文档交付给应用程序，不是一次一行，而是一次一个句法单位。

参见示例Python SAX Parser

逐行读取 XML 而不将整个文件加载到内存

Read XML line by line without loading whole file to memory

python

xml

dataset