如何通过从大小为 10gb 的大型 xml 文件中获取第一条记录来生成 xml 文件而不会出现内存错误?
how to generate an xml file by taking the first record from a large xml file of size 10gb without getting memory error?
我有一个 10 GB 的大 xml 文件,我想创建一个新的 xml 文件,该文件是从 file.i 试图做的大文件的第一条记录生成的这在 java 和 python 中,但由于我正在加载整个数据,所以出现内存错误。
在另一个 post 中,有人建议 XSLT 是 this.I 新手 XSLT 的最佳解决方案,我不知道如何在 xslt 中做到这一点,请建议一些样式 sheet 做这个...
大 XML 文件 (10gb) 示例:
<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
<Header>
<BusinessPartner>CHILIS_US</BusinessPartner>
<FileType>mde</FileType>
<FileNumber>17</FileNumber>
<FormatVariant>1</FormatVariant>
<NumberOfRecords>22</NumberOfRecords>
<CreationDate>2016-06-07T12:00:46-07:00</CreationDate>
</Header>
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
.....
.....
</MemberDataExport>
我想创建一个这样的文件..
<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
</MemberDataExport>
有没有其他方法可以在不出现任何内存错误的情况下执行此操作?请也建议。
你没有显示你的代码,所以我们不可能知道你在做什么是对的还是错的。但是,我敢打赌任何解析器都需要加载整个文件来检查语法是否正确,没有丢失的标签等,这肯定会导致 10 GB 文件的内存不足错误。
因此,在这种情况下,我的方法是使用 BufferedStreamReader
(参见 How to read a large text file line by line using Java?)逐行读取文件,并在到达包含您的结束标记的行,即 </MembershipInfoListItem>
:
StringBuilder sb = new StringBuilder("<MemberDataExport xmlns=\"http://www.payback.net/lmsglobal/batch/memberdataexport\" xmlns:types=\"http://www.payback.net/lmsglobal/xsd/v1/types\">");
sb.append(System.lineSeparator());
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
// process the line
sb.append(line);
sb.append(System.lineSeparator());
if (line.contains("</MembershipInfoListItem>")) {
break;
}
}
sb.append("</MemberDataExport>");
} catch (IOException | AnyOtherExceptionNeeded ex) {
// log or rethrow
}
现在 sb.toString()
会 return 你想要的。
在 Python(您在 Java 之外提到过)中,您可以使用 ElementTree.iterparse
,然后在找到要复制的元素后中断解析:
import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1
for event, elem in ET.iterparse('input1.xml', events = ('start', 'end')):
if event == 'start':
level = level + 1
if level == 0:
result = ET.ElementTree(ET.Element(elem.tag))
if event == 'end':
level = level - 1
if level == 0:
count = count + 1
if count <= copy:
result.getroot().append(elem)
else:
break
result.write('result1.xml', 'UTF-8', True, 'http://www.payback.net/lmsglobal/batch/memberdataexport')
至于更好地保留名称空间前缀,我使用事件 start-ns
并在 ElementTree
上注册收集的名称空间取得了一些成功:
import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1
for event, elem in ET.iterparse('input1.xml', events = ('start', 'end', 'start-ns')):
if event == 'start':
level = level + 1
if level == 0:
result = ET.ElementTree(ET.Element(elem.tag))
if event == 'end':
level = level - 1
if level == 0:
count = count + 1
if count <= copy:
result.getroot().append(elem)
else:
break
if event == 'start-ns':
ET.register_namespace(elem[0], elem[1])
result.write('result1.xml', 'UTF-8', True)
我有一个 10 GB 的大 xml 文件,我想创建一个新的 xml 文件,该文件是从 file.i 试图做的大文件的第一条记录生成的这在 java 和 python 中,但由于我正在加载整个数据,所以出现内存错误。
在另一个 post 中,有人建议 XSLT 是 this.I 新手 XSLT 的最佳解决方案,我不知道如何在 xslt 中做到这一点,请建议一些样式 sheet 做这个...
大 XML 文件 (10gb) 示例:
<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
<Header>
<BusinessPartner>CHILIS_US</BusinessPartner>
<FileType>mde</FileType>
<FileNumber>17</FileNumber>
<FormatVariant>1</FormatVariant>
<NumberOfRecords>22</NumberOfRecords>
<CreationDate>2016-06-07T12:00:46-07:00</CreationDate>
</Header>
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
.....
.....
</MemberDataExport>
我想创建一个这样的文件..
<MemberDataExport xmlns="http://www.payback.net/lmsglobal/batch/memberdataexport" xmlns:types="http://www.payback.net/lmsglobal/xsd/v1/types">
<MembershipInfoListItem>
<MembershipIdentifier>PB00000000001956044</MembershipIdentifier>
<ParticipationStatus>1</ParticipationStatus>
<DataSharing>1</DataSharing>
<MasterInfo>
<Gender>1</Gender>
<Salutation>1</Salutation>
<FirstName>Hazel</FirstName>
<LastName>Sweetman</LastName>
<DateOfBirth>1957-03-25</DateOfBirth>
</MasterInfo>
</MembershipInfoListItem>
</MemberDataExport>
有没有其他方法可以在不出现任何内存错误的情况下执行此操作?请也建议。
你没有显示你的代码,所以我们不可能知道你在做什么是对的还是错的。但是,我敢打赌任何解析器都需要加载整个文件来检查语法是否正确,没有丢失的标签等,这肯定会导致 10 GB 文件的内存不足错误。
因此,在这种情况下,我的方法是使用 BufferedStreamReader
(参见 How to read a large text file line by line using Java?)逐行读取文件,并在到达包含您的结束标记的行,即 </MembershipInfoListItem>
:
StringBuilder sb = new StringBuilder("<MemberDataExport xmlns=\"http://www.payback.net/lmsglobal/batch/memberdataexport\" xmlns:types=\"http://www.payback.net/lmsglobal/xsd/v1/types\">");
sb.append(System.lineSeparator());
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
String line;
while ((line = br.readLine()) != null) {
// process the line
sb.append(line);
sb.append(System.lineSeparator());
if (line.contains("</MembershipInfoListItem>")) {
break;
}
}
sb.append("</MemberDataExport>");
} catch (IOException | AnyOtherExceptionNeeded ex) {
// log or rethrow
}
现在 sb.toString()
会 return 你想要的。
在 Python(您在 Java 之外提到过)中,您可以使用 ElementTree.iterparse
,然后在找到要复制的元素后中断解析:
import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1
for event, elem in ET.iterparse('input1.xml', events = ('start', 'end')):
if event == 'start':
level = level + 1
if level == 0:
result = ET.ElementTree(ET.Element(elem.tag))
if event == 'end':
level = level - 1
if level == 0:
count = count + 1
if count <= copy:
result.getroot().append(elem)
else:
break
result.write('result1.xml', 'UTF-8', True, 'http://www.payback.net/lmsglobal/batch/memberdataexport')
至于更好地保留名称空间前缀,我使用事件 start-ns
并在 ElementTree
上注册收集的名称空间取得了一些成功:
import xml.etree.ElementTree as ET
count = 0
copy = 1 # set this to the number of second level (i.e. children of the root) elements you want to copy
level = -1
for event, elem in ET.iterparse('input1.xml', events = ('start', 'end', 'start-ns')):
if event == 'start':
level = level + 1
if level == 0:
result = ET.ElementTree(ET.Element(elem.tag))
if event == 'end':
level = level - 1
if level == 0:
count = count + 1
if count <= copy:
result.getroot().append(elem)
else:
break
if event == 'start-ns':
ET.register_namespace(elem[0], elem[1])
result.write('result1.xml', 'UTF-8', True)