按字节大小预测拆分 XML 文件
Predictive splitting of XML file by byte size
我有 XML 条消息 xmlStr
必须拆分成更小的 XML 条消息,小于或等于 maxSizeBytes
。这是通过将文档的根和它的第一个子元素作为较小 XML 的基础,并采用一定数量的 <Smt>
元素,并将它们的数量放入新形成的(较小的)XML 消息。
<?xml version="1.0"?>
<Bas>
<Hdr>
<Smt>...</Smt>
<Smt>...</Smt>
<Smt>...</Smt>
</Hdr>
</Bas>
目前,我正在测量整个邮件的大小 int smtNodesPerMessage = (int)Math.Ceiling((double)ASCIIEncoding.ASCII.GetByteCount(xmlStr) / (double)maxSizeBytes);
,然后是
将 smtNodesPerMessage
个节点放入较小的 XML:
//doc is original XDocument message
XDocument splitXML = new XDocument(new XElement(doc.Root.Name,
doc.Root.Descendants("Hdr")));
splitXML.Root.Add(batchOfSmt);
我很快发现,较小的 XML 文件的字节大小是否大于 maxSizeBytes
,因为 XDocument 会向每条消息添加额外的字符,从而增加字节大小。
基本算法是:
- 获取包含空
Hdr
元素的文档的大小。请注意,默认编码为 UTF-8。所以我使用 Encoding.Default.GetByteCount
来计算文档的大小及其元素。
- 为每个子文档克隆这个空的 hdr 文档
- 添加eash
Smt
元素前检查子文档大小是否会超过最大值
带注释的代码
var doc = XDocument.Load("data.xml");
var hdr = xdoc.Root.Element("Hdr");
var elements = hdr.Elements().ToList();
hdr.RemoveAll(); // we can remove child elements, because they are stored in a list
hdr.Value = ""; // otherwise xdoc will compact empty element to <Hdr/>
// calculating size of sub-document 'template'
var sb = new StringBuilder();
using (XmlWriter writer = XmlWriter.Create(sb))
doc.Save(writer);
var outerSizeInBytes = Encoding.Default.GetByteCount(sb.ToString());
var maxSizeInBytes = 100;
var subDocumentIndex = 0; // used just for naming sub-document files
var subDocumentSizeBytes = outerSizeInBytes; // initial size of any sub-document
var subDocument = new XDocument(doc); // clone 'template'
foreach (var smt in elements)
{
var currentElementSizeBytes = Encoding.Default.GetByteCount(smt.ToString());
if (maxSizeInBytes < subDocumentSizeBytes + currentElementSizeBytes
&& subDocumentSizeBytes != outerSizeInBytes) // case when first element is too big
{
subDocument.Save($"doc{++subDocumentIndex}.xml");
subDocument = new XDocument(doc);
subDocumentSizeBytes = outerSizeInBytes;
}
subDocument.Root.Element("Hdr").Add(smt);
subDocumentSizeBytes += currentElementSizeBytes;
}
// if current sub-document has elements added, save it too
if (outerSizeInBytes < subDocumentSizeBytes)
subDocument.Save($"doc{++subDocumentIndex}.xml");
当源为 且最大大小为 250 字节时,您将获得三个文档
<?xml version="1.0"?>
<Bas>
<Hdr>
<Smt>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</Smt>
<Smt>Contrary to popular belief, Lorem Ipsum is not simply random text.</Smt>
<Smt>It has survived not only five centuries,
but also the leap into electronic typesetting, remaining essentially unchanged.</Smt>
<Smt>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</Smt>
</Hdr>
</Bas>
doc1(223 字节):
<?xml version="1.0" encoding="utf-8"?>
<Bas>
<Hdr>
<Smt>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</Smt>
<Smt>Contrary to popular belief, Lorem Ipsum is not simply random text.</Smt>
</Hdr>
</Bas>
doc2(259 字节,单个元素):
<?xml version="1.0" encoding="utf-8"?>
<Bas>
<Hdr>
<Smt>It has survived not only five centuries,
but also the leap into electronic typesetting, remaining essentially unchanged.</Smt>
</Hdr>
</Bas>
doc3(128字节,最后一个)
<?xml version="1.0" encoding="utf-8"?>
<Bas>
<Hdr>
<Smt>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</Smt>
</Hdr>
</Bas>
我有 XML 条消息 xmlStr
必须拆分成更小的 XML 条消息,小于或等于 maxSizeBytes
。这是通过将文档的根和它的第一个子元素作为较小 XML 的基础,并采用一定数量的 <Smt>
元素,并将它们的数量放入新形成的(较小的)XML 消息。
<?xml version="1.0"?>
<Bas>
<Hdr>
<Smt>...</Smt>
<Smt>...</Smt>
<Smt>...</Smt>
</Hdr>
</Bas>
目前,我正在测量整个邮件的大小 int smtNodesPerMessage = (int)Math.Ceiling((double)ASCIIEncoding.ASCII.GetByteCount(xmlStr) / (double)maxSizeBytes);
,然后是
将 smtNodesPerMessage
个节点放入较小的 XML:
//doc is original XDocument message
XDocument splitXML = new XDocument(new XElement(doc.Root.Name,
doc.Root.Descendants("Hdr")));
splitXML.Root.Add(batchOfSmt);
我很快发现,较小的 XML 文件的字节大小是否大于 maxSizeBytes
,因为 XDocument 会向每条消息添加额外的字符,从而增加字节大小。
基本算法是:
- 获取包含空
Hdr
元素的文档的大小。请注意,默认编码为 UTF-8。所以我使用Encoding.Default.GetByteCount
来计算文档的大小及其元素。 - 为每个子文档克隆这个空的 hdr 文档
- 添加eash
Smt
元素前检查子文档大小是否会超过最大值
带注释的代码
var doc = XDocument.Load("data.xml");
var hdr = xdoc.Root.Element("Hdr");
var elements = hdr.Elements().ToList();
hdr.RemoveAll(); // we can remove child elements, because they are stored in a list
hdr.Value = ""; // otherwise xdoc will compact empty element to <Hdr/>
// calculating size of sub-document 'template'
var sb = new StringBuilder();
using (XmlWriter writer = XmlWriter.Create(sb))
doc.Save(writer);
var outerSizeInBytes = Encoding.Default.GetByteCount(sb.ToString());
var maxSizeInBytes = 100;
var subDocumentIndex = 0; // used just for naming sub-document files
var subDocumentSizeBytes = outerSizeInBytes; // initial size of any sub-document
var subDocument = new XDocument(doc); // clone 'template'
foreach (var smt in elements)
{
var currentElementSizeBytes = Encoding.Default.GetByteCount(smt.ToString());
if (maxSizeInBytes < subDocumentSizeBytes + currentElementSizeBytes
&& subDocumentSizeBytes != outerSizeInBytes) // case when first element is too big
{
subDocument.Save($"doc{++subDocumentIndex}.xml");
subDocument = new XDocument(doc);
subDocumentSizeBytes = outerSizeInBytes;
}
subDocument.Root.Element("Hdr").Add(smt);
subDocumentSizeBytes += currentElementSizeBytes;
}
// if current sub-document has elements added, save it too
if (outerSizeInBytes < subDocumentSizeBytes)
subDocument.Save($"doc{++subDocumentIndex}.xml");
当源为 且最大大小为 250 字节时,您将获得三个文档
<?xml version="1.0"?>
<Bas>
<Hdr>
<Smt>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</Smt>
<Smt>Contrary to popular belief, Lorem Ipsum is not simply random text.</Smt>
<Smt>It has survived not only five centuries,
but also the leap into electronic typesetting, remaining essentially unchanged.</Smt>
<Smt>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</Smt>
</Hdr>
</Bas>
doc1(223 字节):
<?xml version="1.0" encoding="utf-8"?>
<Bas>
<Hdr>
<Smt>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</Smt>
<Smt>Contrary to popular belief, Lorem Ipsum is not simply random text.</Smt>
</Hdr>
</Bas>
doc2(259 字节,单个元素):
<?xml version="1.0" encoding="utf-8"?>
<Bas>
<Hdr>
<Smt>It has survived not only five centuries,
but also the leap into electronic typesetting, remaining essentially unchanged.</Smt>
</Hdr>
</Bas>
doc3(128字节,最后一个)
<?xml version="1.0" encoding="utf-8"?>
<Bas>
<Hdr>
<Smt>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</Smt>
</Hdr>
</Bas>