使用 rapidXML 解析多个文件（一个一个地）以使用更少的内存

Question

我需要读取一个大 XML 文件 (~5.4 GB)。我注意到使用 rapidXML 解析文件使用的 RAM 比磁盘上文件的大小多 6 倍（因此解析 200 MB 的文件需要 ~1.2 GB 的 RAM，而 5.4 GB 的文件需要 ~32.4 GB内存！）。为避免交换，我决定将文件分成较小的块并一个一个地读取这些块（使用 comma 库中的 'xml-split' 工具）。我可以正确读取和解析 XML 个文件。

问题：当我到达第一个文件的末尾时，我可以成功打开第二个文件，但是第一个文件仍然占用内存，即使我清除了rapidxml::document and/or 删除 rapidxml::file<>。这是头文件：

//*1st code snippet*
//.h file
#include "rapidxml_utils.hpp"        //Implicitly includes 'rapidxml.hpp'
...
private:
  std::basic_ifstream<char> inStream;
  rapidxml::file<>* sumoXmlFile;
  rapidxml::xml_document<> doc;
  uint16_t fcdFileIndex;               //initialized at 0
...

这是打开新 XML 文件的代码：

//*2nd code snippet*
//.cc file
bool parseNextFile()
{
  //check if file exists (filenames are : fcd0.xml, fcd1.xml, fcd2.xml, etc.)
  struct stat buffer;
  std::string fileName = std::string("fcd") + std::to_string(fcdFileIndex) + ".xml";
  bool fileExists = (stat(fileName.c_str(), &buffer) == 0);

  if(!fileExists)
    return false;

  //"increment" the name for the next file (when this method will be recalled)
  fcdFileIndex++;

  //open a reading stream, create the 'file' and parse it
  inStream.open(fileName.c_str(), std::basic_ifstream<char>::in);
  sumoXmlFile = new rapidxml::file<>(inStream);
  doc.parse<0>(sumoXmlFile->data());

  return true;
}

我第一次在代码中调用 parseNextFile()（打开第一个文件）。然后，定期调用update()方法：

//*3rd code snippet*
void update()
{
  //Read next tag
  rapidxml::xml_node<>* node = doc.first_node("timestep");

  //If no 'timestep' tags are left, clean and parse the next file.
  if(!node)
  {
    doc.clear();         //**not sure**
    delete sumoXmlFile;  //**not sure**
    inStream.close();    //**not sure**

    if(parseNextFile())  //See 2nd code snippet
      node = doc.first_node("timestep");
    else
      return;
  }

  //read the children nodes of the current 'timestep'
  for(rapidxml::xml_node<>* veh = node->first_node(); veh; veh = node->first_node())
  {
    ...
    //read some attributes using 'veh->first_attribute("...")'
    ...

    node->remove_first_node();
  }

  doc.remove_first_node();
}

问题是（我认为）'cleaning'（在前面的代码片段中标记为 'not sure' 的行）。我尝试了 clear()、delete 的几种组合，调用了 memory_pool 析构函数。我没有尝试释放内存。我还直接用

打开了XML个文件

sumoXmlFile = new rapidxml::file<>(fileName.c_str()); //see 2nd code snippet

而不是手动创建 ifstream。

总结，当我打开第一个 XML 文件时，它加载成功并使用了一些 RAM。完成后，我尝试 clean/delete/clear 内存池（没有成功）并打开第二个文件（成功）。此时，1st 和 2nd 文件使用内存。解析第二个文件工作正常（甚至是第三个、第四个等等），但 RAM 在某些时候变得非常满。

（最后）我的问题：我在释放第一个文件使用的内存时做错了吗？是否可以释放使用的内存然后读取下一个文件？如果需要，我不介意破坏进程中的 XML 文件。

（为了完整起见：这段代码实际上是一个 OMNeT++ simulation and the XML file is generated by SUMO。我确信 XML 文件没有错误。）

感谢您提供的任何帮助或提示！

Answer 1

我通过使用 Python script. The script then creates a CSV 文件从 XML 文件中提取有用信息解决了这个问题，该文件在 OMNeT (C++) 中使用 line-by-line 读取 line-by-line std::istream.

使用 rapidXML 解析多个文件（一个一个地）以使用更少的内存

Parsing several files (one by one) with rapidXML to use less memory

memory-management

rapidxml

xml-parsing