有效地解析 XML 中 PHP 中的大文件以生成 SQL
Parse Large XML File in PHP Efficiently to Generate SQL
我正在尝试解析一个大 XML 文件并将其加载到 MySQL。我已经使用 simplexml
来解析它,它工作得很好,但是对于这个大的 XML 文件来说它的速度变慢了。现在我正在尝试使用 XMLReader
.
这是 XML 的示例:
<?xml version="1.0" encoding="UTF-8"?>
<drug type="biotech" created="2005-06-13" updated="2015-02-23">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<name>Lepirudin</name>
<description>Lepirudin is identical </description>
<cas-number>120993-53-5</cas-number>
<groups>
<group>approved</group>
</groups>
<pathways>
<pathway>
<smpdb-id>SMP00278</smpdb-id>
<name>Lepirudin Action Pathway</name>
<drugs>
<drug>
<drugbank-id>DB00001</drugbank-id>
<name>Lepirudin</name>
</drug>
<drug>
<drugbank-id>DB01373</drugbank-id>
<name>Calcium</name>
</drug>
</drugs>
...
</drug>
<drug type="biotech" created="2005-06-15" updated="2015-02-25">
...
</drug>
这是我使用 simplexml
的方法:
<?php
$xml = simplexml_load_file('drugbank.xml');
$servername = "localhost"; // Example : localhost
$username = "root";
$password = "pass";
$dbname = "dbname";
// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$xmlObject_count = $xml->drug->count();
for ($i=0; $i < $xmlObject_count; $i++) {
$name = $xml->drug[$i]->name;
$description = $xml->drug[$i]->description;
$casnumber = $xml->drug[$i]->{'cas-number'};
// ...
$created = $xml->drug[$i]['created'];
$updated = $xml->drug[$i]['updated'];
$type = $xml->drug[$i]['type'];
$sql = "INSERT INTO `drug` (name, description,cas_number,created,updated,type)
VALUES ('$name', '$description','$casnumber','$created','$updated','$type')";
if ($conn->query($sql) === TRUE) {
$last_id = $conn->insert_id;
} else {
echo "outer else Error: " . $sql . "<br>" . $conn->error. "<br>" ;
}
}
$conn->close();
它工作正常,它给了我 7,789 行。但是,我想使用 XMLReader
来解析它。但是 XMLReader
的问题我发现它提供了超过 35,000 行。
如果您查看 XML,您会发现在 <drug />
节点内还有其他一些 <drugs><drug>
子节点。我该如何克服这个问题?
这是我使用 XMLReader
的程序:
<?php
$servername = "localhost"; // Example : localhost
$username = "root";
$password = "pass";
$dbname = "dbname";
// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$reader = new XMLReader();
$reader->open('drugbank.xml');
while ($reader->read())
{
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'drug')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
$name = $xml->name;
$description = $xml->description;
$casnumber = $xml->{'cas-number'};
// ...
$sql = "INSERT INTO `drug` (name, description,cas_number,created,updated,type)
VALUES ('$name', '$description','$casnumber','$created','$updated','$type')";
if ($conn->query($sql) === TRUE) {
$last_id = $conn->insert_id;
} else {
echo "outer else Error: " . $sql . "<br>" . $conn->error. "<br>" ;
}
}
}
$conn->close();
对于这个例子,我发现它提供了超过 35,000 行。
好的,我为您提供了一个在执行速度、内存使用和数据库负载方面有很大改进的工作示例:
<?php
define('INSERT_BATCH_SIZE', 500);
define('DRUG_XML_FILE', 'drugbank.xml');
$servername = "localhost"; // Example : localhost
$username = "root";
$password = "pass";
$dbname = "dbname";
function parseXml($mysql)
{
$drugs = array();
$xmlReader = new XMLReader();
$xmlReader->open(DRUG_XML_FILE);
// Move our pointer to the first <drug /> element.
while ($xmlReader->read() && $xmlReader->name !== 'drug') ;
$drugCount = 0;
$totalDrugs = 0;
// Iterate over the outer <drug /> elements.
while ($xmlReader->name == 'drug')
{
// Convert the node into a SimpleXMLElement for ease of use.
$item = new SimpleXMLElement($xmlReader->readOuterXML());
$name = $item->name;
$description = $item->description;
$casNumber = $item->{'cas-number'};
$created = $item['created'];
$updated = $item['updated'];
$type = $item['type'];
$drugs[] = "('$name', '$description','$casNumber','$created','$updated','$type')";
$drugCount++;
$totalDrugs++;
// Once we've reached the desired batch size, insert the batch and reset the counter.
if ($drugCount >= INSERT_BATCH_SIZE)
{
batchInsertDrugs($mysql, $drugs);
$drugCount = 0;
}
// Go to next <drug />.
$xmlReader->next('drug');
}
$xmlReader->close();
// Insert the leftovers from the last batch.
batchInsertDrugs($mysql, $drugs);
echo "Inserted $totalDrugs total drugs.";
}
function batchInsertDrugs($mysql, &$drugs)
{
// Generate a batched INSERT statement.
$statement = "INSERT INTO `drug` (name, description, cas_number, created, updated, type) VALUES";
$statement = $statement . ' ' . implode(",\n", $drugs);
echo $statement, "\n";
// Run the batch INSERT.
if ($mysql->query($statement))
{
echo "Inserted " . count($drugs) . " drugs.";
}
else
{
echo "INSERT Error: " . $statement . "<br>" . $mysql->error. "<br>" ;
}
// Clear the buffer.
$drugs = array();
}
// Create MySQL connection.
$mysql = new mysqli($servername, $username, $password, $dbname);
if ($mysql->connect_error)
{
die("Connection failed: " . $mysql->connect_error);
}
parseXml($mysql);
我使用 same dataset 测试了这个例子。
以您现在的方式使用 SimpleXML 会导致在内存中解析整个文档,这既缓慢又占用大量内存。此方法使用 XMLReader, which is a fast pull-parser. You can probably make this faster still using the PHP SAX XML Parser,但它的模式有点复杂,上面的示例将明显优于您开始时的示例。
我的示例中的另一个显着变化是我们正在使用 MySQL 批量插入,因此我们实际上只在处理每个 500
(可配置)项时访问数据库。您可以调整此数字以获得更好的性能。在某一点之后,查询将变得太大 MySQL 无法处理,但您一次可以做的事情比 500
多得多。
如果您希望我进一步解释其中的任何部分,或者您对此有任何疑问,请在评论中告诉我! :)
我正在尝试解析一个大 XML 文件并将其加载到 MySQL。我已经使用 simplexml
来解析它,它工作得很好,但是对于这个大的 XML 文件来说它的速度变慢了。现在我正在尝试使用 XMLReader
.
这是 XML 的示例:
<?xml version="1.0" encoding="UTF-8"?>
<drug type="biotech" created="2005-06-13" updated="2015-02-23">
<drugbank-id primary="true">DB00001</drugbank-id>
<drugbank-id>BIOD00024</drugbank-id>
<drugbank-id>BTD00024</drugbank-id>
<name>Lepirudin</name>
<description>Lepirudin is identical </description>
<cas-number>120993-53-5</cas-number>
<groups>
<group>approved</group>
</groups>
<pathways>
<pathway>
<smpdb-id>SMP00278</smpdb-id>
<name>Lepirudin Action Pathway</name>
<drugs>
<drug>
<drugbank-id>DB00001</drugbank-id>
<name>Lepirudin</name>
</drug>
<drug>
<drugbank-id>DB01373</drugbank-id>
<name>Calcium</name>
</drug>
</drugs>
...
</drug>
<drug type="biotech" created="2005-06-15" updated="2015-02-25">
...
</drug>
这是我使用 simplexml
的方法:
<?php
$xml = simplexml_load_file('drugbank.xml');
$servername = "localhost"; // Example : localhost
$username = "root";
$password = "pass";
$dbname = "dbname";
// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$xmlObject_count = $xml->drug->count();
for ($i=0; $i < $xmlObject_count; $i++) {
$name = $xml->drug[$i]->name;
$description = $xml->drug[$i]->description;
$casnumber = $xml->drug[$i]->{'cas-number'};
// ...
$created = $xml->drug[$i]['created'];
$updated = $xml->drug[$i]['updated'];
$type = $xml->drug[$i]['type'];
$sql = "INSERT INTO `drug` (name, description,cas_number,created,updated,type)
VALUES ('$name', '$description','$casnumber','$created','$updated','$type')";
if ($conn->query($sql) === TRUE) {
$last_id = $conn->insert_id;
} else {
echo "outer else Error: " . $sql . "<br>" . $conn->error. "<br>" ;
}
}
$conn->close();
它工作正常,它给了我 7,789 行。但是,我想使用 XMLReader
来解析它。但是 XMLReader
的问题我发现它提供了超过 35,000 行。
如果您查看 XML,您会发现在 <drug />
节点内还有其他一些 <drugs><drug>
子节点。我该如何克服这个问题?
这是我使用 XMLReader
的程序:
<?php
$servername = "localhost"; // Example : localhost
$username = "root";
$password = "pass";
$dbname = "dbname";
// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
die("Connection failed: " . $conn->connect_error);
}
$reader = new XMLReader();
$reader->open('drugbank.xml');
while ($reader->read())
{
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'drug')
{
$doc = new DOMDocument('1.0', 'UTF-8');
$xml = simplexml_import_dom($doc->importNode($reader->expand(),true));
$name = $xml->name;
$description = $xml->description;
$casnumber = $xml->{'cas-number'};
// ...
$sql = "INSERT INTO `drug` (name, description,cas_number,created,updated,type)
VALUES ('$name', '$description','$casnumber','$created','$updated','$type')";
if ($conn->query($sql) === TRUE) {
$last_id = $conn->insert_id;
} else {
echo "outer else Error: " . $sql . "<br>" . $conn->error. "<br>" ;
}
}
}
$conn->close();
对于这个例子,我发现它提供了超过 35,000 行。
好的,我为您提供了一个在执行速度、内存使用和数据库负载方面有很大改进的工作示例:
<?php
define('INSERT_BATCH_SIZE', 500);
define('DRUG_XML_FILE', 'drugbank.xml');
$servername = "localhost"; // Example : localhost
$username = "root";
$password = "pass";
$dbname = "dbname";
function parseXml($mysql)
{
$drugs = array();
$xmlReader = new XMLReader();
$xmlReader->open(DRUG_XML_FILE);
// Move our pointer to the first <drug /> element.
while ($xmlReader->read() && $xmlReader->name !== 'drug') ;
$drugCount = 0;
$totalDrugs = 0;
// Iterate over the outer <drug /> elements.
while ($xmlReader->name == 'drug')
{
// Convert the node into a SimpleXMLElement for ease of use.
$item = new SimpleXMLElement($xmlReader->readOuterXML());
$name = $item->name;
$description = $item->description;
$casNumber = $item->{'cas-number'};
$created = $item['created'];
$updated = $item['updated'];
$type = $item['type'];
$drugs[] = "('$name', '$description','$casNumber','$created','$updated','$type')";
$drugCount++;
$totalDrugs++;
// Once we've reached the desired batch size, insert the batch and reset the counter.
if ($drugCount >= INSERT_BATCH_SIZE)
{
batchInsertDrugs($mysql, $drugs);
$drugCount = 0;
}
// Go to next <drug />.
$xmlReader->next('drug');
}
$xmlReader->close();
// Insert the leftovers from the last batch.
batchInsertDrugs($mysql, $drugs);
echo "Inserted $totalDrugs total drugs.";
}
function batchInsertDrugs($mysql, &$drugs)
{
// Generate a batched INSERT statement.
$statement = "INSERT INTO `drug` (name, description, cas_number, created, updated, type) VALUES";
$statement = $statement . ' ' . implode(",\n", $drugs);
echo $statement, "\n";
// Run the batch INSERT.
if ($mysql->query($statement))
{
echo "Inserted " . count($drugs) . " drugs.";
}
else
{
echo "INSERT Error: " . $statement . "<br>" . $mysql->error. "<br>" ;
}
// Clear the buffer.
$drugs = array();
}
// Create MySQL connection.
$mysql = new mysqli($servername, $username, $password, $dbname);
if ($mysql->connect_error)
{
die("Connection failed: " . $mysql->connect_error);
}
parseXml($mysql);
我使用 same dataset 测试了这个例子。 以您现在的方式使用 SimpleXML 会导致在内存中解析整个文档,这既缓慢又占用大量内存。此方法使用 XMLReader, which is a fast pull-parser. You can probably make this faster still using the PHP SAX XML Parser,但它的模式有点复杂,上面的示例将明显优于您开始时的示例。
我的示例中的另一个显着变化是我们正在使用 MySQL 批量插入,因此我们实际上只在处理每个 500
(可配置)项时访问数据库。您可以调整此数字以获得更好的性能。在某一点之后,查询将变得太大 MySQL 无法处理,但您一次可以做的事情比 500
多得多。
如果您希望我进一步解释其中的任何部分,或者您对此有任何疑问,请在评论中告诉我! :)