PHP 将 Office OpenXML 文件中的文本替换为 XMLWriter/XMLReader

PHP replace text in Office OpenXML files with XMLWriter/XMLReader

我正在使用 XMLReader 在 Office OpenXML 文档中查找文本,并使用 XMLWriter 将其写入 xliff 文件。然后我修改了另一个 xml 文件中的文本,现在我想重建 OpenXML 文档。我正在使用 XML 迭代器 class 就像 this question

中的 suggesetd

我想用xliff文件中的节点内容替换原始文件中的节点内容,检查节点的计数是否与属性相同。所以第 10 个节点将被替换为如果它存在。

我的代码现在发生的情况是它没有替换标签内容。它生成自封闭的空标签并将原始内容放在它后面。就在这个标签之后,它正在关闭文档。

xliff 文件 - segments.xliff

    <?xml version="1.0"?>
<xliff>
 <file original="/home/brgwe507/public_html/previas/wp-content/uploads/sites/9/2015/03/Cap32.docx" datatype="x-noveritis" source-language="pt-BR">
  <body>
   <trans-unit id="177">
    <source><g id="217">In a thermodynamic process, energy is transferred to or from a system by two primary methods.</g></source><seg-source><mrk mtype="seg" id="1"><g id="217">In a thermodynamic process, energy is transferred to or from a system by two primary methods.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="1"><g id="217">tradução segmento1.</g></mrk> </target>
   </trans-unit>
   <trans-unit id="178">
    <source><g id="217">The first method to be considered is work and the second, which will follow in Section 3.2, is heat transfer.</g></source><seg-source><mrk mtype="seg" id="2"><g id="217">The first method to be considered is work and the second, which will follow in Section 3.2, is heat transfer.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="2"><g id="217">tradução segmento 2</g></mrk> </target>
   </trans-unit>
   <trans-unit id="179">
    <source><g id="218">Work, designated </g><g id="219">W</g><g id="220">, is defined in mechanics as the product of a force and the distance moved in the direction of the force.</g></source><seg-source><mrk mtype="seg" id="3"><g id="218">Work, designated </g><g id="219">W</g><g id="220">, is defined in mechanics as the product of a force and the distance moved in the direction of the force.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="3"><g id="218">tradução</g><g id="219">teste</g><g id="220">, segmento 3</g></mrk> </target>
   </trans-unit>
   <trans-unit id="180">
    <source><g id="220">A more general definition of work is used in thermodynamics:</g><g id="221">Work</g><g id="222">, an interaction between a system and its surroundings, is done by a system if the sole external effect on the surroundings could be the raising of a weight.</g></source><seg-source><mrk mtype="seg" id="4"><g id="220">A more general definition of work is used in thermodynamics:</g><g id="221">Work</g><g id="222">, an interaction between a system and its surroundings, is done by a system if the sole external effect on the surroundings could be the raising of a weight.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="4"><g id="220">tradução deste segmento:</g><g id="221">para</g><g id="222">teste de tradução segmento 4.</g></mrk> </target>
   </trans-unit>
   <trans-unit id="181">
    <source><g id="222">The magnitude of the work is the product of the weight and the distance it could be </g><g id="223">lifted.This</g><g id="224"> definition allows a battery to do work since the energy produced by the battery could be the lifting of a weight, as suggested in Fig.</g></source><seg-source><mrk mtype="seg" id="5"><g id="222">The magnitude of the work is the product of the weight and the distance it could be </g><g id="223">lifted.This</g><g id="224"> definition allows a battery to do work since the energy produced by the battery could be the lifting of a weight, as suggested in Fig.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="5"><g id="222">tradução para teste </g><g id="223">xliff.</g><g id="224"> semgneto 5 ladsfoienfoqeiwnf</g></mrk> </target>
   </trans-unit>
   <trans-unit id="182">
    <source><g id="224">3.2.Work has unit</g><g id="225">s of N </g><g id="226">[S]</g><g id="227"> </g><g id="228">m 5 J.</g></source><seg-source><mrk mtype="seg" id="6"><g id="224">3.2.Work has unit</g><g id="225">s of N </g><g id="226">[S]</g><g id="227"> </g><g id="228">m 5 J.</g></mrk></seg-source>
    <target><mrk mtype="seg" id="6"><g id="224">3.2. teste</g><g id="225">1 de 7 </g><g id="226">[S]</g><g id="227"> </g><g id="228">segmento.</g></mrk> </target>
   </trans-unit>
   <trans-unit id="183">
    <source><g id="228">The work done per unit mass, or </g><g id="229">specific work</g><g id="230">, is</g></source><seg-source><mrk mtype="seg" id="7"><g id="228">The work done per unit mass, or </g><g id="229">specific work</g><g id="230">, is</g></mrk></seg-source>
    <target><mrk mtype="seg" id="7"><g id="228">Para tradução </g><g id="229">segmento</g><g id="230">, é</g></mrk> </target>
   </trans-unit>
  </body>
 </file>
</xliff>

原创document.xml待更新

<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
<w:body>
<w:p w:rsidR="000C0514" w:rsidRPr="004F10D0" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
<w:pPr>
<w:rPr>
<w:b/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="004F10D0">
<w:rPr>
<w:b/>
</w:rPr>
<w:t>CHAPTER 3</w:t>
</w:r>
</w:p>
...
<w:p w:rsidR="000C0514" w:rsidRPr="009D4166" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
<w:pPr>
<w:rPr>
<w:b/>
</w:rPr>
</w:pPr>
<w:r w:rsidRPr="009D4166">
<w:rPr>
<w:b/>
</w:rPr>
<w:t>Figure 3.57</w:t>
</w:r>
</w:p>
<w:sectPr w:rsidR="000C0514" w:rsidRPr="009D4166" w:rsidSect="004F10D0">
<w:headerReference w:type="even" r:id="rId7"/>
<w:pgSz w:w="11905" w:h="16840"/>
<w:pgMar w:top="1417" w:right="1701" w:bottom="1417" w:left="1701" w:header="0" w:footer="1305" w:gutter="0"/>
<w:cols w:space="720"/>
</w:sectPr>
</w:body>
</w:document>

PHP代码

    $xmlInputFile  = 'document.xml';
    $xmlOutputFile = 'new_document.xml';
    $xmlxliff = 'segments.xliff';

    $reader = new XMLReader();
    $reader->open($xmlInputFile);

    $writer = new XMLWriter();
    $writer->openUri($xmlOutputFile);

    $iterator = new XMLWritingIteration($writer, $reader);

    $segmentos = new XMLReader();
    $segmentos->open($xmlxliff);

    $writer->startDocument();
    $t=0;
    foreach ($iterator as $node) {
        $isElement = $node->nodeType === XMLReader::ELEMENT;

        if ($isElement && $node->name === 'w:t') {
        // increase <w:t> counter and find the same g id in the xliff
        $t++;
        $writer->startElement($node->name);
            while ($segmentos->read()){
                if ($segmentos->nodeType == XMLREADER::ELEMENT && $segmentos->name === 'g'){
                $gid = $segmentos->getAttribute('id');
                if ($gid === $t){
                    $texto = $segmentos->readInnerXML();
                    $writer->text($texto);
                }
                }
            }
            $writer->endElement();
        }else {
        // handle everything else
        $iterator->write();
        }
    }
    $writer->endDocument();

和new_document.xml

中的输出
<?xml version="1.0"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
 <w:body>
  <w:p w:rsidR="000C0514" w:rsidRPr="004F10D0" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
   <w:pPr>
    <w:rPr>
     <w:b/>
    </w:rPr>
   </w:pPr>
   <w:r w:rsidRPr="004F10D0">
    <w:rPr>
    <w:b/> 
    </w:rPr>
     <w:t/><--self closing <w:t> tag
    CHAPTER 3 <-- original text was not replaced and now is outside the tag
    </w:r>
   </w:p>
  </w:body> <-- body closing tag after first paragraph
</w:document> <-- document closing tag
<w:p w:rsidR="000C0514" w:rsidRPr="004F10D0" w:rsidRDefault="000C0514" w:rsidP="004F10D0"/> <-- more content after document closing tag
<w:p w:rsidR="004F10D0" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">... 

首先,代码确实有点问题。我更新了 XMLReaderIterator to version 0.1.8,其中还包含一个对您的情况有用的小修复。

您示例中流程的一般问题是您没有转发阅读迭代器。因此,稍后,将编写这些部分。这就是您在文档末尾看到它的原因。所以光写还不够,还需要跳过要替换的读迭代器中的元素:

$writer->startElement($node->name);

$node->next();
$iterator->skipNextRead();

$writer->text(sprintf("TEXT #%d", $textCount));
$writer->endElement();

启动元素后,$node->next(); 跳过当前 $node 元素的所有子节点(子节点)。这是必要的,以便以后不会输出这些。

然后 $iterator->skipNextRead() 告诉 foreach 不再前进(已经完成 next()XMLReader 仅前进)。此方法是 v0.1.8 中 XMLWritingIteration 的新方法,因此您需要更新。

整个示例(使用您的示例 XML):

require('xmlreader-iterators.php'); // require XMLReaderIterator library

$xmlInputFile = 'data/worddocument.xml';
$xmlXliffFile = 'data/segments.xliff';

$reader = new XMLReader();
$reader->open($xmlInputFile);

$writer = new XMLWriter();
$writer->openMemory();

$iterator = new XMLWritingIteration($writer, $reader);

$writer->startDocument();

$textCount = 0;
foreach ($iterator as $node) {
    $isElement = $node->nodeType === XMLReader::ELEMENT;

    if ($isElement && $node->name === 'w:t') {
        $textCount++;

        $writer->startElement($node->name);

        $node->next();
        $iterator->skipNextRead();

        $writer->text(sprintf("TEXT #%d", $textCount));
        $writer->endElement();
    } else {
        // handle everything else
        $iterator->write();
    }
}

$writer->endDocument();
echo $writer->outputMemory(true);

输出:

<?xml version="1.0"?>
<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml">
    <w:body>
        <w:p w:rsidR="000C0514" w:rsidRPr="004F10D0" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
            <w:pPr>
                <w:rPr>
                    <w:b/>
                </w:rPr>
            </w:pPr>
            <w:r w:rsidRPr="004F10D0">
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t>TEXT #1</w:t>
            </w:r>
        </w:p>
        ...
        <w:p w:rsidR="000C0514" w:rsidRPr="009D4166" w:rsidRDefault="004F10D0" w:rsidP="004F10D0">
            <w:pPr>
                <w:rPr>
                    <w:b/>
                </w:rPr>
            </w:pPr>
            <w:r w:rsidRPr="009D4166">
                <w:rPr>
                    <w:b/>
                </w:rPr>
                <w:t>TEXT #2</w:t>
            </w:r>
        </w:p>
        <w:sectPr w:rsidR="000C0514" w:rsidRPr="009D4166" w:rsidSect="004F10D0">
            <w:headerReference w:type="even" r:id="rId7"/>
            <w:pgSz w:w="11905" w:h="16840"/>
            <w:pgMar w:top="1417" w:right="1701" w:bottom="1417" w:left="1701" w:header="0" w:footer="1305" w:gutter="0"/>
            <w:cols w:space="720"/>
        </w:sectPr>
    </w:body>
</w:document>

我认为这更像是您想要实现的输出类型。如果 xliff 文件不是那么大,最好不要使用 XMLReader 来解析它,而是使用 SimpleXMLElementDOMDocument.两者都有 XPath,这应该非常方便查找其中的 ID 并快速收集合适的内容。