如何在用 PHP 解析 DOCX document.xml 时获取超链接标签的位置？

Question

我的目标是使用 PHP 解析格式为所有超链接的 DOCX 文件：

<start of hyperlink(number of the first element of hyperlink in text)>, <end of hyperlink(number of the last element of hyperlink in text)>, <hyperlink text>

例如：

input: "Hello, absolutely terrible{adjective: distressing}(you cannot see this in .docx file) world!"

output: {19, 26, "adjective: distressing"}

现在我已经完成了将所有超链接解析为纯文本的代码，但我无法获得它在文本中的位置编号。这是我的代码：

define("dir", "Dictations");
define("test_file", "Dictation_Text.docx");

/**
 * @param $filename
 * @return string
 */
function getHyperLinks($filename) {
    $explode_result = explode('.', $filename);
    $extension = end($explode_result);
    if ($extension == "docx") {
        $dataFile = "word/document.xml";
    }
else {
    return "DOCX files only supported";
}
$zip = new ZipArchive;
if ($zip->open($filename) === true) {
    if (($zip_index = $zip->locateName($dataFile)) !== false) {
        $data = $zip->getFromIndex($zip_index);
        $parser = xml_parser_create();
        xml_parse_into_struct($parser, $data, $values, $indexes);
        xml_parser_free($parser);
        $result = Array();
        foreach ($indexes["W:HYPERLINK"] as $ind) {
            if ($values[$ind]["type"] == "open") {
                $result[] = $values[$ind]["attributes"]["W:ANCHOR"];
            }
        }
        return $result;
    }
    else {
        return "File " . $filename . " couldn't be found in " . document;
    }
}
    else {
        return "Couldn't open archive " . $filename;
    }
}

#TODO: getting filename from front by $_GET
$document = dir . "/" . test_file;
$result = getHyperLinks($document);
if (is_array($result)) {
    foreach ($result as $res) {
        echo $res . "\n";
    }
}
else {
    echo $result;
}

所以我找不到超链接起始位置的任何 XML 属性，请告诉我如何获取它或从 XMLObject 中获取它的一些方法，或者告诉我更多解析 DOCX 文件以获取我需要的所有信息的便捷方式。

Answer 1

您的方法总体上看起来不错，但您查找的文件有误。

.docx link 元素未存储在 document.xml 中。很奇怪，对吧？

word/_rels/document.xml.rels 拥有所有数据（或 header1.xml.rels， ETC。）。

如果您想查看格式，请将您的 .docx 重命名为 .zip。然后你可以解压它并查看里面所有的.xml文件。每个 link 得到一行 XML，所以如果你只需要 link，你可能根本不需要从 document.xml 解析。

如果您确实需要上下文，您将通过每个关系上的 "Id" 变量的关联。

如何在用 PHP 解析 DOCX document.xml 时获取超链接标签的位置？

How to get position of hyperlink tag while parsing DOCX document.xml with PHP?

php

xml

docx

hyperlink

xml-parsing