如何解析具有类似 XML 结构但在内容旁边带有自闭合标签（而不是封闭内容）的文件

Question

我有一个结构如下的文件。它不是 XML，但我需要以某种方式从中制作出 JSON。

所以虽然我希望文件看起来像这样：

<chapter>
<line> Some text which I want to grab. </line>
<line> Some more text which I want to grab. </line>
<line> Even more text which I want to grab. </line>
</chapter>

实际上结构如下：

<chapter>
<line /> Some text which I want to grab.
<line /> Some more text which I want to grab.
<line /> Even more text which I want to grab.
</chapter>

所以每章的'lines'就站在自闭行标签的旁边。你能推荐一种抓住这些的方法吗？可能在 javascript / nodejs?

Answer 1

格式有效 XML，因此您可以使用常规 XML 技术...即 DOMParser 来解析内容

然而，你只需要在解析这些行时稍微聪明一点——你想找到每一行，并收集所有作为文本节点的兄弟节点（应该只有一个，但我提供的代码没有'做任何假设）

您没有指定输出 "structure"，但是您可以使用一种方法来输出嵌套数组 - 第一级是章节，每一章都有一个行数组

var xml = `<chapter>
<line /> Some text which I want to grab.
<line /> Some more text which I want to grab.
<line /> Even more text which I want to grab.
</chapter>`

var parser = new DOMParser();
var content = parser.parseFromString(xml, 'application/xml')
var chapters = content.getElementsByTagName('chapter');
var obj = [].reduce.call(chapters, function(result, chapter) {
    var lines = chapter.getElementsByTagName('line');
    result.push([].reduce.call(lines, function(result, line) {
        var text = '';
        for(var node = line.nextSibling; node && node.nodeType == 3; node = node.nextSibling) {
            text += node.nodeValue;
        }
        result.push(text);
        return result;
    }, []))
    return result;
}, []);
console.log(JSON.stringify(obj));

addressing the comments - firstly some documentation:

DOMParse documentation

Array#reduce documentation

Function#call documentation

现在，解释[].reduce.call(array, fn)这段代码

[].reduce.call 是 shorthand 对于 Array.prototype.reduce.call

getElementsByTagName returns a HTMLCollection ...它的行为类似于一个数组，除了它不是一个...有几种方法可以从一个数组中创建一个数组HTMLCollection——最原始的：

var array = [];
for(var i = 0; i < collection.length; i++) {
    array[i] = collection[i];
}

或

var array = Array.prototype.slice.call(collection);

或 (ES2015+) - 在 IE 中不可用，除非你 polyfill - 请参阅文档

var array = Array.from(collection);

然而，在 [].reduce 上使用 .call 方法允许第一个参数（this 参数）是任何可迭代的，而不仅仅是一个数组，所以它就像使用array 从上面看 array.reduce(fn) - 这是一种将 HTML 集合视为数组的方法，不需要中间变量

如何解析具有类似 XML 结构但在内容旁边带有自闭合标签（而不是封闭内容）的文件

How do I parse a file with XML-like structure, but with self-closing tags next to content (instead of enclosing the content)

javascript

parsing

node.js

xml-parsing

domparser