R: XML: XPATH: 从 html 标签获取标题

Question

我有一个 html 文件，其中包含具有以下结构的数千个条目。

<li class="li1">
  <div class="div1">
    <div class="div2">    
      <div class="div3">
        <a class="a1">
            <strong class="strong1">name</strong>
            <div class="div4">2ndname</div>
        </a>
        <small class="small1">
            <a href="URL" class="a2" title="INFO I WANT!">
                <div class="div5">time</div>
            </a>
        </small>
      </div>
      <p class="p1">Main info</p>



        </div>
    </div>
  </div>
</li>

我正在使用 R 通过 CSS 包提取信息。这是迄今为止有效的方法。

doc <- htmlParse("myfile")
name <- cssApply(doc, ".li1>.div1>.div2>.div3>.a1>.strong1", cssCharacter)
2ndname <- cssApply(doc, ".li1>.div1>.div2>.div3>.a1>.strong1", cssCharacter)

我想获取标题的标题，所以我直接使用XML。我试过：

uh<-xpathApply(doc, "//li[@class='li1']/div[@class='div1']/div[@class='div2']/div[@class='div3']/small[@class='small1']/a[@class='a2']", xmlGetAttr, "title")

但我只得到NULL。一些帮助将非常感激。我读过 attribute value extraction in XML using R 和其他几篇文章，但我找不到我做错了什么。再次感谢！

Answer 1

其实我误解了文件中数据的结构。此处编写的示例适用于解析器，但实际数据无效。我错过了一个级别。

R: XML: XPATH: 从 html 标签获取标题

R: XML: XPATH: Get title from html tag

xml

xpath

r