当 nth-child() 在页面之间变化时指定 CSS

Question

我在抓取多个 HTML 元素时无法指定正确的 CSS 路径。问题在于页面的设置略有不同，因此 nth-child(#) 指定的元素在不同页面之间相差 1。这是我感兴趣的元素 'Unit Code' 在某些页面上位于 nth-child(20) 而在其他页面上位于 nth-child(21)。

我将运行访问数百个站点，因此我需要弄清楚如何处理这种位置变化。此代码以 nth-child(21) 运行，并且可以预见 returns 是第二个 URL.

的错误文本

我正在使用包 rvest。

library(rvest)
urls <- data.frame('site' = 1:2, 'urls' = c('https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-1&unit=SLE010',
                        'https://www.deakin.edu.au/current-students/unitguides/UnitGuide.php?year=2015&semester=TRI-2&unit=SLE339'))

urls$urls <- as.character(urls$urls)

uCode<- sapply(1:length(urls[,1]), function(x)
               html(urls[x,2]) %>% 
               html_nodes(css='#wmt_content > div:nth-child(21) > p.STANDARD') %>% 
               html_text())

每个页面的 html 都很大，页面位于 first and second。 HTML 包含单元代码，加上一些额外的 div，如下所示：

 <div class="UnitGuideElementItem">
    <a name="0-UNIT-CODE"></a>
    <p style="font-size: 100%;" class="BOLD">
        "Unit code"
        <br>
        "&nbsp;"
        <br>
    <p style="font-size: 100%" class="STANDARD">
        "SLE334"
        <br>
    </p>
  </div>
  <div class="UnitGuideElementItem">
    <a name="0-UNIT-TITLE"></a>
    <p style="font-size: 100%;" class="BOLD">
       "Unit title"
       <br>
       "&nbsp;"
       <br>
    <p style="font-size: 100%" class="STANDARD">
       "Medical Microbiology and Immunology"
        <br>
  </div>
  <div class="UnitGuideElementItem">
     <a name="0-CONTACT-HOURS"></a>
     <p style="font-size: 100%;" class="BOLD">
        "Contact hours"
        <br>
        "&nbsp;"
        <br>
     <p style="font-size: 100%" class="STANDARD">
        "3 x 1 hour class per week, 5 x 3 hour practicals per trimester."
     <br>
  </div>

除了 <a> 标记中的 0-UNIT-CODE 之外，HTML 代码的这一部分与其他部分相比没有任何独特之处。通过查看 w3schools page 我可以找到 <a> 标记，但无法弄清楚如何在此节点中指定 <p> 兄弟节点。进入 <a> 标签：

uCode<- sapply(1:length(urls[,1]), function(x)
               html(urls[x,2]) %>% 
               html_nodes(css='[name$=CODE]') %>% 
               html_text())

有谁知道我如何 select 'same' 元素，例如name="0-UNIT-CODE" 的兄弟姐妹，来自 HTML 文件，当元素位置从页面更改为页面时？或者，您如何 return 来自标签的信息，这些信息只能从具有相同父级的不同标签类型中定位？

编辑：包含包名称。将 link 包含在站点中，并包含更多 HTML 以供澄清。

Answer 1

我不确定你真正想要什么select。我想你想要给定锚点之后的所有 p 元素：

a[name="0-UNIT-CODE"] ~ p {
   /* selects all siblings of type p after the first anchor with name attribute set to "0-UNIT-CODE" */
}

Answer 2

您可以使用 xpath 的 "following-sibling": "找到 <p class=STANDARD> 是 <a name=0-UNIT-CODE>.

的兄弟姐妹并在其之后

uCode<- sapply(1:length(urls[,1]), function(x)
               html(urls[x,2]) %>% 
               html_nodes(xpath="//a[@name='0-UNIT-CODE']/following-sibling::p[@class='STANDARD']") %>% 
               html_text())

//a[@name='0-UNIT-CODE'] 用 name="0-UNIT-CODE" 找到 <a> （注意：我认为通常在 xpath 中你会做 //a[local-name()='0-UNIT-CODE'] 但这个语法在这个函数中似乎不被理解？)
/following-sibling::p[@class='STANDARD'] 选择 a 的以下兄弟 class STANDARD。

当 nth-child() 在页面之间变化时指定 CSS

Specify CSS when nth-child() changes between pages

html

css

r

apply

rvest