如何获取 HTML 元素，考虑到另一个标签的后续内容而不是 class？

Question

我正在将 HTML 转换为漂亮整洁的 CSV。我有一个包含 table 和很少 class 的文件。我有3种table，它们的结构都是一样的。唯一的区别是 "th" 元素中的内容，它位于我感兴趣的元素之后。我怎样才能只获得 "th" ("text_that_I_want_to_get") 中具有特定文本的 table 的内容？有没有办法在每种类型的 table 中插入带有 R 的 class？

类型 1 table

 <tr>
    <th class="array">text_that_I_want_to_get</th>
    <td class="array">
        <table>
            <thead>
                <tr>
                    <th class="string">name</th>
                    <th class="string">mean</th>
                    <th class="string">stdev</th>
                </tr>
            </thead>
            <tbody>

类型 2 table

<tr>
    <th class="array">text_that_I_want_to_get</th>
    <td class="array">
        <table>
            <thead>
                <tr>
                    <th class="string">name</th>
                    <th class="array">answers</th>
                </tr>
            </thead>
            <tbody>

类型 3 table

<tr>
    <th class="array">text_that_I_want_to_get</th>
    <td class="array">
        <table>
            <thead>
                <tr>
                    <th class="string">Reference</th>
                </tr>
            </thead>
            <tbody>

Answer 1

您需要以下三个 xpath：

xpath1 <- "//td[table[./thead/tr/th/text() = 'stdev']]/preceding-sibling::th"
xpath2 <- "//td[table[./thead/tr/th/text() = 'answers']]/preceding-sibling::th"
xpath3 <- "//td[table[./thead/tr/th/text() = 'Reference']]/preceding-sibling::th"

这些找到位于三种 table 类型中每一种类型根部的 td 节点，然后找到前面的 th 兄弟节点和您想要的文本。

所以要为 table 键入 1 得到“text_that_I_want_to_get”，您需要：

read_html(url) %>% html_nodes(xpath = xpath1) %>% html_text()
#> [1] "text_that_I_want_to_get"

您可以对 xpath2 和 xpath3 执行相同的操作，以从 table 类型 2 和 table 类型 3 中获取文本。

如何获取 HTML 元素，考虑到另一个标签的后续内容而不是 class？

How to get HTML element considering later content of another tag and not the class?

r

web-scraping

rvest

xml2