用 scrapy 抓取未嵌套的 html

Question

我正在使用优秀的 scrapy 项目来尝试抓取以下内容 HTML:

<div id="bio">
    <b>Birthplace:&nbsp;</b><a href="/tags/?id=90" target="_blank">Ireland</a>
    <br>
    <b>Location:&nbsp;</b><a href="/tags/?id=294" target="_blank">London</a>, 
    <a href="/tags/?id=64" target="_blank">UK</a>
    <br>
    <b>Ethnicity:&nbsp;</b><a href="/tags/?id=4" target="_blank">Caucasian</a><br>
</div>

另一个例子（不同的页面）：

<div id="bio">
    <b>Birthplace:&nbsp;</b><a href="/tags/?id=100" target="_blank">United States</a>
    <br>
    <b>Location:&nbsp;</b><a href="/tags/?id=345" target="_blank">Baltimore</a>, 
    <a href="/tags/?id=190" target="_blank">Maryland</a>,
    <a href="/tags/?id=190" target="_blank">United States</a>
    <br>
    <b>Ethnicity:&nbsp;</b><a href="/tags/?id=4" target="_blank">Black</a><br>
</div>

我正在寻找的输出是：

["London", "UK"]
["Baltimore", "Maryland", "United States"]

如您所见，偶尔会有州和省，所以这并不像只选择前 2 个 <a> 标签那么容易。

我能想到的解决方案：

检测到紧跟在 <a> 元素之后的逗号。没有逗号时停止（最后一个元素）
查找  元素和   元素之间的所有 <a> 标签
获取具有 state/province 的国家列表并按值过滤（我不希望这样做）

编辑：

澄清一下，上面的 2 个示例来自不同的页面。其次 Ethnicity 元素有时不会出现。它可以是 Birthday 或其他一些选项。 Label: 的顺序无法保证，数据非常非结构化，因此很困难。

Answer 1

以下 XPath 表达式：

//b[contains(.,'Location')]/following-sibling::a[not(preceding-sibling::b[contains(.,'Ethnicity')])]/text()

转换为

//b[contains(.,'Location')]       Select `b` elements anywhere in the document and only
                                  if their text content contains "Location"
/following-sibling::a             Of those `b` elements select following sibling
                                  elements `a` 
[not(preceding-sibling::b         but only if they (i.e. the `a` elements) are not
                                  preceded by a `b` element
[contains(.,'Ethnicity')])]       whose text nodes contain "Ethnicity"
/text()                           return all text nodes of those `a` elements

和产量（个别结果由 ------- 分隔）

London
-----------------------
UK
-----------------------
Baltimore
-----------------------
Maryland
-----------------------
United States

它依赖于这样一个事实，即您要查找的 a 元素在包含 Location 的 b 元素和b 个包含 Ethnicity 的元素。总是这样吗？

EDIT：作为对您编辑的回应，请尝试以下类似的表达方式：

//b[contains(.,'Location')]/following-sibling::a[not(preceding-sibling::b[preceding-sibling::b[contains(.,'Location')]])]/text()

用 scrapy 抓取未嵌套的 html

Scraping un-nested html with scrapy

html

python

xpath

scrapy

web-scraping