用 scrapy 抓取未嵌套的 html

Scraping un-nested html with scrapy

我正在使用优秀的 scrapy 项目来尝试抓取以下内容 HTML:

<div id="bio">
    <b>Birthplace:&nbsp;</b><a href="/tags/?id=90" target="_blank">Ireland</a>
    <br>
    <b>Location:&nbsp;</b><a href="/tags/?id=294" target="_blank">London</a>, 
    <a href="/tags/?id=64" target="_blank">UK</a>
    <br>
    <b>Ethnicity:&nbsp;</b><a href="/tags/?id=4" target="_blank">Caucasian</a><br>
</div>

另一个例子(不同的页面):

<div id="bio">
    <b>Birthplace:&nbsp;</b><a href="/tags/?id=100" target="_blank">United States</a>
    <br>
    <b>Location:&nbsp;</b><a href="/tags/?id=345" target="_blank">Baltimore</a>, 
    <a href="/tags/?id=190" target="_blank">Maryland</a>,
    <a href="/tags/?id=190" target="_blank">United States</a>
    <br>
    <b>Ethnicity:&nbsp;</b><a href="/tags/?id=4" target="_blank">Black</a><br>
</div>

我正在寻找的输出是:

["London", "UK"]
["Baltimore", "Maryland", "United States"]

如您所见,偶尔会有州和省,所以这并不像只选择前 2 个 <a> 标签那么容易。

我能想到的解决方案:

编辑:

澄清一下,上面的 2 个示例来自不同的页面。其次 <b>Ethnicity</b> 元素有时不会出现。它可以是 Birthday 或其他一些选项。 <b>Label:</b> 的顺序无法保证,数据非常非结构化,因此很困难。

以下 XPath 表达式:

//b[contains(.,'Location')]/following-sibling::a[not(preceding-sibling::b[contains(.,'Ethnicity')])]/text()

转换为

//b[contains(.,'Location')]       Select `b` elements anywhere in the document and only
                                  if their text content contains "Location"
/following-sibling::a             Of those `b` elements select following sibling
                                  elements `a` 
[not(preceding-sibling::b         but only if they (i.e. the `a` elements) are not
                                  preceded by a `b` element
[contains(.,'Ethnicity')])]       whose text nodes contain "Ethnicity"
/text()                           return all text nodes of those `a` elements

和产量(个别结果由 ------- 分隔)

London
-----------------------
UK
-----------------------
Baltimore
-----------------------
Maryland
-----------------------
United States

它依赖于这样一个事实,即您要查找的 a 元素 包含 Locationb 元素和b 个包含 Ethnicity 的元素。总是这样吗?


EDIT:作为对您编辑的回应,请尝试以下类似的表达方式:

//b[contains(.,'Location')]/following-sibling::a[not(preceding-sibling::b[preceding-sibling::b[contains(.,'Location')]])]/text()