使用 XPATH 提取两个标签（粗体）<b> 之间的所有文本

Question

这是我的 HTML 元素，

<div class="abstract-content selected" id="en-abstract">
    <p>
        <b>Introduction.</b> 
         Against the backdrop of increasing resistance to conventional antibiotics, bacteriocins represent an attractive alternative, given their potent activity, novel modes of action and perceived lack of issues with resistance.
        <b>Aim.</b>
         In this study, the nature of the antibacterial activity of a clinical isolate of 
        <i>Streptococcus gallolyticus</i>
         was investigated.
        <b>Methods.</b>
         Optimization of the production of an inhibitor from strain AB39 was performed using different broth media and supplements. Purification was carried out using size exclusion, ion exchange and HPLC. Gel diffusion agar overlay, MS/MS, 
        <i>de novo</i>
         peptide sequencing and genome mining were used in a proteogenomics approach to facilitate identification of the genetic basis for production of the inhibitor.
        <b>Results.</b>
         Strain AB39 was identified as representing 
        <i>Streptococcus gallolyticus</i>
         subsp. 
        <i>pasteurianus</i>
         and the successful production and purification of the AB39 peptide, named nisin P, with a mass of 3133.78 Da, was achieved using BHI broth with 10 % serum. Nisin P showed antibacterial activity towards clinical isolates of drug-resistant bacteria, including methicillin-resistant 
        <i>Staphylococcus aureus</i>
         , vancomycin-resistant 
        <i>Enterococcus</i>
         and penicillin-resistant 
        <i>Streptococcus pneumoniae</i>
         . In addition, the peptide exhibited significant stability towards high temperature, wide pH and certain proteolytic enzymes and displayed very low toxicity towards sheep red blood cells and Vero cells.
        <b>Conclusion.</b>
         To the best of our knowledge, this study represents the first production, purification and characterization of nisin P. Further study of nisin P may reveal its potential for treating or preventing infections caused by antibiotic-resistant Gram-positive bacteria, or those evading vaccination regimens.
    </p>
</div>

在这里，我想从“<b>”标签中提取 "headings"，并从它们下方的文本中提取相应的值。

示例： "AIM" ：在这项研究中，对临床分离的没食子链球菌 activity 的抗菌性质进行了研究。

有什么方法可以使用 xpath 实现这一点。并注意：我正在使用 scrapy 来提取东西。

我用过

"response.xpath("//p//text()[normalize-space()][preceding-sibling::*/self::b] ")" 将所有标题值作为单独的块提供，

[u' 在对传统抗生素的耐药性不断增加的背景下，细菌素是一种有吸引力的替代品，因为它们具有强大的 activity、新颖的作用方式，而且人们认为没有耐药性问题。', u' 在这项研究中，的临床分离株的抗菌activity的性质， u' 被调查了', u' 使用不同的肉汤培养基和补充剂对菌株 AB39 的抑制剂生产进行了优化。使用尺寸排阻、离子交换和 HPLC 进行纯化。凝胶扩散琼脂覆盖层，MS/MS, ', u' 肽测序和基因组挖掘被用于蛋白质组学方法，以促进识别产生抑制剂的遗传基础。', u' 菌株 AB39 被鉴定为代表 ', 你亚种。 ', u' 并成功生产和纯化 AB39 肽，命名为乳链菌肽 P，质量为 3133.78 Da，使用含 10% 血清的 BHI 肉汤实现。乳链菌肽 P 对 drug-resistant 细菌的临床分离株表现出抗菌 activity，包括 methicillin-resistant'，你', vancomycin-resistant ', 你'和penicillin-resistant'，你'。此外，该肽对高温、宽 pH 和某些蛋白水解酶表现出显着的稳定性，并且对绵羊红细胞和 Vero 细胞表现出非常低的毒性。', u' 据我们所知，这项研究代表了乳链菌肽 P 的首次生产、纯化和表征。进一步研究乳链菌肽 P 可能会揭示其治疗或预防由 antibiotic-resistant Gram-positive 细菌引起的感染的潜力, 或那些逃避疫苗接种方案的人。\n \n\n \n ']

任何指导都有帮助！！！

Answer 1

最快的方法可能是使用 string(//p) 获取所有内容并使用特定的文本操作命令拆分。

有了XPath，你可以试试：

获取所有标题（returns 5个元素）：

//b/text()

用这些XPaths（returns 5*1 元素）得到相应的描述（包括斜体标签）：

normalize-space(substring-before(substring-after(string(//p),//b[.="Introduction."]),//b[.="Aim."]))
normalize-space(substring-before(substring-after(string(//p),//b[.="Aim."]),//b[.="Methods."]))
normalize-space(substring-before(substring-after(string(//p),//b[.="Methods."]),//b[.="Results."]))
normalize-space(substring-before(substring-after(string(//p),//b[.="Results."]),//b[.="Conclusion."]))
normalize-space(substring-after(string(//p),//b[.="Conclusion."]))

如果您不知道标签之间的文本，您可以使用按位置索引 (//b[1],//b[2],...)。使用 count(//b) 可以知道最大值。

编辑：替代 XPaths：

normalize-space(//text()[preceding::b="Introduction." and following::b="Aim."])
normalize-space(//text()[preceding::b="Aim." and following::b="Methods."])
normalize-space(//text()[preceding::b="Methods." and following::b="Results."])
normalize-space(//text()[preceding::b="Results." and following::b="Conclusion."])
normalize-space(//text()[preceding::b="Conclusion."])

使用 XPATH 提取两个标签（粗体）<b> 之间的所有文本

Extracting all the text between two tags(bold)<b> using XPATH

html

python

xpath

xpath-1.0

scrapy