Python lxml 在文本中间存在标签时提取文本

Question

我正在尝试解析和提取 claim-text 标签内的所有文本，并将其准备为 csv。所以每个声明标签都有一个包含所有声明文本的列。

基本上，声明以两种样式表示。第一个 claim id="CLM-00001" num="00001"> 是另一个嵌套声明文本标签中的嵌套声明文本标签。第二种风格，如果你看<claim id="CLM-00002" num="00002">它在文本中间有一个<claim-ref标签（这似乎是我的问题）。

<claims id="claims">
        <claim id="CLM-00001" num="00001">
            <claim-text>1. A method of forming an amorphous metal foam formed of an amorphous metal powder comprising:
                <claim-text>mixing at least one amorphous metal powder and at least one gas-splitting propellant powder into a propellant filled amorphous metal powder mixture, such that upon decomposition of the gas-splitting propellant powder, gas-containing pores are created within the amorphous metal powder mixture;</claim-text>
                <claim-text>compacting the mixture such that the amorphous metal powder particles are bonded to one another to form a gas-tight seal around the gas-splitting propellant powder particles, the mixture being compacted at a compacting temperature and pressure sufficient to allow for bonding of the mixture, wherein the temperature is below any crystalline transition temperature of the amorphous metal powder, and for a duration not exceeding a time for any crystalline transformation of said amorphous metal powder at the compacting temperature and pressure;</claim-text>
                <claim-text>cooling the compacted mixture at a cooling rate sufficient that the amorphous metal powder mixture remains amorphous;</claim-text>
                <claim-text>expanding the compacted amorphous metal powder mixture to form a foam material, said expansion being conducted at an expansion temperature below any crystalline transition temperature of the amorphous metal powder, but sufficiently high to allow bubble expansion, at a surrounding pressure sufficient to promote expansion arising from a difference between a pressure in the gas-containing pores and the surrounding pressure, and for a duration not exceeding the time for any crystalline transformation to take place; and</claim-text>
                <claim-text>cooling the expanded foam material in order to allow the foam material to remain amorphous.</claim-text>
            </claim-text>
        </claim>
        <claim id="CLM-00002" num="00002">
            <claim-text>2. The method according to <claim-ref idref="CLM-00001">claim 1</claim-ref> wherein the gas-splitting propellant powder decomposes during expansion.</claim-text>
        </claim>
        <claim id="CLM-00003" num="00003">
            <claim-text>3. The method according to <claim-ref idref="CLM-00001">claim 1</claim-ref> wherein the gas-splitting propellant powder decomposes during compaction.</claim-text>
        </claim>
...
...
...
</claims>

我试过这个：Python element tree - extract text from element, stripping tags
和这个：python xml.etree.ElementTree remove empty tag in the middle of text

我尝试了 itertext() 方法，它为第一个声明标签提供了这个（它为我提供了该列所需的一切）：

['1. A method of forming an amorphous metal foam formed of an amorphous metal powder comprising:\n                ', 'mixing at least one amorphous metal powder and at least one gas-splitting propellant powder into a propellant filled amorphous metal powder mixture, such that upon decomposition of the gas-splitting propellant powder, gas-containing pores are created within the amorphous metal powder mixture;', '\n                ', 'compacting the mixture such that the amorphous metal powder particles are bonded to one another to form a gas-tight seal around the gas-splitting propellant powder particles, the mixture being compacted at a compacting temperature and pressure sufficient to allow for bonding of the mixture, wherein the temperature is below any crystalline transition temperature of the amorphous metal powder, and for a duration not exceeding a time for any crystalline transformation of said amorphous metal powder at the compacting temperature and pressure;', '\n                ', 'cooling the compacted mixture at a cooling rate sufficient that the amorphous metal powder mixture remains amorphous;', '\n                ', 'expanding the compacted amorphous metal powder mixture to form a foam material, said expansion being conducted at an expansion temperature below any crystalline transition temperature of the amorphous metal powder, but sufficiently high to allow bubble expansion, at a surrounding pressure sufficient to promote expansion arising from a difference between a pressure in the gas-containing pores and the surrounding pressure, and for a duration not exceeding the time for any crystalline transformation to take place; and', '\n                ', 'cooling the expanded foam material in order to allow the foam material to remain amorphous.', '\n            ', '\n        ']

现在进入下一个索赔标签 <claim id="CLM-00002" num="00002"> 它应该让我很理想：

The method according to wherein the gas-splitting propellant powder decomposes during expansion.

但它让我：

['2. The method according to ', '\n        ']

我使用的代码是：

result = []
    for doc in root.xpath('//claims/claim/claim-text'): 
        textwork = ((doc.getparent()).itertext('claim-text'))
        b=[]
        for texts in textwork:
            b.append(texts)
 
        result.append([b])
    write_all_to_csv(result, FILENAME_CLAIMS)

注意：该代码是简化版。我还从声明中提取了其他可以正常工作的内容。只是缩短它以专注于问题。

Answer 1

只需从 itertext 方法中删除标签名称，它就会提取标签中的所有相关文本。希望这有帮助。

from lxml import etree
root=etree.fromstring(xml)
result = []
for doc in root.xpath('//claims/claim/claim-text'): 
    textwork = (''.join((doc.getparent()).itertext()))
    #print(textwork)
    #b=[]
    #for texts in textwork:
    #    b.append(texts)

    result.append([textwork])
print(result)
#write_all_to_csv(result, FILENAME_CLAIMS)

Python lxml 在文本中间存在标签时提取文本

Python lxml extract text when a tag exists in the middle of the text

python

xml

lxml