Python XML 查找标签的具体位置

Python XML finding the specific location of a tag

我目前正在使用 python 中的内置 lxml.etree 通过 xml 文件进行解析。 我正在 运行 发表一些关于提取元素标签内的文本的问题。

以下是我当前问题的示例代码。

<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>

<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>

<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 

<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>

我的冲突如下:

如果有标题,我正在使用第一个 P 标签来捕获每个 body 标签的标题。标题(在大多数情况下)是 body 标签之后的第一个 P 标签(因此示例代码第 1 行和第 4 行)。我没有特定的标题名称列表,这就是我使用此方法捕获标题的原因。

问题是 body 中不存在标题,但 body 标签中某处有 P 标签,它不在 body 标签之后(因此代码行 2和 3 ) 程序将第一个 P 标签和其中的文本作为标题。在这种情况下,相应的 P 标签不是标题,不应被视为一个,但由于它被视为一个,因此 P 标签之前的任何文本都将被忽略,不会被写入新的文本文件。

为了进一步说明,以下是写入文本文件的内容。

Title 1 : This is a sample text after the the p tag that contains the title.
not a title : This is sample text after a p tag that does not contain a title.
not a title : This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

需要输出到文本文件

Title 1 : This is a sample text after the the p tag that contains the title.
sample text sample text sample text sample text not a title This is sample text after a p tag that does not contain a title.
sample text sample text not a title This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.

可能的解决方案:

1.有什么办法可以找到第一个P标签的位置。如果第一个 P 标签紧跟在 body 标签之后,我想保留它。我想删除但保留文本的任何其他 P 标签。我可以通过使用 lxml.etree

中的内置函数来做到这一点
strip_tags()

非常感谢任何对此问题或其他可能解决方案的见解...提前致谢!

我能够使用 BeautifulSoup 和正则表达式来识别标题。

from bs4 import BeautifulSoup as soup
from lxml import etree
import re


markup = """<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>

<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>

<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body> 

<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>"""


soup = soup(markup,'html.parser')

titles = soup.select('body')

for title in titles:
    
    groups = re.search('<body> *<p>', str(title))
    has_title = groups != None
    if has_title:
        print(title.p.text)