Python XML 查找标签的具体位置
Python XML finding the specific location of a tag
我目前正在使用 python 中的内置 lxml.etree 通过 xml 文件进行解析。
我正在 运行 发表一些关于提取元素标签内的文本的问题。
以下是我当前问题的示例代码。
<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>
<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>
<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body>
<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>
我的冲突如下:
如果有标题,我正在使用第一个 P 标签来捕获每个 body 标签的标题。标题(在大多数情况下)是 body 标签之后的第一个 P 标签(因此示例代码第 1 行和第 4 行)。我没有特定的标题名称列表,这就是我使用此方法捕获标题的原因。
问题是 body 中不存在标题,但 body 标签中某处有 P 标签,它不在 body 标签之后(因此代码行 2和 3 ) 程序将第一个 P 标签和其中的文本作为标题。在这种情况下,相应的 P 标签不是标题,不应被视为一个,但由于它被视为一个,因此 P 标签之前的任何文本都将被忽略,不会被写入新的文本文件。
为了进一步说明,以下是写入文本文件的内容。
Title 1 : This is a sample text after the the p tag that contains the title.
not a title : This is sample text after a p tag that does not contain a title.
not a title : This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.
需要输出到文本文件
Title 1 : This is a sample text after the the p tag that contains the title.
sample text sample text sample text sample text not a title This is sample text after a p tag that does not contain a title.
sample text sample text not a title This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.
可能的解决方案:
1.有什么办法可以找到第一个P标签的位置。如果第一个 P 标签紧跟在 body 标签之后,我想保留它。我想删除但保留文本的任何其他 P 标签。我可以通过使用 lxml.etree
中的内置函数来做到这一点
strip_tags()
非常感谢任何对此问题或其他可能解决方案的见解...提前致谢!
我能够使用 BeautifulSoup 和正则表达式来识别标题。
from bs4 import BeautifulSoup as soup
from lxml import etree
import re
markup = """<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>
<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>
<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body>
<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>"""
soup = soup(markup,'html.parser')
titles = soup.select('body')
for title in titles:
groups = re.search('<body> *<p>', str(title))
has_title = groups != None
if has_title:
print(title.p.text)
我目前正在使用 python 中的内置 lxml.etree 通过 xml 文件进行解析。 我正在 运行 发表一些关于提取元素标签内的文本的问题。
以下是我当前问题的示例代码。
<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>
<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>
<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body>
<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>
我的冲突如下:
如果有标题,我正在使用第一个 P 标签来捕获每个 body 标签的标题。标题(在大多数情况下)是 body 标签之后的第一个 P 标签(因此示例代码第 1 行和第 4 行)。我没有特定的标题名称列表,这就是我使用此方法捕获标题的原因。
问题是 body 中不存在标题,但 body 标签中某处有 P 标签,它不在 body 标签之后(因此代码行 2和 3 ) 程序将第一个 P 标签和其中的文本作为标题。在这种情况下,相应的 P 标签不是标题,不应被视为一个,但由于它被视为一个,因此 P 标签之前的任何文本都将被忽略,不会被写入新的文本文件。
为了进一步说明,以下是写入文本文件的内容。
Title 1 : This is a sample text after the the p tag that contains the title.
not a title : This is sample text after a p tag that does not contain a title.
not a title : This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.
需要输出到文本文件
Title 1 : This is a sample text after the the p tag that contains the title.
sample text sample text sample text sample text not a title This is sample text after a p tag that does not contain a title.
sample text sample text not a title This is sample text after a p tag that does not contain a title. Another P tag not containing a title
Title 2 : This is a sample text after the the p tag that contains the title.
可能的解决方案:
1.有什么办法可以找到第一个P标签的位置。如果第一个 P 标签紧跟在 body 标签之后,我想保留它。我想删除但保留文本的任何其他 P 标签。我可以通过使用 lxml.etree
中的内置函数来做到这一点strip_tags()
非常感谢任何对此问题或其他可能解决方案的见解...提前致谢!
我能够使用 BeautifulSoup 和正则表达式来识别标题。
from bs4 import BeautifulSoup as soup
from lxml import etree
import re
markup = """<body> <P> Title 1 </P> This is a sample text after the the p tag that contains the title. </body>
<body> sample text sample text sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. </body>
<body> sample text sample text <P> not a title </P> This is sample text after a p tag that does not contain a title. <P> Another P tag not containing a title </P></body>
<body> <P> Title 2 </P> This is a sample text after the the p tag that contains the title. </body>"""
soup = soup(markup,'html.parser')
titles = soup.select('body')
for title in titles:
groups = re.search('<body> *<p>', str(title))
has_title = groups != None
if has_title:
print(title.p.text)