从 XML 文件中的 <content:encoded> 检索文本数据

Retrieving text data from <content:encoded> in XML file

我有一个 XML 文件,如下所示:

<rss version="2.0"
    xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:wp="http://wordpress.org/export/1.2/"
>

<channel>

<item>
        <title>Label: some_title&quot;</title>
        <link>some_link</link>
        <pubDate>some_date</pubDate>
        <dc:creator><![CDATA[University]]></dc:creator>
        <guid isPermaLink="false">https://link.link</guid>
        <description></description>
        <content:encoded><![CDATA[[vc_row][vc_column][vc_column_text]<strong>some text<a href="https://link.link" target="_blank" rel="noopener noreferrer">text</a> some more text</strong><!--more-->

[caption id="attachment_344" align="aligncenter" width="524"]<img class="-image-" src="link.link.png" alt="" width="524" height="316" /> <em>A <a href="link.link" target="_blank" rel="noopener noreferrer">screenshot</a> by the people</em>[/caption]

&nbsp;

<strong>some more text</strong>

&nbsp;
<div class="entry-content">

<em>Leave your comments</em>

</div>
<div class="post-meta wf-mobile-collapsed">
<div class="entry-meta"></div>
</div>
[/vc_column_text][/vc_column][/vc_row][vc_row][vc_column][/vc_column][/vc_row][vc_row][vc_column][dt_quote]<strong><b>RESEARCH | ARTICLE </b></strong>University[/dt_quote][/vc_column][/vc_row]]]></content:encoded>
        <excerpt:encoded><![CDATA[]]></excerpt:encoded>
</item>
some more <item> </item>s here
</channel>

我想提取 <content:encoded> 部分中的原始文本,不包括标签和网址。我已经用 BeautifulSoup 和 Scarpy 以及其他 lxml 方法尝试过这个。大多数 return 一个空列表。

有没有一种方法可以让我在不使用正则表达式的情况下检索这些信息?

非常感谢。

更新

我打开 XML 文件使用:

content = []
with open(xml_file, "r") as file:
    content = file.readlines()
    content = "".join(content)
    xml = bs(content, "lxml")

然后我用 scrapy 尝试了这个:

response = HtmlResponse(url=xml_file, encoding='utf-8')

response.selector.register_namespace('content', 
                                     'http://purl.org/rss/1.0/modules/content/')
response.xpath('channel/item/content:encoded').getall()

其中 return 是一个空列表。

并尝试了第一个答案中的代码:

soup = bs(xml.select_one("content:encoded").text, "html.parser")
text = "\n".join(
    s.get_text(strip=True, separator=" ") for s in soup.select("strong"))
print(text)

并得到这个错误:Only the following pseudo-classes are implemented: nth-of-type.

当我用 lxml 打开文件时,我 运行 这个 for 循环:

data = {}
n = 0

for item in xml.findall('item'):
  id = 'claim_id_' + str(n)
  keys = {}
  title = item.find('title').text
  keys['label'] = title.split(': ')[0]
  keys['claim'] = title.split(': ')[1]
  if item.find('content:encoded'):
    keys['text'] = bs(html.unescape(item.encoded.text), 'lxml')
  data[id] = keys
  print(data)
  n += 1

它很好地保存了标签和声明,但文本没有任何内容。现在我使用 BeautifulSoup 打开文件,它 return 出现了这个错误:'NoneType' object is not callable

如果您只需要 <strong> 标签内的文本,您可以使用我的示例。否则,这里似乎只有正则表达式适合:

from bs4 import BeautifulSoup

xml_doc = """
<rss version="2.0"
    xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:wp="http://wordpress.org/export/1.2/"
>

...the XML from the question...

</rss>
"""

soup = BeautifulSoup(xml_doc, "xml")

soup = BeautifulSoup(soup.select_one("content|encoded").text, "html.parser")

text = "\n".join(
    s.get_text(strip=True, separator=" ") for s in soup.select("strong")
)
print(text)

打印:

some text text some more text
some more text
RESEARCH | ARTICLE

我最终使用正则表达式 (regex) 得到了文本部分。

import re

for item in root.iter('item'):
  grandchildren = item.getchildren()
  for grandchild in grandchildren:
    if 'encoded' in grandchild.tag:
      text = grandchild.text
      text = re.sub(r'\[.*?\]', "", text)   # gets rid of square brackets and their content
      text = re.sub(r'\<.*?\>', "", text)   # gets rid of <> signs and their content
      text = text.replace("&nbsp;", "")   # gets rid of &nbsp;
      text = " ".join(text.split())