将字符串转换为 Beautiful Soup 对象

Question

我是 python 和 post 的新手，非常感谢您的帮助！我正在尝试使用 Beautiful Soup 动态解析 30 多个不同的 RSS 博客提要。令人惊讶的是，它们不是标准的。因此，我首先创建了一个列表，其中包含我想要获取的所有潜在 xml 标签，我将其命名为 headers:

headers = ['title', 'description', 'author', 'credit', 'pubDate', 'link', 'origLink']

然后我从我试图抓取的 RSS 提要中抓取所有标签，并将它们放入自己的列表中，命名为标签：

import requests
from bs4 import BeautifulSoup as bs
requests.packages.urllib3.disable_warnings()

headers = ['title', 'description', 'author', 'credit', 'pubDate', 'link', 'origLink']

url = 'https://www.zdnet.com/blog/security/rss.xml'
resp = requests.get(url, verify=False)
soup = bs(resp.text, features='xml')
data = soup.find_all('item')

tags = [tag.name for tag in data[0].find_all()]
print(tags)

然后我构建了一个新的标签列表，n_tags，两个列表中的元素重叠：

n_tags = [i for i in headers if i in tags]
print(n_tags)

然后我遍历数据中的所有项目（页面上的所有博客 posts）并且我遍历我的新标签列表中的所有元素（所有与之相关的标签）博客）。我卡住的地方是 n_tags 是一个字符串列表，而不是汤对象。

手动解析提要的方法是：

for item in data:
    print(item.title.text)
    print(item.description.text)
    print(item.pubDate.text)
    print(item.credit.text)
    print(item.link.text)

但是，我想遍历标签列表并将它们插入代码中以获取xml标签的内容。

for item in data:
    for el in n_tags:
    content = item + "." + el + ".text"
    print(content)

这个returns一个错误：

TypeError: unsupported operand type(s) for +: 'Tag' and 'str'

我需要将列表中的字符串转换为 soup "Tag" 对象，以便将它们连接起来。我尝试将 Tag 对象重新转换为字符串并将整个字符串重新建立为 soup 对象，但没有成功。它没有出错，它只是返回 None

content = str(item) + "." + el + ".text"
print(soup.content)

我得到的最接近的是：

for item in data:
    for el in n_tags:
        content = str(item) + "." + el + ".text"
        print(content)

它实际上 returns 内容，但它不是我要找的，“.text”似乎没有应用，对于列表中的每个元素，博客 post 内容重复。

我没有想法，感谢阅读。如果您有任何问题，请告诉我。

Answer 1

我不确定我是否理解你的问题，但你似乎正试图 select 仅来自 RSS 提要中的特定元素的文本。

您可以尝试使用此脚本（使用 CSS select 或）：

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.zdnet.com/blog/security/rss.xml'
soup = bs(requests.get(url).content, 'html.parser')

headers = ['title', 'description', 'author', 'credit', 'pubDate', 'link', 'origLink']

for tag in soup.select(','.join(headers)):
    print(tag.text)

打印：

ZDNet | security RSS

Tue, 05 May 2020 00:15:23 +0000

ZDNet | security RSS

US financial industry regulator warns of widespread phishing campaign
FINRA warns of phishing campaign aimed at stealing members' Microsoft Office or SharePoint passwords.
Mon, 04 May 2020 23:29:00 +0000

Academics turn PC power units into speakers to leak secrets from air-gapped systems
POWER-SUPPLaY technique uses "singing capacitor" phenomenon for data exfiltration.
Mon, 04 May 2020 16:06:00 +0000

... and so on.

将字符串转换为 Beautiful Soup 对象

convert a string into a Beautiful Soup object

python

rss

beautifulsoup

rss-reader

python-3.x