如何在 python 中使用 rss 提要?

How to use an rss feed in python?

我以前从未使用过 RSS 提要,我似乎找不到提要的 url

提供 RSS Feed 的页面:

https://www.sec.gov/edgar/browse/?CIK=717826&owner=exclude

我正在使用 feedparser:

import feedparser

rss_url = 'https://www.sec.gov/edgar/browse/?CIK=717826/.rss'

Feed = feedparser.parse(rss_url)

pointer = Feed.entries[1]

# result is empty

我想我用错了 link,似乎找不到合适的。我试图在 RSS 按钮上查看源代码,但没有找到 link。当我单击该按钮时,它会下载一个 XML 文件。

有人可以帮助我了解如何找到这个 link 吗?

RSS 按钮上的 link 是正确的

https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40

你去那里时得到 XML 文档的行为也是正确的,因为 RSS 是基于 XML 格式的,所以 feedparser库处理的是实际 XML 内容。它解析它并允许您通过 Python API.

访问结果

比如在页面上

https://www.sec.gov/edgar/browse/?CIK=717826&owner=exclude

你有第三排

Securities to be offered to employees in employee benefit plans

并且在 RSS 提要(XML 格式)中您也有此条目:

<entry>
    <category label="form type" scheme="https://www.sec.gov/" term="S-8" />
    <content-type type="text/xml">
      <acceptance-date-time>2022-01-10T06:13:38.000Z</acceptance-date-time>
      <accession-number>0001193125-22-004979</accession-number>
      <act>33</act>
      <file-number>333-262071</file-number>
      <filing-date>2022-01-10</filing-date>
      <filing-href>https://www.sec.gov/Archives/edgar/data/717826/000119312522004979/0001193125-22-004979-index.htm</filing-href>
      <film-number>22519561</film-number>
      <form-name>Securities to be offered to employees in employee benefit plans</form-name>
      <size>220338</size>
    </content-type>
    <id>urn:tag:sec.gov,2021:accession-number=0001193125-22-004979</id>
    <link href="https://www.sec.gov/Archives/edgar/data/717826/000119312522004979/0001193125-22-004979-index.htm" rel="alternate" type="text/html" />
    <summary type="html"> &lt;strong&gt;Filed:&lt;/strong&gt; 2022-01-10 &lt;strong&gt;AccNo:&lt;/strong&gt; 0001193125-22-004979 &lt;strong&gt;Size:&lt;/strong&gt; 221KB </summary>
    <title>Securities to be offered to employees in employee benefit plans</title>
    <updated>2022-02-23T20:41:16.245Z</updated>
  </entry>

更新:

另一方面,当您更改代码以使用 RSS 按钮时 URL

from pprint import pprint

import feedparser

rss_url = 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40'
f = feedparser.parse(rss_url)
pprint(f)

您会看到该站点正在阻止您的请求已被阻止:

{'bozo': 1,
 'bozo_exception': SAXParseException('mismatched tag'),
 'encoding': 'utf-8',
 'entries': [],
 'feed': {'html': {'xmlns': 'http://www.w3.org/1999/xhtml'},
          'meta': {'content': 'text/html; charset=UTF-8',
                   'http-equiv': 'Content-Type'},
          'summary': '<div id="header">U.S. Securities and Exchange '
                     'Commission</div>\n'
                     '<div id="content">\n'
                     '<h1>Your Request Originates from an Undeclared Automated '
                     'Tool</h1>\n'
                     '<p>To allow for equitable access to all users, SEC '
                     'reserves the right to limit requests originating from '
                     'undeclared automated tools. Your request has been '
                     'identified as part of a network of automated tools '
                     'outside of the acceptable policy and will be managed '
                     'until action is taken to declare your traffic.</p>\n'
                     '\n'
                     '<p>Please declare your traffic by updating your user '
                     'agent to include company specific information.</p>\n'
                     '\n'
                     '\n'
                     '<p>For best practices on efficiently downloading '
                     'information from SEC.gov, including the latest EDGAR '
                     'filings, visit <a href="https://www.sec.gov/developer" '
                     'target="_blank">sec.gov/developer</a>. You can also <a '
                     'href="https://public.govdelivery.com/accounts/USSEC/subscriber/new?topic_id=USSEC_260" '
                     'target="_blank">sign up for email updates</a> on the SEC '
                     'open data program, including best practices that make it '
                     'more efficient to download data, and SEC.gov '
                     'enhancements that may impact scripted downloading '
                     'processes. For more information, contact <a '
                     'href="mailto:opendata@sec.gov">opendata@sec.gov</a>.</p>\n'
                     '\n'
                     '<p>For more information, please see the SEC’s <a '
                     'href="https://data.sec.gov/rss?cik=717826&amp;type=3,4,5&amp;exclude=true&amp;count=40#internet">Web '
                     'Site Privacy and Security Policy</a>. Thank you for your '
                     'interest in the U.S. Securities and Exchange '
                     'Commission.\n'
                     '<p>Reference ID: 0.563d1602.1645724603.4d26f4e</p>\n'
                     '<div class="grey_box">\n'
                     '<h2>More Information</h2>\n'
                     '<h3><a name="internet">Internet Security '
                     'Policy</a></h3>\n'
                     '\n'
                     '<p>By using this site, you are agreeing to security '
                     'monitoring and auditing. For security purposes, and to '
                     'ensure that the public service remains available to '
                     'users, this government computer system employs programs '
                     'to monitor network traffic to identify unauthorized '
                     'attempts to upload or change information or to otherwise '
                     'cause damage, including attempts to deny service to '
                     'users.</p>\n'
                     '\n'
                     '<p>Unauthorized attempts to upload information and/or '
                     'change information on any portion of this site are '
                     'strictly prohibited and are subject to prosecution under '
                     'the Computer Fraud and Abuse Act of 1986 and the '
                     'National Information Infrastructure Protection Act of '
                     '1996 (see Title 18 U.S.C. §§ 1001 and 1030).</p>\n'
                     '\n'
                     '<p>To ensure our website performs well for all users, '
                     'the SEC monitors the frequency of requests for SEC.gov '
                     'content to ensure automated searches do not impact the '
                     'ability of others to access SEC.gov content. We reserve '
                     'the right to block IP addresses that submit excessive '
                     'requests.  Current guidelines limit users to a total of '
                     'no more than 10 requests per second, regardless of the '
                     'number of machines used to submit requests. </p>\n'
                     '\n'
                     '<p>If a user or application submits more than 10 '
                     'requests per second, further requests from the IP '
                     'address(es) may be limited for a brief period. Once the '
                     'rate of requests has dropped below the threshold for 10 '
                     'minutes, the user may resume accessing content on '
                     'SEC.gov. This SEC practice is designed to limit '
                     'excessive automated searches on SEC.gov and is not '
                     'intended or expected to impact individuals browsing the '
                     'SEC.gov website. </p>\n'
                     '\n'
                     '<p>Note that this policy may change as the SEC manages '
                     'SEC.gov to ensure that the website performs efficiently '
                     'and remains available to all users.</p>\n'
                     '</div>\n'
                     '<br />\n'
                     '<p class="note"><b>Note:</b> We do not offer technical '
                     'support for developing or debugging scripted downloading '
                     'processes.</p>\n'
                     '</div>'},
 'headers': {'cache-control': 'max-age=0, no-cache, no-store',
             'connection': 'close',
             'content-encoding': 'gzip',
             'content-length': '2177',
             'content-type': 'text/html',
             'date': 'Thu, 24 Feb 2022 17:43:24 GMT',
             'expires': 'Thu, 24 Feb 2022 17:43:24 GMT',
             'mime-version': '1.0',
             'pragma': 'no-cache',
             'server': 'AkamaiGHost',
             'strict-transport-security': 'max-age=31536000 ; preload',
             'vary': 'Accept-Encoding'},
 'href': 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40',
 'namespaces': {'xhtml': 'http://www.w3.org/1999/xhtml'},
 'status': 403,
 'version': ''}

要进行调整,请查看文档 development section and in particular programmatic access。你必须使用正确的 User-Agent:

from pprint import pprint

import feedparser

rss_url = 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40'
f = feedparser.parse(rss_url, agent="Sample Company Name AdminContact@DOMAIN.com")
pprint(f)
print(len(f.entries))  # 21