如何在 python 中使用 rss 提要?
How to use an rss feed in python?
我以前从未使用过 RSS
提要,我似乎找不到提要的 url
。
提供 RSS Feed 的页面:
https://www.sec.gov/edgar/browse/?CIK=717826&owner=exclude
我正在使用 feedparser
:
import feedparser
rss_url = 'https://www.sec.gov/edgar/browse/?CIK=717826/.rss'
Feed = feedparser.parse(rss_url)
pointer = Feed.entries[1]
# result is empty
我想我用错了 link,似乎找不到合适的。我试图在 RSS 按钮上查看源代码,但没有找到 link。当我单击该按钮时,它会下载一个 XML 文件。
有人可以帮助我了解如何找到这个 link 吗?
RSS 按钮上的 link 是正确的
https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40
你去那里时得到 XML 文档的行为也是正确的,因为 RSS 是基于 XML 格式的,所以 feedparser
库处理的是实际 XML 内容。它解析它并允许您通过 Python API.
访问结果
比如在页面上
https://www.sec.gov/edgar/browse/?CIK=717826&owner=exclude
你有第三排
Securities to be offered to employees in employee benefit plans
并且在 RSS 提要(XML 格式)中您也有此条目:
<entry>
<category label="form type" scheme="https://www.sec.gov/" term="S-8" />
<content-type type="text/xml">
<acceptance-date-time>2022-01-10T06:13:38.000Z</acceptance-date-time>
<accession-number>0001193125-22-004979</accession-number>
<act>33</act>
<file-number>333-262071</file-number>
<filing-date>2022-01-10</filing-date>
<filing-href>https://www.sec.gov/Archives/edgar/data/717826/000119312522004979/0001193125-22-004979-index.htm</filing-href>
<film-number>22519561</film-number>
<form-name>Securities to be offered to employees in employee benefit plans</form-name>
<size>220338</size>
</content-type>
<id>urn:tag:sec.gov,2021:accession-number=0001193125-22-004979</id>
<link href="https://www.sec.gov/Archives/edgar/data/717826/000119312522004979/0001193125-22-004979-index.htm" rel="alternate" type="text/html" />
<summary type="html"> <strong>Filed:</strong> 2022-01-10 <strong>AccNo:</strong> 0001193125-22-004979 <strong>Size:</strong> 221KB </summary>
<title>Securities to be offered to employees in employee benefit plans</title>
<updated>2022-02-23T20:41:16.245Z</updated>
</entry>
更新:
另一方面,当您更改代码以使用 RSS 按钮时 URL
from pprint import pprint
import feedparser
rss_url = 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40'
f = feedparser.parse(rss_url)
pprint(f)
您会看到该站点正在阻止您的请求已被阻止:
{'bozo': 1,
'bozo_exception': SAXParseException('mismatched tag'),
'encoding': 'utf-8',
'entries': [],
'feed': {'html': {'xmlns': 'http://www.w3.org/1999/xhtml'},
'meta': {'content': 'text/html; charset=UTF-8',
'http-equiv': 'Content-Type'},
'summary': '<div id="header">U.S. Securities and Exchange '
'Commission</div>\n'
'<div id="content">\n'
'<h1>Your Request Originates from an Undeclared Automated '
'Tool</h1>\n'
'<p>To allow for equitable access to all users, SEC '
'reserves the right to limit requests originating from '
'undeclared automated tools. Your request has been '
'identified as part of a network of automated tools '
'outside of the acceptable policy and will be managed '
'until action is taken to declare your traffic.</p>\n'
'\n'
'<p>Please declare your traffic by updating your user '
'agent to include company specific information.</p>\n'
'\n'
'\n'
'<p>For best practices on efficiently downloading '
'information from SEC.gov, including the latest EDGAR '
'filings, visit <a href="https://www.sec.gov/developer" '
'target="_blank">sec.gov/developer</a>. You can also <a '
'href="https://public.govdelivery.com/accounts/USSEC/subscriber/new?topic_id=USSEC_260" '
'target="_blank">sign up for email updates</a> on the SEC '
'open data program, including best practices that make it '
'more efficient to download data, and SEC.gov '
'enhancements that may impact scripted downloading '
'processes. For more information, contact <a '
'href="mailto:opendata@sec.gov">opendata@sec.gov</a>.</p>\n'
'\n'
'<p>For more information, please see the SEC’s <a '
'href="https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40#internet">Web '
'Site Privacy and Security Policy</a>. Thank you for your '
'interest in the U.S. Securities and Exchange '
'Commission.\n'
'<p>Reference ID: 0.563d1602.1645724603.4d26f4e</p>\n'
'<div class="grey_box">\n'
'<h2>More Information</h2>\n'
'<h3><a name="internet">Internet Security '
'Policy</a></h3>\n'
'\n'
'<p>By using this site, you are agreeing to security '
'monitoring and auditing. For security purposes, and to '
'ensure that the public service remains available to '
'users, this government computer system employs programs '
'to monitor network traffic to identify unauthorized '
'attempts to upload or change information or to otherwise '
'cause damage, including attempts to deny service to '
'users.</p>\n'
'\n'
'<p>Unauthorized attempts to upload information and/or '
'change information on any portion of this site are '
'strictly prohibited and are subject to prosecution under '
'the Computer Fraud and Abuse Act of 1986 and the '
'National Information Infrastructure Protection Act of '
'1996 (see Title 18 U.S.C. §§ 1001 and 1030).</p>\n'
'\n'
'<p>To ensure our website performs well for all users, '
'the SEC monitors the frequency of requests for SEC.gov '
'content to ensure automated searches do not impact the '
'ability of others to access SEC.gov content. We reserve '
'the right to block IP addresses that submit excessive '
'requests. Current guidelines limit users to a total of '
'no more than 10 requests per second, regardless of the '
'number of machines used to submit requests. </p>\n'
'\n'
'<p>If a user or application submits more than 10 '
'requests per second, further requests from the IP '
'address(es) may be limited for a brief period. Once the '
'rate of requests has dropped below the threshold for 10 '
'minutes, the user may resume accessing content on '
'SEC.gov. This SEC practice is designed to limit '
'excessive automated searches on SEC.gov and is not '
'intended or expected to impact individuals browsing the '
'SEC.gov website. </p>\n'
'\n'
'<p>Note that this policy may change as the SEC manages '
'SEC.gov to ensure that the website performs efficiently '
'and remains available to all users.</p>\n'
'</div>\n'
'<br />\n'
'<p class="note"><b>Note:</b> We do not offer technical '
'support for developing or debugging scripted downloading '
'processes.</p>\n'
'</div>'},
'headers': {'cache-control': 'max-age=0, no-cache, no-store',
'connection': 'close',
'content-encoding': 'gzip',
'content-length': '2177',
'content-type': 'text/html',
'date': 'Thu, 24 Feb 2022 17:43:24 GMT',
'expires': 'Thu, 24 Feb 2022 17:43:24 GMT',
'mime-version': '1.0',
'pragma': 'no-cache',
'server': 'AkamaiGHost',
'strict-transport-security': 'max-age=31536000 ; preload',
'vary': 'Accept-Encoding'},
'href': 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40',
'namespaces': {'xhtml': 'http://www.w3.org/1999/xhtml'},
'status': 403,
'version': ''}
要进行调整,请查看文档 development section and in particular programmatic access。你必须使用正确的 User-Agent
:
from pprint import pprint
import feedparser
rss_url = 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40'
f = feedparser.parse(rss_url, agent="Sample Company Name AdminContact@DOMAIN.com")
pprint(f)
print(len(f.entries)) # 21
我以前从未使用过 RSS
提要,我似乎找不到提要的 url
。
提供 RSS Feed 的页面:
https://www.sec.gov/edgar/browse/?CIK=717826&owner=exclude
我正在使用 feedparser
:
import feedparser
rss_url = 'https://www.sec.gov/edgar/browse/?CIK=717826/.rss'
Feed = feedparser.parse(rss_url)
pointer = Feed.entries[1]
# result is empty
我想我用错了 link,似乎找不到合适的。我试图在 RSS 按钮上查看源代码,但没有找到 link。当我单击该按钮时,它会下载一个 XML 文件。
有人可以帮助我了解如何找到这个 link 吗?
RSS 按钮上的 link 是正确的
https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40
你去那里时得到 XML 文档的行为也是正确的,因为 RSS 是基于 XML 格式的,所以 feedparser
库处理的是实际 XML 内容。它解析它并允许您通过 Python API.
比如在页面上
https://www.sec.gov/edgar/browse/?CIK=717826&owner=exclude
你有第三排
Securities to be offered to employees in employee benefit plans
并且在 RSS 提要(XML 格式)中您也有此条目:
<entry>
<category label="form type" scheme="https://www.sec.gov/" term="S-8" />
<content-type type="text/xml">
<acceptance-date-time>2022-01-10T06:13:38.000Z</acceptance-date-time>
<accession-number>0001193125-22-004979</accession-number>
<act>33</act>
<file-number>333-262071</file-number>
<filing-date>2022-01-10</filing-date>
<filing-href>https://www.sec.gov/Archives/edgar/data/717826/000119312522004979/0001193125-22-004979-index.htm</filing-href>
<film-number>22519561</film-number>
<form-name>Securities to be offered to employees in employee benefit plans</form-name>
<size>220338</size>
</content-type>
<id>urn:tag:sec.gov,2021:accession-number=0001193125-22-004979</id>
<link href="https://www.sec.gov/Archives/edgar/data/717826/000119312522004979/0001193125-22-004979-index.htm" rel="alternate" type="text/html" />
<summary type="html"> <strong>Filed:</strong> 2022-01-10 <strong>AccNo:</strong> 0001193125-22-004979 <strong>Size:</strong> 221KB </summary>
<title>Securities to be offered to employees in employee benefit plans</title>
<updated>2022-02-23T20:41:16.245Z</updated>
</entry>
更新:
另一方面,当您更改代码以使用 RSS 按钮时 URL
from pprint import pprint
import feedparser
rss_url = 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40'
f = feedparser.parse(rss_url)
pprint(f)
您会看到该站点正在阻止您的请求已被阻止:
{'bozo': 1,
'bozo_exception': SAXParseException('mismatched tag'),
'encoding': 'utf-8',
'entries': [],
'feed': {'html': {'xmlns': 'http://www.w3.org/1999/xhtml'},
'meta': {'content': 'text/html; charset=UTF-8',
'http-equiv': 'Content-Type'},
'summary': '<div id="header">U.S. Securities and Exchange '
'Commission</div>\n'
'<div id="content">\n'
'<h1>Your Request Originates from an Undeclared Automated '
'Tool</h1>\n'
'<p>To allow for equitable access to all users, SEC '
'reserves the right to limit requests originating from '
'undeclared automated tools. Your request has been '
'identified as part of a network of automated tools '
'outside of the acceptable policy and will be managed '
'until action is taken to declare your traffic.</p>\n'
'\n'
'<p>Please declare your traffic by updating your user '
'agent to include company specific information.</p>\n'
'\n'
'\n'
'<p>For best practices on efficiently downloading '
'information from SEC.gov, including the latest EDGAR '
'filings, visit <a href="https://www.sec.gov/developer" '
'target="_blank">sec.gov/developer</a>. You can also <a '
'href="https://public.govdelivery.com/accounts/USSEC/subscriber/new?topic_id=USSEC_260" '
'target="_blank">sign up for email updates</a> on the SEC '
'open data program, including best practices that make it '
'more efficient to download data, and SEC.gov '
'enhancements that may impact scripted downloading '
'processes. For more information, contact <a '
'href="mailto:opendata@sec.gov">opendata@sec.gov</a>.</p>\n'
'\n'
'<p>For more information, please see the SEC’s <a '
'href="https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40#internet">Web '
'Site Privacy and Security Policy</a>. Thank you for your '
'interest in the U.S. Securities and Exchange '
'Commission.\n'
'<p>Reference ID: 0.563d1602.1645724603.4d26f4e</p>\n'
'<div class="grey_box">\n'
'<h2>More Information</h2>\n'
'<h3><a name="internet">Internet Security '
'Policy</a></h3>\n'
'\n'
'<p>By using this site, you are agreeing to security '
'monitoring and auditing. For security purposes, and to '
'ensure that the public service remains available to '
'users, this government computer system employs programs '
'to monitor network traffic to identify unauthorized '
'attempts to upload or change information or to otherwise '
'cause damage, including attempts to deny service to '
'users.</p>\n'
'\n'
'<p>Unauthorized attempts to upload information and/or '
'change information on any portion of this site are '
'strictly prohibited and are subject to prosecution under '
'the Computer Fraud and Abuse Act of 1986 and the '
'National Information Infrastructure Protection Act of '
'1996 (see Title 18 U.S.C. §§ 1001 and 1030).</p>\n'
'\n'
'<p>To ensure our website performs well for all users, '
'the SEC monitors the frequency of requests for SEC.gov '
'content to ensure automated searches do not impact the '
'ability of others to access SEC.gov content. We reserve '
'the right to block IP addresses that submit excessive '
'requests. Current guidelines limit users to a total of '
'no more than 10 requests per second, regardless of the '
'number of machines used to submit requests. </p>\n'
'\n'
'<p>If a user or application submits more than 10 '
'requests per second, further requests from the IP '
'address(es) may be limited for a brief period. Once the '
'rate of requests has dropped below the threshold for 10 '
'minutes, the user may resume accessing content on '
'SEC.gov. This SEC practice is designed to limit '
'excessive automated searches on SEC.gov and is not '
'intended or expected to impact individuals browsing the '
'SEC.gov website. </p>\n'
'\n'
'<p>Note that this policy may change as the SEC manages '
'SEC.gov to ensure that the website performs efficiently '
'and remains available to all users.</p>\n'
'</div>\n'
'<br />\n'
'<p class="note"><b>Note:</b> We do not offer technical '
'support for developing or debugging scripted downloading '
'processes.</p>\n'
'</div>'},
'headers': {'cache-control': 'max-age=0, no-cache, no-store',
'connection': 'close',
'content-encoding': 'gzip',
'content-length': '2177',
'content-type': 'text/html',
'date': 'Thu, 24 Feb 2022 17:43:24 GMT',
'expires': 'Thu, 24 Feb 2022 17:43:24 GMT',
'mime-version': '1.0',
'pragma': 'no-cache',
'server': 'AkamaiGHost',
'strict-transport-security': 'max-age=31536000 ; preload',
'vary': 'Accept-Encoding'},
'href': 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40',
'namespaces': {'xhtml': 'http://www.w3.org/1999/xhtml'},
'status': 403,
'version': ''}
要进行调整,请查看文档 development section and in particular programmatic access。你必须使用正确的 User-Agent
:
from pprint import pprint
import feedparser
rss_url = 'https://data.sec.gov/rss?cik=717826&type=3,4,5&exclude=true&count=40'
f = feedparser.parse(rss_url, agent="Sample Company Name AdminContact@DOMAIN.com")
pprint(f)
print(len(f.entries)) # 21