我正在使用 python3.7 进行 RSS 提要新闻抓取。我没有得到确切的信息。帮助我获取正确的数据

Question

在这里，我试图从 RSS 提要中获取新闻，但我没有获得准确的信息。我正在使用请求和 BeautifulSoup 来实现目标。我有以下对象。

<item>
 <title>
  US making very good headway in respect to Covid-19 vaccines: Donald Trump
 </title>
 <description>
  <a href="https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms"><img border="0" hspace="10" align="left" style="margin-top:3px;margin-right:5px;" src="https://timesofindia.indiatimes.com/photo/76399892.cms" /></a>Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.
 </description>
 <link>
  https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
 </link>
 <guid>
  https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms
 </guid>
 <pubDate>
  Mon, 15 Jun 2020 22:11:06 PT
 </pubDate>
</item>

欲望问题的代码在这里..

def timesofindiaNews():
    URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'

    page = requests.get(URL)
    soup = BeautifulSoup(page.content, features = 'xml')

    # print(soup.prettify())

    news_elems = soup.find_all('item')
    news = []
    print(news_elems[0].prettify())
    for news_elem in news_elems:

        title = news_elem.title.text
        news_description = news_elem.description.text       
        image = news_elem.description.img
        # news_date = news_elem.pubDate.text
        news_link = news_elem.link.text

我想要标签中的描述，但其中包含更多详细信息，例如描述中不需要的。上面的代码给出了以下输出。

    {
      "image": null,
      "news_description": "<a href=\"https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms\"><img border=\"0\" hspace=\"10\" align=\"left\" style=\"margin-top:3px;margin-right:5px;\" src=\"https://timesofindia.indiatimes.com/photo/76399892.cms\" /></a>Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
      "news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
      "source": "trucknews",
      "title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
    }

预期输出===>

    {
      "image": "image/link/from/the/description",
      "news_description": "Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.",
      "news_link": "https://timesofindia.indiatimes.com/international/us/us-making-very-good-headway-in-respect-to-covid-19-vaccines-donald-trump/articleshow/76399892.cms",
      "source": "trucknews",
      "title": "US making very good headway in respect to Covid-19 vaccines: Donald Trump"
    }

Answer 1

< > 更改为 < 和 &gt。这就是为什么我使用 formatter=None 并改变一些东西来控制 it.Please 参见 news_description。我想你得到了结果。你可以试试：

import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}


def timesofindiaNews():
    URL = 'https://timesofindia.indiatimes.com/rssfeeds_us/72258322.cms'

    page = requests.get(URL,headers=headers)
    soup = BeautifulSoup(page.text, 'xml')

    # print(soup.prettify())

    news_elems = soup.find_all('item')
    news = []
    # print(news_elems[0].prettify())
    for news_elem in news_elems:

        title = news_elem.title.text
        n_description = news_elem.description
        store = n_description.prettify(formatter=None)
        sp = BeautifulSoup(store, 'xml')
        news_description = sp.find("a").nextSibling
        print(news_description)
        # print(news_description)
        image = news_elem.description.img
        # news_date = news_elem.pubDate.text
        news_link = news_elem.link.text


timesofindiaNews()

输出将是：

Washington, Jun 16 () The United States is making very good headway in respect to vaccines for the coronavirus pandemic and also therapeutically, President Donald Trump has said.

The proposed suspension could extend into the government's new fiscal year beginning October 1, when many new visas are issued, The Wall Street Journal reported on Thursday, quoting unnamed administration officials.

The team of researchers at the University of Georgia (UGA) in the US noted that the SARS-CoV-2 protein PLpro is essential for the replication and the ability of the virus to suppress host immune function.

After two weeks of protests over the death of George Floyd, hundreds of New Yorkers took to the streets again calling for reform in law enforcement and the withdrawal of police department funding.

Indian-origin California Senator Kamala Harris has joined former vice president and 2020 Democratic presidential nominee Joe Biden to raise USD 3.5 million for the upcoming November elections.


and so on....

我正在使用 python3.7 进行 RSS 提要新闻抓取。我没有得到确切的信息。帮助我获取正确的数据

I am doing RSS feed news scrapting using python3.7. I am not get the exact information. Help me to get the proper data

python

xml

rss