部分 ,<link> 的正则表达式 BS4 输出

Question

我正在解析此 RSS 提要：

https://www.google.com/alerts/feeds/12700550304290381537/6239785894655863043

我正在使用以下代码：

import requests
from bs4 import BeautifulSoup

url = "https://www.google.com/alerts/feeds/12700550304290381537/6239785894655863043"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('entry')

news_items = []

for item in items:
  news_item = {}
  news_item['title'] = item.title.text
  news_item['link'] = item.link['href']
  news_item['published'] = item.published.text
  news_item['source'] = item.link
  news_items.append(news_item)

news_items[0]

我得到以下输出：

{'link': <link href="https://www.google.com/url?rct=j&amp;sa=t&amp;url=https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&amp;ct=ga&amp;cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&amp;usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ"/>,
 'published': '2020-06-11T15:33:11Z',
 'source': <link href="https://www.google.com/url?rct=j&amp;sa=t&amp;url=https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&amp;ct=ga&amp;cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&amp;usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ"/>,
 'title': 'Duitsland lanceert <b>corona</b>-tracing-<b>app</b>'}

然而，我正在寻找的输出是：

{'link': 'https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&amp;ct=ga&amp;cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&amp;usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ',
 'published': '2020-06-11T15:33:11Z',
 'source': 'Duitslandinstituut'
 'title': 'Duitsland lanceert corona-tracing-app'}

所以，首先，我想去掉 google link 部分。其次，我希望来源是第二个 'https://' 之后的名称，并带有大写字母。第三，我想从标题中删除任何 <\b> 等属性。我打算将结果放入参考书目中，因此文本中不能包含任何计算机代码。

我试图在 BS4 中修复此问题，但未能成功。有人建议我在 pandas df 之后用正则表达式来做，但我对正则表达式不熟悉，而且例子很难理解。有人有解决办法吗？

Answer 1

尝试按以下方式更改 for 循环：

for item in items:
      news_item = {}
      news_item['link'] = item.link['href']
      news_item['published'] = item.published.text
      source = item.link['href'].split('//')[2].split('.')[1].capitalize()
      news_item['source'] = source
      news_items.append(news_item)
      n_s = BeautifulSoup(item.title.text,'lxml')
      new_t = ''.join(n_s.find_all(text=True))
      news_item['title'] = new_t
for item in news_items:
    print(item)

输出（当时我运行它）：

{'link': 'https://www.google.com/url?rct=j&sa=t&url=https://www.nrc.nl/nieuws/2020/06/12/de-nieuwe-corona-app-een-balanceeract-tussen-te-streng-en-te-soft-a4002678&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNFc54u6UszfKuIsSWFHQ_JTeqfIQA', 'published': '2020-06-12T14:37:30Z', 'source': 'Nrc', 'title': "De nieuwe corona-app: een balanceeract tussen 'te streng' en 'te soft'"}
{'link': 'https://www.google.com/url?rct=j&sa=t&url=https://www.standaard.be/cnt/dmf20200612_04989287&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHtIbdXB6q3hcvnNTvG7KC76fV7xQ', 'published': '2020-06-12T11:46:32Z', 'source': 'Standaard', 'title': 'Mobiele coronateams en app tegen tweede golf'}

等等

Answer 2

如果您不想使用正则表达式，可以对字符串使用 .replace 方法。 urllib.parse.urlparse 从 url.

获取域

import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse 

def parse(url):
    news_items = []
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text)

    items = soup.find_all('entry')

    for item in items:
        title = item.title.text.replace('<b>', '').replace('</b>', '')
        link = item.link['href'].replace(
            'https://www.google.com/url?rct=j&sa=t&url=', '').split('&')[0]
        source = urlparse(link).netloc.split('.')[1].title()
        published = item.published.text

        news_items.append(dict(zip(
            ['link', 'published', 'source', 'title'], [link, published, source, title]
        )))
    return news_items

部分 ,<link> 的正则表达式 BS4 输出

Regex BS4 output for part of ,<link>

python

rss

beautifulsoup

xml-parsing