部分 ,<link> 的正则表达式 BS4 输出
Regex BS4 output for part of ,<link>
我正在解析此 RSS 提要:
https://www.google.com/alerts/feeds/12700550304290381537/6239785894655863043
我正在使用以下代码:
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/alerts/feeds/12700550304290381537/6239785894655863043"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('entry')
news_items = []
for item in items:
news_item = {}
news_item['title'] = item.title.text
news_item['link'] = item.link['href']
news_item['published'] = item.published.text
news_item['source'] = item.link
news_items.append(news_item)
news_items[0]
我得到以下输出:
{'link': <link href="https://www.google.com/url?rct=j&sa=t&url=https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ"/>,
'published': '2020-06-11T15:33:11Z',
'source': <link href="https://www.google.com/url?rct=j&sa=t&url=https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ"/>,
'title': 'Duitsland lanceert <b>corona</b>-tracing-<b>app</b>'}
然而,我正在寻找的输出是:
{'link': 'https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ',
'published': '2020-06-11T15:33:11Z',
'source': 'Duitslandinstituut'
'title': 'Duitsland lanceert corona-tracing-app'}
所以,首先,我想去掉 google link 部分。其次,我希望来源是第二个 'https://' 之后的名称,并带有大写字母。第三,我想从标题中删除任何 <\b> 等属性。我打算将结果放入参考书目中,因此文本中不能包含任何计算机代码。
我试图在 BS4 中修复此问题,但未能成功。有人建议我在 pandas df 之后用正则表达式来做,但我对正则表达式不熟悉,而且例子很难理解。有人有解决办法吗?
尝试按以下方式更改 for
循环:
for item in items:
news_item = {}
news_item['link'] = item.link['href']
news_item['published'] = item.published.text
source = item.link['href'].split('//')[2].split('.')[1].capitalize()
news_item['source'] = source
news_items.append(news_item)
n_s = BeautifulSoup(item.title.text,'lxml')
new_t = ''.join(n_s.find_all(text=True))
news_item['title'] = new_t
for item in news_items:
print(item)
输出(当时我运行它):
{'link': 'https://www.google.com/url?rct=j&sa=t&url=https://www.nrc.nl/nieuws/2020/06/12/de-nieuwe-corona-app-een-balanceeract-tussen-te-streng-en-te-soft-a4002678&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNFc54u6UszfKuIsSWFHQ_JTeqfIQA', 'published': '2020-06-12T14:37:30Z', 'source': 'Nrc', 'title': "De nieuwe corona-app: een balanceeract tussen 'te streng' en 'te soft'"}
{'link': 'https://www.google.com/url?rct=j&sa=t&url=https://www.standaard.be/cnt/dmf20200612_04989287&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHtIbdXB6q3hcvnNTvG7KC76fV7xQ', 'published': '2020-06-12T11:46:32Z', 'source': 'Standaard', 'title': 'Mobiele coronateams en app tegen tweede golf'}
等等
如果您不想使用正则表达式,可以对字符串使用 .replace
方法。 urllib.parse.urlparse
从 url.
获取域
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
def parse(url):
news_items = []
resp = requests.get(url)
soup = BeautifulSoup(resp.text)
items = soup.find_all('entry')
for item in items:
title = item.title.text.replace('<b>', '').replace('</b>', '')
link = item.link['href'].replace(
'https://www.google.com/url?rct=j&sa=t&url=', '').split('&')[0]
source = urlparse(link).netloc.split('.')[1].title()
published = item.published.text
news_items.append(dict(zip(
['link', 'published', 'source', 'title'], [link, published, source, title]
)))
return news_items
我正在解析此 RSS 提要:
https://www.google.com/alerts/feeds/12700550304290381537/6239785894655863043
我正在使用以下代码:
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/alerts/feeds/12700550304290381537/6239785894655863043"
resp = requests.get(url)
soup = BeautifulSoup(resp.content, features='xml')
items = soup.findAll('entry')
news_items = []
for item in items:
news_item = {}
news_item['title'] = item.title.text
news_item['link'] = item.link['href']
news_item['published'] = item.published.text
news_item['source'] = item.link
news_items.append(news_item)
news_items[0]
我得到以下输出:
{'link': <link href="https://www.google.com/url?rct=j&sa=t&url=https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ"/>,
'published': '2020-06-11T15:33:11Z',
'source': <link href="https://www.google.com/url?rct=j&sa=t&url=https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ"/>,
'title': 'Duitsland lanceert <b>corona</b>-tracing-<b>app</b>'}
然而,我正在寻找的输出是:
{'link': 'https://duitslandinstituut.nl/artikel/38250/duitsland-lanceert-corona-tracing-app&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHDFPconO3h8mpzJh92x4HrjPL2tQ',
'published': '2020-06-11T15:33:11Z',
'source': 'Duitslandinstituut'
'title': 'Duitsland lanceert corona-tracing-app'}
所以,首先,我想去掉 google link 部分。其次,我希望来源是第二个 'https://' 之后的名称,并带有大写字母。第三,我想从标题中删除任何 <\b> 等属性。我打算将结果放入参考书目中,因此文本中不能包含任何计算机代码。
我试图在 BS4 中修复此问题,但未能成功。有人建议我在 pandas df 之后用正则表达式来做,但我对正则表达式不熟悉,而且例子很难理解。有人有解决办法吗?
尝试按以下方式更改 for
循环:
for item in items:
news_item = {}
news_item['link'] = item.link['href']
news_item['published'] = item.published.text
source = item.link['href'].split('//')[2].split('.')[1].capitalize()
news_item['source'] = source
news_items.append(news_item)
n_s = BeautifulSoup(item.title.text,'lxml')
new_t = ''.join(n_s.find_all(text=True))
news_item['title'] = new_t
for item in news_items:
print(item)
输出(当时我运行它):
{'link': 'https://www.google.com/url?rct=j&sa=t&url=https://www.nrc.nl/nieuws/2020/06/12/de-nieuwe-corona-app-een-balanceeract-tussen-te-streng-en-te-soft-a4002678&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNFc54u6UszfKuIsSWFHQ_JTeqfIQA', 'published': '2020-06-12T14:37:30Z', 'source': 'Nrc', 'title': "De nieuwe corona-app: een balanceeract tussen 'te streng' en 'te soft'"}
{'link': 'https://www.google.com/url?rct=j&sa=t&url=https://www.standaard.be/cnt/dmf20200612_04989287&ct=ga&cd=CAIyGWFlODkwMWNhMWM0YmE4ODU6bmw6bmw6Tkw&usg=AFQjCNHtIbdXB6q3hcvnNTvG7KC76fV7xQ', 'published': '2020-06-12T11:46:32Z', 'source': 'Standaard', 'title': 'Mobiele coronateams en app tegen tweede golf'}
等等
如果您不想使用正则表达式,可以对字符串使用 .replace
方法。 urllib.parse.urlparse
从 url.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
def parse(url):
news_items = []
resp = requests.get(url)
soup = BeautifulSoup(resp.text)
items = soup.find_all('entry')
for item in items:
title = item.title.text.replace('<b>', '').replace('</b>', '')
link = item.link['href'].replace(
'https://www.google.com/url?rct=j&sa=t&url=', '').split('&')[0]
source = urlparse(link).netloc.split('.')[1].title()
published = item.published.text
news_items.append(dict(zip(
['link', 'published', 'source', 'title'], [link, published, source, title]
)))
return news_items