使用 Python 使用特定关键字过滤 html 链接列表
Filtering the list of html links using a specific key word using Python
我正在尝试使用 link 列表中每个 link 中的特定作品来提取 link。下面是我获取 URL 的代码:
import urllib
from bs4 import BeautifulSoup as bs
url ='https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats'
html_page = urllib.request.urlopen(url)
soup = bs(html_page, "html.parser")
links = []
player_link =[]
for link in soup.findAll('a'):
links.append(link.get('href'))
从上面的代码行中,我可以将 links 的列表存储在变量 links 中
我想创建一个仅包含特定单词 summary 的新列表。
应存储在新列表 player_list 中的预期输出(仅部分输出)如下所示:
player_list =['/en/players/3bb7b8b4/matchlogs/2021-2022/summary/Ederson-Match-Logs',
'/en/players/3eb22ec9/matchlogs/2021-2022/summary/Bernardo-Silva-Match-Logs',
'/en/players/bd6351cd/matchlogs/2021-2022/summary/Joao-Cancelo-Match-Logs',
'/en/players/31c69ef1/matchlogs/2021-2022/summary/Ruben-Dias-Match-Logs',
'/en/players/6434f10d/matchlogs/2021-2022/summary/Rodri-Match-Logs',
'/en/players/119b9a8e/matchlogs/2021-2022/summary/Aymeric-Laporte-Match-Logs']
我尝试探索之前的一些帖子,但没有成功。接下来我可以尝试什么?
您可以检查条件(link 是否为 non-empty 并且其中包含 summary
):
out = [x for x in links if x and 'summary' in x]
输出:
['/en/players/3bb7b8b4/matchlogs/2021-2022/summary/Ederson-Match-Logs',
'/en/players/3eb22ec9/matchlogs/2021-2022/summary/Bernardo-Silva-Match-Logs',
'/en/players/bd6351cd/matchlogs/2021-2022/summary/Joao-Cancelo-Match-Logs',
'/en/players/31c69ef1/matchlogs/2021-2022/summary/Ruben-Dias-Match-Logs',
'/en/players/6434f10d/matchlogs/2021-2022/summary/Rodri-Match-Logs',
...
'/en/players/02aed921/matchlogs/2021-2022/summary/Cieran-Slicker-Match-Logs',
'/en/players/c19a2df1/matchlogs/2021-2022/summary/Josh-Wilson-Esbrand-Match-Logs']
最后过滤列表的另一种方法是 select 您的目标更具体并从头开始过滤 - 仅关注 list comprehension
selects <a>
中包含摘要并将其与您的 baseUrl 连接:
['https://fbref.com'+e['href'] for e in soup.select('a[href*="summary"]')]
例子
import urllib
from bs4 import BeautifulSoup as bs
url ='https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats'
html_page = urllib.request.urlopen(url)
soup = bs(html_page, "html.parser")
summaryUrls = ['https://fbref.com'+e['href'] for e in soup.select('a[href*="summary"]')]
print(summaryUrls)
输出
['https://fbref.com/en/players/3bb7b8b4/matchlogs/2021-2022/summary/Ederson-Match-Logs',
'https://fbref.com/en/players/3eb22ec9/matchlogs/2021-2022/summary/Bernardo-Silva-Match-Logs',
'https://fbref.com/en/players/bd6351cd/matchlogs/2021-2022/summary/Joao-Cancelo-Match-Logs',
'https://fbref.com/en/players/31c69ef1/matchlogs/2021-2022/summary/Ruben-Dias-Match-Logs',
'https://fbref.com/en/players/6434f10d/matchlogs/2021-2022/summary/Rodri-Match-Logs',
'https://fbref.com/en/players/119b9a8e/matchlogs/2021-2022/summary/Aymeric-Laporte-Match-Logs',
'https://fbref.com/en/players/ed1e53f3/matchlogs/2021-2022/summary/Phil-Foden-Match-Logs',
'https://fbref.com/en/players/86dd77d1/matchlogs/2021-2022/summary/Kyle-Walker-Match-Logs',
'https://fbref.com/en/players/b400bde0/matchlogs/2021-2022/summary/Raheem-Sterling-Match-Logs',
'https://fbref.com/en/players/e46012d4/matchlogs/2021-2022/summary/Kevin-De-Bruyne-Match-Logs',
'https://fbref.com/en/players/b0b4fd3e/matchlogs/2021-2022/summary/Jack-Grealish-Match-Logs',
'https://fbref.com/en/players/819b3158/matchlogs/2021-2022/summary/Ilkay-Gundogan-Match-Logs',
'https://fbref.com/en/players/b66315ae/matchlogs/2021-2022/summary/Gabriel-Jesus-Match-Logs',
'https://fbref.com/en/players/892d5bb1/matchlogs/2021-2022/summary/Riyad-Mahrez-Match-Logs',
'https://fbref.com/en/players/5eecec3d/matchlogs/2021-2022/summary/John-Stones-Match-Logs',...]
我正在尝试使用 link 列表中每个 link 中的特定作品来提取 link。下面是我获取 URL 的代码:
import urllib
from bs4 import BeautifulSoup as bs
url ='https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats'
html_page = urllib.request.urlopen(url)
soup = bs(html_page, "html.parser")
links = []
player_link =[]
for link in soup.findAll('a'):
links.append(link.get('href'))
从上面的代码行中,我可以将 links 的列表存储在变量 links 中 我想创建一个仅包含特定单词 summary 的新列表。 应存储在新列表 player_list 中的预期输出(仅部分输出)如下所示:
player_list =['/en/players/3bb7b8b4/matchlogs/2021-2022/summary/Ederson-Match-Logs',
'/en/players/3eb22ec9/matchlogs/2021-2022/summary/Bernardo-Silva-Match-Logs',
'/en/players/bd6351cd/matchlogs/2021-2022/summary/Joao-Cancelo-Match-Logs',
'/en/players/31c69ef1/matchlogs/2021-2022/summary/Ruben-Dias-Match-Logs',
'/en/players/6434f10d/matchlogs/2021-2022/summary/Rodri-Match-Logs',
'/en/players/119b9a8e/matchlogs/2021-2022/summary/Aymeric-Laporte-Match-Logs']
我尝试探索之前的一些帖子,但没有成功。接下来我可以尝试什么?
您可以检查条件(link 是否为 non-empty 并且其中包含 summary
):
out = [x for x in links if x and 'summary' in x]
输出:
['/en/players/3bb7b8b4/matchlogs/2021-2022/summary/Ederson-Match-Logs',
'/en/players/3eb22ec9/matchlogs/2021-2022/summary/Bernardo-Silva-Match-Logs',
'/en/players/bd6351cd/matchlogs/2021-2022/summary/Joao-Cancelo-Match-Logs',
'/en/players/31c69ef1/matchlogs/2021-2022/summary/Ruben-Dias-Match-Logs',
'/en/players/6434f10d/matchlogs/2021-2022/summary/Rodri-Match-Logs',
...
'/en/players/02aed921/matchlogs/2021-2022/summary/Cieran-Slicker-Match-Logs',
'/en/players/c19a2df1/matchlogs/2021-2022/summary/Josh-Wilson-Esbrand-Match-Logs']
最后过滤列表的另一种方法是 select 您的目标更具体并从头开始过滤 - 仅关注 list comprehension
selects <a>
中包含摘要并将其与您的 baseUrl 连接:
['https://fbref.com'+e['href'] for e in soup.select('a[href*="summary"]')]
例子
import urllib
from bs4 import BeautifulSoup as bs
url ='https://fbref.com/en/squads/b8fd03ef/Manchester-City-Stats'
html_page = urllib.request.urlopen(url)
soup = bs(html_page, "html.parser")
summaryUrls = ['https://fbref.com'+e['href'] for e in soup.select('a[href*="summary"]')]
print(summaryUrls)
输出
['https://fbref.com/en/players/3bb7b8b4/matchlogs/2021-2022/summary/Ederson-Match-Logs',
'https://fbref.com/en/players/3eb22ec9/matchlogs/2021-2022/summary/Bernardo-Silva-Match-Logs',
'https://fbref.com/en/players/bd6351cd/matchlogs/2021-2022/summary/Joao-Cancelo-Match-Logs',
'https://fbref.com/en/players/31c69ef1/matchlogs/2021-2022/summary/Ruben-Dias-Match-Logs',
'https://fbref.com/en/players/6434f10d/matchlogs/2021-2022/summary/Rodri-Match-Logs',
'https://fbref.com/en/players/119b9a8e/matchlogs/2021-2022/summary/Aymeric-Laporte-Match-Logs',
'https://fbref.com/en/players/ed1e53f3/matchlogs/2021-2022/summary/Phil-Foden-Match-Logs',
'https://fbref.com/en/players/86dd77d1/matchlogs/2021-2022/summary/Kyle-Walker-Match-Logs',
'https://fbref.com/en/players/b400bde0/matchlogs/2021-2022/summary/Raheem-Sterling-Match-Logs',
'https://fbref.com/en/players/e46012d4/matchlogs/2021-2022/summary/Kevin-De-Bruyne-Match-Logs',
'https://fbref.com/en/players/b0b4fd3e/matchlogs/2021-2022/summary/Jack-Grealish-Match-Logs',
'https://fbref.com/en/players/819b3158/matchlogs/2021-2022/summary/Ilkay-Gundogan-Match-Logs',
'https://fbref.com/en/players/b66315ae/matchlogs/2021-2022/summary/Gabriel-Jesus-Match-Logs',
'https://fbref.com/en/players/892d5bb1/matchlogs/2021-2022/summary/Riyad-Mahrez-Match-Logs',
'https://fbref.com/en/players/5eecec3d/matchlogs/2021-2022/summary/John-Stones-Match-Logs',...]