在 table 中抓取 URL 个链接

Scraping URL links in a table

我已经能够毫无问题地抓取其他数据,我也可以使用下面的代码抓取 url links。

response2 = requests.get(url2)
soup = BeautifulSoup(response2.text, 'lxml')


for link in soup.find_all('a', href=True):
    print(link['href'])

然而我现在面临两个挑战:

1- 我只对每行突出显示的 URL 感兴趣(事件 link)

2- 我如何使用这些 link 依次从每个页面抓取数据(就像我为每个 link 设置一个新代码替换 url 的 url 修正)

urlfix = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
responsefix = requests.get(urlfix)
dffix = pd.read_html(responsefix.text)[0]

#remove times and other data
dffix.drop('Time', axis=1, inplace=True)  
dffix.drop('Time.1', axis=1, inplace=True)  
dffix.drop('Competitors', axis=1, inplace=True)  

#rename columns
dffix.rename(columns = {dffix.columns[3] : 'Win_M'}, inplace = True)
dffix.rename(columns = {dffix.columns[4] : 'Win_F'}, inplace = True)


#filter for event
dffix['Worldchamps']=dffix['Event'].str.contains(r'World Championships', na=True)
dffix['Worldcup']=dffix['Event'].str.contains(r'World Cup', na=True)
#this line for do no contain , | for two
dffix['Miscrace']=~dffix['Event'].str.contains(r'World Championships|World Cup', na=True)


with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(dffix)

Screenshot of the webpage

要获取事件 link 仅使用 CSS 选择器 .future td:nth-child(2) a

for link in soup.select('.future  td:nth-child(2) a'):
    print(link['href'], link.text)

注意: 对于以后的问题 - 每个问题应该只有一个问题以保持专注 - 其他人注定要 ask a new question.

只是为了指出一个方向,select 你的元素更具体,请注意你必须用 baseUrl 连接 href

以下 list comprehension 将创建一个 URL 列表,您可以使用它来迭代和获取详细信息 tables - 使用 css selectors 到 select tbody of the table with id T1 and concat the href of each first <a> in row with baseUrl:

['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]

请记住,还有一个分页,有没有结果的详细页面,... - 如果您卡在那里提出一个新问题,请同时提供预期的输出。谢谢

例子

url = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
response = requests.get(url)
soup = BeautifulSoup(response.content)

urlList = ['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]

data = []

for url in urlList:
    try:
        data.append(pd.read_html(url)[0])
    except:
        print(f'No tables found:{url}')

pd.concat(data)

输出

...
No tables found:https://www.rootsandrain.com/event9599/2022-jul-9-mercedes-benz-uci-world-cup-dh-4-lenzerheide/
No tables found:https://www.rootsandrain.com/event9598/2022-jun-11-mercedes-benz-uci-world-cup-dh-3-leogang/
No tables found:https://www.rootsandrain.com/event9597/2022-may-22-mercedes-benz-uci-world-cup-dh-2-fort-william/
...
Unnamed: 0 Pos⇧ Bib Name Unnamed: 4 Licence YoB Sponsors km/h sector1 + sector2 + sector3 + sector4 + sector5 = Qualifier km/h.1 sector1 +.1 sector2 +.1 sector3 +.1 sector4 +.1 sector5 =.1 Run 1 Diff sector3 = sector3 =.1
nan 1st 3 Loïc BRUNI nan 1.00075e+10 1994 Specialized Gravity 57.781 28.973s1 1:08.4101 40.922s1 31.328s6 24.900s11 3:14.5331 59.062 28.697s1 1:08.8755 40.703s1 31.067s16 24.037s3 3:13.3791 - nan nan
nan 2nd 7 Troy BROSNAN nan 1.00073e+10 1993 Canyon Collective Factory Team 56.258 29.331s8 1:09.1763 42.676s6 30.488s2 24.493s2 3:16.1643 59.023 29.008s5 1:09.40313 41.363s8 30.121s2 23.905s2 3:13.8002 0.421s nan nan
nan 3rd 16 Ángel SUÁREZ ALONSO nan 1.00088e+10 1995 COMMENCAL 21 54.1939 30.077s26 1:18.27071 1:16.68773 2:00.79772 26.728s67 5:32.55972 58.067 28.991s4 1:09.2669 41.973s16 29.531s1 24.249s7 3:14.0103 0.631s nan nan