在 table 中抓取 URL 个链接

Question

我已经能够毫无问题地抓取其他数据，我也可以使用下面的代码抓取 url links。

response2 = requests.get(url2)
soup = BeautifulSoup(response2.text, 'lxml')


for link in soup.find_all('a', href=True):
    print(link['href'])

然而我现在面临两个挑战：

1- 我只对每行突出显示的 URL 感兴趣（事件 link）

2- 我如何使用这些 link 依次从每个页面抓取数据（就像我为每个 link 设置一个新代码替换 url 的 url 修正）

urlfix = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
responsefix = requests.get(urlfix)
dffix = pd.read_html(responsefix.text)[0]

#remove times and other data
dffix.drop('Time', axis=1, inplace=True)  
dffix.drop('Time.1', axis=1, inplace=True)  
dffix.drop('Competitors', axis=1, inplace=True)  

#rename columns
dffix.rename(columns = {dffix.columns[3] : 'Win_M'}, inplace = True)
dffix.rename(columns = {dffix.columns[4] : 'Win_F'}, inplace = True)


#filter for event
dffix['Worldchamps']=dffix['Event'].str.contains(r'World Championships', na=True)
dffix['Worldcup']=dffix['Event'].str.contains(r'World Cup', na=True)
#this line for do no contain , | for two
dffix['Miscrace']=~dffix['Event'].str.contains(r'World Championships|World Cup', na=True)


with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(dffix)

Screenshot of the webpage

Answer 1

要获取事件 link 仅使用 CSS 选择器 .future td:nth-child(2) a

for link in soup.select('.future  td:nth-child(2) a'):
    print(link['href'], link.text)

Answer 2

注意： 对于以后的问题 - 每个问题应该只有一个问题以保持专注 - 其他人注定要 ask a new question.

只是为了指出一个方向，select 你的元素更具体，请注意你必须用 baseUrl 连接 href。

以下 list comprehension 将创建一个 URL 列表，您可以使用它来迭代和获取详细信息 tables - 使用 css selectors 到 select tbody of the table with id T1 and concat the href of each first <a> in row with baseUrl:

['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]

请记住，还有一个分页，有没有结果的详细页面，... - 如果您卡在那里提出一个新问题，请同时提供预期的输出。谢谢

例子

url = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
response = requests.get(url)
soup = BeautifulSoup(response.content)

urlList = ['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]

data = []

for url in urlList:
    try:
        data.append(pd.read_html(url)[0])
    except:
        print(f'No tables found:{url}')

pd.concat(data)

输出

...
No tables found:https://www.rootsandrain.com/event9599/2022-jul-9-mercedes-benz-uci-world-cup-dh-4-lenzerheide/
No tables found:https://www.rootsandrain.com/event9598/2022-jun-11-mercedes-benz-uci-world-cup-dh-3-leogang/
No tables found:https://www.rootsandrain.com/event9597/2022-may-22-mercedes-benz-uci-world-cup-dh-2-fort-william/
...

Unnamed: 0	Pos⇧	Bib	Name	Unnamed: 4	Licence	YoB	Sponsors	km/h	sector1 +	sector2 +	sector3 +	sector4 +	sector5 =	Qualifier	km/h.1	sector1 +.1	sector2 +.1	sector3 +.1	sector4 +.1	sector5 =.1	Run 1	Diff	sector3 =	sector3 =.1
nan	1st	3	Loïc BRUNI	nan	1.00075e+10	1994	Specialized Gravity	57.781	28.973s1	1:08.4101	40.922s1	31.328s6	24.900s11	3:14.5331	59.062	28.697s1	1:08.8755	40.703s1	31.067s16	24.037s3	3:13.3791	-	nan	nan
nan	2nd	7	Troy BROSNAN	nan	1.00073e+10	1993	Canyon Collective Factory Team	56.258	29.331s8	1:09.1763	42.676s6	30.488s2	24.493s2	3:16.1643	59.023	29.008s5	1:09.40313	41.363s8	30.121s2	23.905s2	3:13.8002	0.421s	nan	nan
nan	3rd	16	Ángel SUÁREZ ALONSO	nan	1.00088e+10	1995	COMMENCAL 21	54.1939	30.077s26	1:18.27071	1:16.68773	2:00.79772	26.728s67	5:32.55972	58.067	28.991s4	1:09.2669	41.973s16	29.531s1	24.249s7	3:14.0103	0.631s	nan	nan

在 table 中抓取 URL 个链接

Scraping URL links in a table

python

url

web-scraping

例子

输出