在 table 中抓取 URL 个链接
Scraping URL links in a table
我已经能够毫无问题地抓取其他数据,我也可以使用下面的代码抓取 url links。
response2 = requests.get(url2)
soup = BeautifulSoup(response2.text, 'lxml')
for link in soup.find_all('a', href=True):
print(link['href'])
然而我现在面临两个挑战:
1- 我只对每行突出显示的 URL 感兴趣(事件 link)
2- 我如何使用这些 link 依次从每个页面抓取数据(就像我为每个 link 设置一个新代码替换 url 的 url 修正)
urlfix = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
responsefix = requests.get(urlfix)
dffix = pd.read_html(responsefix.text)[0]
#remove times and other data
dffix.drop('Time', axis=1, inplace=True)
dffix.drop('Time.1', axis=1, inplace=True)
dffix.drop('Competitors', axis=1, inplace=True)
#rename columns
dffix.rename(columns = {dffix.columns[3] : 'Win_M'}, inplace = True)
dffix.rename(columns = {dffix.columns[4] : 'Win_F'}, inplace = True)
#filter for event
dffix['Worldchamps']=dffix['Event'].str.contains(r'World Championships', na=True)
dffix['Worldcup']=dffix['Event'].str.contains(r'World Cup', na=True)
#this line for do no contain , | for two
dffix['Miscrace']=~dffix['Event'].str.contains(r'World Championships|World Cup', na=True)
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(dffix)
Screenshot of the webpage
要获取事件 link 仅使用 CSS 选择器 .future td:nth-child(2) a
for link in soup.select('.future td:nth-child(2) a'):
print(link['href'], link.text)
注意: 对于以后的问题 - 每个问题应该只有一个问题以保持专注 - 其他人注定要 ask a new question.
只是为了指出一个方向,select 你的元素更具体,请注意你必须用 baseUrl 连接 href
。
以下 list comprehension
将创建一个 URL 列表,您可以使用它来迭代和获取详细信息 tables - 使用 css selectors
到 select tbody
of the table with id T1
and concat the href
of each first <a>
in row with baseUrl:
['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]
请记住,还有一个分页,有没有结果的详细页面,... - 如果您卡在那里提出一个新问题,请同时提供预期的输出。谢谢
例子
url = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
response = requests.get(url)
soup = BeautifulSoup(response.content)
urlList = ['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]
data = []
for url in urlList:
try:
data.append(pd.read_html(url)[0])
except:
print(f'No tables found:{url}')
pd.concat(data)
输出
...
No tables found:https://www.rootsandrain.com/event9599/2022-jul-9-mercedes-benz-uci-world-cup-dh-4-lenzerheide/
No tables found:https://www.rootsandrain.com/event9598/2022-jun-11-mercedes-benz-uci-world-cup-dh-3-leogang/
No tables found:https://www.rootsandrain.com/event9597/2022-may-22-mercedes-benz-uci-world-cup-dh-2-fort-william/
...
Unnamed: 0
Pos⇧
Bib
Name
Unnamed: 4
Licence
YoB
Sponsors
km/h
sector1 +
sector2 +
sector3 +
sector4 +
sector5 =
Qualifier
km/h.1
sector1 +.1
sector2 +.1
sector3 +.1
sector4 +.1
sector5 =.1
Run 1
Diff
sector3 =
sector3 =.1
nan
1st
3
Loïc BRUNI
nan
1.00075e+10
1994
Specialized Gravity
57.781
28.973s1
1:08.4101
40.922s1
31.328s6
24.900s11
3:14.5331
59.062
28.697s1
1:08.8755
40.703s1
31.067s16
24.037s3
3:13.3791
-
nan
nan
nan
2nd
7
Troy BROSNAN
nan
1.00073e+10
1993
Canyon Collective Factory Team
56.258
29.331s8
1:09.1763
42.676s6
30.488s2
24.493s2
3:16.1643
59.023
29.008s5
1:09.40313
41.363s8
30.121s2
23.905s2
3:13.8002
0.421s
nan
nan
nan
3rd
16
Ángel SUÁREZ ALONSO
nan
1.00088e+10
1995
COMMENCAL 21
54.1939
30.077s26
1:18.27071
1:16.68773
2:00.79772
26.728s67
5:32.55972
58.067
28.991s4
1:09.2669
41.973s16
29.531s1
24.249s7
3:14.0103
0.631s
nan
nan
我已经能够毫无问题地抓取其他数据,我也可以使用下面的代码抓取 url links。
response2 = requests.get(url2)
soup = BeautifulSoup(response2.text, 'lxml')
for link in soup.find_all('a', href=True):
print(link['href'])
然而我现在面临两个挑战:
1- 我只对每行突出显示的 URL 感兴趣(事件 link)
2- 我如何使用这些 link 依次从每个页面抓取数据(就像我为每个 link 设置一个新代码替换 url 的 url 修正)
urlfix = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
responsefix = requests.get(urlfix)
dffix = pd.read_html(responsefix.text)[0]
#remove times and other data
dffix.drop('Time', axis=1, inplace=True)
dffix.drop('Time.1', axis=1, inplace=True)
dffix.drop('Competitors', axis=1, inplace=True)
#rename columns
dffix.rename(columns = {dffix.columns[3] : 'Win_M'}, inplace = True)
dffix.rename(columns = {dffix.columns[4] : 'Win_F'}, inplace = True)
#filter for event
dffix['Worldchamps']=dffix['Event'].str.contains(r'World Championships', na=True)
dffix['Worldcup']=dffix['Event'].str.contains(r'World Cup', na=True)
#this line for do no contain , | for two
dffix['Miscrace']=~dffix['Event'].str.contains(r'World Championships|World Cup', na=True)
with pd.option_context('display.max_rows', None, 'display.max_columns', None): # more options can be specified also
print(dffix)
Screenshot of the webpage
要获取事件 link 仅使用 CSS 选择器 .future td:nth-child(2) a
for link in soup.select('.future td:nth-child(2) a'):
print(link['href'], link.text)
注意: 对于以后的问题 - 每个问题应该只有一个问题以保持专注 - 其他人注定要 ask a new question.
只是为了指出一个方向,select 你的元素更具体,请注意你必须用 baseUrl 连接 href
。
以下 list comprehension
将创建一个 URL 列表,您可以使用它来迭代和获取详细信息 tables - 使用 css selectors
到 select tbody
of the table with id T1
and concat the href
of each first <a>
in row with baseUrl:
['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]
请记住,还有一个分页,有没有结果的详细页面,... - 如果您卡在那里提出一个新问题,请同时提供预期的输出。谢谢
例子
url = 'https://www.rootsandrain.com/organiser21/uci/events/filters/dh/'
response = requests.get(url)
soup = BeautifulSoup(response.content)
urlList = ['https://www.rootsandrain.com'+row.a['href'] for row in soup.select('#T1 tbody tr')]
data = []
for url in urlList:
try:
data.append(pd.read_html(url)[0])
except:
print(f'No tables found:{url}')
pd.concat(data)
输出
...
No tables found:https://www.rootsandrain.com/event9599/2022-jul-9-mercedes-benz-uci-world-cup-dh-4-lenzerheide/
No tables found:https://www.rootsandrain.com/event9598/2022-jun-11-mercedes-benz-uci-world-cup-dh-3-leogang/
No tables found:https://www.rootsandrain.com/event9597/2022-may-22-mercedes-benz-uci-world-cup-dh-2-fort-william/
...
Unnamed: 0 | Pos⇧ | Bib | Name | Unnamed: 4 | Licence | YoB | Sponsors | km/h | sector1 + | sector2 + | sector3 + | sector4 + | sector5 = | Qualifier | km/h.1 | sector1 +.1 | sector2 +.1 | sector3 +.1 | sector4 +.1 | sector5 =.1 | Run 1 | Diff | sector3 = | sector3 =.1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
nan | 1st | 3 | Loïc BRUNI | nan | 1.00075e+10 | 1994 | Specialized Gravity | 57.781 | 28.973s1 | 1:08.4101 | 40.922s1 | 31.328s6 | 24.900s11 | 3:14.5331 | 59.062 | 28.697s1 | 1:08.8755 | 40.703s1 | 31.067s16 | 24.037s3 | 3:13.3791 | - | nan | nan |
nan | 2nd | 7 | Troy BROSNAN | nan | 1.00073e+10 | 1993 | Canyon Collective Factory Team | 56.258 | 29.331s8 | 1:09.1763 | 42.676s6 | 30.488s2 | 24.493s2 | 3:16.1643 | 59.023 | 29.008s5 | 1:09.40313 | 41.363s8 | 30.121s2 | 23.905s2 | 3:13.8002 | 0.421s | nan | nan |
nan | 3rd | 16 | Ángel SUÁREZ ALONSO | nan | 1.00088e+10 | 1995 | COMMENCAL 21 | 54.1939 | 30.077s26 | 1:18.27071 | 1:16.68773 | 2:00.79772 | 26.728s67 | 5:32.55972 | 58.067 | 28.991s4 | 1:09.2669 | 41.973s16 | 29.531s1 | 24.249s7 | 3:14.0103 | 0.631s | nan | nan |