使用 BeautifulSoup 抓取:从 HTML 页面抓取 table 中的特定列

Scraping with BeautifulSoup: Scraping a specific column in a table, from a HTML page

我正在尝试获取代码为“CATAC2021”的第二列下的数据,其中“aaaa”是 Shakemap Site 使用 Python。这些是活动的ID。

我尝试使用下面的代码访问 table 的第二列并从该行检索 ID 数据,但到目前为止我似乎没有成功。有谁知道我去哪里 wrong/how 来更正这个?

from bs4 import BeautifulSoup
from urllib import request

page = request.urlopen('http://shakemapcam.ethz.ch/archive/').read()
soup = BeautifulSoup(page)

desired_table = soup.findAll('table')[2]

# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
    if 'CATAC2021' in th.string:
        desired_columns.append([headers.index(th), th.getText()])

# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')

for row in rows[1:]:
    cells = row.findAll('td')
    row_name = row.findNext('th').getText()
    for column in desired_columns:
        print(cells[column[0]].text, row_name, column[1])

我会在这里使用 pandas 来获取 table,然后使用正则表达式提取模式(在四位数之后和第一个 / 之前)。请注意,虽然有一个 Event ID 列,所以请确保您知道其中的区别。我将其命名为 eventId.

import pandas as pd

url = 'http://shakemapcam.ethz.ch/archive/'
df = pd.read_html(url, header =0)[-1]
df['eventID'] = df['Name/Epicenter'].str.extract(r'(.*)\d{4}(.*)(\s//?.*)(//?.*)')[1]
df['prefix'] = df['Name/Epicenter'].str.extract(r'(.*)\d{4}(.*)(\s//?.*)(//?.*)')[0]

输出:

print(df[['Name/Epicenter','prefix','eventId']])
                                      Name/Epicenter     prefix eventId
0         CATAC2021efod / 6.354496002 / -76.18144226      CATAC    efod
1         CATAC2021edxe / 15.67289066 / -93.40866852      CATAC    edxe
2         CATAC2021ebzg / 9.406171799 / -84.55581665      CATAC    ebzg
3         CATAC2021eayx / 14.03658199 / -92.30122375      CATAC    eayx
4         CATAC2021eayx / 14.03546429 / -92.30183411      CATAC    eayx
                                             ...        ...     ...
1574   ineterloc2018acor / 12.21397209 / -86.7282486  ineterloc    acor
1575  ineterloc2018acor / 12.21113586 / -86.73029327  ineterloc    acor
1576  ineterloc2018acor / 12.20839691 / -86.73122406  ineterloc    acor
1577  ineterloc2018aatd / 16.59416389 / -86.35289764  ineterloc    aatd
1578  ineterloc2018aatd / 16.64553833 / -86.26078796  ineterloc    aatd

[1579 rows x 3 columns]