单击日期范围按钮和 Python html table 中的抓取工具

Question

我尝试从here中抓取一个小的table数据，过程如下图所示：

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://oilprice.com/rig-count'
# html = urllib.request.urlopen(url)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')

contents = soup.find_all('div', {'class': 'info_table'})
print(contents[0].children)

rows = []
for child in contents[0].children:
    row = []
    for td in child:
        print(td) # not work after this line
        try:
            row.append(td.text.replace('\n', ''))
        except:
            continue
if len(row) > 0:
    rows.append(row)
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)

由于 contents 的输出相当大 html 数据，所以我不知道如何正确提取它们并保存为数据框。有人可以分享答案或给我一些提示吗？谢谢。

Answer 1

您可以使用：

table = soup.find('div', {'class': 'info_table'})
data = [[cell.text.strip() for cell in row.find_all('div')]
            for row in table.find_all('div', recursive=False)]
df = pd.DataFrame(data[1:], columns=data[0])

输出：

>>> df
              Date Oil Rigs Gas Rigs Total Rigs Frac Spread Production Million Bpd
0     4th Mar 2022      519      130        650         280                       
1    25th Feb 2022      522      127        650         290                       
2    18th Feb 2022      520      124        645         283                  11.60
3    11th Feb 2022      516      118        635         275                  11.60
4     4th Feb 2022      497      116        613         264                  11.60
..             ...      ...      ...        ...         ...                    ...
358  26th Dec 2014     1499      340       1840         367                   9.12
359  19th Dec 2014     1536      338       1875         415                   9.13
360  12th Dec 2014     1546      346       1893         411                   9.14
361   5th Dec 2014     1575      344       1920         428                   9.12
362  21st Nov 2014     1574      355       1929         452                   9.08

[363 rows x 6 columns]

更新

让 Pandas 猜测数据类型的懒惰解决方案是将您的数据转换为 csv：

import io

table = soup.find('div', {'class': 'info_table'})
data = ['\t'.join(cell.text.strip() for cell in row.find_all('div'))
            for row in table.find_all('div', recursive=False)]
buf = io.StringIO()
buf.writelines('\n'.join(data))
buf.seek(0)

df = pd.read_csv(buf, sep='\t', parse_dates=['Date'])

输出：

>>> df
          Date  Oil Rigs  Gas Rigs  Total Rigs  Frac Spread  Production Million Bpd
0   2022-03-04       519       130         650          280                     NaN
1   2022-02-25       522       127         650          290                     NaN
2   2022-02-18       520       124         645          283                   11.60
3   2022-02-11       516       118         635          275                   11.60
4   2022-02-04       497       116         613          264                   11.60
..         ...       ...       ...         ...          ...                     ...
358 2014-12-26      1499       340        1840          367                    9.12
359 2014-12-19      1536       338        1875          415                    9.13
360 2014-12-12      1546       346        1893          411                    9.14
361 2014-12-05      1575       344        1920          428                    9.12
362 2014-11-21      1574       355        1929          452                    9.08

[363 rows x 6 columns]

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 363 entries, 0 to 362
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   Date                    363 non-null    datetime64[ns]
 1   Oil Rigs                363 non-null    int64         
 2   Gas Rigs                363 non-null    int64         
 3   Total Rigs              363 non-null    int64         
 4   Frac Spread             363 non-null    int64         
 5   Production Million Bpd  360 non-null    float64       
dtypes: datetime64[ns](1), float64(1), int64(4)
memory usage: 17.1 KB

Answer 2

最好的答案必须对应最小的变化，你只需要使用re进行合理匹配：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import bs4
import re

url = 'https://oilprice.com/rig-count'
# html = urllib.request.urlopen(url)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')

contents = soup.find_all('div', {'class': 'info_table'})

rows = []
for child in contents[0].children:
    row = []
    for td in child:
        if type(td) == bs4.element.Tag:
            data = re.sub('\s','',re.findall('(<[/]?[a-zA-Z].*?>)([\s\S]*?)?(<[/]?[a-zA-Z].*?>)',str(td))[0][1])
            row.append(data)
    if row != []:       
        rows.append(row)
df = pd.DataFrame(rows[1:], columns=rows[0])
print(df)

Answer 3

我应用列表理解技术。

import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://oilprice.com/rig-count'
req = requests.get(url).text
lst = []
soup = BeautifulSoup(req, 'lxml')
data = [x.get_text().replace('\t', '').replace('\n\n',' ').replace('\n','') for x in soup.select('div.info_table_holder div div.info_table_row')] 
lst.extend(data)

df = pd.DataFrame(lst, columns=['Data'])
print(df)

输出：

0            4th Mar 2022 519 130 650 280 
1           25th Feb 2022 522 127 650 290 
2      18th Feb 2022 520 124 645 283 11.60
3      11th Feb 2022 516 118 635 275 11.60
4       4th Feb 2022 497 116 613 264 11.60
...                                    ...
2007          4th Feb 2000 157 387 0 0 0 0
2008         28th Jan 2000 171 381 0 0 0 0
2009         21st Jan 2000 186 338 0 0 0 0
2010         14th Jan 2000 169 342 0 0 0 0
2011          7th Jan 2000 134 266 0 0 0 0

[2012 rows x 1 columns]

单击日期范围按钮和 Python html table 中的抓取工具

Click a date range button and crawler one html table in Python

beautifulsoup

dataframe

python-3.x

pandas

python-requests