从长期气候数据的文本文件创建 Pandas 数据框

Creating Pandas Dataframe from textfile of long term Climate Data

我有一个Textfile(.DAT)文件,里面有一个站的每日气候数据,

This is the URL of Dataset

daily_data_file=r".._may24_SD.DAT"

df = pd.read_csv(daily_data_file, skiprows=[5], delimiter=r"\s+", names=['YEAR', 'DATE', 'JAN', 'FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'])

它创建数据框但是

以防万一, 某个月有 31 天, 有些有30 二月有 28 或 29

但是因为空格是 omitted/delimited

每个月末的最后 3 列移到数据框的左侧 就像这里的输出一样,最后留下 NaN 值。

In [4]: df
Out [4]: 

         YEAR   DATE    JAN   FEB   MAR  APR    MAY  JUN    JUL     AUG  SEP    OCT  NOV    DEC
0        YEAR   DATE    JAN  FEB    MAR  APR    MAY  JUN    JUL     AUG  SEP    OCT  NOV    DEC
1        1901   1       0.0  0.0    0.3  0.0    3.7  0.9    11.1    0.1  2.5    0.0  0.0    0.0
2        1901   2       0.0  0.0    16.5 0.0    12.3 0.0    11.4    2.7  4.9    0.0  0.0    0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3803     2019   27      0.0  0.0    0.0  0.0    0.0  4.4    12.9    1.1  10.2   6.8  0.0    0.0
3804     2019   28      0.0  0.0    0.0  0.1    0.0  6.0    7.3     0.1  0.3    9.8  0.0    0.0
3805     2019   29      0.0  0.0    0.0  0.0    7.5  7.5    0.6     0.8  8.3    0.0  0.0    NaN
3806    2019    30      0.0  0.0    0.0  0.0    10.2 10.0   3.9     2.0  2.3    0.0  0.0    NaN
3807    2019    31      0.0  0.0    0.0  15.7   24.0 4.5    1.2     NaN  NaN    NaN  NaN    NaN

文本文件应该如何定界 以便数据保持原始形式 即在相应列中每个月的 29 日、30 日和 31 日的 NaN 值,

而不是将它们移到数据框的左侧。

文本文件中数据的格式是这样的

  • 这种类型的数据最好由 read_fwf()
  • 处理
  • 为了让 infer 正常工作,给了它 32 行固定格式数据
  • 一旦所有数据都在数据框中,通过测试 YEAR 进行清理是数字,以排除数据中多个点的空行和 header 行
  • 最终在所有列上设置预期的数据类型
import requests
import pandas as pd
import numpy as np
import io
from pathlib import Path

# download sample data and save to file...
url = "https://raw.githubusercontent.com/abhilashsinghimd/AASD_Geojson/main/25_may24_SD1.DAT"
res = requests.get(url)
with open(Path.cwd().joinpath("SO_example.DAT"), "w") as f: f.write(res.text)
    
# read file from your file system here...
with open(Path.cwd().joinpath("SO_example.DAT"), "r") as f: text = f.read()
    
df = pd.read_fwf(
    io.StringIO(
        "\n".join(text.split("\n")[6:7] + text.split("\n")[8 : 8 + 31])
        + "\n".join(text.split("\n")[8+31:])
    ),
    infer_nrows=32,
)

print(f"expected row count:{(2019-1900)*31}")
# exclude header rows littered through data
df = df.loc[~pd.to_numeric(df["YEAR"], errors="coerce").isna()]
# convert to expected datatypes
df = df.assign(**{c:df.loc[:,c].astype("int" if c in ["YEAR","DATE"] else "float") for c in df.columns})

pd.set_option("display.width",100)
print(df)

输出

expected row count:3689
      YEAR  DATE  JAN  FEB   MAR  APR   MAY   JUN   JUL   AUG   SEP  OCT  NOV  DEC
0     1901     1  0.0  0.0   0.3  0.0   3.7   0.9  11.1   0.1   2.5  0.0  0.0  0.0
1     1901     2  0.0  0.0  16.5  0.0  12.3   0.0  11.4   2.7   4.9  0.0  0.0  0.0
2     1901     3  0.0  0.0   0.0  0.0   1.2   0.0   1.3   1.9   0.6  0.0  0.0  0.0
3     1901     4  0.0  0.0   0.0  0.0   1.2   0.0   7.6  20.5   2.5  0.0  0.0  0.0
4     1901     5  0.0  0.0   0.0  1.9   0.0   0.0  18.7  41.4   2.6  0.0  0.0  0.0
...    ...   ...  ...  ...   ...  ...   ...   ...   ...   ...   ...  ...  ...  ...
4156  2019    27  0.0  0.0   0.0  0.0   0.0   4.4  12.9   1.1  10.2  6.8  0.0  0.0
4157  2019    28  0.0  0.0   0.0  0.1   0.0   6.0   7.3   0.1   0.3  9.8  0.0  0.0
4158  2019    29  0.0  NaN   0.0  0.0   0.0   7.5   7.5   0.6   0.8  8.3  0.0  0.0
4159  2019    30  0.0  NaN   0.0  0.0   0.0  10.2  10.0   3.9   2.0  2.3  0.0  0.0
4160  2019    31  0.0  NaN   0.0  NaN   0.0   NaN  15.7  24.0   NaN  4.5  NaN  1.2

[3689 rows x 14 columns]