从长期气候数据的文本文件创建 Pandas 数据框
Creating Pandas Dataframe from textfile of long term Climate Data
我有一个Textfile(.DAT)文件,里面有一个站的每日气候数据,
daily_data_file=r".._may24_SD.DAT"
df = pd.read_csv(daily_data_file, skiprows=[5], delimiter=r"\s+", names=['YEAR', 'DATE', 'JAN', 'FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'])
它创建数据框但是
以防万一,
某个月有 31 天,
有些有30
二月有 28 或 29
但是因为空格是 omitted/delimited
每个月末的最后 3 列移到数据框的左侧
就像这里的输出一样,最后留下 NaN 值。
In [4]: df
Out [4]:
YEAR DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
0 YEAR DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
1 1901 1 0.0 0.0 0.3 0.0 3.7 0.9 11.1 0.1 2.5 0.0 0.0 0.0
2 1901 2 0.0 0.0 16.5 0.0 12.3 0.0 11.4 2.7 4.9 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3803 2019 27 0.0 0.0 0.0 0.0 0.0 4.4 12.9 1.1 10.2 6.8 0.0 0.0
3804 2019 28 0.0 0.0 0.0 0.1 0.0 6.0 7.3 0.1 0.3 9.8 0.0 0.0
3805 2019 29 0.0 0.0 0.0 0.0 7.5 7.5 0.6 0.8 8.3 0.0 0.0 NaN
3806 2019 30 0.0 0.0 0.0 0.0 10.2 10.0 3.9 2.0 2.3 0.0 0.0 NaN
3807 2019 31 0.0 0.0 0.0 15.7 24.0 4.5 1.2 NaN NaN NaN NaN NaN
文本文件应该如何定界
以便数据保持原始形式
即在相应列中每个月的 29 日、30 日和 31 日的 NaN 值,
而不是将它们移到数据框的左侧。
文本文件中数据的格式是这样的
- 这种类型的数据最好由 read_fwf()
处理
- 为了让 infer 正常工作,给了它 32 行固定格式数据
- 一旦所有数据都在数据框中,通过测试 YEAR 进行清理是数字,以排除数据中多个点的空行和 header 行
- 最终在所有列上设置预期的数据类型
import requests
import pandas as pd
import numpy as np
import io
from pathlib import Path
# download sample data and save to file...
url = "https://raw.githubusercontent.com/abhilashsinghimd/AASD_Geojson/main/25_may24_SD1.DAT"
res = requests.get(url)
with open(Path.cwd().joinpath("SO_example.DAT"), "w") as f: f.write(res.text)
# read file from your file system here...
with open(Path.cwd().joinpath("SO_example.DAT"), "r") as f: text = f.read()
df = pd.read_fwf(
io.StringIO(
"\n".join(text.split("\n")[6:7] + text.split("\n")[8 : 8 + 31])
+ "\n".join(text.split("\n")[8+31:])
),
infer_nrows=32,
)
print(f"expected row count:{(2019-1900)*31}")
# exclude header rows littered through data
df = df.loc[~pd.to_numeric(df["YEAR"], errors="coerce").isna()]
# convert to expected datatypes
df = df.assign(**{c:df.loc[:,c].astype("int" if c in ["YEAR","DATE"] else "float") for c in df.columns})
pd.set_option("display.width",100)
print(df)
输出
expected row count:3689
YEAR DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
0 1901 1 0.0 0.0 0.3 0.0 3.7 0.9 11.1 0.1 2.5 0.0 0.0 0.0
1 1901 2 0.0 0.0 16.5 0.0 12.3 0.0 11.4 2.7 4.9 0.0 0.0 0.0
2 1901 3 0.0 0.0 0.0 0.0 1.2 0.0 1.3 1.9 0.6 0.0 0.0 0.0
3 1901 4 0.0 0.0 0.0 0.0 1.2 0.0 7.6 20.5 2.5 0.0 0.0 0.0
4 1901 5 0.0 0.0 0.0 1.9 0.0 0.0 18.7 41.4 2.6 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4156 2019 27 0.0 0.0 0.0 0.0 0.0 4.4 12.9 1.1 10.2 6.8 0.0 0.0
4157 2019 28 0.0 0.0 0.0 0.1 0.0 6.0 7.3 0.1 0.3 9.8 0.0 0.0
4158 2019 29 0.0 NaN 0.0 0.0 0.0 7.5 7.5 0.6 0.8 8.3 0.0 0.0
4159 2019 30 0.0 NaN 0.0 0.0 0.0 10.2 10.0 3.9 2.0 2.3 0.0 0.0
4160 2019 31 0.0 NaN 0.0 NaN 0.0 NaN 15.7 24.0 NaN 4.5 NaN 1.2
[3689 rows x 14 columns]
我有一个Textfile(.DAT)文件,里面有一个站的每日气候数据,
daily_data_file=r".._may24_SD.DAT"
df = pd.read_csv(daily_data_file, skiprows=[5], delimiter=r"\s+", names=['YEAR', 'DATE', 'JAN', 'FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC'])
它创建数据框但是
以防万一, 某个月有 31 天, 有些有30 二月有 28 或 29
但是因为空格是 omitted/delimited
每个月末的最后 3 列移到数据框的左侧 就像这里的输出一样,最后留下 NaN 值。
In [4]: df
Out [4]:
YEAR DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
0 YEAR DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
1 1901 1 0.0 0.0 0.3 0.0 3.7 0.9 11.1 0.1 2.5 0.0 0.0 0.0
2 1901 2 0.0 0.0 16.5 0.0 12.3 0.0 11.4 2.7 4.9 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3803 2019 27 0.0 0.0 0.0 0.0 0.0 4.4 12.9 1.1 10.2 6.8 0.0 0.0
3804 2019 28 0.0 0.0 0.0 0.1 0.0 6.0 7.3 0.1 0.3 9.8 0.0 0.0
3805 2019 29 0.0 0.0 0.0 0.0 7.5 7.5 0.6 0.8 8.3 0.0 0.0 NaN
3806 2019 30 0.0 0.0 0.0 0.0 10.2 10.0 3.9 2.0 2.3 0.0 0.0 NaN
3807 2019 31 0.0 0.0 0.0 15.7 24.0 4.5 1.2 NaN NaN NaN NaN NaN
文本文件应该如何定界 以便数据保持原始形式 即在相应列中每个月的 29 日、30 日和 31 日的 NaN 值,
而不是将它们移到数据框的左侧。
文本文件中数据的格式是这样的
- 这种类型的数据最好由 read_fwf() 处理
- 为了让 infer 正常工作,给了它 32 行固定格式数据
- 一旦所有数据都在数据框中,通过测试 YEAR 进行清理是数字,以排除数据中多个点的空行和 header 行
- 最终在所有列上设置预期的数据类型
import requests
import pandas as pd
import numpy as np
import io
from pathlib import Path
# download sample data and save to file...
url = "https://raw.githubusercontent.com/abhilashsinghimd/AASD_Geojson/main/25_may24_SD1.DAT"
res = requests.get(url)
with open(Path.cwd().joinpath("SO_example.DAT"), "w") as f: f.write(res.text)
# read file from your file system here...
with open(Path.cwd().joinpath("SO_example.DAT"), "r") as f: text = f.read()
df = pd.read_fwf(
io.StringIO(
"\n".join(text.split("\n")[6:7] + text.split("\n")[8 : 8 + 31])
+ "\n".join(text.split("\n")[8+31:])
),
infer_nrows=32,
)
print(f"expected row count:{(2019-1900)*31}")
# exclude header rows littered through data
df = df.loc[~pd.to_numeric(df["YEAR"], errors="coerce").isna()]
# convert to expected datatypes
df = df.assign(**{c:df.loc[:,c].astype("int" if c in ["YEAR","DATE"] else "float") for c in df.columns})
pd.set_option("display.width",100)
print(df)
输出
expected row count:3689
YEAR DATE JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
0 1901 1 0.0 0.0 0.3 0.0 3.7 0.9 11.1 0.1 2.5 0.0 0.0 0.0
1 1901 2 0.0 0.0 16.5 0.0 12.3 0.0 11.4 2.7 4.9 0.0 0.0 0.0
2 1901 3 0.0 0.0 0.0 0.0 1.2 0.0 1.3 1.9 0.6 0.0 0.0 0.0
3 1901 4 0.0 0.0 0.0 0.0 1.2 0.0 7.6 20.5 2.5 0.0 0.0 0.0
4 1901 5 0.0 0.0 0.0 1.9 0.0 0.0 18.7 41.4 2.6 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4156 2019 27 0.0 0.0 0.0 0.0 0.0 4.4 12.9 1.1 10.2 6.8 0.0 0.0
4157 2019 28 0.0 0.0 0.0 0.1 0.0 6.0 7.3 0.1 0.3 9.8 0.0 0.0
4158 2019 29 0.0 NaN 0.0 0.0 0.0 7.5 7.5 0.6 0.8 8.3 0.0 0.0
4159 2019 30 0.0 NaN 0.0 0.0 0.0 10.2 10.0 3.9 2.0 2.3 0.0 0.0
4160 2019 31 0.0 NaN 0.0 NaN 0.0 NaN 15.7 24.0 NaN 4.5 NaN 1.2
[3689 rows x 14 columns]