类型检查 Pandas 个 DataFrames
Type-checking Pandas DataFrames
我想对 Pandas DataFrame 进行类型检查,即我想指定 DataFrame 必须具有哪些列标签以及其中存储了哪种数据类型 (dtype
)。粗略的实现(受此 question 启发)会像这样工作:
from collections import namedtuple
Col = namedtuple('Col', 'label, type')
def dataframe_check(*specification):
def check_accepts(f):
assert len(specification) <= f.__code__.co_argcount
def new_f(*args, **kwds):
for (df, specs) in zip(args, specification):
spec_columns = [spec.label for spec in specs]
assert (df.columns == spec_columns).all(), \
'Columns dont match specs {}'.format(spec_columns)
spec_dtypes = [spec.type for spec in specs]
assert (df.dtypes == spec_dtypes).all(), \
'Dtypes dont match specs {}'.format(spec_dtypes)
return f(*args, **kwds)
new_f.__name__ = f.__name__
return new_f
return check_accepts
我不介意检查功能的复杂性,但它增加了很多样板代码。
@dataframe_check([Col('a', int), Col('b', int)], # df1
[Col('a', int), Col('b', float)],) # df2
def f(df1, df2):
return df1 + df2
f(df, df)
是否有更 Pythonic 的 DataFrame 类型检查方法?看起来更像 the new Python 3.6 static type-checking?
的东西
是否可以在mypy中实现?
也许不是最 pythonic 的方式,但是使用 dict 作为你的规范可能会成功(键作为列名,值作为 data types):
import pandas as pd
df = pd.DataFrame(columns=['col1', 'col2'])
df['col1'] = df['col1'].astype('int')
df['col2'] = df['col2'].astype('str')
cols_dtypes_req = {'col1':'int', 'col2':'object'} #'str' dtype is 'object' in pandas
def check_df(dataframe, specs):
for colname in specs:
if colname not in dataframe:
return 'Column missing.'
elif dataframe[colname].dtype != specs[colname]:
return 'Data type incorrect.'
for dfcol in dataframe:
if dfcol not in specs:
return 'Unexpected dataframe column.'
return 'Dataframe meets specifications.'
print(check_df(df, cols_dtypes_req))
尝试pandera
A data validation library for scientists, engineers, and analysts seeking correctness.
我想对 Pandas DataFrame 进行类型检查,即我想指定 DataFrame 必须具有哪些列标签以及其中存储了哪种数据类型 (dtype
)。粗略的实现(受此 question 启发)会像这样工作:
from collections import namedtuple
Col = namedtuple('Col', 'label, type')
def dataframe_check(*specification):
def check_accepts(f):
assert len(specification) <= f.__code__.co_argcount
def new_f(*args, **kwds):
for (df, specs) in zip(args, specification):
spec_columns = [spec.label for spec in specs]
assert (df.columns == spec_columns).all(), \
'Columns dont match specs {}'.format(spec_columns)
spec_dtypes = [spec.type for spec in specs]
assert (df.dtypes == spec_dtypes).all(), \
'Dtypes dont match specs {}'.format(spec_dtypes)
return f(*args, **kwds)
new_f.__name__ = f.__name__
return new_f
return check_accepts
我不介意检查功能的复杂性,但它增加了很多样板代码。
@dataframe_check([Col('a', int), Col('b', int)], # df1
[Col('a', int), Col('b', float)],) # df2
def f(df1, df2):
return df1 + df2
f(df, df)
是否有更 Pythonic 的 DataFrame 类型检查方法?看起来更像 the new Python 3.6 static type-checking?
的东西是否可以在mypy中实现?
也许不是最 pythonic 的方式,但是使用 dict 作为你的规范可能会成功(键作为列名,值作为 data types):
import pandas as pd
df = pd.DataFrame(columns=['col1', 'col2'])
df['col1'] = df['col1'].astype('int')
df['col2'] = df['col2'].astype('str')
cols_dtypes_req = {'col1':'int', 'col2':'object'} #'str' dtype is 'object' in pandas
def check_df(dataframe, specs):
for colname in specs:
if colname not in dataframe:
return 'Column missing.'
elif dataframe[colname].dtype != specs[colname]:
return 'Data type incorrect.'
for dfcol in dataframe:
if dfcol not in specs:
return 'Unexpected dataframe column.'
return 'Dataframe meets specifications.'
print(check_df(df, cols_dtypes_req))
尝试pandera
A data validation library for scientists, engineers, and analysts seeking correctness.