Python 列验证使用 pandas 架构
Python coulmns validation using pandas schema
我正在尝试使用 PandasSchema.I 验证我的 DataFrame 库尔姆斯,我卡在验证某些列中,例如像这样的列:
1.ip_address- 应包含以下格式的 1.1.1.1 的 IP 地址,或者如果有任何其他值应该引发错误,则它应该为 null。
2.initial_date- 格式 yyyy-mm-dd h:m:s 或 mm-dd-yyyy h:m:s 等。
3.customertype 应该在 ['type1'、'type2'、'type3'] 中,否则会引发错误。
4.customer 满意= yes/no 或空白
5.customerid 不应超过 5 个字符,例如- cus01,cus02
6.time 应采用 %:%: 格式或 h:m:s 格式,否则会引发异常。
from pandas_schema import Column, Schema
def check_string(sr):
try:
str(sr)
except InvalidOperation:
return False
return True
def check_datetime(self,dec):
try:
datetime.datetime.strptime(dec, self.date_format)
return True
except:
return False
def check_int(num):
try:
int(num)
except ValueError:
return False
return True
string_validation=[CustomElementValidation(lambda x: check_string(x).str.len()>5 ,'Field Correct')]
int_validation = [CustomElementValidation(lambda i: check_int(i), 'is not integer')]
contain_validation = [CustomElementValidation(lambda y: check_string(y) not in['type1','type2','type3'], 'Filed is correct')]
date_time_validation=[CustomElementValidation(lambda dt: check_datetime(dt).strptime('%m/%d/%Y %H:%M %p'),'is not a date
time')]
null_validation = [CustomElementValidation(lambda d: d is not np.nan, 'this field cannot be null')]
schema = Schema([
Column('CompanyID', string_validation + null_validation),
Column('initialdate', date_time_validation),
Column('customertype', contain_validation),
Column('ip', string_validation),
Column('customersatisfied', string_validation)])
errors = schema.validate(combined_df)
errors_index_rows = [e.row for e in errors]
pd.DataFrame({'col':errors}).to_csv('errors.csv')
我刚刚查看了 PandasShema 的文档,大多数(如果不是全部)您正在寻找它的开箱即用功能。看看:
作为快速解决您的问题的尝试,与此类似的方法应该有效:
from pandas_schema.validation import (
InListValidation
,IsDtypeValidation
,DateFormatValidation
,MatchesPatternValidation
)
schema = Schema([
# Match a string of length between 1 and 5
Column('CompanyID', [MatchesPatternValidation(r".{1,5}")]),
# Match a date-like string of ISO 8601 format (https://www.iso.org/iso-8601-date-and-time-format.html)
Column('initialdate', [DateFormatValidation("%Y-%m-%d %H:%M:%S")], allow_empty=True),
# Match only strings in the following list
Column('customertype', [InListValidation(["type1", "type2", "type3"])]),
# Match an IP address RegEx (https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch07s16.html)
Column('ip', [MatchesPatternValidation(r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}")]),
# Match only strings in the following list
Column('customersatisfied', [InListValidation(["yes", "no"])], allow_empty=True)
])
我正在尝试使用 PandasSchema.I 验证我的 DataFrame 库尔姆斯,我卡在验证某些列中,例如像这样的列:
1.ip_address- 应包含以下格式的 1.1.1.1 的 IP 地址,或者如果有任何其他值应该引发错误,则它应该为 null。 2.initial_date- 格式 yyyy-mm-dd h:m:s 或 mm-dd-yyyy h:m:s 等。 3.customertype 应该在 ['type1'、'type2'、'type3'] 中,否则会引发错误。 4.customer 满意= yes/no 或空白 5.customerid 不应超过 5 个字符,例如- cus01,cus02 6.time 应采用 %:%: 格式或 h:m:s 格式,否则会引发异常。
from pandas_schema import Column, Schema
def check_string(sr):
try:
str(sr)
except InvalidOperation:
return False
return True
def check_datetime(self,dec):
try:
datetime.datetime.strptime(dec, self.date_format)
return True
except:
return False
def check_int(num):
try:
int(num)
except ValueError:
return False
return True
string_validation=[CustomElementValidation(lambda x: check_string(x).str.len()>5 ,'Field Correct')]
int_validation = [CustomElementValidation(lambda i: check_int(i), 'is not integer')]
contain_validation = [CustomElementValidation(lambda y: check_string(y) not in['type1','type2','type3'], 'Filed is correct')]
date_time_validation=[CustomElementValidation(lambda dt: check_datetime(dt).strptime('%m/%d/%Y %H:%M %p'),'is not a date
time')]
null_validation = [CustomElementValidation(lambda d: d is not np.nan, 'this field cannot be null')]
schema = Schema([
Column('CompanyID', string_validation + null_validation),
Column('initialdate', date_time_validation),
Column('customertype', contain_validation),
Column('ip', string_validation),
Column('customersatisfied', string_validation)])
errors = schema.validate(combined_df)
errors_index_rows = [e.row for e in errors]
pd.DataFrame({'col':errors}).to_csv('errors.csv')
我刚刚查看了 PandasShema 的文档,大多数(如果不是全部)您正在寻找它的开箱即用功能。看看:
作为快速解决您的问题的尝试,与此类似的方法应该有效:
from pandas_schema.validation import (
InListValidation
,IsDtypeValidation
,DateFormatValidation
,MatchesPatternValidation
)
schema = Schema([
# Match a string of length between 1 and 5
Column('CompanyID', [MatchesPatternValidation(r".{1,5}")]),
# Match a date-like string of ISO 8601 format (https://www.iso.org/iso-8601-date-and-time-format.html)
Column('initialdate', [DateFormatValidation("%Y-%m-%d %H:%M:%S")], allow_empty=True),
# Match only strings in the following list
Column('customertype', [InListValidation(["type1", "type2", "type3"])]),
# Match an IP address RegEx (https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch07s16.html)
Column('ip', [MatchesPatternValidation(r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}")]),
# Match only strings in the following list
Column('customersatisfied', [InListValidation(["yes", "no"])], allow_empty=True)
])