将 Python 代码块转换为函数的问题
Issues Converting Python Code Block to Function
我在分析中经常使用一段代码来标准化客户用来访问 Internet 提供商服务的设备类型的描述。代码块如下:
# Standardize devices_desc labels
###-- SMARTPHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["SMART PHONE", "SMARTPHONE"], "SMARTPHONE"
)
###-- FEATURE_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["FEATURE PHONE"], "FEATURE_PHONE"
)
###-- BASIC_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["BASIC PHONE", "BASIC"], "BASIC_PHONE"
)
###-- TABLET
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["TABLETS", "TABLET"], "TABLET"
)
###-- MODEM/GSM_GATEWAY
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
[
"MODEM/GSM GATEWAY",
"DONGLE",
"PLUGGABLE CARD (E.G. USB STICK)",
"MODEM/GSM GATEWAY",
],
"MODEM/GSM_GATEWAY",
)
###-- M2M_EQUIPMENT
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["M2M EQUIPMENT"], "M2M_EQUIPMENT"
)
###-- NA/UNKNOWN
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
[np.NaN, "UNKNOWN", "NA", "OTHER", "-"], "UNDEFINED"
)
devices_df 是数据框,而 devices_desc 是 devices_df 中的列。我使用 pandas(Anaconda 分布)进行分析。我决定将此代码块转换为一个函数,使其可在我用于分析的所有文件中重复使用。以下是我的初步尝试:
def fix_cust_device_type(devices_desc):
if devices_desc in ["BASIC PHONE", "BASIC"]:
return "BASIC_PHONE"
if devices_desc in ["FEATURE PHONE"]:
return "FEATURE_PHONE"
if devices_desc in ["SMART PHONE", "SMARTPHONE"]:
return "SMARTPHONE"
if devices_desc in ["TABLETS", "TABLET"]:
return "TABLET"
if devices_desc in [
"MODEM/GSM GATEWAY",
"DONGLE",
"PLUGGABLE CARD (E.G. USB STICK)",
"MODEM/GSM GATEWAY",
]:
return "MODEM/GSM_GATEWAY"
if devices_desc in ["M2M EQUIPMENT"]:
return "M2M_EQUIPMENT"
else:
return "UNDEFINED"
我尝试按如下方式应用函数:
devices_df["devices_desc"] = devices_df["devices_desc"].apply(fix_cust_device_type)
但是,我收到以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-abd75c9eeb58> in <module>
----> 1 devices_df["devices_desc"] = GSM_Data["devices_desc"].apply(fix_cust_device_type)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-23-bad83bc1b381> in fix_cust_device_type(devices_desc)
1 def fix_cust_device_type(devices_desc):
----> 2 if devices_desc in ["BASIC PHONE", "BASIC"]:
3 return "BASIC_PHONE"
4
5 if devices_desc in ["FEATURE PHONE"]:
pandas\_libs\missing.pyx in pandas._libs.missing.NAType.__bool__()
**TypeError: boolean value of NA is ambiguous**
确定错误原因的努力已被证明是失败的。我想了解以下内容:
- 错误原因
- 如何纠正错误
- 实现我提出的解决方案的 pythonic 方式
请提供帮助。谢谢。
很难判断您何时没有提供任何数据(或某些 MWE),但从错误消息来看,您的数据框中似乎缺少数据(pd.NA
)。
当我尝试 运行 你的代码和简单的例子时,一切正常,例如:
df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE"]})
df["devices_desc"].apply(fix_cust_device_type)
# Out:
# 0 BASIC_PHONE
# 1 MODEM/GSM_GATEWAY
但是当我包含丢失的数据时,我得到了你发布的错误:
df = pd.DataFrame({"devices_desc": ["BASIC", pd.NA]})
df["devices_desc"].apply(fix_cust_device_type)
# --> TypeError: boolean value of NA is ambiguous
因此您应该检查您的数据。如果 NA
值没问题,那么您应该将其包含在 fix_cust_device_type
中,例如在函数的开头添加以下代码:
if pd.isna(devices_desc):
return "NA" # or any string according you needs
如果 NA
值不正确,您应该删除它们。例如。 df.dropna()
或 df.dropna(subset=["devices_desc"])
.
另一种处理问题的方法如下:
- 将您的函数转换为字典
# This is a short version just for showcase
replace_dict = {'BASIC': 'BASIC_PHONE', 'DONGLE': 'MODEM/GSM_GATEWAY'}
- 对创建的字典使用 replace 方法(不需要 apply 并将处理缺失值)
df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE", pd.NA]})
df["devices_desc"] = df["devices_desc"].replace(replace_dict)
# Content of df:
# devices_desc
# 0 BASIC_PHONE
# 1 MODEM/GSM_GATEWAY
# 2 <NA>
我在分析中经常使用一段代码来标准化客户用来访问 Internet 提供商服务的设备类型的描述。代码块如下:
# Standardize devices_desc labels
###-- SMARTPHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["SMART PHONE", "SMARTPHONE"], "SMARTPHONE"
)
###-- FEATURE_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["FEATURE PHONE"], "FEATURE_PHONE"
)
###-- BASIC_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["BASIC PHONE", "BASIC"], "BASIC_PHONE"
)
###-- TABLET
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["TABLETS", "TABLET"], "TABLET"
)
###-- MODEM/GSM_GATEWAY
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
[
"MODEM/GSM GATEWAY",
"DONGLE",
"PLUGGABLE CARD (E.G. USB STICK)",
"MODEM/GSM GATEWAY",
],
"MODEM/GSM_GATEWAY",
)
###-- M2M_EQUIPMENT
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
["M2M EQUIPMENT"], "M2M_EQUIPMENT"
)
###-- NA/UNKNOWN
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
[np.NaN, "UNKNOWN", "NA", "OTHER", "-"], "UNDEFINED"
)
devices_df 是数据框,而 devices_desc 是 devices_df 中的列。我使用 pandas(Anaconda 分布)进行分析。我决定将此代码块转换为一个函数,使其可在我用于分析的所有文件中重复使用。以下是我的初步尝试:
def fix_cust_device_type(devices_desc):
if devices_desc in ["BASIC PHONE", "BASIC"]:
return "BASIC_PHONE"
if devices_desc in ["FEATURE PHONE"]:
return "FEATURE_PHONE"
if devices_desc in ["SMART PHONE", "SMARTPHONE"]:
return "SMARTPHONE"
if devices_desc in ["TABLETS", "TABLET"]:
return "TABLET"
if devices_desc in [
"MODEM/GSM GATEWAY",
"DONGLE",
"PLUGGABLE CARD (E.G. USB STICK)",
"MODEM/GSM GATEWAY",
]:
return "MODEM/GSM_GATEWAY"
if devices_desc in ["M2M EQUIPMENT"]:
return "M2M_EQUIPMENT"
else:
return "UNDEFINED"
我尝试按如下方式应用函数:
devices_df["devices_desc"] = devices_df["devices_desc"].apply(fix_cust_device_type)
但是,我收到以下错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-28-abd75c9eeb58> in <module>
----> 1 devices_df["devices_desc"] = GSM_Data["devices_desc"].apply(fix_cust_device_type)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4198 else:
4199 values = self.astype(object)._values
-> 4200 mapped = lib.map_infer(values, f, convert=convert_dtype)
4201
4202 if len(mapped) and isinstance(mapped[0], Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-23-bad83bc1b381> in fix_cust_device_type(devices_desc)
1 def fix_cust_device_type(devices_desc):
----> 2 if devices_desc in ["BASIC PHONE", "BASIC"]:
3 return "BASIC_PHONE"
4
5 if devices_desc in ["FEATURE PHONE"]:
pandas\_libs\missing.pyx in pandas._libs.missing.NAType.__bool__()
**TypeError: boolean value of NA is ambiguous**
确定错误原因的努力已被证明是失败的。我想了解以下内容:
- 错误原因
- 如何纠正错误
- 实现我提出的解决方案的 pythonic 方式
请提供帮助。谢谢。
很难判断您何时没有提供任何数据(或某些 MWE),但从错误消息来看,您的数据框中似乎缺少数据(pd.NA
)。
当我尝试 运行 你的代码和简单的例子时,一切正常,例如:
df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE"]})
df["devices_desc"].apply(fix_cust_device_type)
# Out:
# 0 BASIC_PHONE
# 1 MODEM/GSM_GATEWAY
但是当我包含丢失的数据时,我得到了你发布的错误:
df = pd.DataFrame({"devices_desc": ["BASIC", pd.NA]})
df["devices_desc"].apply(fix_cust_device_type)
# --> TypeError: boolean value of NA is ambiguous
因此您应该检查您的数据。如果 NA
值没问题,那么您应该将其包含在 fix_cust_device_type
中,例如在函数的开头添加以下代码:
if pd.isna(devices_desc):
return "NA" # or any string according you needs
如果 NA
值不正确,您应该删除它们。例如。 df.dropna()
或 df.dropna(subset=["devices_desc"])
.
另一种处理问题的方法如下:
- 将您的函数转换为字典
# This is a short version just for showcase
replace_dict = {'BASIC': 'BASIC_PHONE', 'DONGLE': 'MODEM/GSM_GATEWAY'}
- 对创建的字典使用 replace 方法(不需要 apply 并将处理缺失值)
df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE", pd.NA]})
df["devices_desc"] = df["devices_desc"].replace(replace_dict)
# Content of df:
# devices_desc
# 0 BASIC_PHONE
# 1 MODEM/GSM_GATEWAY
# 2 <NA>