将 Python 代码块转换为函数的问题

Issues Converting Python Code Block to Function

我在分析中经常使用一段代码来标准化客户用来访问 Internet 提供商服务的设备类型的描述。代码块如下:

# Standardize devices_desc labels
###-- SMARTPHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["SMART PHONE", "SMARTPHONE"], "SMARTPHONE"
)
###-- FEATURE_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["FEATURE PHONE"], "FEATURE_PHONE"
)
###-- BASIC_PHONE
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["BASIC PHONE", "BASIC"], "BASIC_PHONE"
)
###-- TABLET
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["TABLETS", "TABLET"], "TABLET"
)
###-- MODEM/GSM_GATEWAY
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    [
        "MODEM/GSM GATEWAY",
        "DONGLE",
        "PLUGGABLE CARD (E.G. USB STICK)",
        "MODEM/GSM GATEWAY",
    ],
    "MODEM/GSM_GATEWAY",
)
###-- M2M_EQUIPMENT
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    ["M2M EQUIPMENT"], "M2M_EQUIPMENT"
)
###-- NA/UNKNOWN
devices_df["devices_desc"] = devices_df["devices_desc"].replace(
    [np.NaN, "UNKNOWN", "NA", "OTHER", "-"], "UNDEFINED"
)

devices_df 是数据框,而 devices_desc 是 devices_df 中的列。我使用 pandas(Anaconda 分布)进行分析。我决定将此代码块转换为一个函数,使其可在我用于分析的所有文件中重复使用。以下是我的初步尝试:

def fix_cust_device_type(devices_desc):
    if devices_desc in ["BASIC PHONE", "BASIC"]:
        return "BASIC_PHONE"
    if devices_desc in ["FEATURE PHONE"]:
        return "FEATURE_PHONE"
    if devices_desc in ["SMART PHONE", "SMARTPHONE"]:
        return "SMARTPHONE"
    if devices_desc in ["TABLETS", "TABLET"]:
        return "TABLET"
    if devices_desc in [
        "MODEM/GSM GATEWAY",
        "DONGLE",
        "PLUGGABLE CARD (E.G. USB STICK)",
        "MODEM/GSM GATEWAY",
    ]:
        return "MODEM/GSM_GATEWAY"
    if devices_desc in ["M2M EQUIPMENT"]:
        return "M2M_EQUIPMENT"
    else:
        return "UNDEFINED"

我尝试按如下方式应用函数:

devices_df["devices_desc"] = devices_df["devices_desc"].apply(fix_cust_device_type)

但是,我收到以下错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-28-abd75c9eeb58> in <module>
----> 1 devices_df["devices_desc"] = GSM_Data["devices_desc"].apply(fix_cust_device_type)

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   4198             else:
   4199                 values = self.astype(object)._values
-> 4200                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4201 
   4202         if len(mapped) and isinstance(mapped[0], Series):

pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-23-bad83bc1b381> in fix_cust_device_type(devices_desc)
      1 def fix_cust_device_type(devices_desc):
----> 2     if devices_desc in ["BASIC PHONE", "BASIC"]:
      3         return "BASIC_PHONE"
      4 
      5     if devices_desc in ["FEATURE PHONE"]:

pandas\_libs\missing.pyx in pandas._libs.missing.NAType.__bool__()

**TypeError: boolean value of NA is ambiguous**

确定错误原因的努力已被证明是失败的。我想了解以下内容:

  1. 错误原因
  2. 如何纠正错误
  3. 实现我提出的解决方案的 pythonic 方式

请提供帮助。谢谢。

很难判断您何时没有提供任何数据(或某些 MWE),但从错误消息来看,您的数据框中似乎缺少数据(pd.NA)。

当我尝试 运行 你的代码和简单的例子时,一切正常,例如:

df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE"]})
df["devices_desc"].apply(fix_cust_device_type)

# Out:
# 0          BASIC_PHONE
# 1    MODEM/GSM_GATEWAY

但是当我包含丢失的数据时,我得到了你发布的错误:

df = pd.DataFrame({"devices_desc": ["BASIC", pd.NA]})
df["devices_desc"].apply(fix_cust_device_type)

# --> TypeError: boolean value of NA is ambiguous

因此您应该检查您的数据。如果 NA 值没问题,那么您应该将其包含在 fix_cust_device_type 中,例如在函数的开头添加以下代码:

if pd.isna(devices_desc):
    return "NA"  # or any string according you needs

如果 NA 值不正确,您应该删除它们。例如。 df.dropna()df.dropna(subset=["devices_desc"]).

另一种处理问题的方法如下:

  1. 将您的函数转换为字典
# This is a short version just for showcase
replace_dict = {'BASIC': 'BASIC_PHONE', 'DONGLE': 'MODEM/GSM_GATEWAY'}
  1. 对创建的字典使用 replace 方法(不需要 apply 并将处理缺失值)
df = pd.DataFrame({"devices_desc": ["BASIC", "DONGLE", pd.NA]})
df["devices_desc"] = df["devices_desc"].replace(replace_dict)

# Content of df:
#         devices_desc
# 0        BASIC_PHONE
# 1  MODEM/GSM_GATEWAY
# 2               <NA>