如何根据多个条件创建新的 pandas 列,包括 'isnull'、'or' 和 if cotB 'isin' colA -like 语句?

How to make new pandas column based on multiple conditionals including 'isnull', 'or' and if colB 'isin' colA -like statements?

第一次问。有没有一种方法可以在不迭代 for 循环/保持代码符合 Pandas 精神的情况下获得包含所有三个语句(或类似 isnull,类似 isin)的新 df 列?我尝试了几个处理常见条件问题各个方面的线程的建议,但我尝试过的每次迭代通常都会导致我 "ValueError: The truth value of a Series is ambiguous. Use a.empty、a.bool()、a.item()、a.any() 或 a.all()." 或产生不正确的结果。以下是多次尝试的示例数据和代码。我的目在 'comp_unit' 中意味着我的功能无法正常工作)和 (2) 没有重复公司名称(因为有时 'unit_desc' 已经 [不正确] 包含公司名称,例如第 2 行)。

所需的数据帧

company unit_desc comp_new comp_unit
Generic Some description NaN Some description
NaN Unit with features NaN Unit with features
Some LLC Some LLC Xtra cool space Some LLC Some LLC Xtra cool space
Another LLC Unit with features Another LLC Another LLC Unit with features
Another LLC Basic unit Another LLC Another LLC Basic unit
Some LLC basic unit Some LLC Some LLC basic unit

导入和初始示例 df

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'], 
    'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
    })

尝试 0:使用 np.where
ATTEMPT 0 结果:ValueError 同上

def my_func(df, unit, comp, bad_info_list):
    """Return new dataframe with new column combining company and unit descriptions

    Args:
        df (DataFrame): Pandas dataframe with product and brand info
        unit (str): name of unit description column
        comp (str): name of company name column
        bad_info_list (list): list of unwanted terms
    """

    # (WORKS) Make new company column filtering out unwanted terms
    df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
    # (!!!START OF NOT WORKING!!!)
    # Make new column with brand and product descriptions
    df["comp_unit"] = np.where(
        (df["comp_new"].isnull().all() or df["comp_new"].isin(df[unit])),
        df[unit],
        (df["comp_new"] + " " + df[unit]),
    )
    # (!!!END OF NOT WORKING!!!)
    
    return df

df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)  

尝试 1:使用 np.where 和 ValueError 建议,如内联注释所示
尝试 1 结果:

def my_func(df, unit, comp, bad_info_list):

    # (WORKS) Make new company column filtering out unwanted terms
    df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
    # (!!!START OF NOT WORKING!!!)
    # Make new column with brand and product descriptions
    df["comp_unit"] = np.where(
        ((df["comp_new"].isnull().all()) | (df["comp_new"].isin(df[unit]))), # Swap .all() with other options
        df[unit],
        (df["comp_new"] + " " + df[unit]),
    )
    # (!!!END OF NOT WORKING!!!)
    
    return df


df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)

ATTEMPT 1.5:与 1 相同,除了 .isnull().all()== np.nan
交换 ATTEMPT 1.5:结果不正确
我发现 isin 语句没有歧义错误很奇怪——也许它没有按预期工作?

尝试 2:使用 if/elif/else 和来自 ValueError
的不同建议 似乎可以为每个条件使用 for 循环来解决问题,但难道不应该有其他方法吗?
ATTEMPT 2 结果:参见 ATTEMPT 1

的要点
def my_func(df, unit, comp, bad_info_list):
    # (WORKS) Make new company column filtering out unwanted terms
    df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
    # (!!!START OF NOT WORKING!!!)
    if df["comp_new"].isnull(): # Tried .all(), .any(), .item(), etc. just before ":"
        df["comp_unit"] = df[unit]
    elif df["comp_new"].isin(df[unit]): # Tried .all(), etc. just before ":"
        df["comp_unit"] = df[unit]
    else:
        df["comp_unit"] = df["comp_new"] + " " + df[unit]
    # (!!!END OF NOT WORKING!!!)
    
    return df

df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)

尝试 3:使用 if/elif/else 结合应用
ATTEMPT 3 结果:AttributeError:'float' 对象没有属性 'isin'

bad_info_list=["Generic", "Name"]
df["comp_new"] = df["company"].apply(lambda x: x if x not in bad_info_list else np.nan)

def comp_unit_merge(df):
    if df["comp_new"] == np.nan: #.isnull().item():
        return df["unit_desc"]
    elif df["comp_new"].isin(df["unit_desc"]): # AttributeError: 'float' object has no attribute 'isin'
        return df["unit_desc"]
    else:
        return df["comp_new"] + " " + df["unit_desc"]
    
df["comp_unit"] = df.apply(comp_unit_merge, axis=1)
print(df)

尝试 4:使用 np.select(条件,值)
尝试 4 结果:结果不正确
公司名称未包含在最后几行中

def my_func(df, unit, comp, bad_info_list):
    # (WORKS) Make new company column filtering out unwanted terms
    df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
    # (!!!START OF NOT WORKING!!!)
    conditions = [
        ((df["comp_new"] == np.nan) | (df["comp_new"].isin(df[comp]))),
        (df["comp_new"] != np.nan),
    ]
    values = [
        (df[unit]),
        (df["comp_new"] + " " + df[unit]),
    ]
    df["comp_unit"] = np.select(conditions, values)
    # (!!!END OF NOT WORKING!!!)
    
    return df

df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)

先尝试填Nan值然后把两列相加

df = pd.DataFrame({
    'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'],
    'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
    })
df = df.fillna('')
df['new_col'] = df['company'] + ' ' + df['unit_desc']

>>>>> df
       company                 unit_desc                            new_col
0      Generic          Some description           Generic Some description
1                     Unit with features                 Unit with features
2     Some LLC  Some LLC Xtra cool space  Some LLC Some LLC Xtra cool space
3  Another LLC        Unit with features     Another LLC Unit with features
4  Another LLC                Basic unit             Another LLC Basic unit
5     Some LLC                basic unit                Some LLC basic unit
def my_func(dataframe, unit, comp, bad_info_list):

    df = dataframe.copy()

    df['comp_new'] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)

    idx = df[df.apply(lambda x: str(x['comp_new']) in str(x[unit]), axis=1) | df['comp_new'].isnull()].index

    df['comp_unit'] = np.where(
        df.index.isin(idx),
        df[unit],
        df['comp_new'] + ' ' + df[unit]
    )

    return df

new_df = my_func(df, 'unit_desc', 'company', ['Generic', 'Name'])

如果我没理解错的话,“Attempt 0”很接近,但条件不正确。试试这个:

df["comp_unit"] = np.where(
    ((df["comp_new"].isnull()) | (df["comp_new"].apply(lambda row: row['comp_new'] in row[unit], axis='columns'))),
    df[unit],
    (df["comp_new"] + " " + df[unit]),
)

Pandas 不擅长在包含字符串的系列(或列)之间进行矢量化操作,因此您必须暂时使用 apply(..., axis = 1)。我只会用一次:

bad_info_list=["Generic", "Name"]

df_new = df.assign(comp_new = df.apply(
    lambda row: row['unit_desc'] if pd.isna(row['company']) or
    row['company'] in bad_info_list or
    row['unit_desc'].startswith(row['company'])
    else ' '.join(row), axis=1))

它没有改变原来的 df 并且按预期生成:

       company                 unit_desc                        comp_new
0      Generic          Some description                Some description
1          NaN        Unit with features              Unit with features
2     Some LLC  Some LLC Xtra cool space        Some LLC Xtra cool space
3  Another LLC        Unit with features  Another LLC Unit with features
4  Another LLC                Basic unit          Another LLC Basic unit
5     Some LLC                basic unit             Some LLC basic unit

当使用 axis=1 时,应用的函数接收一行作为参数。在大多数情况下,对该行进行索引会为您提供字符串对象——遇到 NaN 的情况除外。

Numpy NaN 实际上是浮点数。因此,当您尝试对公司列执行字符串操作时,例如检查 unit_desc 是否包含公司,这会为包含 NaN 的行抛出错误。

Numpy 有一个函数 isnan,但是在字符串上调用这个函数也会抛出错误。因此,任何具有实际公司价值的行都会导致该检查出现问题。

您可以使用 isinstance 检查数据类型,或者您可以提前从数据中删除 NaN。


此示例提前删除了 NaN。

badlist=["Generic", "Name"]

def merge(row):
    if row['company'] in badlist:
        return row['unit_desc']
    if row['company'] in row['unit_desc']:
        return row['unit_desc']
    return f"{row['company']} {row['unit_desc']}".strip()

df['company'] = df['company'].fillna('')
df['comp_unit'] = df.apply(merge, axis=1)
print(df)

Here's an online runnable version.


这是安全检测 NaN 的替代方法:

badlist=["Generic", "Name"]

def merge(row):
    if isinstance(row['company'], float) and np.isnan(row['company']):
        return row['unit_desc']
    if row['company'] in badlist:
        return row['unit_desc']
    if row['company'] in row['unit_desc']:
        return row['unit_desc']
    return f"{row['company']} {row['unit_desc']}".strip()

df['comp_unit'] = df.apply(merge, axis=1)
print(df)

Here's an online runnable version.