如何根据多个条件创建新的 pandas 列,包括 'isnull'、'or' 和 if cotB 'isin' colA -like 语句?
How to make new pandas column based on multiple conditionals including 'isnull', 'or' and if colB 'isin' colA -like statements?
第一次问。有没有一种方法可以在不迭代 for
循环/保持代码符合 Pandas 精神的情况下获得包含所有三个语句(或类似 isnull,类似 isin)的新 df 列?我尝试了几个处理常见条件问题各个方面的线程的建议,但我尝试过的每次迭代通常都会导致我 "ValueError: The truth value of a Series is ambiguous. Use a.empty、a.bool()、a.item()、a.any() 或 a.all()." 或产生不正确的结果。以下是多次尝试的示例数据和代码。我的目在 'comp_unit' 中意味着我的功能无法正常工作)和 (2) 没有重复公司名称(因为有时 'unit_desc' 已经 [不正确] 包含公司名称,例如第 2 行)。
所需的数据帧
company
unit_desc
comp_new
comp_unit
Generic
Some description
NaN
Some description
NaN
Unit with features
NaN
Unit with features
Some LLC
Some LLC Xtra cool space
Some LLC
Some LLC Xtra cool space
Another LLC
Unit with features
Another LLC
Another LLC Unit with features
Another LLC
Basic unit
Another LLC
Another LLC Basic unit
Some LLC
basic unit
Some LLC
Some LLC basic unit
导入和初始示例 df
import pandas as pd
import numpy as np
df = pd.DataFrame({
'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'],
'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
})
尝试 0:使用 np.where
ATTEMPT 0 结果:ValueError 同上
def my_func(df, unit, comp, bad_info_list):
"""Return new dataframe with new column combining company and unit descriptions
Args:
df (DataFrame): Pandas dataframe with product and brand info
unit (str): name of unit description column
comp (str): name of company name column
bad_info_list (list): list of unwanted terms
"""
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
# Make new column with brand and product descriptions
df["comp_unit"] = np.where(
(df["comp_new"].isnull().all() or df["comp_new"].isin(df[unit])),
df[unit],
(df["comp_new"] + " " + df[unit]),
)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
尝试 1:使用 np.where
和 ValueError 建议,如内联注释所示
尝试 1 结果:
- 使用 .all():似乎考虑了整个系列的所有匹配项,因此产生了错误的结果
- 使用 .any():似乎考虑了整个系列的任何匹配,因此产生了错误的结果
- 使用 .item():似乎要检查整个系列的大小,因此会产生 ValueError:只能将大小为 1 的数组转换为 Python 标量
- 使用 .bool():Returns 与之前相同的 ValueError
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
# Make new column with brand and product descriptions
df["comp_unit"] = np.where(
((df["comp_new"].isnull().all()) | (df["comp_new"].isin(df[unit]))), # Swap .all() with other options
df[unit],
(df["comp_new"] + " " + df[unit]),
)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
ATTEMPT 1.5:与 1 相同,除了 .isnull().all()
与 == np.nan
交换
ATTEMPT 1.5:结果不正确
我发现 isin
语句没有歧义错误很奇怪——也许它没有按预期工作?
尝试 2:使用 if/elif/else 和来自 ValueError
的不同建议
似乎可以为每个条件使用 for 循环来解决问题,但难道不应该有其他方法吗?
ATTEMPT 2 结果:参见 ATTEMPT 1
的要点
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
if df["comp_new"].isnull(): # Tried .all(), .any(), .item(), etc. just before ":"
df["comp_unit"] = df[unit]
elif df["comp_new"].isin(df[unit]): # Tried .all(), etc. just before ":"
df["comp_unit"] = df[unit]
else:
df["comp_unit"] = df["comp_new"] + " " + df[unit]
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
尝试 3:使用 if/elif/else 结合应用
ATTEMPT 3 结果:AttributeError:'float' 对象没有属性 'isin'
bad_info_list=["Generic", "Name"]
df["comp_new"] = df["company"].apply(lambda x: x if x not in bad_info_list else np.nan)
def comp_unit_merge(df):
if df["comp_new"] == np.nan: #.isnull().item():
return df["unit_desc"]
elif df["comp_new"].isin(df["unit_desc"]): # AttributeError: 'float' object has no attribute 'isin'
return df["unit_desc"]
else:
return df["comp_new"] + " " + df["unit_desc"]
df["comp_unit"] = df.apply(comp_unit_merge, axis=1)
print(df)
尝试 4:使用 np.select(条件,值)
尝试 4 结果:结果不正确
公司名称未包含在最后几行中
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
conditions = [
((df["comp_new"] == np.nan) | (df["comp_new"].isin(df[comp]))),
(df["comp_new"] != np.nan),
]
values = [
(df[unit]),
(df["comp_new"] + " " + df[unit]),
]
df["comp_unit"] = np.select(conditions, values)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
先尝试填Nan值然后把两列相加
df = pd.DataFrame({
'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'],
'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
})
df = df.fillna('')
df['new_col'] = df['company'] + ' ' + df['unit_desc']
>>>>> df
company unit_desc new_col
0 Generic Some description Generic Some description
1 Unit with features Unit with features
2 Some LLC Some LLC Xtra cool space Some LLC Some LLC Xtra cool space
3 Another LLC Unit with features Another LLC Unit with features
4 Another LLC Basic unit Another LLC Basic unit
5 Some LLC basic unit Some LLC basic unit
def my_func(dataframe, unit, comp, bad_info_list):
df = dataframe.copy()
df['comp_new'] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
idx = df[df.apply(lambda x: str(x['comp_new']) in str(x[unit]), axis=1) | df['comp_new'].isnull()].index
df['comp_unit'] = np.where(
df.index.isin(idx),
df[unit],
df['comp_new'] + ' ' + df[unit]
)
return df
new_df = my_func(df, 'unit_desc', 'company', ['Generic', 'Name'])
如果我没理解错的话,“Attempt 0”很接近,但条件不正确。试试这个:
df["comp_unit"] = np.where(
((df["comp_new"].isnull()) | (df["comp_new"].apply(lambda row: row['comp_new'] in row[unit], axis='columns'))),
df[unit],
(df["comp_new"] + " " + df[unit]),
)
Pandas 不擅长在包含字符串的系列(或列)之间进行矢量化操作,因此您必须暂时使用 apply(..., axis = 1)
。我只会用一次:
bad_info_list=["Generic", "Name"]
df_new = df.assign(comp_new = df.apply(
lambda row: row['unit_desc'] if pd.isna(row['company']) or
row['company'] in bad_info_list or
row['unit_desc'].startswith(row['company'])
else ' '.join(row), axis=1))
它没有改变原来的 df
并且按预期生成:
company unit_desc comp_new
0 Generic Some description Some description
1 NaN Unit with features Unit with features
2 Some LLC Some LLC Xtra cool space Some LLC Xtra cool space
3 Another LLC Unit with features Another LLC Unit with features
4 Another LLC Basic unit Another LLC Basic unit
5 Some LLC basic unit Some LLC basic unit
当使用 axis=1 时,应用的函数接收一行作为参数。在大多数情况下,对该行进行索引会为您提供字符串对象——遇到 NaN 的情况除外。
Numpy NaN 实际上是浮点数。因此,当您尝试对公司列执行字符串操作时,例如检查 unit_desc 是否包含公司,这会为包含 NaN 的行抛出错误。
Numpy 有一个函数 isnan
,但是在字符串上调用这个函数也会抛出错误。因此,任何具有实际公司价值的行都会导致该检查出现问题。
您可以使用 isinstance
检查数据类型,或者您可以提前从数据中删除 NaN。
此示例提前删除了 NaN。
badlist=["Generic", "Name"]
def merge(row):
if row['company'] in badlist:
return row['unit_desc']
if row['company'] in row['unit_desc']:
return row['unit_desc']
return f"{row['company']} {row['unit_desc']}".strip()
df['company'] = df['company'].fillna('')
df['comp_unit'] = df.apply(merge, axis=1)
print(df)
Here's an online runnable version.
这是安全检测 NaN 的替代方法:
badlist=["Generic", "Name"]
def merge(row):
if isinstance(row['company'], float) and np.isnan(row['company']):
return row['unit_desc']
if row['company'] in badlist:
return row['unit_desc']
if row['company'] in row['unit_desc']:
return row['unit_desc']
return f"{row['company']} {row['unit_desc']}".strip()
df['comp_unit'] = df.apply(merge, axis=1)
print(df)
第一次问。有没有一种方法可以在不迭代 for
循环/保持代码符合 Pandas 精神的情况下获得包含所有三个语句(或类似 isnull,类似 isin)的新 df 列?我尝试了几个处理常见条件问题各个方面的线程的建议,但我尝试过的每次迭代通常都会导致我 "ValueError: The truth value of a Series is ambiguous. Use a.empty、a.bool()、a.item()、a.any() 或 a.all()." 或产生不正确的结果。以下是多次尝试的示例数据和代码。我的目在 'comp_unit' 中意味着我的功能无法正常工作)和 (2) 没有重复公司名称(因为有时 'unit_desc' 已经 [不正确] 包含公司名称,例如第 2 行)。
所需的数据帧
company | unit_desc | comp_new | comp_unit |
---|---|---|---|
Generic | Some description | NaN | Some description |
NaN | Unit with features | NaN | Unit with features |
Some LLC | Some LLC Xtra cool space | Some LLC | Some LLC Xtra cool space |
Another LLC | Unit with features | Another LLC | Another LLC Unit with features |
Another LLC | Basic unit | Another LLC | Another LLC Basic unit |
Some LLC | basic unit | Some LLC | Some LLC basic unit |
导入和初始示例 df
import pandas as pd
import numpy as np
df = pd.DataFrame({
'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'],
'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
})
尝试 0:使用 np.where
ATTEMPT 0 结果:ValueError 同上
def my_func(df, unit, comp, bad_info_list):
"""Return new dataframe with new column combining company and unit descriptions
Args:
df (DataFrame): Pandas dataframe with product and brand info
unit (str): name of unit description column
comp (str): name of company name column
bad_info_list (list): list of unwanted terms
"""
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
# Make new column with brand and product descriptions
df["comp_unit"] = np.where(
(df["comp_new"].isnull().all() or df["comp_new"].isin(df[unit])),
df[unit],
(df["comp_new"] + " " + df[unit]),
)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
尝试 1:使用 np.where
和 ValueError 建议,如内联注释所示
尝试 1 结果:
- 使用 .all():似乎考虑了整个系列的所有匹配项,因此产生了错误的结果
- 使用 .any():似乎考虑了整个系列的任何匹配,因此产生了错误的结果
- 使用 .item():似乎要检查整个系列的大小,因此会产生 ValueError:只能将大小为 1 的数组转换为 Python 标量
- 使用 .bool():Returns 与之前相同的 ValueError
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
# Make new column with brand and product descriptions
df["comp_unit"] = np.where(
((df["comp_new"].isnull().all()) | (df["comp_new"].isin(df[unit]))), # Swap .all() with other options
df[unit],
(df["comp_new"] + " " + df[unit]),
)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
ATTEMPT 1.5:与 1 相同,除了 .isnull().all()
与 == np.nan
交换
ATTEMPT 1.5:结果不正确
我发现 isin
语句没有歧义错误很奇怪——也许它没有按预期工作?
尝试 2:使用 if/elif/else 和来自 ValueError
的不同建议
似乎可以为每个条件使用 for 循环来解决问题,但难道不应该有其他方法吗?
ATTEMPT 2 结果:参见 ATTEMPT 1
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
if df["comp_new"].isnull(): # Tried .all(), .any(), .item(), etc. just before ":"
df["comp_unit"] = df[unit]
elif df["comp_new"].isin(df[unit]): # Tried .all(), etc. just before ":"
df["comp_unit"] = df[unit]
else:
df["comp_unit"] = df["comp_new"] + " " + df[unit]
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
尝试 3:使用 if/elif/else 结合应用
ATTEMPT 3 结果:AttributeError:'float' 对象没有属性 'isin'
bad_info_list=["Generic", "Name"]
df["comp_new"] = df["company"].apply(lambda x: x if x not in bad_info_list else np.nan)
def comp_unit_merge(df):
if df["comp_new"] == np.nan: #.isnull().item():
return df["unit_desc"]
elif df["comp_new"].isin(df["unit_desc"]): # AttributeError: 'float' object has no attribute 'isin'
return df["unit_desc"]
else:
return df["comp_new"] + " " + df["unit_desc"]
df["comp_unit"] = df.apply(comp_unit_merge, axis=1)
print(df)
尝试 4:使用 np.select(条件,值)
尝试 4 结果:结果不正确
公司名称未包含在最后几行中
def my_func(df, unit, comp, bad_info_list):
# (WORKS) Make new company column filtering out unwanted terms
df["comp_new"] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
# (!!!START OF NOT WORKING!!!)
conditions = [
((df["comp_new"] == np.nan) | (df["comp_new"].isin(df[comp]))),
(df["comp_new"] != np.nan),
]
values = [
(df[unit]),
(df["comp_new"] + " " + df[unit]),
]
df["comp_unit"] = np.select(conditions, values)
# (!!!END OF NOT WORKING!!!)
return df
df_new = my_func(df, "unit_desc", "company", bad_info_list=["Generic", "Name"])
print(df_new)
先尝试填Nan值然后把两列相加
df = pd.DataFrame({
'company': ['Generic', np.nan, 'Some LLC', 'Another LLC', 'Another LLC', 'Some LLC'],
'unit_desc': ['Some description', 'Unit with features', 'Some LLC Xtra cool space', 'Unit with features', 'Basic unit', 'basic unit'],
})
df = df.fillna('')
df['new_col'] = df['company'] + ' ' + df['unit_desc']
>>>>> df
company unit_desc new_col
0 Generic Some description Generic Some description
1 Unit with features Unit with features
2 Some LLC Some LLC Xtra cool space Some LLC Some LLC Xtra cool space
3 Another LLC Unit with features Another LLC Unit with features
4 Another LLC Basic unit Another LLC Basic unit
5 Some LLC basic unit Some LLC basic unit
def my_func(dataframe, unit, comp, bad_info_list):
df = dataframe.copy()
df['comp_new'] = df[comp].apply(lambda x: x if x not in bad_info_list else np.nan)
idx = df[df.apply(lambda x: str(x['comp_new']) in str(x[unit]), axis=1) | df['comp_new'].isnull()].index
df['comp_unit'] = np.where(
df.index.isin(idx),
df[unit],
df['comp_new'] + ' ' + df[unit]
)
return df
new_df = my_func(df, 'unit_desc', 'company', ['Generic', 'Name'])
如果我没理解错的话,“Attempt 0”很接近,但条件不正确。试试这个:
df["comp_unit"] = np.where(
((df["comp_new"].isnull()) | (df["comp_new"].apply(lambda row: row['comp_new'] in row[unit], axis='columns'))),
df[unit],
(df["comp_new"] + " " + df[unit]),
)
Pandas 不擅长在包含字符串的系列(或列)之间进行矢量化操作,因此您必须暂时使用 apply(..., axis = 1)
。我只会用一次:
bad_info_list=["Generic", "Name"]
df_new = df.assign(comp_new = df.apply(
lambda row: row['unit_desc'] if pd.isna(row['company']) or
row['company'] in bad_info_list or
row['unit_desc'].startswith(row['company'])
else ' '.join(row), axis=1))
它没有改变原来的 df
并且按预期生成:
company unit_desc comp_new
0 Generic Some description Some description
1 NaN Unit with features Unit with features
2 Some LLC Some LLC Xtra cool space Some LLC Xtra cool space
3 Another LLC Unit with features Another LLC Unit with features
4 Another LLC Basic unit Another LLC Basic unit
5 Some LLC basic unit Some LLC basic unit
当使用 axis=1 时,应用的函数接收一行作为参数。在大多数情况下,对该行进行索引会为您提供字符串对象——遇到 NaN 的情况除外。
Numpy NaN 实际上是浮点数。因此,当您尝试对公司列执行字符串操作时,例如检查 unit_desc 是否包含公司,这会为包含 NaN 的行抛出错误。
Numpy 有一个函数 isnan
,但是在字符串上调用这个函数也会抛出错误。因此,任何具有实际公司价值的行都会导致该检查出现问题。
您可以使用 isinstance
检查数据类型,或者您可以提前从数据中删除 NaN。
此示例提前删除了 NaN。
badlist=["Generic", "Name"]
def merge(row):
if row['company'] in badlist:
return row['unit_desc']
if row['company'] in row['unit_desc']:
return row['unit_desc']
return f"{row['company']} {row['unit_desc']}".strip()
df['company'] = df['company'].fillna('')
df['comp_unit'] = df.apply(merge, axis=1)
print(df)
Here's an online runnable version.
这是安全检测 NaN 的替代方法:
badlist=["Generic", "Name"]
def merge(row):
if isinstance(row['company'], float) and np.isnan(row['company']):
return row['unit_desc']
if row['company'] in badlist:
return row['unit_desc']
if row['company'] in row['unit_desc']:
return row['unit_desc']
return f"{row['company']} {row['unit_desc']}".strip()
df['comp_unit'] = df.apply(merge, axis=1)
print(df)