删除 DataFrame 中多列列表的某些元素

Question

我有一个 table 这样的：

Column1	Column2	Column3	Column4	Column5
100	John	[-, 1]	[brown, yellow]	[nan, nan]
200	Stefan	[nan, 2]	[nan, yellow]	[-, accepted]

如您所见，第 3-5 列完全由列表组成，我想要的是删除 破折号 (-) 以及 "nan" 这些列中列表中的元素。

所以最后的输出应该是这样的：

Column1	Column2	Column3	Column4	Column5
100	John	[1]	[brown, yellow]	[]
200	Stefan	[2]	[yellow]	[accepted]

我能够使用以下函数获得最接近此结果的结果：

Table1["Column3"] = Table1["Column3"].apply(lambda x: [el for el in x if el != '-' if pd.isnull(el) == False])

但它的问题是我不知道如何将它应用于 DataFrame 中由列表组成的所有列。这是一个简化的例子，原来我有将近 15 列，想知道是否有办法实现它，而不是为所有 15 列单独编写一个这样的函数。

Answer 1

试试这个

# data
df = pd.DataFrame({'Column1': [100, 200],
                   'Column2': ['John', 'Stefan'],
                   'Column3': [['-', 1], [np.nan, 2]],
                   'Column4': [['brown', 'yellow'], [np.nan, 'yellow']],
                   'Column5': [[np.nan, np.nan], ['-', 'accepted']]})

# stack and explode to get the list elements out of lists
exp_df = df.set_index(['Column1', 'Column2']).stack().explode()
# mask that filters out dash and nans
m = exp_df.ne('-') & exp_df.notna()
# after using m, aggregate back to lists
exp_df[m].groupby(level=[0,1,2]).agg(list).unstack(fill_value=[]).reset_index()

Answer 2

另一个解决方案：

for c in df.columns:
    df[c] = df[c].apply(
        lambda x: [v for v in x if v != "-" and pd.notna(v)]
        if isinstance(x, list)
        else x
    )

print(df)

打印：

   Column1 Column2 Column3          Column4     Column5
0      100    John     [1]  [brown, yellow]          []
1      200  Stefan     [2]         [yellow]  [accepted]

Answer 3

如果我正确理解您的目标。这是我将如何处理它。

import pandas as pd
import numpy as np
from typing import Any

### 1. replicate your dataframe. nan here is from np. not sure what nan in your df is. 
df = pd.DataFrame({
    'col_1':[100,200],
    'col_2':['John','Stefan'],
    'col_3':[['-', 1],[np.nan,2]],
    'col_4':[['brown', 'yellow'],[np.nan, 'yellow']]
})

### 2: remove funciton: this function will remove dash and np.nan from each cell for a selected cols once applied
def remove(element: Any) -> Any:
    try: 
        return [x for x in element if x not in [ '-', np.nan]]
    except TypeError: # in case some cell value is not a list
        return element

### 3: detect_col_element_as_list: this function will detect if a given col has any cell composed by list. if so return True
def detect_col_element_as_list(element: pd.Series) -> bool:
    return any(isinstance(x, list) for x in element)

### 4: first get all cols that have cells as list 
cols_contain_list = [col for col in df.columns if detect_col_element_as_list(df[col])]

### 5: a for loop to apply remove function to all cols that has list as cell value
for col in cols_contain_list:
    df[col] = df[col].apply(lambda x: remove(x))

如果这是你想要的，请告诉我。

删除 DataFrame 中多列列表的某些元素

Remove certain elements of a list for multiple columns in a DataFrame

python

lambda

list

dataframe