删除 DataFrame 中多列列表的某些元素
Remove certain elements of a list for multiple columns in a DataFrame
我有一个 table 这样的:
Column1
Column2
Column3
Column4
Column5
100
John
[-, 1]
[brown, yellow]
[nan, nan]
200
Stefan
[nan, 2]
[nan, yellow]
[-, accepted]
如您所见,第 3-5 列完全由列表组成,我想要的是删除 破折号 (-) 以及 "nan" 这些列中列表中的元素。
所以最后的输出应该是这样的:
Column1
Column2
Column3
Column4
Column5
100
John
[1]
[brown, yellow]
[]
200
Stefan
[2]
[yellow]
[accepted]
我能够使用以下函数获得最接近此结果的结果:
Table1["Column3"] = Table1["Column3"].apply(lambda x: [el for el in x if el != '-' if pd.isnull(el) == False])
但它的问题是我不知道如何将它应用于 DataFrame 中由列表组成的所有列。
这是一个简化的例子,原来我有将近 15 列,想知道是否有办法实现它,而不是为所有 15 列单独编写一个这样的函数。
试试这个
# data
df = pd.DataFrame({'Column1': [100, 200],
'Column2': ['John', 'Stefan'],
'Column3': [['-', 1], [np.nan, 2]],
'Column4': [['brown', 'yellow'], [np.nan, 'yellow']],
'Column5': [[np.nan, np.nan], ['-', 'accepted']]})
# stack and explode to get the list elements out of lists
exp_df = df.set_index(['Column1', 'Column2']).stack().explode()
# mask that filters out dash and nans
m = exp_df.ne('-') & exp_df.notna()
# after using m, aggregate back to lists
exp_df[m].groupby(level=[0,1,2]).agg(list).unstack(fill_value=[]).reset_index()
另一个解决方案:
for c in df.columns:
df[c] = df[c].apply(
lambda x: [v for v in x if v != "-" and pd.notna(v)]
if isinstance(x, list)
else x
)
print(df)
打印:
Column1 Column2 Column3 Column4 Column5
0 100 John [1] [brown, yellow] []
1 200 Stefan [2] [yellow] [accepted]
如果我正确理解您的目标。这是我将如何处理它。
import pandas as pd
import numpy as np
from typing import Any
### 1. replicate your dataframe. nan here is from np. not sure what nan in your df is.
df = pd.DataFrame({
'col_1':[100,200],
'col_2':['John','Stefan'],
'col_3':[['-', 1],[np.nan,2]],
'col_4':[['brown', 'yellow'],[np.nan, 'yellow']]
})
### 2: remove funciton: this function will remove dash and np.nan from each cell for a selected cols once applied
def remove(element: Any) -> Any:
try:
return [x for x in element if x not in [ '-', np.nan]]
except TypeError: # in case some cell value is not a list
return element
### 3: detect_col_element_as_list: this function will detect if a given col has any cell composed by list. if so return True
def detect_col_element_as_list(element: pd.Series) -> bool:
return any(isinstance(x, list) for x in element)
### 4: first get all cols that have cells as list
cols_contain_list = [col for col in df.columns if detect_col_element_as_list(df[col])]
### 5: a for loop to apply remove function to all cols that has list as cell value
for col in cols_contain_list:
df[col] = df[col].apply(lambda x: remove(x))
如果这是你想要的,请告诉我。
我有一个 table 这样的:
Column1 | Column2 | Column3 | Column4 | Column5 |
---|---|---|---|---|
100 | John | [-, 1] | [brown, yellow] | [nan, nan] |
200 | Stefan | [nan, 2] | [nan, yellow] | [-, accepted] |
如您所见,第 3-5 列完全由列表组成,我想要的是删除 破折号 (-) 以及 "nan" 这些列中列表中的元素。
所以最后的输出应该是这样的:
Column1 | Column2 | Column3 | Column4 | Column5 |
---|---|---|---|---|
100 | John | [1] | [brown, yellow] | [] |
200 | Stefan | [2] | [yellow] | [accepted] |
我能够使用以下函数获得最接近此结果的结果:
Table1["Column3"] = Table1["Column3"].apply(lambda x: [el for el in x if el != '-' if pd.isnull(el) == False])
但它的问题是我不知道如何将它应用于 DataFrame 中由列表组成的所有列。 这是一个简化的例子,原来我有将近 15 列,想知道是否有办法实现它,而不是为所有 15 列单独编写一个这样的函数。
试试这个
# data
df = pd.DataFrame({'Column1': [100, 200],
'Column2': ['John', 'Stefan'],
'Column3': [['-', 1], [np.nan, 2]],
'Column4': [['brown', 'yellow'], [np.nan, 'yellow']],
'Column5': [[np.nan, np.nan], ['-', 'accepted']]})
# stack and explode to get the list elements out of lists
exp_df = df.set_index(['Column1', 'Column2']).stack().explode()
# mask that filters out dash and nans
m = exp_df.ne('-') & exp_df.notna()
# after using m, aggregate back to lists
exp_df[m].groupby(level=[0,1,2]).agg(list).unstack(fill_value=[]).reset_index()
另一个解决方案:
for c in df.columns:
df[c] = df[c].apply(
lambda x: [v for v in x if v != "-" and pd.notna(v)]
if isinstance(x, list)
else x
)
print(df)
打印:
Column1 Column2 Column3 Column4 Column5
0 100 John [1] [brown, yellow] []
1 200 Stefan [2] [yellow] [accepted]
如果我正确理解您的目标。这是我将如何处理它。
import pandas as pd
import numpy as np
from typing import Any
### 1. replicate your dataframe. nan here is from np. not sure what nan in your df is.
df = pd.DataFrame({
'col_1':[100,200],
'col_2':['John','Stefan'],
'col_3':[['-', 1],[np.nan,2]],
'col_4':[['brown', 'yellow'],[np.nan, 'yellow']]
})
### 2: remove funciton: this function will remove dash and np.nan from each cell for a selected cols once applied
def remove(element: Any) -> Any:
try:
return [x for x in element if x not in [ '-', np.nan]]
except TypeError: # in case some cell value is not a list
return element
### 3: detect_col_element_as_list: this function will detect if a given col has any cell composed by list. if so return True
def detect_col_element_as_list(element: pd.Series) -> bool:
return any(isinstance(x, list) for x in element)
### 4: first get all cols that have cells as list
cols_contain_list = [col for col in df.columns if detect_col_element_as_list(df[col])]
### 5: a for loop to apply remove function to all cols that has list as cell value
for col in cols_contain_list:
df[col] = df[col].apply(lambda x: remove(x))
如果这是你想要的,请告诉我。