根据列表中符号的子字符串替换 Pandas DataFrame 列的值
Replace values of Pandas DataFrame columns based upon substrings of symbols in a list
我正在尝试从 DataFrame 的两列中删除一些错误数据。列容易损坏,其中符号出现在列值中。我想检查两列中的所有值,并在出现符号时用 '' 替换标识的值。
例如:
import pandas as pd
bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
p1 p2 p3
0 1 abc* abc
1 2 abc@ abc.
2 3 zxya zxya
3 4 &sdf &sdf
4 5 p xx p xx
5 6 abcd abcd
我一直在尝试使用列表推导式遍历 bad_chars 变量并将 p2 和 p3 列中的值替换为空 '',但没有成功,结果如下所示:
p1 p2 p3
0 1 abc
1 2
2 3 zxya zxya
3 4
4 5
5 6 abcd abcd
完成此操作后,我想删除 p2 列、p3 列或两者中包含空单元格的所有行。
p1 p2 p3
0 3 zxya zxya
1 6 abcd abcd
给你:
import pandas as pd
bad_chars = ['\,', '\@', '\/', '\!', '\&', '\*', '\.', '\_', '\ ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zx_ya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
df.loc[df['p2'].str.contains('|'.join(bad_chars)), 'p2'] = None
df.loc[df['p3'].str.contains('|'.join(bad_chars)), 'p3'] = None
df = df.dropna(subset=['p2', 'p3'])
df
请注意,我已经更改了 bad_chars(向其中添加了 \)
另一种选择供您尝试。
import pandas as pd
bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
for i in df.index:
# creates True/False list checking each char in df cell's
# content using line comprehension
p2_chks = [char in bad_chars for char in df.at[i,"p2"]]
p3_chks = [char in bad_chars for char in df.at[i,"p3"]]
# if "True" exists in the either of the check lists,
# then delete the row
if (True in p2_chks) or (True in p3_chks):
print("{}: p2 or p3 three is true".format(i))
df = df.drop(i)
# Reindex the df rows. Use drop=True so
# new column is not added with old index
df = df.reset_index(drop=True)
print(df)
请试试这个:
import pandas as pd
import numpy as np
bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
def check_char(text):
for char in bad_chars:
if char in text:
return np.nan
break
return text
check_cols = ['p2','p3']
for col in check_cols:
df[col] = df[col].apply(lambda x:check_char(x))
df = df.dropna(subset=check_cols)
我正在尝试从 DataFrame 的两列中删除一些错误数据。列容易损坏,其中符号出现在列值中。我想检查两列中的所有值,并在出现符号时用 '' 替换标识的值。
例如:
import pandas as pd
bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
p1 p2 p3
0 1 abc* abc
1 2 abc@ abc.
2 3 zxya zxya
3 4 &sdf &sdf
4 5 p xx p xx
5 6 abcd abcd
我一直在尝试使用列表推导式遍历 bad_chars 变量并将 p2 和 p3 列中的值替换为空 '',但没有成功,结果如下所示:
p1 p2 p3
0 1 abc
1 2
2 3 zxya zxya
3 4
4 5
5 6 abcd abcd
完成此操作后,我想删除 p2 列、p3 列或两者中包含空单元格的所有行。
p1 p2 p3
0 3 zxya zxya
1 6 abcd abcd
给你:
import pandas as pd
bad_chars = ['\,', '\@', '\/', '\!', '\&', '\*', '\.', '\_', '\ ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zx_ya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
df.loc[df['p2'].str.contains('|'.join(bad_chars)), 'p2'] = None
df.loc[df['p3'].str.contains('|'.join(bad_chars)), 'p3'] = None
df = df.dropna(subset=['p2', 'p3'])
df
请注意,我已经更改了 bad_chars(向其中添加了 \)
另一种选择供您尝试。
import pandas as pd
bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
for i in df.index:
# creates True/False list checking each char in df cell's
# content using line comprehension
p2_chks = [char in bad_chars for char in df.at[i,"p2"]]
p3_chks = [char in bad_chars for char in df.at[i,"p3"]]
# if "True" exists in the either of the check lists,
# then delete the row
if (True in p2_chks) or (True in p3_chks):
print("{}: p2 or p3 three is true".format(i))
df = df.drop(i)
# Reindex the df rows. Use drop=True so
# new column is not added with old index
df = df.reset_index(drop=True)
print(df)
请试试这个:
import pandas as pd
import numpy as np
bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
def check_char(text):
for char in bad_chars:
if char in text:
return np.nan
break
return text
check_cols = ['p2','p3']
for col in check_cols:
df[col] = df[col].apply(lambda x:check_char(x))
df = df.dropna(subset=check_cols)