pandas:按行比较列并删除与第一列相比的重复项
pandas: compare columns row-wise and remove duplicates compred to the first column
我有一个数据框如下:
import pandas as pd
data = {'name': ['the weather is good', ' we need fresh air','today is sunny', 'we are lucky'],
'name_1': ['we are lucky','the weather is good', ' we need fresh air','today is sunny'],
'name_2': ['the weather is good', 'today is sunny', 'we are lucky',' we need fresh air'],
'name_3': [ 'today is sunny','the weather is good',' we need fresh air', 'we are lucky']}
df = pd.DataFrame(data)
我想按行比较列(意味着要比较具有相同索引的行)并用 'same' 一词替换重复项(如果它们与第一列具有相同的值)。我想要的输出是:
name name_1 name_2 \
0 the weather is good we are lucky same
1 we need fresh air the weather is good today is sunny
2 today is sunny we need fresh air we are lucky
3 we are lucky today is sunny we need fresh air
name_3
0 today is sunny
1 the weather is good
2 we need fresh air
3 same
为了找到这些值,我尝试了以下操作:
import numpy as np
np.where(df['name'].eq(df['name_1'])|df['name'].eq(df['name_2'])|df['name'].eq(df['name_3']))
但要替换它们,我不知道如何为 np.where() 制定(条件,x,y)。以下 return 与列 'name' 和 'name_3':
np.where(df['name'].eq(df['name_1'])|df['name'].eq(df['name_2'])|df['name'].eq(df['name_3']),'same',df)
IIUC,你想检查'name_1'、'name_2'、'name_3'列中哪些值在列名中具有相同的值,如果是,则将这些值替换为'same',否则保持原样。您使用 numpy.where
是正确的,但请尝试将您的语句重写为:
import numpy as np
cols = ['name_1','name_2','name_3']
for c in cols:
df[c] = np.where(df['name'].eq(df[c]),'same',df[c])
这给你:
name name_1 name_2 \
0 the weather is good we are lucky same
1 we need fresh air the weather is good today is sunny
2 today is sunny we need fresh air we are lucky
3 we are lucky today is sunny we need fresh air
name_3
0 today is sunny
1 the weather is good
2 we need fresh air
3 same
我有一个数据框如下:
import pandas as pd
data = {'name': ['the weather is good', ' we need fresh air','today is sunny', 'we are lucky'],
'name_1': ['we are lucky','the weather is good', ' we need fresh air','today is sunny'],
'name_2': ['the weather is good', 'today is sunny', 'we are lucky',' we need fresh air'],
'name_3': [ 'today is sunny','the weather is good',' we need fresh air', 'we are lucky']}
df = pd.DataFrame(data)
我想按行比较列(意味着要比较具有相同索引的行)并用 'same' 一词替换重复项(如果它们与第一列具有相同的值)。我想要的输出是:
name name_1 name_2 \
0 the weather is good we are lucky same
1 we need fresh air the weather is good today is sunny
2 today is sunny we need fresh air we are lucky
3 we are lucky today is sunny we need fresh air
name_3
0 today is sunny
1 the weather is good
2 we need fresh air
3 same
为了找到这些值,我尝试了以下操作:
import numpy as np
np.where(df['name'].eq(df['name_1'])|df['name'].eq(df['name_2'])|df['name'].eq(df['name_3']))
但要替换它们,我不知道如何为 np.where() 制定(条件,x,y)。以下 return 与列 'name' 和 'name_3':
np.where(df['name'].eq(df['name_1'])|df['name'].eq(df['name_2'])|df['name'].eq(df['name_3']),'same',df)
IIUC,你想检查'name_1'、'name_2'、'name_3'列中哪些值在列名中具有相同的值,如果是,则将这些值替换为'same',否则保持原样。您使用 numpy.where
是正确的,但请尝试将您的语句重写为:
import numpy as np
cols = ['name_1','name_2','name_3']
for c in cols:
df[c] = np.where(df['name'].eq(df[c]),'same',df[c])
这给你:
name name_1 name_2 \
0 the weather is good we are lucky same
1 we need fresh air the weather is good today is sunny
2 today is sunny we need fresh air we are lucky
3 we are lucky today is sunny we need fresh air
name_3
0 today is sunny
1 the weather is good
2 we need fresh air
3 same