Python :删除数据框中的特定行并保留特定行
Python : Dropping specific rows in a dataframe and keep a specif one
假设我有这个数据框
Name = ['ID', 'Country', 'IBAN','ID_info_1', 'Dan_Age', 'ID_info_1','Dan_city','ID_info_1','Dan_country','ID_info_1', 'ID_info_2', 'ID_info_2','ID_info_2', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12', '18','25','Berlin','34', '55','345','432', '43', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','','CHF', '','DKN']
Group = ['0','0','0','1','1','2','2','3','3','4','1','2','3','4','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_info_1 12 EUR 1
4 Dan_Age 18 EUR 1
5 ID_info_1 25 EUR 2
6 Dan_city Berlin 2
7 ID_info_1 34 EUR 3
8 Dan_country 55 3
9 ID_info_1 345 4
10 ID_info_2 432 1
11 ID_info_2 43 EUR 2
12 ID_info_2 GER EUR 3
13 Dan_sex M USD 4
14 Dan_Age 22 USD 2
15 Dan_country FRA 2
16 Dan_sex M CHF 2
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
我想缩小这个数据框!我想通过保留列“组”中具有最高级别的行来仅减少包含字符串“信息”的行。所以在这个数据框中,这意味着我在第 4 组中保留行“ID_info_1”,在第 3 组中保留“ID_info_1”。此外,我想更改它们在“组”列为 1。
所以最后我想得到这个索引也被重置的新数据框
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_info_1 12 EUR 1
4 Dan_Age 18 EUR 1
5 Dan_city Berlin 2
6 Dan_country 55 3
7 ID_info_1 345 1
8 ID_info_2 GER EUR 1
9 Dan_sex M USD 4
10 Dan_Age 22 USD 2
11 Dan_country FRA 2
12 Dan_sex M CHF 2
13 Dan_city Madrid 3
14 Dan_country ESP DKN 3
有人有有效的想法吗?
谢谢
您可以使用在名称列中搜索字符串 'info' 并在组列中搜索值的 lambda 函数创建掩码。
arr = []
mask = df.apply(lambda x: True if 'info' in x['Name'] else False, axis=1)
for info in df[mask]['Name'].unique():
min_val = df.loc[df['Name'] == info]['Group'].min()
arr += list(df[(df['Name'] == info) & (df['Group'] > min_val)].index)
df.drop(arr, inplace=True)
df.reset_index(inplace=True)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_info_1 12 EUR 1
4 Dan_Age 18 EUR 1
5 Dan_city Berlin 2
6 Dan_country 55 3
7 ID_info_2 432 1
8 Dan_sex M USD 4
9 Dan_Age 22 USD 2
10 Dan_country FRA 2
11 Dan_sex M CHF 2
12 Dan_city Madrid 3
13 Dan_country ESP DKN 3
我知道 df 看起来不像您想要的 100p,但这就是我理解您的问题的方式。如果我错了请告诉我。
编辑
重读问题并编辑了一些代码。
这个怎么样:
# select rows with "info"
di = df[df.Name.str.contains('info')]
# Find the rows below max for removal
di = di[di.groupby('Name')['Group'].transform('max') != di['Group']]
# Remove those rows and set a new index as requested
df = df.drop(di.index).reset_index(drop=True)
# Change group to one on remaining "info" rows
df.loc[df.Name.str.contains('info'), 'Group'] = 1
假设我有这个数据框
Name = ['ID', 'Country', 'IBAN','ID_info_1', 'Dan_Age', 'ID_info_1','Dan_city','ID_info_1','Dan_country','ID_info_1', 'ID_info_2', 'ID_info_2','ID_info_2', 'Dan_sex', 'Dan_Age', 'Dan_country','Dan_sex' , 'Dan_city','Dan_country' ]
Value = ['TAMARA_CO', 'GERMANY','FR56', '12', '18','25','Berlin','34', '55','345','432', '43', 'GER', 'M', '22', 'FRA', 'M', 'Madrid', 'ESP']
Ccy = ['','','','EUR','EUR','EUR','','EUR','','','','EUR','EUR','USD','USD','','CHF', '','DKN']
Group = ['0','0','0','1','1','2','2','3','3','4','1','2','3','4','2','2','2','3','3']
df = pd.DataFrame({'Name':Name, 'Value' : Value, 'Ccy' : Ccy,'Group':Group})
print(df)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_info_1 12 EUR 1
4 Dan_Age 18 EUR 1
5 ID_info_1 25 EUR 2
6 Dan_city Berlin 2
7 ID_info_1 34 EUR 3
8 Dan_country 55 3
9 ID_info_1 345 4
10 ID_info_2 432 1
11 ID_info_2 43 EUR 2
12 ID_info_2 GER EUR 3
13 Dan_sex M USD 4
14 Dan_Age 22 USD 2
15 Dan_country FRA 2
16 Dan_sex M CHF 2
17 Dan_city Madrid 3
18 Dan_country ESP DKN 3
我想缩小这个数据框!我想通过保留列“组”中具有最高级别的行来仅减少包含字符串“信息”的行。所以在这个数据框中,这意味着我在第 4 组中保留行“ID_info_1”,在第 3 组中保留“ID_info_1”。此外,我想更改它们在“组”列为 1。
所以最后我想得到这个索引也被重置的新数据框
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_info_1 12 EUR 1
4 Dan_Age 18 EUR 1
5 Dan_city Berlin 2
6 Dan_country 55 3
7 ID_info_1 345 1
8 ID_info_2 GER EUR 1
9 Dan_sex M USD 4
10 Dan_Age 22 USD 2
11 Dan_country FRA 2
12 Dan_sex M CHF 2
13 Dan_city Madrid 3
14 Dan_country ESP DKN 3
有人有有效的想法吗?
谢谢
您可以使用在名称列中搜索字符串 'info' 并在组列中搜索值的 lambda 函数创建掩码。
arr = []
mask = df.apply(lambda x: True if 'info' in x['Name'] else False, axis=1)
for info in df[mask]['Name'].unique():
min_val = df.loc[df['Name'] == info]['Group'].min()
arr += list(df[(df['Name'] == info) & (df['Group'] > min_val)].index)
df.drop(arr, inplace=True)
df.reset_index(inplace=True)
Name Value Ccy Group
0 ID TAMARA_CO 0
1 Country GERMANY 0
2 IBAN FR56 0
3 ID_info_1 12 EUR 1
4 Dan_Age 18 EUR 1
5 Dan_city Berlin 2
6 Dan_country 55 3
7 ID_info_2 432 1
8 Dan_sex M USD 4
9 Dan_Age 22 USD 2
10 Dan_country FRA 2
11 Dan_sex M CHF 2
12 Dan_city Madrid 3
13 Dan_country ESP DKN 3
我知道 df 看起来不像您想要的 100p,但这就是我理解您的问题的方式。如果我错了请告诉我。
编辑 重读问题并编辑了一些代码。
这个怎么样:
# select rows with "info"
di = df[df.Name.str.contains('info')]
# Find the rows below max for removal
di = di[di.groupby('Name')['Group'].transform('max') != di['Group']]
# Remove those rows and set a new index as requested
df = df.drop(di.index).reset_index(drop=True)
# Change group to one on remaining "info" rows
df.loc[df.Name.str.contains('info'), 'Group'] = 1