根据重复项更改数据框中的值 - python
changing values in data frame based on duplicates - python
我有一个超过 100k 行的相当大的数据集,其中有许多重复项和一些缺失值或错误值。尝试在下面的代码片段中简化问题。
sampleData = {
'BI Business Name' : ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www@1', 'www@1', 'www@2', 'www@3', np.nan],
'BI Telephone' : ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
我正在尝试更改基于重复行的值,因此如果任何三个字段匹配,那么第四个字段也应该匹配。我应该得到这样的结果:
result = {
'BI Business Name' : ['AAA', 'AAA', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www@1', 'www@1', 'www@2', 'www@3', 'www@3'],
'BI Telephone' : ['999', '999', '666', '12345', '12345']
}
df = pd.DataFrame(result)
我找到了非常冗长的方法-这里只显示更改名称的部分。
df['Phone_code_web'] = df['BId Postcode'] + df['BI Website'] + df['BI Telephone']
reference_name = df[['BI Business Name', 'BI Telephone', 'BId Postcode','BI Website']]
reference_name = reference_name.dropna()
reference_name['Phone_code_web'] = reference_name['BId Postcode'] + reference_name['BI Website'] +
reference_name['BI Telephone']
duplicate_ref = reference_name[reference_name['Phone_code_web'].duplicated()]
reference_name = pd.concat([reference_name,duplicate_ref]).drop_duplicates(keep=False)
reference_name
def replace_name(row):
try:
old_name = row['BI Business Name']
reference = row['Phone_code_web']
new_name = reference_name[reference_name['Phone_code_web']==reference].iloc[0,0]
print(new_name)
return new_name
except Exception as e:
return old_name
df['BI Business Name']=df.apply(replace_name, axis=1)
df
有更简单的方法吗?
你可以试试这个:
import pandas as pd
sampleData = {
'BI Business Name': ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode': ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website': ['www@1', 'www@1', 'www@2', 'www@3', np.nan],
'BI Telephone': ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
print(df)
def fill_gaps(_df, _x): # _df and _x are local variables that represent the dataframe and one of its rows, respectively
# pd.isnull(_x) = list of Booleans indicating which columns have NaNs
# df.columns[pd.isnull(_x)] = list of columns whose value is a NaN
for col in df.columns[pd.isnull(_x)]:
# len(set(y) & set(_x)) = length of the intersection of the row being considered (_x) and each of the other rows in turn (y)
# the mask is a list of Booleans which are True if:
# 1) y[col] is not Null (e.g. for row 3 we need to replace (BI Telephone = NaN) with a non-NaN 'BI Telephone' value)
# 2) and the length of the intersection above is at least 3 (as required)
mask = df.apply(lambda y: pd.notnull(y[col]) and len(set(y) & set(_x)) == 3, axis=1)
# if the mask has at least one "True" value, select the value in the corresponding column (if there are several possible values, select the first one)
_x[col] = df[mask][col].iloc[0] if any(mask) else _x[col]
return _x
# Apply the logic described above to each row in turn (x = each row)
df = df.apply(lambda x: fill_gaps(df, x), axis=1)
print(df)
输出:
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www@1 999
1 BBB NW1 8NZ www@1 999
2 CCC WC2N 4AA www@2 666
3 DDD CV7 9JY www@3 NaN
4 DDD CV7 9JY NaN 12345
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www@1 999
1 BBB NW1 8NZ www@1 999
2 CCC WC2N 4AA www@2 666
3 DDD CV7 9JY www@3 12345
4 DDD CV7 9JY www@3 12345
我有一个超过 100k 行的相当大的数据集,其中有许多重复项和一些缺失值或错误值。尝试在下面的代码片段中简化问题。
sampleData = {
'BI Business Name' : ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www@1', 'www@1', 'www@2', 'www@3', np.nan],
'BI Telephone' : ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
我正在尝试更改基于重复行的值,因此如果任何三个字段匹配,那么第四个字段也应该匹配。我应该得到这样的结果:
result = {
'BI Business Name' : ['AAA', 'AAA', 'CCC', 'DDD','DDD'],
'BId Postcode' : ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website' : ['www@1', 'www@1', 'www@2', 'www@3', 'www@3'],
'BI Telephone' : ['999', '999', '666', '12345', '12345']
}
df = pd.DataFrame(result)
我找到了非常冗长的方法-这里只显示更改名称的部分。
df['Phone_code_web'] = df['BId Postcode'] + df['BI Website'] + df['BI Telephone']
reference_name = df[['BI Business Name', 'BI Telephone', 'BId Postcode','BI Website']]
reference_name = reference_name.dropna()
reference_name['Phone_code_web'] = reference_name['BId Postcode'] + reference_name['BI Website'] +
reference_name['BI Telephone']
duplicate_ref = reference_name[reference_name['Phone_code_web'].duplicated()]
reference_name = pd.concat([reference_name,duplicate_ref]).drop_duplicates(keep=False)
reference_name
def replace_name(row):
try:
old_name = row['BI Business Name']
reference = row['Phone_code_web']
new_name = reference_name[reference_name['Phone_code_web']==reference].iloc[0,0]
print(new_name)
return new_name
except Exception as e:
return old_name
df['BI Business Name']=df.apply(replace_name, axis=1)
df
有更简单的方法吗?
你可以试试这个:
import pandas as pd
sampleData = {
'BI Business Name': ['AAA', 'BBB', 'CCC', 'DDD','DDD'],
'BId Postcode': ['NW1 8NZ', 'NW1 8NZ', 'WC2N 4AA', 'CV7 9JY', 'CV7 9JY'],
'BI Website': ['www@1', 'www@1', 'www@2', 'www@3', np.nan],
'BI Telephone': ['999', '999', '666', np.nan, '12345']
}
df = pd.DataFrame(sampleData)
print(df)
def fill_gaps(_df, _x): # _df and _x are local variables that represent the dataframe and one of its rows, respectively
# pd.isnull(_x) = list of Booleans indicating which columns have NaNs
# df.columns[pd.isnull(_x)] = list of columns whose value is a NaN
for col in df.columns[pd.isnull(_x)]:
# len(set(y) & set(_x)) = length of the intersection of the row being considered (_x) and each of the other rows in turn (y)
# the mask is a list of Booleans which are True if:
# 1) y[col] is not Null (e.g. for row 3 we need to replace (BI Telephone = NaN) with a non-NaN 'BI Telephone' value)
# 2) and the length of the intersection above is at least 3 (as required)
mask = df.apply(lambda y: pd.notnull(y[col]) and len(set(y) & set(_x)) == 3, axis=1)
# if the mask has at least one "True" value, select the value in the corresponding column (if there are several possible values, select the first one)
_x[col] = df[mask][col].iloc[0] if any(mask) else _x[col]
return _x
# Apply the logic described above to each row in turn (x = each row)
df = df.apply(lambda x: fill_gaps(df, x), axis=1)
print(df)
输出:
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www@1 999
1 BBB NW1 8NZ www@1 999
2 CCC WC2N 4AA www@2 666
3 DDD CV7 9JY www@3 NaN
4 DDD CV7 9JY NaN 12345
BI Business Name BId Postcode BI Website BI Telephone
0 AAA NW1 8NZ www@1 999
1 BBB NW1 8NZ www@1 999
2 CCC WC2N 4AA www@2 666
3 DDD CV7 9JY www@3 12345
4 DDD CV7 9JY www@3 12345