Python Pandas: 如何分组和比较列
Python Pandas: How to groupby and compare columns
这是我的数据农场 'df':
match name group
adamant Adamant Home Network 86
adamant ADAMANT, Ltd. 86
adamant bild TOV Adamant-Bild 86
360works 360WORKS 94
360works 360works.com 94
每个组号,我想一个一个地比较名称,看看它们是否与 'match' 列中的同一个词匹配。
所以期望的输出将是计数:
If they match we count it as 'TP' and if not we count it as 'FN'.
我想计算每个组号的匹配词数量,但这对我想要的完全没有帮助:
df.groupby(group).count()
有没有人知道如何去做?
如果我理解你的问题,这应该可以解决问题:
import re
import pandas
df = pandas.DataFrame([['adamant', 'Adamant Home Network', 86], ['adamant', 'ADAMANT, Ltd.', 86],
['adamant bild', "TOV Adamant-Bild", 86], ['360works', '360WORKS', 94],
['360works ', "360works.com ", 94]], columns=['match', 'name', 'group'])
def my_function(group):
for i, row in group.iterrows():
if ''.join(re.findall("[a-zA-Z]+", row['match'])).lower() not in ''.join(
re.findall("[a-zA-Z]+", row['name'])).lower():
# parsing the names in each columns and looking for an inclusion
# if one of the inclusion fails, we return 'FN'
return 'FN'
# if all inclusions succeed, we return 'TP'
return 'TP'
res_series = df.groupby('group').apply(my_function)
res_series.name = 'count'
res_df = res_series.reset_index()
print res_df
这会给你这个 DataFrame:
group count
1 86 'TP'
2 94 'TP'
此函数将为每个提供的组逐行比较名称和匹配列:
def apply_func(df):
x = df['name'] == df['match']
return x.map({False:'FIN', True:'TP'})
In [683]: temp.join(temp.groupby('group').apply(apply_func).reset_index(), rsuffix='_1', how='left')
Out[683]:
match name group group_1 level_1 0
0 adamant Adamant Home Network 86 86 0 FIN
1 adamant ADAMANT, Ltd. 86 86 1 FIN
2 adamant bild TOV Adamant-Bild 86 86 2 FIN
3 360works 360WORKS 94 94 3 FIN
4 360works 360works.com 94 94 4 FIN
这是我的数据农场 'df':
match name group
adamant Adamant Home Network 86
adamant ADAMANT, Ltd. 86
adamant bild TOV Adamant-Bild 86
360works 360WORKS 94
360works 360works.com 94
每个组号,我想一个一个地比较名称,看看它们是否与 'match' 列中的同一个词匹配。
所以期望的输出将是计数:
If they match we count it as 'TP' and if not we count it as 'FN'.
我想计算每个组号的匹配词数量,但这对我想要的完全没有帮助:
df.groupby(group).count()
有没有人知道如何去做?
如果我理解你的问题,这应该可以解决问题:
import re
import pandas
df = pandas.DataFrame([['adamant', 'Adamant Home Network', 86], ['adamant', 'ADAMANT, Ltd.', 86],
['adamant bild', "TOV Adamant-Bild", 86], ['360works', '360WORKS', 94],
['360works ', "360works.com ", 94]], columns=['match', 'name', 'group'])
def my_function(group):
for i, row in group.iterrows():
if ''.join(re.findall("[a-zA-Z]+", row['match'])).lower() not in ''.join(
re.findall("[a-zA-Z]+", row['name'])).lower():
# parsing the names in each columns and looking for an inclusion
# if one of the inclusion fails, we return 'FN'
return 'FN'
# if all inclusions succeed, we return 'TP'
return 'TP'
res_series = df.groupby('group').apply(my_function)
res_series.name = 'count'
res_df = res_series.reset_index()
print res_df
这会给你这个 DataFrame:
group count
1 86 'TP'
2 94 'TP'
此函数将为每个提供的组逐行比较名称和匹配列:
def apply_func(df):
x = df['name'] == df['match']
return x.map({False:'FIN', True:'TP'})
In [683]: temp.join(temp.groupby('group').apply(apply_func).reset_index(), rsuffix='_1', how='left')
Out[683]:
match name group group_1 level_1 0
0 adamant Adamant Home Network 86 86 0 FIN
1 adamant ADAMANT, Ltd. 86 86 1 FIN
2 adamant bild TOV Adamant-Bild 86 86 2 FIN
3 360works 360WORKS 94 94 3 FIN
4 360works 360works.com 94 94 4 FIN