使用其中一个数据帧作为键将 Python 中的数据帧组合到字典中
Combining dataframes in Python to a dictionary using one of the dataframes as key
我有 3 个数据框,包含每日数据:唯一代码、姓名、分数。第 1 行中的第一个值称为 Rank,然后我有日期,Rank 下的第一列包含排名编号(第一列用作索引)。
**df1** UNIQUE CODES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 Code_1 Code_3 Code_4
2 Code_2 Code_1 Code_2
...
1000 Code_5 Code_6 Code_7
**df2** NAMES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 Jon Maria Peter
2 Brian Jon Maria
...
1000 Chris Tim Charles
**df3** SCORES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 10 20 30
2 15 10 40
...
1000 25 15 20
期望输出:
我想将这些数据帧组合成一个字典,使用 df1 代号作为键,所以它看起来像这样:
dictionary = {'Code_1':[Jon, 20] , 'Code_2':[Brian, 15]}
由于有重复的竞争对手,我需要在所有数据系列中对他们的分数求和。因此,在上面的示例中,Jon 的 Score_1 将包含 12/8/2017 和 12/9/2017 的分数。
有 1000 行和 26 列 + 索引,因此需要一种方法来捕获它们。我认为嵌套循环可以在这里工作,但没有足够的经验来构建一个有效的循环。
最后,我想按最高分对字典进行排序。请提出对此的任何解决方案或更直接的方法来组合这些数据并获得分数排名。
我附上了数据框的图片,包含名称、代码和分数。
names
codes
scores
我在我拥有的 3 个数据帧上使用了下面建议的解决方案。请注意,标签代表代码,球员代表名字,奖杯代表分数:
# reshape to get dates into rows
hashtags_reshaped = pd.melt(hashtags, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Code').drop('Rank', axis = 1)
# reshape to get dates into rows
players_reshaped = pd.melt(players, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Name').drop('Rank', axis = 1)
# reshape to get the dates into rows
trophies_reshaped = pd.melt(trophies, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Score').drop('Rank', axis = 1)
# merge the three together.
# This _assumes_ that the dfs are all in the same order and that all the data matches up.
merged_df = pd.DataFrame([hashtags_reshaped['Date'],
hashtags_reshaped['Code'], players_reshaped['Name'],
trophies_reshaped['Score']]).T
print(merged_df)
# group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
print(grouped_df)
summed_df = merged_df.drop('Date', axis = 1) \
.groupby(['Code', 'Name']).sum() \
.sort_values('Score', ascending = False).reset_index()
summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
print(summed_df)
但我得到了一个奇怪的输出:总分应该是数百或数千(因为平均分数是 200-300,平均参与频率是 4-6 次)。我得到的分数结果相差甚远,但他们的匹配代码和名称正确。
summed_df:
0 (MandiBralaX, 996871590076253)
1 (Arso_C, 9955130513430)
2 (ThatRainbowGuy, 9946)
3 (fabi, 9940)
4 (Dogão, 991917)
5 (Hierbo, 99168)
6 (Clyde, 9916156180128)
7 (.A.R.M.I.N., 9916014310187143)
8 (keftedokofths, 9900)
9 (⚽AngelSosa⚽, 990)
10 (Totoo98, 99)
group_df:
Code Name Score \
0 #JL2J02LY MandiBralaX 996871590076253
1 #80JQ90VC Arso_C 9955130513430
2 #9GGC2CUQ ThatRainbowGuy 9946
3 #8LL989QV fabi 9940
4 #9PPC89L Dogão 991917
5 #2JPLQ8JP8 Hierbo 99168
这应该可以帮助您完成大部分工作。我没有像你指定的那样在最后创建字典;虽然您可能需要这种格式,但您最终会得到嵌套的字典或列表,因为每个代码都有 1 个名称,但可能有许多日期和分数与之关联。您希望如何记录这些内容 - 列表、字典等?
下面的代码 returns 分组数据框;您可以将其直接输出到字典(如图所示),但您可能需要详细指定格式,尤其是当您需要有序字典时。 (字典本质上是无序的;如果您确实需要有序字典,您必须 from collections import OrderedDict
并查看该文档。
import pandas as pd
#create the dfs; note that 'Code' is set up as a string
df1 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['1', '2'], '12/9/2017': ['3', '1']})
df1.set_index('Rank', inplace = True)
# reshape to get dates into rows
df1_reshaped = pd.melt(df1, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Code').drop('Rank', axis = 1)
#print(df1_reshaped)
# create the second df
df2 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['Name_1', 'Name_2'], '12/9/2017': ['Name_3', 'Name_1']})
df2.set_index('Rank', inplace = True)
# reshape to get dates into rows
df2_reshaped = pd.melt(df2, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Name').drop('Rank', axis = 1)
#print(df2_reshaped)
# create the third df
df3 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['10', '20'], '12/9/2017': ['30', '10']})
df3.set_index('Rank', inplace = True)
# reshape to get the dates into rows
df3_reshaped = pd.melt(df3, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Score').drop('Rank', axis = 1)
#print(df3_reshaped)
# merge the three together.
# This _assumes_ that the dfs are all in the same order and that all the data matches up.
merged_df = pd.DataFrame([df1_reshaped['Date'], df1_reshaped['Code'], df2_reshaped['Name'], df3_reshaped['Score']]).T
print(merged_df)
# group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
print(grouped_df)
summed_df = merged_df.drop('Date', axis = 1) \
.groupby(['Code', 'Name']).sum() \
.sort_values('Score', ascending = False).reset_index()
summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
print(summed_df)
未排序的字典:
d = dict(zip(summed_df.Code, summed_df.li))
print(d)
你当然可以直接创建 OrderedDict,并且应该:
from collections import OrderedDict
d2 = OrderedDict(zip(summed_df.Code, summed_df.li))
print(d2)
summed_df
:
Code Name Score li
0 3 Name_3 30 (Name_3, 30)
1 1 Name_1 20 (Name_1, 20)
2 2 Name_2 20 (Name_2, 20)
d
:
{'3': ('Name_3', 30), '1': ('Name_1', 20), '2': ('Name_2', 20)}
d2
,排序:
OrderedDict([('3', ('Name_3', 30)), ('1', ('Name_1', 20)), ('2', ('Name_2', 20))])
这个returns你的(名字,分数)作为一个元组,而不是一个列表,但是......它应该得到更多的方式。
我有 3 个数据框,包含每日数据:唯一代码、姓名、分数。第 1 行中的第一个值称为 Rank,然后我有日期,Rank 下的第一列包含排名编号(第一列用作索引)。
**df1** UNIQUE CODES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 Code_1 Code_3 Code_4
2 Code_2 Code_1 Code_2
...
1000 Code_5 Code_6 Code_7
**df2** NAMES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 Jon Maria Peter
2 Brian Jon Maria
...
1000 Chris Tim Charles
**df3** SCORES
Rank 12/8/2017 12/9/2017 .... 1/3/2018
1 10 20 30
2 15 10 40
...
1000 25 15 20
期望输出:
我想将这些数据帧组合成一个字典,使用 df1 代号作为键,所以它看起来像这样:
dictionary = {'Code_1':[Jon, 20] , 'Code_2':[Brian, 15]}
由于有重复的竞争对手,我需要在所有数据系列中对他们的分数求和。因此,在上面的示例中,Jon 的 Score_1 将包含 12/8/2017 和 12/9/2017 的分数。
有 1000 行和 26 列 + 索引,因此需要一种方法来捕获它们。我认为嵌套循环可以在这里工作,但没有足够的经验来构建一个有效的循环。
最后,我想按最高分对字典进行排序。请提出对此的任何解决方案或更直接的方法来组合这些数据并获得分数排名。
我附上了数据框的图片,包含名称、代码和分数。
names
codes
scores
我在我拥有的 3 个数据帧上使用了下面建议的解决方案。请注意,标签代表代码,球员代表名字,奖杯代表分数:
# reshape to get dates into rows
hashtags_reshaped = pd.melt(hashtags, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Code').drop('Rank', axis = 1)
# reshape to get dates into rows
players_reshaped = pd.melt(players, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Name').drop('Rank', axis = 1)
# reshape to get the dates into rows
trophies_reshaped = pd.melt(trophies, id_vars = ['Rank'],
value_vars = hashtags.columns,
var_name = 'Date',
value_name = 'Score').drop('Rank', axis = 1)
# merge the three together.
# This _assumes_ that the dfs are all in the same order and that all the data matches up.
merged_df = pd.DataFrame([hashtags_reshaped['Date'],
hashtags_reshaped['Code'], players_reshaped['Name'],
trophies_reshaped['Score']]).T
print(merged_df)
# group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
print(grouped_df)
summed_df = merged_df.drop('Date', axis = 1) \
.groupby(['Code', 'Name']).sum() \
.sort_values('Score', ascending = False).reset_index()
summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
print(summed_df)
但我得到了一个奇怪的输出:总分应该是数百或数千(因为平均分数是 200-300,平均参与频率是 4-6 次)。我得到的分数结果相差甚远,但他们的匹配代码和名称正确。
summed_df:
0 (MandiBralaX, 996871590076253)
1 (Arso_C, 9955130513430)
2 (ThatRainbowGuy, 9946)
3 (fabi, 9940)
4 (Dogão, 991917)
5 (Hierbo, 99168)
6 (Clyde, 9916156180128)
7 (.A.R.M.I.N., 9916014310187143)
8 (keftedokofths, 9900)
9 (⚽AngelSosa⚽, 990)
10 (Totoo98, 99)
group_df:
Code Name Score \
0 #JL2J02LY MandiBralaX 996871590076253
1 #80JQ90VC Arso_C 9955130513430
2 #9GGC2CUQ ThatRainbowGuy 9946
3 #8LL989QV fabi 9940
4 #9PPC89L Dogão 991917
5 #2JPLQ8JP8 Hierbo 99168
这应该可以帮助您完成大部分工作。我没有像你指定的那样在最后创建字典;虽然您可能需要这种格式,但您最终会得到嵌套的字典或列表,因为每个代码都有 1 个名称,但可能有许多日期和分数与之关联。您希望如何记录这些内容 - 列表、字典等?
下面的代码 returns 分组数据框;您可以将其直接输出到字典(如图所示),但您可能需要详细指定格式,尤其是当您需要有序字典时。 (字典本质上是无序的;如果您确实需要有序字典,您必须 from collections import OrderedDict
并查看该文档。
import pandas as pd
#create the dfs; note that 'Code' is set up as a string
df1 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['1', '2'], '12/9/2017': ['3', '1']})
df1.set_index('Rank', inplace = True)
# reshape to get dates into rows
df1_reshaped = pd.melt(df1, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Code').drop('Rank', axis = 1)
#print(df1_reshaped)
# create the second df
df2 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['Name_1', 'Name_2'], '12/9/2017': ['Name_3', 'Name_1']})
df2.set_index('Rank', inplace = True)
# reshape to get dates into rows
df2_reshaped = pd.melt(df2, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Name').drop('Rank', axis = 1)
#print(df2_reshaped)
# create the third df
df3 = pd.DataFrame({'Rank': [1, 2], '12/8/2017': ['10', '20'], '12/9/2017': ['30', '10']})
df3.set_index('Rank', inplace = True)
# reshape to get the dates into rows
df3_reshaped = pd.melt(df3, id_vars = ['Rank'],
value_vars = df1.columns,
var_name = 'Date',
value_name = 'Score').drop('Rank', axis = 1)
#print(df3_reshaped)
# merge the three together.
# This _assumes_ that the dfs are all in the same order and that all the data matches up.
merged_df = pd.DataFrame([df1_reshaped['Date'], df1_reshaped['Code'], df2_reshaped['Name'], df3_reshaped['Score']]).T
print(merged_df)
# group by code, name, and date; sum the scores together if multiple exist for a given code-name-date grouping
grouped_df = merged_df.groupby(['Code', 'Name', 'Date']).sum().sort_values('Score', ascending = False)
print(grouped_df)
summed_df = merged_df.drop('Date', axis = 1) \
.groupby(['Code', 'Name']).sum() \
.sort_values('Score', ascending = False).reset_index()
summed_df['li'] = list(zip(summed_df.Name, summed_df.Score))
print(summed_df)
未排序的字典:
d = dict(zip(summed_df.Code, summed_df.li))
print(d)
你当然可以直接创建 OrderedDict,并且应该:
from collections import OrderedDict
d2 = OrderedDict(zip(summed_df.Code, summed_df.li))
print(d2)
summed_df
:
Code Name Score li
0 3 Name_3 30 (Name_3, 30)
1 1 Name_1 20 (Name_1, 20)
2 2 Name_2 20 (Name_2, 20)
d
:
{'3': ('Name_3', 30), '1': ('Name_1', 20), '2': ('Name_2', 20)}
d2
,排序:
OrderedDict([('3', ('Name_3', 30)), ('1', ('Name_1', 20)), ('2', ('Name_2', 20))])
这个returns你的(名字,分数)作为一个元组,而不是一个列表,但是......它应该得到更多的方式。