Pandas isin() 输出到字符串和通用代码优化
Pandas isin() output to string and general code optimisation
我刚开始使用 python 对工作中的数据进行一些分析,所以我真的可以在这里得到一些帮助:)
我有一个非洲国家的 df 和一堆指标,另一个 df 的维度代表分组,如果一个国家在那个组中,这个国家的名字就在那里。
这是一张快照:
# indicators df
df_indicators= pd.DataFrame({'Country': ['Algeria', 'Angola', 'Benin'],
'Commitment to CAADP Process': [np.nan, 0.1429, 0.8571]})
# groupings df
df_groupings= pd.DataFrame({'Fragile': ['Somalia', 'Angola', 'Benin'],
'SSA': ['Angola', 'Benin', 'South Africa']})
# what I would like to have without doing it manually
df_indicators['Fragile'] = ['Not Fragile', 'Fragile', 'Fragile']
df_indicators['SSA'] = ['Not SSA', 'SSA', 'Not SSA']
df_indicators
并想向此 df 添加维度,以告诉我该国家/地区是否是脆弱国家(以及其他分组)。所以我有另一个 df,其中包含属于该类别的国家/地区列表。
我使用 isin 实例来检查是否相等,但我真正想要的是,例如,在新维度 "Fragile" 中,TRUE 值将被 [=28= 代替,而不是 TRUE 和 FALSE ] 和 FALSE 值 "NOT FRAGILE"。
不用说,如果您看到任何改进此代码的方法,我非常渴望向专业人士学习!特别是如果你在可持续发展目标统计领域。
import pandas as pd
import numpy as np
excel_file = 'Downloads/BR_Data.xlsx'
indicators = pd.read_excel(excel_file, sheetname="Indicators", header=1)
groupings = pd.read_excel(excel_file, sheetname="Groupings", header=0)
# Title countries in the Sub-Saharan Africa dimension
decap = groupings["Sub-Saharan Africa (World Bank List)"].str.title()
groupings["Sub-Saharan Africa (World Bank List)"] = decap
# Create list of legal country names
legal_tags = {"Côte d’Ivoire":"Ivory Coast", "Cote D'Ivoire":"Ivory Coast", "Democratic Republic of the Congo":"DR Congo", "Congo, Dem. Rep.":"DR Congo",
"Congo, Repub. of the":"DR Congo", "Congo, Rep.": "DR Congo", "Dr Congo": "DR Congo", "Central African Rep.":"Central African Republic", "Sao Tome & Principe":
"Sao Tome and Principe", "Gambia, The":"Gambia"}
# I am sure there is a way to insert a list of the column names instead of copy pasting the name of every column label 5 times
groupings.replace({"Least Developing Countries in Africa (UN Classification, used by WB)" : legal_tags}, inplace = True)
groupings.replace({"Oil Exporters (McKinsey Global Institute)" : legal_tags}, inplace = True)
groupings.replace({"Sub-Saharan Africa (World Bank List)" : legal_tags}, inplace = True)
groupings.replace({"African Fragile and Conflict Affected Aread (OECD)" : legal_tags}, inplace = True)
groupings
# If the country is df indicator is found in grouping df then assign true to new column [LDC] => CAN I REPLACE TRUE WITH "LDC" etc...?
indicators["LDC"] = indicators["Country"].isin(groupings["Least Developing Countries in Africa (UN Classification, used by WB)"])
indicators["Fragile"] = indicators["Country"].isin(groupings["African Fragile and Conflict Affected Aread (OECD)"])
indicators["Oil"] = indicators["Country"].isin(groupings["Oil Exporters (McKinsey Global Institute)"])
indicators["SSA"] = indicators["Country"].isin(groupings["Sub-Saharan Africa (World Bank List)"])
indicators["Landlock"] = indicators["Country"].isin(groupings['Landlocked (UNCTAD List)'])
# When I concatenate the data frames of the groupings average I get an index with a bunch of true, false, true, false etc...
df = indicators.merge(groupings, left_on = "Country", right_on= "Country", how ="right")
labels = ['African Fragile and Conflict Affected Aread (OECD)', 'Sub-Saharan Africa (World Bank List)', 'Landlocked (UNCTAD List)', 'North Africa (excl. Middle east)', 'Oil Exporters (McKinsey Global Institute)', 'Least Developing Countries in Africa (UN Classification, used by WB)']
df.drop(labels, axis = 1, inplace = True)
df.loc['mean'] = df.mean()
df_regions = df[:].groupby('Regional Group').mean()
df_LDC = df[:].groupby('LDC').mean()
df_Oil = df[:].groupby('Oil').mean()
df_SSA = df[:].groupby('SSA').mean()
df_landlock = df[:].groupby('Landlock').mean()
df_fragile = df[:].groupby('Fragile').mean()
frames = [df_regions, df_Oil, df_SSA, df_landlock, df_fragile]
result = pd.concat(frames)
result
您可以在 serie 上应用函数而不是 isin()
def get_value(x, y, choice):
if x in y:
return choice[0]
else:
return choice[1]
indicators["LDC"] = indicators["Country"].apply(get_value, y=groupings["..."].tolist(), choice= ["Fragile", "Not Fragile"])
我不是 100% 确定您需要 tolist() 但此代码将对数据框的每一行应用该函数并且 return 选择如果为真则为 1,如果为假则为 2。
希望对你有所帮助,
我刚开始使用 python 对工作中的数据进行一些分析,所以我真的可以在这里得到一些帮助:)
我有一个非洲国家的 df 和一堆指标,另一个 df 的维度代表分组,如果一个国家在那个组中,这个国家的名字就在那里。
这是一张快照:
# indicators df
df_indicators= pd.DataFrame({'Country': ['Algeria', 'Angola', 'Benin'],
'Commitment to CAADP Process': [np.nan, 0.1429, 0.8571]})
# groupings df
df_groupings= pd.DataFrame({'Fragile': ['Somalia', 'Angola', 'Benin'],
'SSA': ['Angola', 'Benin', 'South Africa']})
# what I would like to have without doing it manually
df_indicators['Fragile'] = ['Not Fragile', 'Fragile', 'Fragile']
df_indicators['SSA'] = ['Not SSA', 'SSA', 'Not SSA']
df_indicators
并想向此 df 添加维度,以告诉我该国家/地区是否是脆弱国家(以及其他分组)。所以我有另一个 df,其中包含属于该类别的国家/地区列表。
我使用 isin 实例来检查是否相等,但我真正想要的是,例如,在新维度 "Fragile" 中,TRUE 值将被 [=28= 代替,而不是 TRUE 和 FALSE ] 和 FALSE 值 "NOT FRAGILE"。
不用说,如果您看到任何改进此代码的方法,我非常渴望向专业人士学习!特别是如果你在可持续发展目标统计领域。
import pandas as pd
import numpy as np
excel_file = 'Downloads/BR_Data.xlsx'
indicators = pd.read_excel(excel_file, sheetname="Indicators", header=1)
groupings = pd.read_excel(excel_file, sheetname="Groupings", header=0)
# Title countries in the Sub-Saharan Africa dimension
decap = groupings["Sub-Saharan Africa (World Bank List)"].str.title()
groupings["Sub-Saharan Africa (World Bank List)"] = decap
# Create list of legal country names
legal_tags = {"Côte d’Ivoire":"Ivory Coast", "Cote D'Ivoire":"Ivory Coast", "Democratic Republic of the Congo":"DR Congo", "Congo, Dem. Rep.":"DR Congo",
"Congo, Repub. of the":"DR Congo", "Congo, Rep.": "DR Congo", "Dr Congo": "DR Congo", "Central African Rep.":"Central African Republic", "Sao Tome & Principe":
"Sao Tome and Principe", "Gambia, The":"Gambia"}
# I am sure there is a way to insert a list of the column names instead of copy pasting the name of every column label 5 times
groupings.replace({"Least Developing Countries in Africa (UN Classification, used by WB)" : legal_tags}, inplace = True)
groupings.replace({"Oil Exporters (McKinsey Global Institute)" : legal_tags}, inplace = True)
groupings.replace({"Sub-Saharan Africa (World Bank List)" : legal_tags}, inplace = True)
groupings.replace({"African Fragile and Conflict Affected Aread (OECD)" : legal_tags}, inplace = True)
groupings
# If the country is df indicator is found in grouping df then assign true to new column [LDC] => CAN I REPLACE TRUE WITH "LDC" etc...?
indicators["LDC"] = indicators["Country"].isin(groupings["Least Developing Countries in Africa (UN Classification, used by WB)"])
indicators["Fragile"] = indicators["Country"].isin(groupings["African Fragile and Conflict Affected Aread (OECD)"])
indicators["Oil"] = indicators["Country"].isin(groupings["Oil Exporters (McKinsey Global Institute)"])
indicators["SSA"] = indicators["Country"].isin(groupings["Sub-Saharan Africa (World Bank List)"])
indicators["Landlock"] = indicators["Country"].isin(groupings['Landlocked (UNCTAD List)'])
# When I concatenate the data frames of the groupings average I get an index with a bunch of true, false, true, false etc...
df = indicators.merge(groupings, left_on = "Country", right_on= "Country", how ="right")
labels = ['African Fragile and Conflict Affected Aread (OECD)', 'Sub-Saharan Africa (World Bank List)', 'Landlocked (UNCTAD List)', 'North Africa (excl. Middle east)', 'Oil Exporters (McKinsey Global Institute)', 'Least Developing Countries in Africa (UN Classification, used by WB)']
df.drop(labels, axis = 1, inplace = True)
df.loc['mean'] = df.mean()
df_regions = df[:].groupby('Regional Group').mean()
df_LDC = df[:].groupby('LDC').mean()
df_Oil = df[:].groupby('Oil').mean()
df_SSA = df[:].groupby('SSA').mean()
df_landlock = df[:].groupby('Landlock').mean()
df_fragile = df[:].groupby('Fragile').mean()
frames = [df_regions, df_Oil, df_SSA, df_landlock, df_fragile]
result = pd.concat(frames)
result
您可以在 serie 上应用函数而不是 isin()
def get_value(x, y, choice):
if x in y:
return choice[0]
else:
return choice[1]
indicators["LDC"] = indicators["Country"].apply(get_value, y=groupings["..."].tolist(), choice= ["Fragile", "Not Fragile"])
我不是 100% 确定您需要 tolist() 但此代码将对数据框的每一行应用该函数并且 return 选择如果为真则为 1,如果为假则为 2。
希望对你有所帮助,