Pandas isin() 输出到字符串和通用代码优化

Question

我刚开始使用 python 对工作中的数据进行一些分析，所以我真的可以在这里得到一些帮助:)

我有一个非洲国家的 df 和一堆指标，另一个 df 的维度代表分组，如果一个国家在那个组中，这个国家的名字就在那里。

这是一张快照：

# indicators df
df_indicators= pd.DataFrame({'Country': ['Algeria', 'Angola', 'Benin'], 
                   'Commitment to CAADP Process': [np.nan, 0.1429, 0.8571]})
# groupings df
df_groupings= pd.DataFrame({'Fragile': ['Somalia', 'Angola', 'Benin'], 
                   'SSA': ['Angola', 'Benin', 'South Africa']})
# what I would like to have without doing it manually
df_indicators['Fragile'] = ['Not Fragile', 'Fragile', 'Fragile']
df_indicators['SSA'] = ['Not SSA', 'SSA', 'Not SSA']

df_indicators

并想向此 df 添加维度，以告诉我该国家/地区是否是脆弱国家（以及其他分组）。所以我有另一个 df，其中包含属于该类别的国家/地区列表。

我使用 isin 实例来检查是否相等，但我真正想要的是，例如，在新维度 "Fragile" 中，TRUE 值将被 [=28= 代替，而不是 TRUE 和 FALSE ] 和 FALSE 值 "NOT FRAGILE"。

不用说，如果您看到任何改进此代码的方法，我非常渴望向专业人士学习！特别是如果你在可持续发展目标统计领域。

import pandas as pd
import numpy as np
excel_file = 'Downloads/BR_Data.xlsx'
indicators = pd.read_excel(excel_file, sheetname="Indicators", header=1)
groupings = pd.read_excel(excel_file, sheetname="Groupings", header=0)

# Title countries in the Sub-Saharan Africa dimension

decap = groupings["Sub-Saharan Africa (World Bank List)"].str.title()
groupings["Sub-Saharan Africa (World Bank List)"] = decap

# Create list of legal country names

legal_tags = {"Côte d’Ivoire":"Ivory Coast", "Cote D'Ivoire":"Ivory Coast", "Democratic Republic of the Congo":"DR Congo", "Congo, Dem. Rep.":"DR Congo", 
              "Congo, Repub. of the":"DR Congo", "Congo, Rep.": "DR Congo", "Dr Congo": "DR Congo", "Central African Rep.":"Central African Republic", "Sao Tome & Principe":
              "Sao Tome and Principe", "Gambia, The":"Gambia"}

# I am sure there is a way to insert a list of the column names instead of copy pasting the name of every column label 5 times

groupings.replace({"Least Developing Countries in Africa (UN Classification, used by WB)" : legal_tags}, inplace = True)
groupings.replace({"Oil Exporters (McKinsey Global Institute)" : legal_tags}, inplace = True)
groupings.replace({"Sub-Saharan Africa (World Bank List)" : legal_tags}, inplace = True)
groupings.replace({"African Fragile and Conflict Affected Aread (OECD)" : legal_tags}, inplace = True)
groupings

# If the country is df indicator is found in grouping df then assign true to new column [LDC] => CAN I REPLACE TRUE WITH "LDC" etc...? 

indicators["LDC"] = indicators["Country"].isin(groupings["Least Developing Countries in Africa (UN Classification, used by WB)"])
indicators["Fragile"] = indicators["Country"].isin(groupings["African Fragile and Conflict Affected Aread (OECD)"])
indicators["Oil"] = indicators["Country"].isin(groupings["Oil Exporters (McKinsey Global Institute)"])
indicators["SSA"] = indicators["Country"].isin(groupings["Sub-Saharan Africa (World Bank List)"])
indicators["Landlock"] = indicators["Country"].isin(groupings['Landlocked (UNCTAD List)'])

# When I concatenate the data frames of the groupings average I get an index with a bunch of true, false, true, false etc...
df = indicators.merge(groupings, left_on = "Country", right_on= "Country", how ="right")
labels = ['African Fragile and Conflict Affected Aread (OECD)', 'Sub-Saharan Africa (World Bank List)', 'Landlocked (UNCTAD List)', 'North Africa (excl. Middle east)', 'Oil Exporters (McKinsey Global Institute)', 'Least Developing Countries in Africa (UN Classification, used by WB)']
df.drop(labels, axis = 1, inplace = True)
df.loc['mean'] = df.mean()
df_regions = df[:].groupby('Regional Group').mean()
df_LDC = df[:].groupby('LDC').mean()
df_Oil = df[:].groupby('Oil').mean()
df_SSA = df[:].groupby('SSA').mean()
df_landlock = df[:].groupby('Landlock').mean()
df_fragile = df[:].groupby('Fragile').mean()

frames = [df_regions, df_Oil, df_SSA, df_landlock, df_fragile]
result = pd.concat(frames)

result

Answer 1

您可以在 serie 上应用函数而不是 isin()

def get_value(x, y, choice):
    if x in y:
        return choice[0]
    else:
        return choice[1]

indicators["LDC"] = indicators["Country"].apply(get_value, y=groupings["..."].tolist(), choice= ["Fragile", "Not Fragile"])

我不是 100% 确定您需要 tolist() 但此代码将对数据框的每一行应用该函数并且 return 选择如果为真则为 1，如果为假则为 2。

希望对你有所帮助，

Pandas isin() 输出到字符串和通用代码优化

Pandas isin() output to string and general code optimisation

python

group-concat

dataframe

isinstance

pandas