Pandas 数据框根据条件更改列中的值

Question

我下面有一个大数据框：

此处用作示例的数据'education_val.csv'可在此处找到https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv

import pandas as pd 

edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)

ID  Year    Education
22445   1991    higher education
29925   1991    No qualifications
76165   1991    No qualifications
223725  1991    Other
280165  1991    intermediate qualifications
333205  1991    No qualifications
387605  1991    higher education
541285  1991    No qualifications
541965  1991    No qualifications
599765  1991    No qualifications

Education 列中的值为：

edu.Education.value_counts()

intermediate qualifications 153705
higher education    67020
No qualifications   55842
Other   36915

我想用以下方式替换“教育”列中的值：

如果 ID 在某一年的 Education 列中具有值 higher education，则该 ID 的所有未来年份也将具有higher education 在 Education 列中。
如果某个 ID 在某一年中的值为 intermediate qualifications，则该 ID 的所有未来年份都将在相应的值中具有 intermediate qualifications Education 列。但是，如果此 ID 的值 higher education 出现在随后的任何年份中，则 higher education 会在随后的年份中替换 intermediate qualifications，无论 Other 或No qualifications occur.

例如，在下面的 DataFrame 中，ID 22445 在 1991 年的值为 higher education，22445 的所有后续值为 Education以后应该用higher education代替，一直到2017.

edu.loc[edu['ID'] == 22445]

ID  Year    Education
22445   1991    higher education
22445   1992    higher education
22445   1993    higher education
22445   1994    higher education
22445   1995    higher education
22445   1996    intermediate qualifications
22445   1997    intermediate qualifications
22445   1998    Other
22445   1999    No qualifications
22445   2000    intermediate qualifications
22445   2001    intermediate qualifications
22445   2002    intermediate qualifications
22445   2003    intermediate qualifications
22445   2004    intermediate qualifications
22445   2005    intermediate qualifications
22445   2006    intermediate qualifications
22445   2007    intermediate qualifications
22445   2008    intermediate qualifications
22445   2010    intermediate qualifications
22445   2011    intermediate qualifications
22445   2012    intermediate qualifications
22445   2013    intermediate qualifications
22445   2014    intermediate qualifications
22445   2015    intermediate qualifications
22445   2016    intermediate qualifications
22445   2017    intermediate qualifications

类似地，下面Dataframe中的ID 1587125在1991年的值为intermediate qualifications，在1993年变为higher education。在未来年份（从 1993 年开始）1587125 列 Education 中的所有后续值应为 higher education.

edu.loc[edu['ID'] == 1587125]

ID  Year    Education
1587125 1991    intermediate qualifications
1587125 1992    intermediate qualifications
1587125 1993    higher education
1587125 1994    higher education
1587125 1995    higher education
1587125 1996    higher education
1587125 1997    higher education
1587125 1998    higher education
1587125 1999    higher education
1587125 2000    higher education
1587125 2001    higher education
1587125 2002    higher education
1587125 2003    higher education
1587125 2004    Other
1587125 2005    No qualifications
1587125 2006    intermediate qualifications
1587125 2007    intermediate qualifications
1587125 2008    intermediate qualifications
1587125 2010    intermediate qualifications
1587125 2011    higher education
1587125 2012    higher education
1587125 2013    higher education
1587125 2014    higher education
1587125 2015    higher education
1587125 2016    higher education
1587125 2017    higher education

数据中有 12,057 个唯一 ID，Year 列从 1991 年到 2017 年。如何根据所有 12、057 更改 Education 的值以上条件？我不确定如何以统一的方式对所有唯一的 ID 执行此操作。上面的Github link附上了这里作为示例使用的示例数据。非常感谢。

Answer 1

您可以像这样使用 categorical data 来做到这一点：

df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')

eddtype = pd.CategoricalDtype(['No qualifications', 
                               'Other',
                               'intermediate qualifications',
                               'higher education'], 
                               ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)

df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
                 .transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )

它被明确地分解了，所以你可以看到我正在使用的数据操作。

创建教育categorical dtype with order
接下来，更改 Education 列的 dtype 以使用该分类 dtype (EducationCat)
使用分类代码进行 cummax 计算
索引到 return 由 cummax 计算 (EduMax) 定义的类别

输出：

df[df['ID'] == 1587125]

            ID  Year                    Education                 EducationCat                       EduMax
18      1587125  1991  intermediate qualifications  intermediate qualifications  intermediate qualifications
12075   1587125  1992  intermediate qualifications  intermediate qualifications  intermediate qualifications
24132   1587125  1993             higher education             higher education             higher education
36189   1587125  1994             higher education             higher education             higher education
48246   1587125  1995             higher education             higher education             higher education
60303   1587125  1996             higher education             higher education             higher education
72360   1587125  1997             higher education             higher education             higher education
84417   1587125  1998             higher education             higher education             higher education
96474   1587125  1999             higher education             higher education             higher education
108531  1587125  2000             higher education             higher education             higher education
120588  1587125  2001             higher education             higher education             higher education
132645  1587125  2002             higher education             higher education             higher education
144702  1587125  2003             higher education             higher education             higher education
156759  1587125  2004                        Other                        Other             higher education
168816  1587125  2005            No qualifications            No qualifications             higher education
180873  1587125  2006  intermediate qualifications  intermediate qualifications             higher education
192930  1587125  2007  intermediate qualifications  intermediate qualifications             higher education
204987  1587125  2008  intermediate qualifications  intermediate qualifications             higher education
217044  1587125  2010  intermediate qualifications  intermediate qualifications             higher education
229101  1587125  2011             higher education             higher education             higher education
241158  1587125  2012             higher education             higher education             higher education
253215  1587125  2013             higher education             higher education             higher education
265272  1587125  2014             higher education             higher education             higher education
277329  1587125  2015             higher education             higher education             higher education
289386  1587125  2016             higher education             higher education             higher education
301443  1587125  2017             higher education             higher education             higher education

Answer 2

学历层次分明是有顺序的。您的问题可以重述为“滚动最大值”问题：某个人在某一年的最高教育水平是多少？

试试这个：

# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}

# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)

# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()

# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})

edu['Education'] = tmp

测试：

edu[edu['ID'] == 1587125]

    ID  Year                    Education
1587125  1991  intermediate qualifications
1587125  1992  intermediate qualifications
1587125  1993             higher education
1587125  1994             higher education
1587125  1995             higher education
1587125  1996             higher education
1587125  1997             higher education
1587125  1998             higher education
1587125  1999             higher education
1587125  2000             higher education
1587125  2001             higher education
1587125  2002             higher education
1587125  2003             higher education
1587125  2004             higher education
1587125  2005             higher education
1587125  2006             higher education
1587125  2007             higher education
1587125  2008             higher education
1587125  2010             higher education
1587125  2011             higher education
1587125  2012             higher education
1587125  2013             higher education
1587125  2014             higher education
1587125  2015             higher education
1587125  2016             higher education
1587125  2017             higher education

Answer 3

您可以遍历 ID，然后遍历年份。 DataFrame是按时间顺序排列的，所以如果一个人在一个单元格中有'higher education'或'intermediate qualifications'，你可以保存这些知识并将其应用到后续的单元格中：

edu = edu.set_index('ID')
ids = edu.index.unique()

for id in ids:
    # booleans to keep track of education statuses we've seen
    higher_ed = False
    inter_qual = False

    rows = edu.loc[id]
    for _, row in rows:
        # check for intermediate qualifications
        if inter_qual:
            row['Education'] = 'intermediate qualifications'
        elif row['Education'] = 'intermediate qualifications':
            inter_qual = True

        # check for higher education
        if higher_ed:
            row['Education'] = 'higher education'
        elif row['Education'] = 'higher education':
            higher_ed = True

我们可能不止一次地覆盖每个状态并不重要 — 如果一个人同时拥有 'intermediate qualifications' 和 'higher education'，我们只需要确保 'higher education'最后设置。

我通常不建议使用 for 循环来处理 DataFrame — 但每个单元格值可能依赖于它上面的值，而且 Dataframe 并没有大到无法实现这一点。

Pandas 数据框根据条件更改列中的值

Pandas dataframe change values in a column based on conditions

python

pandas

panel-data