Pandas 数据框根据条件更改列中的值
Pandas dataframe change values in a column based on conditions
我下面有一个大数据框:
此处用作示例的数据'education_val.csv'可在此处找到https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv
import pandas as pd
edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)
ID Year Education
22445 1991 higher education
29925 1991 No qualifications
76165 1991 No qualifications
223725 1991 Other
280165 1991 intermediate qualifications
333205 1991 No qualifications
387605 1991 higher education
541285 1991 No qualifications
541965 1991 No qualifications
599765 1991 No qualifications
Education
列中的值为:
edu.Education.value_counts()
intermediate qualifications 153705
higher education 67020
No qualifications 55842
Other 36915
我想用以下方式替换“教育”列中的值:
如果 ID
在某一年的 Education
列中具有值 higher education
,则该 ID
的所有未来年份也将具有higher education
在 Education
列中。
如果某个 ID
在某一年中的值为 intermediate qualifications
,则该 ID
的所有未来年份都将在相应的值中具有 intermediate qualifications
Education
列。但是,如果此 ID
的值 higher education
出现在随后的任何年份中,则 higher education
会在随后的年份中替换 intermediate qualifications
,无论 Other
或No qualifications occur
.
例如,在下面的 DataFrame 中,ID
22445 在 1991
年的值为 higher education
,22445
的所有后续值为 Education
以后应该用higher education
代替,一直到2017
.
edu.loc[edu['ID'] == 22445]
ID Year Education
22445 1991 higher education
22445 1992 higher education
22445 1993 higher education
22445 1994 higher education
22445 1995 higher education
22445 1996 intermediate qualifications
22445 1997 intermediate qualifications
22445 1998 Other
22445 1999 No qualifications
22445 2000 intermediate qualifications
22445 2001 intermediate qualifications
22445 2002 intermediate qualifications
22445 2003 intermediate qualifications
22445 2004 intermediate qualifications
22445 2005 intermediate qualifications
22445 2006 intermediate qualifications
22445 2007 intermediate qualifications
22445 2008 intermediate qualifications
22445 2010 intermediate qualifications
22445 2011 intermediate qualifications
22445 2012 intermediate qualifications
22445 2013 intermediate qualifications
22445 2014 intermediate qualifications
22445 2015 intermediate qualifications
22445 2016 intermediate qualifications
22445 2017 intermediate qualifications
类似地,下面Dataframe中的ID
1587125在1991
年的值为intermediate qualifications
,在1993
年变为higher education
。在未来年份(从 1993 年开始)1587125
列 Education
中的所有后续值应为 higher education
.
edu.loc[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 Other
1587125 2005 No qualifications
1587125 2006 intermediate qualifications
1587125 2007 intermediate qualifications
1587125 2008 intermediate qualifications
1587125 2010 intermediate qualifications
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
数据中有 12,057 个唯一 ID
,Year
列从 1991 年到 2017 年。如何根据所有 12、057 更改 Education
的值以上条件?我不确定如何以统一的方式对所有唯一的 ID
执行此操作。上面的Github link附上了这里作为示例使用的示例数据。非常感谢。
您可以像这样使用 categorical data 来做到这一点:
df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')
eddtype = pd.CategoricalDtype(['No qualifications',
'Other',
'intermediate qualifications',
'higher education'],
ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)
df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
.transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )
它被明确地分解了,所以你可以看到我正在使用的数据操作。
- 创建教育categorical dtype with order
- 接下来,更改 Education 列的 dtype 以使用该分类
dtype (EducationCat)
- 使用分类代码进行 cummax 计算
- 索引到 return 由 cummax 计算 (EduMax) 定义的类别
输出:
df[df['ID'] == 1587125]
ID Year Education EducationCat EduMax
18 1587125 1991 intermediate qualifications intermediate qualifications intermediate qualifications
12075 1587125 1992 intermediate qualifications intermediate qualifications intermediate qualifications
24132 1587125 1993 higher education higher education higher education
36189 1587125 1994 higher education higher education higher education
48246 1587125 1995 higher education higher education higher education
60303 1587125 1996 higher education higher education higher education
72360 1587125 1997 higher education higher education higher education
84417 1587125 1998 higher education higher education higher education
96474 1587125 1999 higher education higher education higher education
108531 1587125 2000 higher education higher education higher education
120588 1587125 2001 higher education higher education higher education
132645 1587125 2002 higher education higher education higher education
144702 1587125 2003 higher education higher education higher education
156759 1587125 2004 Other Other higher education
168816 1587125 2005 No qualifications No qualifications higher education
180873 1587125 2006 intermediate qualifications intermediate qualifications higher education
192930 1587125 2007 intermediate qualifications intermediate qualifications higher education
204987 1587125 2008 intermediate qualifications intermediate qualifications higher education
217044 1587125 2010 intermediate qualifications intermediate qualifications higher education
229101 1587125 2011 higher education higher education higher education
241158 1587125 2012 higher education higher education higher education
253215 1587125 2013 higher education higher education higher education
265272 1587125 2014 higher education higher education higher education
277329 1587125 2015 higher education higher education higher education
289386 1587125 2016 higher education higher education higher education
301443 1587125 2017 higher education higher education higher education
学历层次分明是有顺序的。您的问题可以重述为“滚动最大值”问题:某个人在某一年的最高教育水平是多少?
试试这个:
# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}
# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)
# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()
# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})
edu['Education'] = tmp
测试:
edu[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 higher education
1587125 2005 higher education
1587125 2006 higher education
1587125 2007 higher education
1587125 2008 higher education
1587125 2010 higher education
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
您可以遍历 ID,然后遍历年份。 DataFrame是按时间顺序排列的,所以如果一个人在一个单元格中有'higher education'或'intermediate qualifications',你可以保存这些知识并将其应用到后续的单元格中:
edu = edu.set_index('ID')
ids = edu.index.unique()
for id in ids:
# booleans to keep track of education statuses we've seen
higher_ed = False
inter_qual = False
rows = edu.loc[id]
for _, row in rows:
# check for intermediate qualifications
if inter_qual:
row['Education'] = 'intermediate qualifications'
elif row['Education'] = 'intermediate qualifications':
inter_qual = True
# check for higher education
if higher_ed:
row['Education'] = 'higher education'
elif row['Education'] = 'higher education':
higher_ed = True
我们可能不止一次地覆盖每个状态并不重要 — 如果一个人同时拥有 'intermediate qualifications' 和 'higher education',我们只需要确保 'higher education'最后设置。
我通常不建议使用 for 循环来处理 DataFrame — 但每个单元格值可能依赖于它上面的值,而且 Dataframe 并没有大到无法实现这一点。
我下面有一个大数据框:
此处用作示例的数据'education_val.csv'可在此处找到https://github.com/ENLK/Py-Projects-/blob/master/education_val.csv
import pandas as pd
edu = pd.read_csv('education_val.csv')
del edu['Unnamed: 0']
edu.head(10)
ID Year Education
22445 1991 higher education
29925 1991 No qualifications
76165 1991 No qualifications
223725 1991 Other
280165 1991 intermediate qualifications
333205 1991 No qualifications
387605 1991 higher education
541285 1991 No qualifications
541965 1991 No qualifications
599765 1991 No qualifications
Education
列中的值为:
edu.Education.value_counts()
intermediate qualifications 153705
higher education 67020
No qualifications 55842
Other 36915
我想用以下方式替换“教育”列中的值:
如果
ID
在某一年的Education
列中具有值higher education
,则该ID
的所有未来年份也将具有higher education
在Education
列中。如果某个
ID
在某一年中的值为intermediate qualifications
,则该ID
的所有未来年份都将在相应的值中具有intermediate qualifications
Education
列。但是,如果此ID
的值higher education
出现在随后的任何年份中,则higher education
会在随后的年份中替换intermediate qualifications
,无论Other
或No qualifications occur
.
例如,在下面的 DataFrame 中,ID
22445 在 1991
年的值为 higher education
,22445
的所有后续值为 Education
以后应该用higher education
代替,一直到2017
.
edu.loc[edu['ID'] == 22445]
ID Year Education
22445 1991 higher education
22445 1992 higher education
22445 1993 higher education
22445 1994 higher education
22445 1995 higher education
22445 1996 intermediate qualifications
22445 1997 intermediate qualifications
22445 1998 Other
22445 1999 No qualifications
22445 2000 intermediate qualifications
22445 2001 intermediate qualifications
22445 2002 intermediate qualifications
22445 2003 intermediate qualifications
22445 2004 intermediate qualifications
22445 2005 intermediate qualifications
22445 2006 intermediate qualifications
22445 2007 intermediate qualifications
22445 2008 intermediate qualifications
22445 2010 intermediate qualifications
22445 2011 intermediate qualifications
22445 2012 intermediate qualifications
22445 2013 intermediate qualifications
22445 2014 intermediate qualifications
22445 2015 intermediate qualifications
22445 2016 intermediate qualifications
22445 2017 intermediate qualifications
类似地,下面Dataframe中的ID
1587125在1991
年的值为intermediate qualifications
,在1993
年变为higher education
。在未来年份(从 1993 年开始)1587125
列 Education
中的所有后续值应为 higher education
.
edu.loc[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 Other
1587125 2005 No qualifications
1587125 2006 intermediate qualifications
1587125 2007 intermediate qualifications
1587125 2008 intermediate qualifications
1587125 2010 intermediate qualifications
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
数据中有 12,057 个唯一 ID
,Year
列从 1991 年到 2017 年。如何根据所有 12、057 更改 Education
的值以上条件?我不确定如何以统一的方式对所有唯一的 ID
执行此操作。上面的Github link附上了这里作为示例使用的示例数据。非常感谢。
您可以像这样使用 categorical data 来做到这一点:
df = pd.read_csv('https://raw.githubusercontent.com/ENLK/Py-Projects-/master/education_val.csv')
eddtype = pd.CategoricalDtype(['No qualifications',
'Other',
'intermediate qualifications',
'higher education'],
ordered=True)
df['EducationCat'] = df['Education'].str.strip().astype(eddtype)
df['EduMax'] = df.sort_values('Year').groupby('ID')['EducationCat']\
.transform(lambda x: eddtype.categories[x.cat.codes.cummax()] )
它被明确地分解了,所以你可以看到我正在使用的数据操作。
- 创建教育categorical dtype with order
- 接下来,更改 Education 列的 dtype 以使用该分类 dtype (EducationCat)
- 使用分类代码进行 cummax 计算
- 索引到 return 由 cummax 计算 (EduMax) 定义的类别
输出:
df[df['ID'] == 1587125]
ID Year Education EducationCat EduMax
18 1587125 1991 intermediate qualifications intermediate qualifications intermediate qualifications
12075 1587125 1992 intermediate qualifications intermediate qualifications intermediate qualifications
24132 1587125 1993 higher education higher education higher education
36189 1587125 1994 higher education higher education higher education
48246 1587125 1995 higher education higher education higher education
60303 1587125 1996 higher education higher education higher education
72360 1587125 1997 higher education higher education higher education
84417 1587125 1998 higher education higher education higher education
96474 1587125 1999 higher education higher education higher education
108531 1587125 2000 higher education higher education higher education
120588 1587125 2001 higher education higher education higher education
132645 1587125 2002 higher education higher education higher education
144702 1587125 2003 higher education higher education higher education
156759 1587125 2004 Other Other higher education
168816 1587125 2005 No qualifications No qualifications higher education
180873 1587125 2006 intermediate qualifications intermediate qualifications higher education
192930 1587125 2007 intermediate qualifications intermediate qualifications higher education
204987 1587125 2008 intermediate qualifications intermediate qualifications higher education
217044 1587125 2010 intermediate qualifications intermediate qualifications higher education
229101 1587125 2011 higher education higher education higher education
241158 1587125 2012 higher education higher education higher education
253215 1587125 2013 higher education higher education higher education
265272 1587125 2014 higher education higher education higher education
277329 1587125 2015 higher education higher education higher education
289386 1587125 2016 higher education higher education higher education
301443 1587125 2017 higher education higher education higher education
学历层次分明是有顺序的。您的问题可以重述为“滚动最大值”问题:某个人在某一年的最高教育水平是多少?
试试这个:
# A dictionary mapping each label to a rank
mappings = {e: i for i, e in enumerate(['No qualifications', 'Other', 'intermediate qualifications', 'higher education'])}
# Convert the label to its rank
edu['Education'] = edu['Education'].map(mappings)
# The gist of the solution: an expanding max level of education per person
tmp = edu.sort_values('Year').groupby('ID')['Education'].expanding().max()
# The first index level in tmp is the ID, the second level is the original index
# We only need the original index, hence the droplevel
# We also convert the rank back to the label (swapping keys and values in the mappings dictionary)
tmp = tmp.droplevel(0).map({v: k for k, v in mappings.items()})
edu['Education'] = tmp
测试:
edu[edu['ID'] == 1587125]
ID Year Education
1587125 1991 intermediate qualifications
1587125 1992 intermediate qualifications
1587125 1993 higher education
1587125 1994 higher education
1587125 1995 higher education
1587125 1996 higher education
1587125 1997 higher education
1587125 1998 higher education
1587125 1999 higher education
1587125 2000 higher education
1587125 2001 higher education
1587125 2002 higher education
1587125 2003 higher education
1587125 2004 higher education
1587125 2005 higher education
1587125 2006 higher education
1587125 2007 higher education
1587125 2008 higher education
1587125 2010 higher education
1587125 2011 higher education
1587125 2012 higher education
1587125 2013 higher education
1587125 2014 higher education
1587125 2015 higher education
1587125 2016 higher education
1587125 2017 higher education
您可以遍历 ID,然后遍历年份。 DataFrame是按时间顺序排列的,所以如果一个人在一个单元格中有'higher education'或'intermediate qualifications',你可以保存这些知识并将其应用到后续的单元格中:
edu = edu.set_index('ID')
ids = edu.index.unique()
for id in ids:
# booleans to keep track of education statuses we've seen
higher_ed = False
inter_qual = False
rows = edu.loc[id]
for _, row in rows:
# check for intermediate qualifications
if inter_qual:
row['Education'] = 'intermediate qualifications'
elif row['Education'] = 'intermediate qualifications':
inter_qual = True
# check for higher education
if higher_ed:
row['Education'] = 'higher education'
elif row['Education'] = 'higher education':
higher_ed = True
我们可能不止一次地覆盖每个状态并不重要 — 如果一个人同时拥有 'intermediate qualifications' 和 'higher education',我们只需要确保 'higher education'最后设置。
我通常不建议使用 for 循环来处理 DataFrame — 但每个单元格值可能依赖于它上面的值,而且 Dataframe 并没有大到无法实现这一点。