如何从本书价格数据集数据中简单提取版本类型、月份和年份?
How extract Edition type ,Month and Year from this book price data set data in simple way?
import pandas as pd
df=pd.DataFrame({'Edition_TypeDate':
[''2016'','5 Oct 2017','2017','2 Aug 2009','Illustrated, Import','Import, 22 Feb 2018','Import, 14 Dec 2017','Import, 1 Mar 2018','Abridged, Audiobook, Box set',
'International Edition, 26 Apr 2012','Import, 2018','Box set, 15 Jun 2014','Unabridged, 6 Jul 2007']})
我的图书数据集中有其中一列。现在从这个专栏,我想要三个新专栏。
1.Edition_Type --> 包括 Import、Illustrated 或 null(如果未提及)
2.Edition_Month--->包括 Aug、Oct 或如果未提及则为空
3.Edition _Year--->包括 2016、2017、2018 或如果未提及则为 null
怎么做?帮我定义一个我可以应用到这个的函数。
您可以将 Series.str.extract
与带 |
的关键字一起用于正则表达式 or
,多年来 (\d{4}$)
表示从字符串末尾获取 4 位数字:
df['Edition_Type'] = df['Edition_TypeDate'].str.extract(r'(Import|Illustrated)')
df['Edition_Month'] = df['Edition_TypeDate'].str.extract(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)')
df['Edition _Year'] = df['Edition_TypeDate'].str.extract(r'(\d{4}$)')
print (df)
Edition_TypeDate Edition_Type Edition_Month \
0 2016 NaN NaN
1 5 Oct 2017 NaN Oct
2 2017 NaN NaN
3 2 Aug 2009 NaN Aug
4 Illustrated, Import Illustrated NaN
5 Import, 22 Feb 2018 Import Feb
6 Import, 14 Dec 2017 Import Dec
7 Import, 1 Mar 2018 Import Mar
8 Abridged, Audiobook, Box set NaN NaN
9 International Edition, 26 Apr 2012 NaN Apr
10 Import, 2018 Import NaN
11 Box set, 15 Jun 2014 NaN Jun
12 Unabridged, 6 Jul 2007 NaN Jul
Edition _Year
0 2016
1 2017
2 2017
3 2009
4 NaN
5 2018
6 2017
7 2018
8 NaN
9 2012
10 2018
11 2014
import pandas as pd
df=pd.DataFrame({'Edition_TypeDate':
[''2016'','5 Oct 2017','2017','2 Aug 2009','Illustrated, Import','Import, 22 Feb 2018','Import, 14 Dec 2017','Import, 1 Mar 2018','Abridged, Audiobook, Box set',
'International Edition, 26 Apr 2012','Import, 2018','Box set, 15 Jun 2014','Unabridged, 6 Jul 2007']})
我的图书数据集中有其中一列。现在从这个专栏,我想要三个新专栏。
1.Edition_Type --> 包括 Import、Illustrated 或 null(如果未提及)
2.Edition_Month--->包括 Aug、Oct 或如果未提及则为空
3.Edition _Year--->包括 2016、2017、2018 或如果未提及则为 null
怎么做?帮我定义一个我可以应用到这个的函数。
您可以将 Series.str.extract
与带 |
的关键字一起用于正则表达式 or
,多年来 (\d{4}$)
表示从字符串末尾获取 4 位数字:
df['Edition_Type'] = df['Edition_TypeDate'].str.extract(r'(Import|Illustrated)')
df['Edition_Month'] = df['Edition_TypeDate'].str.extract(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)')
df['Edition _Year'] = df['Edition_TypeDate'].str.extract(r'(\d{4}$)')
print (df)
Edition_TypeDate Edition_Type Edition_Month \
0 2016 NaN NaN
1 5 Oct 2017 NaN Oct
2 2017 NaN NaN
3 2 Aug 2009 NaN Aug
4 Illustrated, Import Illustrated NaN
5 Import, 22 Feb 2018 Import Feb
6 Import, 14 Dec 2017 Import Dec
7 Import, 1 Mar 2018 Import Mar
8 Abridged, Audiobook, Box set NaN NaN
9 International Edition, 26 Apr 2012 NaN Apr
10 Import, 2018 Import NaN
11 Box set, 15 Jun 2014 NaN Jun
12 Unabridged, 6 Jul 2007 NaN Jul
Edition _Year
0 2016
1 2017
2 2017
3 2009
4 NaN
5 2018
6 2017
7 2018
8 NaN
9 2012
10 2018
11 2014