如何使用正则表达式获取 pandas 数据框另一列中一列的子字符串
How to Get sub string of a column in onother column of pandas dataframe using regex
我的数据框是这样的-
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660
1 AXISBANK26May2022 CE 690
2 AXISBANK26May2022 PE 670
3 BANKNIFTY19May2022 PE 30200
4 BANKNIFTY19May2022 PE 31200
5 BANKNIFTY26May2022 PE 34300
我想要的输出-
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 660 CE
1 AXISBANK26May2022 CE 690 26May2022 690 CE
2 AXISBANK26May2022 PE 670 26May2022 670 PE
3 BANKNIFTY19May2022 PE 30200 19May2022 30200 PE
4 BANKNIFTY19May2022 PE 31200 19May2022 31200 PE
5 BANKNIFTY26May2022 PE 34300 26May2022 34300 PE
我这样试过-
df['Expiry]= df['CONTRACT'].str.extract(r'(\d{2}\D{3}\d{4})')
df['Strike_price']= df['CONTRACT'].str.extract(r'(\d{5})')
df['Option_type']= df['CONTRACT'].str.extract(r'(\D\D)')
请帮助找到没有 Space 的正确列。
谢谢
您可以使用
df[['Expiry','Option_type','Stike_price']] = df['CONTRACT'].str.extract(r'(\d{2}[^\W\d]{3}\d{4})\s+([A-Z]+)\s+(\d+)$', expand=True)
参见regex demo。 详情:
(\d{2}[^\W\d]{3}\d{4})
- 第 1 组:两位数,三位 letters/underscore,四位
\s+
- 一个或多个空格
([A-Z]+)
- 第 2 组:一个或多个大写 ASCII 字母
\s+
- 一个或多个空格
(\d+)
- 第 3 组:一个或多个数字
$
- 字符串结尾。
或者,您可以利用命名捕获组将数据提取到一个单独的数据框中,其中列名已定义并按预期顺序排列,然后将两者合并:
import pandas as pd
df = pd.DataFrame({'CONTRACT':['AXISBANK26May2022 CE 660','BANKNIFTY19May2022 PE 30200']})
df_extr = df['CONTRACT'].str.extract(r'(?P<Expiry>\d{2}[^\W\d]{3}\d{4})\s+(?P<Stike_price>[A-Z]+)\s+(?P<Option_type>\d+)$', expand=True)
df = df.merge(df_extr, left_index=True, right_index=True)
输出:
>>> df.merge(df_extr, left_index=True, right_index=True)
CONTRACT Expiry Stike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 CE 660
1 BANKNIFTY19May2022 PE 30200 19May2022 PE 30200
一个选项是 str.extract
:
pattern = r"[A-Z]+(\d+\D+\d+)\s+([A-Z]+)\s+(\d+)"
extracts = df.CONTRACT.str.extract(pattern)
extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)
df.assign(**extracts)
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 CE 660
1 AXISBANK26May2022 CE 690 26May2022 CE 690
2 AXISBANK26May2022 PE 670 26May2022 PE 670
3 BANKNIFTY19May2022 PE 30200 19May2022 PE 30200
4 BANKNIFTY19May2022 PE 31200 19May2022 PE 31200
5 BANKNIFTY26May2022 PE 34300 26May2022 PE 34300
另一种方法是在正则表达式上使用 str.split
,但这是一种更长的方法,而且我怀疑容易出现更多错误:
extracts = (df
.CONTRACT
.str.split(r"(\d+\D+\d+)|\s+", expand = True)
.dropna(how = 'all', axis = 1)
.loc[:, lambda df: df.ne('').any()]
.iloc[:, 1:])
extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)
df.assign(**extracts)
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 CE 660
1 AXISBANK26May2022 CE 690 26May2022 CE 690
2 AXISBANK26May2022 PE 670 26May2022 PE 670
3 BANKNIFTY19May2022 PE 30200 19May2022 PE 30200
4 BANKNIFTY19May2022 PE 31200 19May2022 PE 31200
5 BANKNIFTY26May2022 PE 34300 26May2022 PE 34300
我的数据框是这样的-
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660
1 AXISBANK26May2022 CE 690
2 AXISBANK26May2022 PE 670
3 BANKNIFTY19May2022 PE 30200
4 BANKNIFTY19May2022 PE 31200
5 BANKNIFTY26May2022 PE 34300
我想要的输出-
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 660 CE
1 AXISBANK26May2022 CE 690 26May2022 690 CE
2 AXISBANK26May2022 PE 670 26May2022 670 PE
3 BANKNIFTY19May2022 PE 30200 19May2022 30200 PE
4 BANKNIFTY19May2022 PE 31200 19May2022 31200 PE
5 BANKNIFTY26May2022 PE 34300 26May2022 34300 PE
我这样试过-
df['Expiry]= df['CONTRACT'].str.extract(r'(\d{2}\D{3}\d{4})')
df['Strike_price']= df['CONTRACT'].str.extract(r'(\d{5})')
df['Option_type']= df['CONTRACT'].str.extract(r'(\D\D)')
请帮助找到没有 Space 的正确列。 谢谢
您可以使用
df[['Expiry','Option_type','Stike_price']] = df['CONTRACT'].str.extract(r'(\d{2}[^\W\d]{3}\d{4})\s+([A-Z]+)\s+(\d+)$', expand=True)
参见regex demo。 详情:
(\d{2}[^\W\d]{3}\d{4})
- 第 1 组:两位数,三位 letters/underscore,四位\s+
- 一个或多个空格([A-Z]+)
- 第 2 组:一个或多个大写 ASCII 字母\s+
- 一个或多个空格(\d+)
- 第 3 组:一个或多个数字$
- 字符串结尾。
或者,您可以利用命名捕获组将数据提取到一个单独的数据框中,其中列名已定义并按预期顺序排列,然后将两者合并:
import pandas as pd
df = pd.DataFrame({'CONTRACT':['AXISBANK26May2022 CE 660','BANKNIFTY19May2022 PE 30200']})
df_extr = df['CONTRACT'].str.extract(r'(?P<Expiry>\d{2}[^\W\d]{3}\d{4})\s+(?P<Stike_price>[A-Z]+)\s+(?P<Option_type>\d+)$', expand=True)
df = df.merge(df_extr, left_index=True, right_index=True)
输出:
>>> df.merge(df_extr, left_index=True, right_index=True)
CONTRACT Expiry Stike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 CE 660
1 BANKNIFTY19May2022 PE 30200 19May2022 PE 30200
一个选项是 str.extract
:
pattern = r"[A-Z]+(\d+\D+\d+)\s+([A-Z]+)\s+(\d+)"
extracts = df.CONTRACT.str.extract(pattern)
extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)
df.assign(**extracts)
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 CE 660
1 AXISBANK26May2022 CE 690 26May2022 CE 690
2 AXISBANK26May2022 PE 670 26May2022 PE 670
3 BANKNIFTY19May2022 PE 30200 19May2022 PE 30200
4 BANKNIFTY19May2022 PE 31200 19May2022 PE 31200
5 BANKNIFTY26May2022 PE 34300 26May2022 PE 34300
另一种方法是在正则表达式上使用 str.split
,但这是一种更长的方法,而且我怀疑容易出现更多错误:
extracts = (df
.CONTRACT
.str.split(r"(\d+\D+\d+)|\s+", expand = True)
.dropna(how = 'all', axis = 1)
.loc[:, lambda df: df.ne('').any()]
.iloc[:, 1:])
extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)
df.assign(**extracts)
CONTRACT Expiry Strike_price Option_type
0 AXISBANK26May2022 CE 660 26May2022 CE 660
1 AXISBANK26May2022 CE 690 26May2022 CE 690
2 AXISBANK26May2022 PE 670 26May2022 PE 670
3 BANKNIFTY19May2022 PE 30200 19May2022 PE 30200
4 BANKNIFTY19May2022 PE 31200 19May2022 PE 31200
5 BANKNIFTY26May2022 PE 34300 26May2022 PE 34300