如何使用正则表达式获取 pandas 数据框另一列中一列的子字符串

How to Get sub string of a column in onother column of pandas dataframe using regex

我的数据框是这样的-

CONTRACT                                Expiry   Strike_price  Option_type

0     AXISBANK26May2022 CE 660
1     AXISBANK26May2022 CE 690
2     AXISBANK26May2022 PE 670
3     BANKNIFTY19May2022 PE 30200
4     BANKNIFTY19May2022 PE 31200
5     BANKNIFTY26May2022 PE 34300

我想要的输出-

CONTRACT                                Expiry   Strike_price  Option_type

0     AXISBANK26May2022 CE 660         26May2022   660            CE
1     AXISBANK26May2022 CE 690         26May2022   690            CE
2     AXISBANK26May2022 PE 670         26May2022   670            PE
3     BANKNIFTY19May2022 PE 30200      19May2022   30200          PE
4     BANKNIFTY19May2022 PE 31200      19May2022   31200          PE
5     BANKNIFTY26May2022 PE 34300      26May2022   34300          PE

我这样试过-

df['Expiry]= df['CONTRACT'].str.extract(r'(\d{2}\D{3}\d{4})')
df['Strike_price']= df['CONTRACT'].str.extract(r'(\d{5})')
df['Option_type']= df['CONTRACT'].str.extract(r'(\D\D)')

请帮助找到没有 Space 的正确列。 谢谢

您可以使用

df[['Expiry','Option_type','Stike_price']] = df['CONTRACT'].str.extract(r'(\d{2}[^\W\d]{3}\d{4})\s+([A-Z]+)\s+(\d+)$', expand=True)

参见regex demo详情:

  • (\d{2}[^\W\d]{3}\d{4}) - 第 1 组:两位数,三位 letters/underscore,四位
  • \s+ - 一个或多个空格
  • ([A-Z]+) - 第 2 组:一个或多个大写 ASCII 字母
  • \s+ - 一个或多个空格
  • (\d+) - 第 3 组:一个或多个数字
  • $ - 字符串结尾。

或者,您可以利用命名捕获组将数据提取到一个单独的数据框中,其中列名已定义并按预期顺序排列,然后将两者合并:

import pandas as pd
df = pd.DataFrame({'CONTRACT':['AXISBANK26May2022 CE 660','BANKNIFTY19May2022 PE 30200']})
df_extr = df['CONTRACT'].str.extract(r'(?P<Expiry>\d{2}[^\W\d]{3}\d{4})\s+(?P<Stike_price>[A-Z]+)\s+(?P<Option_type>\d+)$', expand=True)
df = df.merge(df_extr, left_index=True, right_index=True)

输出:

>>> df.merge(df_extr, left_index=True, right_index=True)
                      CONTRACT     Expiry Stike_price Option_type
0     AXISBANK26May2022 CE 660  26May2022          CE         660
1  BANKNIFTY19May2022 PE 30200  19May2022          PE       30200

一个选项是 str.extract:

pattern = r"[A-Z]+(\d+\D+\d+)\s+([A-Z]+)\s+(\d+)"
extracts = df.CONTRACT.str.extract(pattern)
extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)

df.assign(**extracts)

                      CONTRACT     Expiry Strike_price Option_type
0     AXISBANK26May2022 CE 660  26May2022           CE         660
1     AXISBANK26May2022 CE 690  26May2022           CE         690
2     AXISBANK26May2022 PE 670  26May2022           PE         670
3  BANKNIFTY19May2022 PE 30200  19May2022           PE       30200
4  BANKNIFTY19May2022 PE 31200  19May2022           PE       31200
5  BANKNIFTY26May2022 PE 34300  26May2022           PE       34300

另一种方法是在正则表达式上使用 str.split,但这是一种更长的方法,而且我怀疑容易出现更多错误:

extracts = (df
            .CONTRACT
            .str.split(r"(\d+\D+\d+)|\s+", expand = True)
            .dropna(how = 'all', axis = 1)
            .loc[:, lambda df: df.ne('').any()]
            .iloc[:, 1:])

extracts = extracts.set_axis(['Expiry', 'Strike_price', 'Option_type'], axis = 1)

df.assign(**extracts)

                      CONTRACT     Expiry Strike_price Option_type
0     AXISBANK26May2022 CE 660  26May2022           CE         660
1     AXISBANK26May2022 CE 690  26May2022           CE         690
2     AXISBANK26May2022 PE 670  26May2022           PE         670
3  BANKNIFTY19May2022 PE 30200  19May2022           PE       30200
4  BANKNIFTY19May2022 PE 31200  19May2022           PE       31200
5  BANKNIFTY26May2022 PE 34300  26May2022           PE       34300