Pandas 来自列上正则表达式的多索引

Pandas MultiIndex from regex on column

我有一个 pandas 数据框看起来

df = pd.DataFrame(
    [
        ['JoeSmith', 5],
        ['CathySmith', 3],
        ['BrianSmith', 12],
        ['MarySmith', 67],
        ['JoeJones', 23],
        ['CathyJones', 98],
        ['BrianJones', 438],
        ['MaryJones', 75],
        ['JoeCollins', 56],
        ['CathyCollins', 125],
        ['BrianCollins', 900],
        ['MaryCollins', 321],
    ], columns = ['Name', 'Value']
)

print df

            Name  Value
0       JoeSmith      5
1     CathySmith      3
2     BrianSmith     12
3      MarySmith     67
4       JoeJones     23
5     CathyJones     98
6     BrianJones    438
7      MaryJones     75
8     JoeCollins     56
9   CathyCollins    125
10  BrianCollins    900
11   MaryCollins    321

第一列 'Name' 需要拆分为名字和姓氏并放入 MultiIndex。

               Value
Joe   Smith        5
Cathy Smith        3
Brian Smith       12
Mary  Smith       67
Joe   Jones       23
Cathy Jones       98
Brian Jones      438
Mary  Jones       75
Joe   Collins     56
Cathy Collins    125
Brian Collins    900
Mary  Collins    321

解决方案

import pandas as pd

pattern = r'.*\b([A-Z][a-z]*)([A-Z][a-z]*)\b.*'
names = df.Name.str.extract(pattern, expand=True)
midx = pd.MultiIndex.from_tuples(names.values.tolist())
df.index = midx
df[['Value']]

说明

pattern 获取一组以大写 A-Z 开头的字母,后跟任意数量的小写 a-z 后跟另一个大写 A-Z 和任意数字小写 a-z。然后它把它一分为二。

pd.MultiIndex.from_tuples 创建 MultiIndex.

names.values.tolist() 将转换后的 DataFrame 转换为将被解释为元组的列表列表。

我认为你可以使用 extract for extracting Names and surname, then set_index and last dropName:

df[['name','surname']] = df.Name.str.extract(r'([A-Z][a-z]*)([A-Z][a-z]*)', expand=True)
df = df.set_index(['name','surname']).drop('Name', axis=1)
print df
               Value
name  surname       
Joe   Smith        5
Cathy Smith        3
Brian Smith       12
Mary  Smith       67
Joe   Jones       23
Cathy Jones       98
Brian Jones      438
Mary  Jones       75
Joe   Collins     56
Cathy Collins    125
Brian Collins    900
Mary  Collins    321