Pandas 来自列上正则表达式的多索引
Pandas MultiIndex from regex on column
我有一个 pandas 数据框看起来
df = pd.DataFrame(
[
['JoeSmith', 5],
['CathySmith', 3],
['BrianSmith', 12],
['MarySmith', 67],
['JoeJones', 23],
['CathyJones', 98],
['BrianJones', 438],
['MaryJones', 75],
['JoeCollins', 56],
['CathyCollins', 125],
['BrianCollins', 900],
['MaryCollins', 321],
], columns = ['Name', 'Value']
)
print df
Name Value
0 JoeSmith 5
1 CathySmith 3
2 BrianSmith 12
3 MarySmith 67
4 JoeJones 23
5 CathyJones 98
6 BrianJones 438
7 MaryJones 75
8 JoeCollins 56
9 CathyCollins 125
10 BrianCollins 900
11 MaryCollins 321
第一列 'Name' 需要拆分为名字和姓氏并放入 MultiIndex。
Value
Joe Smith 5
Cathy Smith 3
Brian Smith 12
Mary Smith 67
Joe Jones 23
Cathy Jones 98
Brian Jones 438
Mary Jones 75
Joe Collins 56
Cathy Collins 125
Brian Collins 900
Mary Collins 321
解决方案
import pandas as pd
pattern = r'.*\b([A-Z][a-z]*)([A-Z][a-z]*)\b.*'
names = df.Name.str.extract(pattern, expand=True)
midx = pd.MultiIndex.from_tuples(names.values.tolist())
df.index = midx
df[['Value']]
说明
pattern
获取一组以大写 A-Z
开头的字母,后跟任意数量的小写 a-z
后跟另一个大写 A-Z
和任意数字小写 a-z
。然后它把它一分为二。
pd.MultiIndex.from_tuples
创建 MultiIndex
.
names.values.tolist()
将转换后的 DataFrame
转换为将被解释为元组的列表列表。
我认为你可以使用 extract
for extracting Names
and surname
, then set_index
and last drop
列 Name
:
df[['name','surname']] = df.Name.str.extract(r'([A-Z][a-z]*)([A-Z][a-z]*)', expand=True)
df = df.set_index(['name','surname']).drop('Name', axis=1)
print df
Value
name surname
Joe Smith 5
Cathy Smith 3
Brian Smith 12
Mary Smith 67
Joe Jones 23
Cathy Jones 98
Brian Jones 438
Mary Jones 75
Joe Collins 56
Cathy Collins 125
Brian Collins 900
Mary Collins 321
我有一个 pandas 数据框看起来
df = pd.DataFrame(
[
['JoeSmith', 5],
['CathySmith', 3],
['BrianSmith', 12],
['MarySmith', 67],
['JoeJones', 23],
['CathyJones', 98],
['BrianJones', 438],
['MaryJones', 75],
['JoeCollins', 56],
['CathyCollins', 125],
['BrianCollins', 900],
['MaryCollins', 321],
], columns = ['Name', 'Value']
)
print df
Name Value
0 JoeSmith 5
1 CathySmith 3
2 BrianSmith 12
3 MarySmith 67
4 JoeJones 23
5 CathyJones 98
6 BrianJones 438
7 MaryJones 75
8 JoeCollins 56
9 CathyCollins 125
10 BrianCollins 900
11 MaryCollins 321
第一列 'Name' 需要拆分为名字和姓氏并放入 MultiIndex。
Value
Joe Smith 5
Cathy Smith 3
Brian Smith 12
Mary Smith 67
Joe Jones 23
Cathy Jones 98
Brian Jones 438
Mary Jones 75
Joe Collins 56
Cathy Collins 125
Brian Collins 900
Mary Collins 321
解决方案
import pandas as pd
pattern = r'.*\b([A-Z][a-z]*)([A-Z][a-z]*)\b.*'
names = df.Name.str.extract(pattern, expand=True)
midx = pd.MultiIndex.from_tuples(names.values.tolist())
df.index = midx
df[['Value']]
说明
pattern
获取一组以大写 A-Z
开头的字母,后跟任意数量的小写 a-z
后跟另一个大写 A-Z
和任意数字小写 a-z
。然后它把它一分为二。
pd.MultiIndex.from_tuples
创建 MultiIndex
.
names.values.tolist()
将转换后的 DataFrame
转换为将被解释为元组的列表列表。
我认为你可以使用 extract
for extracting Names
and surname
, then set_index
and last drop
列 Name
:
df[['name','surname']] = df.Name.str.extract(r'([A-Z][a-z]*)([A-Z][a-z]*)', expand=True)
df = df.set_index(['name','surname']).drop('Name', axis=1)
print df
Value
name surname
Joe Smith 5
Cathy Smith 3
Brian Smith 12
Mary Smith 67
Joe Jones 23
Cathy Jones 98
Brian Jones 438
Mary Jones 75
Joe Collins 56
Cathy Collins 125
Brian Collins 900
Mary Collins 321