是否有 Pandas 解决方案——例如:使用 numba 或 Cython——来“转换”/“应用”一个索引,一个 MultiIndexed DataFrame?
Is there a Pandas solution—e.g.: with numba, or Cython—to `transform`/`apply` with an index, a MultiIndexed DataFrame?
是否有 Pandas 解决方案——例如:使用 numba 或 Cython——使用索引 transform
/apply
?
我知道我可以使用 iterrows
, itertuples
, iteritems
or items
. But what I want to do should be trivial to vectorize… I've built a simple proxy to my actual use-case (runnable code):
df = pd.DataFrame(
np.random.randn(8, 4),
index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])
namednumber2numbername = {
'one': ('zero', 'one', 'two', 'three', 'four',
'five', 'six', 'seven', 'eight', 'nine'),
'two': ('i', 'ii', 'iii', 'iv', 'v',
'vi', 'vii', 'viii', 'ix', 'x')
}
def namednumber2numbername_applicator(series):
def to_s(value):
if pd.isnull(value) or isinstance(value, string_types): return value
value = np.ushort(value)
if value > 10: return value
# TODO: Figure out idx of `series.name` at this `value`… instead of `'one'`
return namednumber2numbername['one'][value]
return series.apply(to_s)
df.transform(namednumber2numbername_applicator)
实际产量
0 1 2 3
bar one zero zero one 65535
two zero zero zero zero
baz one zero zero zero zero
two zero two zero zero
foo one 65535 zero zero zero
two zero 65535 65534 zero
qux one zero one zero zero
two zero zero zero zero
我想要的输出
0 1 2 3
bar one zero zero one 65535
two i i i i
baz one zero zero zero zero
two i iii i i
foo one 65535 zero zero zero
two i 65535 65534 i
qux one zero one zero zero
two i i i i
可能相关:How to query MultiIndex index columns values in pandas
本质上我正在寻找与 JavaScript's Array.prototype.map
相同的行为(通过 idx
)。
我编写了一个非常快速的转换版本来获得这些结果。您也可以在生成器内部执行 np.ushort,它仍然很快,但在外部要快得多:
import time
df = pd.DataFrame(
np.random.randn(8, 4**7),
index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])
start = time.time()
df.loc[:,] = np.ushort(df)
df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1)
end = time.time()
print(end - start)
# 1.150895118713379
这是原来的时间:
df = pd.DataFrame( np.random.randn(8, 4),
index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])
start = time.time()
df.loc[:,] = np.ushort(df)
df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1)
end = time.time()
print(end - start)
# 0.005067110061645508
In [453]: df
Out[453]:
0 1 2 3
bar one zero zero one zero
two i i i i
baz one zero zero zero zero
two i i ii i
foo one 65535 zero 65535 zero
two i i i i
qux one zero zero zero zero
two i i i ii
我得到了一个班轮:
df.transform(lambda x: [ np.ushort(value) if np.ushort(value) > 10 else namednumber2numbername[pos[1]][np.ushort(value)] for pos, value in x.items()])
0 1 2 3
bar one zero zero zero zero
two i i ii i
baz one 65534 zero 65535 zero
two ii i 65535 i
foo one zero zero zero zero
two ii i i ii
qux one 65535 zero zero zero
two i i i i
好的没有 .items() 的版本:
def what(x):
if type(x[0]) == np.float64:
if np.ushort(x[0])>10:
return np.ushort(x[0])
else:
return(namednumber2numbername[x.index[0][1]][np.ushort(x[0])])
df.groupby(level=[0,1]).transform(what)
0 1 2 3
bar one zero one zero zero
two i ii 65535 i
baz one zero zero 65535 zero
two i i i i
foo one zero one zero zero
two i i i i
qux one two zero zero 65534
two i i i ii
还有一张班轮!!!!没有 .items 根据您的要求!我们将级别 0 和 1 分组,然后执行计算以确定值::
df.groupby(level=[0,1]).transform(lambda x: np.ushort(x[0]) if type(x[0]) == np.float64 and np.ushort(x[0]) >10 else namednumber2numbername[x.index[0][1]][np.ushort(x[0])])
0 1 2 3
bar one zero one zero zero
two i ii 65535 i
baz one zero zero 65535 zero
two i i i i
foo one zero one zero zero
two i i i i
qux one two zero zero 65534
two i i i ii
为了获得其他值,我这样做了:
df.transform(lambda x: [ str(x.name[0]) + '_' + str(x.name[1]) + '_' + str( pos)+ '_' +str(value) for pos,value in x.items()])
print('Transformed DataFrame:\n',
df.transform(what), sep='')
Transformed DataFrame:
α ... ω ε
f a b c ... b c j
one α_a_one_79.96465755359696 α_b_one_31.32938096131651 α_c_one_2.61444370203201 ... ω_b_one_35.7457972161041 ω_c_one_40.224465043054195 ε_j_one_43.527184108357496
two α_a_two_42.66244395377804 α_b_two_65.92020941618344 α_c_two_77.26467264185487 ... ω_b_two_40.91908469505522 ω_c_two_50.395561828234555 ε_j_two_71.67418483119914
one α_a_one_47.9769845681328 α_b_one_38.90671671550259 α_c_one_67.13601594352508 ... ω_b_one_23.23799084164898 ω_c_one_63.551178212994465 ε_j_one_16.975582723809303
这是一个没有 .items 的:
df.transform(lambda x: ['_'.join((x.name[0], x.name[1], x.index[0], str(i) if type(i) == float else 0)) for i in list(x)])
输出
α ... ω ε
f a b c ... b c j
one α_a_one_79.96465755359696 α_b_one_31.32938096131651 α_c_one_2.61444370203201 ... ω_b_one_35.7457972161041 ω_c_one_40.224465043054195 ε_j_one_43.527184108357496
two α_a_two_42.66244395377804 α_b_two_65.92020941618344 α_c_two_77.26467264185487 ... ω_b_two_40.91908469505522 ω_c_two_50.395561828234555 ε_j_two_71.67418483119914
one α_a_one_47.9769845681328 α_b_one_38.90671671550259 α_c_one_67.13601594352508 ... ω_b_one_23.23799084164898 ω_c_one_63.551178212994465 ε_j_one_16.975582723809303
我也是这样做的,没有分组:
df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + df.T.eq(x).columns + '_' + x.astype(str) , axis=1).T
or even better and most simple:
df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) , axis=1).T
or
df.T.transform(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) , axis=1).T
or with no .T:
df.transform(lambda x: x.index[0][0] + '_'+ x.index[0][1] + '_' + x.name + '_' + x.astype(str) , axis=1)
α ... ω ε
f a b c ... b c j
one α_a_one_79.96465755359696 α_b_one_31.32938096131651 α_c_one_2.61444370203201 ... ω_b_one_35.7457972161041 ω_c_one_40.224465043054195 ε_j_one_43.527184108357496
two α_a_two_42.66244395377804 α_b_two_65.92020941618344 α_c_two_77.26467264185487 ... ω_b_two_40.91908469505522 ω_c_two_50.395561828234555 ε_j_two_71.67418483119914
one α_a_one_47.9769845681328 α_b_one_38.90671671550259 α_c_one_67.13601594352508 ... ω_b_one_23.23799084164898 ω_c_one_63.551178212994465 ε_j_one_16.975582723809303
Transform
默认情况下将函数应用于每一列。您可以改为将它应用于每个 行 ,指定轴参数 = 1
或 'columns'
。然后您可以访问行索引并将其第二个名称字段传递给您的函数:
def namednumber2numbername_applicator(series):
def to_s(value, name):
if pd.isnull(value): return value
value = np.ushort(value)
if value > 10: return value
return namednumber2numbername[name][value]
return series.apply(to_s, args=((series.name[1]),))
df.transform(namednumber2numbername_applicator, 1)
结果:
0 1 2 3
bar one 65535 zero zero 65535
two ii i iii 65535
baz one 65535 zero zero 65535
two i i 65535 i
foo one zero zero zero zero
two i 65535 i i
qux one zero zero zero 65535
two i i i i
使用Series.map的示例:
class dict_default_key(dict):
def __missing__(self, key):
return key
number_names = [
'zero',
'one',
'two',
'three',
'four',
'five',
'six',
'seven',
'eight',
'nine'
]
roman_numerals = [
'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x'
]
name_mapping = {
'one': dict_default_key(
{c: v for c, v in enumerate(number_names)}
),
'two': dict_default_key(
{c: v for c, v in enumerate(roman_numerals)}
)
}
def translate(series):
key = series.name[1]
row_map = name_mapping[key]
result = series.map(row_map)
return result
ushorts = df.apply(np.ushort)
ushorts.apply(translate, axis=1)
这是使用 reindex
and np.where()
的另一种方法:
def myf(dataframe,dictionary):
cond1=dataframe.isna()
cond2=np.ushort(dataframe)>10
m=(pd.DataFrame.from_dict(dictionary,orient='index')
.reindex(dataframe.index.get_level_values(1)))
m.index=pd.MultiIndex.from_arrays((dataframe.index.get_level_values(0),m.index))
arr=np.where(cond1|cond2,np.ushort(dataframe),
m[m.columns.intersection(dataframe.columns)])
return pd.DataFrame(arr,dataframe.index,dataframe.columns)
myf(df,namednumber2numbername)
0 1 2 3
bar one zero one two three
two 65535 ii iii 65535
baz one zero one 65535 three
two i ii iii iv
foo one zero 65535 two three
two i ii iii iv
qux one zero 65535 two 65535
two i ii iii iv
遵循的步骤:
- This function creates a dataframe with the dictionary (
m
) and reindexes ad the original.
- Post this, we are adding an extra level to make it a multiindex same as the original dataframe. (print m inside func to see m)
- Then we check condition if dataframe is Null or
np.ushort
value more than 10
- If condition matches, return
np.ushort
of dataframe else values from matching columns from m.
让我知道是否有任何我遗漏的步骤需要检查,或者您想合并,因为我认为这是避免按行计算的一种方法。
下面是我将如何解决这个问题:
# 1. Rewrite functions to include a parameter for `idx`
def some_fun_name(value, idx):
value = np.ushort(value)
if value > 10:
return value
else:
return namednumber2numbername[idx][value]
def apply_some_fun_name(s):
idx = list(s.index.get_level_values(1).unique())[0]
return s.transform(some_fun_name, idx=idx)
# 2. Apply function over the keys of the multi-index, replacing while operating:
df = df.groupby(level=1).transform(apply_some_fun_name)
# 3. I got the following result while using `np.random.seed(1)`:
0 1 2 3
bar one one zero zero 65535
two i 65534 ii i
baz one zero zero one 65534
two i i ii 65535
foo one zero zero zero zero
two 65535 ii i i
qux one zero zero zero zero
two i i i i
是否有 Pandas 解决方案——例如:使用 numba 或 Cython——使用索引 transform
/apply
?
我知道我可以使用 iterrows
, itertuples
, iteritems
or items
. But what I want to do should be trivial to vectorize… I've built a simple proxy to my actual use-case (runnable code):
df = pd.DataFrame(
np.random.randn(8, 4),
index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])
namednumber2numbername = {
'one': ('zero', 'one', 'two', 'three', 'four',
'five', 'six', 'seven', 'eight', 'nine'),
'two': ('i', 'ii', 'iii', 'iv', 'v',
'vi', 'vii', 'viii', 'ix', 'x')
}
def namednumber2numbername_applicator(series):
def to_s(value):
if pd.isnull(value) or isinstance(value, string_types): return value
value = np.ushort(value)
if value > 10: return value
# TODO: Figure out idx of `series.name` at this `value`… instead of `'one'`
return namednumber2numbername['one'][value]
return series.apply(to_s)
df.transform(namednumber2numbername_applicator)
实际产量
0 1 2 3
bar one zero zero one 65535
two zero zero zero zero
baz one zero zero zero zero
two zero two zero zero
foo one 65535 zero zero zero
two zero 65535 65534 zero
qux one zero one zero zero
two zero zero zero zero
我想要的输出
0 1 2 3
bar one zero zero one 65535
two i i i i
baz one zero zero zero zero
two i iii i i
foo one 65535 zero zero zero
two i 65535 65534 i
qux one zero one zero zero
two i i i i
可能相关:How to query MultiIndex index columns values in pandas
本质上我正在寻找与 JavaScript's Array.prototype.map
相同的行为(通过 idx
)。
我编写了一个非常快速的转换版本来获得这些结果。您也可以在生成器内部执行 np.ushort,它仍然很快,但在外部要快得多:
import time
df = pd.DataFrame(
np.random.randn(8, 4**7),
index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])
start = time.time()
df.loc[:,] = np.ushort(df)
df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1)
end = time.time()
print(end - start)
# 1.150895118713379
这是原来的时间:
df = pd.DataFrame( np.random.randn(8, 4),
index=[np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])])
start = time.time()
df.loc[:,] = np.ushort(df)
df = df.transform(lambda x: [ i if i> 10 else namednumber2numbername[x.name[1]][i] for i in x], axis=1)
end = time.time()
print(end - start)
# 0.005067110061645508
In [453]: df
Out[453]:
0 1 2 3
bar one zero zero one zero
two i i i i
baz one zero zero zero zero
two i i ii i
foo one 65535 zero 65535 zero
two i i i i
qux one zero zero zero zero
two i i i ii
我得到了一个班轮:
df.transform(lambda x: [ np.ushort(value) if np.ushort(value) > 10 else namednumber2numbername[pos[1]][np.ushort(value)] for pos, value in x.items()])
0 1 2 3
bar one zero zero zero zero
two i i ii i
baz one 65534 zero 65535 zero
two ii i 65535 i
foo one zero zero zero zero
two ii i i ii
qux one 65535 zero zero zero
two i i i i
好的没有 .items() 的版本:
def what(x):
if type(x[0]) == np.float64:
if np.ushort(x[0])>10:
return np.ushort(x[0])
else:
return(namednumber2numbername[x.index[0][1]][np.ushort(x[0])])
df.groupby(level=[0,1]).transform(what)
0 1 2 3
bar one zero one zero zero
two i ii 65535 i
baz one zero zero 65535 zero
two i i i i
foo one zero one zero zero
two i i i i
qux one two zero zero 65534
two i i i ii
还有一张班轮!!!!没有 .items 根据您的要求!我们将级别 0 和 1 分组,然后执行计算以确定值::
df.groupby(level=[0,1]).transform(lambda x: np.ushort(x[0]) if type(x[0]) == np.float64 and np.ushort(x[0]) >10 else namednumber2numbername[x.index[0][1]][np.ushort(x[0])])
0 1 2 3
bar one zero one zero zero
two i ii 65535 i
baz one zero zero 65535 zero
two i i i i
foo one zero one zero zero
two i i i i
qux one two zero zero 65534
two i i i ii
为了获得其他值,我这样做了:
df.transform(lambda x: [ str(x.name[0]) + '_' + str(x.name[1]) + '_' + str( pos)+ '_' +str(value) for pos,value in x.items()])
print('Transformed DataFrame:\n',
df.transform(what), sep='')
Transformed DataFrame:
α ... ω ε
f a b c ... b c j
one α_a_one_79.96465755359696 α_b_one_31.32938096131651 α_c_one_2.61444370203201 ... ω_b_one_35.7457972161041 ω_c_one_40.224465043054195 ε_j_one_43.527184108357496
two α_a_two_42.66244395377804 α_b_two_65.92020941618344 α_c_two_77.26467264185487 ... ω_b_two_40.91908469505522 ω_c_two_50.395561828234555 ε_j_two_71.67418483119914
one α_a_one_47.9769845681328 α_b_one_38.90671671550259 α_c_one_67.13601594352508 ... ω_b_one_23.23799084164898 ω_c_one_63.551178212994465 ε_j_one_16.975582723809303
这是一个没有 .items 的:
df.transform(lambda x: ['_'.join((x.name[0], x.name[1], x.index[0], str(i) if type(i) == float else 0)) for i in list(x)])
输出
α ... ω ε
f a b c ... b c j
one α_a_one_79.96465755359696 α_b_one_31.32938096131651 α_c_one_2.61444370203201 ... ω_b_one_35.7457972161041 ω_c_one_40.224465043054195 ε_j_one_43.527184108357496
two α_a_two_42.66244395377804 α_b_two_65.92020941618344 α_c_two_77.26467264185487 ... ω_b_two_40.91908469505522 ω_c_two_50.395561828234555 ε_j_two_71.67418483119914
one α_a_one_47.9769845681328 α_b_one_38.90671671550259 α_c_one_67.13601594352508 ... ω_b_one_23.23799084164898 ω_c_one_63.551178212994465 ε_j_one_16.975582723809303
我也是这样做的,没有分组:
df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + df.T.eq(x).columns + '_' + x.astype(str) , axis=1).T
or even better and most simple:
df.T.apply(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) , axis=1).T
or
df.T.transform(lambda x: x.name[0] + '_'+ x.name[1] + '_' + x.index + '_' + x.astype(str) , axis=1).T
or with no .T:
df.transform(lambda x: x.index[0][0] + '_'+ x.index[0][1] + '_' + x.name + '_' + x.astype(str) , axis=1)
α ... ω ε
f a b c ... b c j
one α_a_one_79.96465755359696 α_b_one_31.32938096131651 α_c_one_2.61444370203201 ... ω_b_one_35.7457972161041 ω_c_one_40.224465043054195 ε_j_one_43.527184108357496
two α_a_two_42.66244395377804 α_b_two_65.92020941618344 α_c_two_77.26467264185487 ... ω_b_two_40.91908469505522 ω_c_two_50.395561828234555 ε_j_two_71.67418483119914
one α_a_one_47.9769845681328 α_b_one_38.90671671550259 α_c_one_67.13601594352508 ... ω_b_one_23.23799084164898 ω_c_one_63.551178212994465 ε_j_one_16.975582723809303
Transform
默认情况下将函数应用于每一列。您可以改为将它应用于每个 行 ,指定轴参数 = 1
或 'columns'
。然后您可以访问行索引并将其第二个名称字段传递给您的函数:
def namednumber2numbername_applicator(series):
def to_s(value, name):
if pd.isnull(value): return value
value = np.ushort(value)
if value > 10: return value
return namednumber2numbername[name][value]
return series.apply(to_s, args=((series.name[1]),))
df.transform(namednumber2numbername_applicator, 1)
结果:
0 1 2 3
bar one 65535 zero zero 65535
two ii i iii 65535
baz one 65535 zero zero 65535
two i i 65535 i
foo one zero zero zero zero
two i 65535 i i
qux one zero zero zero 65535
two i i i i
使用Series.map的示例:
class dict_default_key(dict):
def __missing__(self, key):
return key
number_names = [
'zero',
'one',
'two',
'three',
'four',
'five',
'six',
'seven',
'eight',
'nine'
]
roman_numerals = [
'i', 'ii', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x'
]
name_mapping = {
'one': dict_default_key(
{c: v for c, v in enumerate(number_names)}
),
'two': dict_default_key(
{c: v for c, v in enumerate(roman_numerals)}
)
}
def translate(series):
key = series.name[1]
row_map = name_mapping[key]
result = series.map(row_map)
return result
ushorts = df.apply(np.ushort)
ushorts.apply(translate, axis=1)
这是使用 reindex
and np.where()
的另一种方法:
def myf(dataframe,dictionary):
cond1=dataframe.isna()
cond2=np.ushort(dataframe)>10
m=(pd.DataFrame.from_dict(dictionary,orient='index')
.reindex(dataframe.index.get_level_values(1)))
m.index=pd.MultiIndex.from_arrays((dataframe.index.get_level_values(0),m.index))
arr=np.where(cond1|cond2,np.ushort(dataframe),
m[m.columns.intersection(dataframe.columns)])
return pd.DataFrame(arr,dataframe.index,dataframe.columns)
myf(df,namednumber2numbername)
0 1 2 3
bar one zero one two three
two 65535 ii iii 65535
baz one zero one 65535 three
two i ii iii iv
foo one zero 65535 two three
two i ii iii iv
qux one zero 65535 two 65535
two i ii iii iv
遵循的步骤:
- This function creates a dataframe with the dictionary (
m
) and reindexes ad the original.- Post this, we are adding an extra level to make it a multiindex same as the original dataframe. (print m inside func to see m)
- Then we check condition if dataframe is Null or
np.ushort
value more than 10- If condition matches, return
np.ushort
of dataframe else values from matching columns from m.
让我知道是否有任何我遗漏的步骤需要检查,或者您想合并,因为我认为这是避免按行计算的一种方法。
下面是我将如何解决这个问题:
# 1. Rewrite functions to include a parameter for `idx`
def some_fun_name(value, idx):
value = np.ushort(value)
if value > 10:
return value
else:
return namednumber2numbername[idx][value]
def apply_some_fun_name(s):
idx = list(s.index.get_level_values(1).unique())[0]
return s.transform(some_fun_name, idx=idx)
# 2. Apply function over the keys of the multi-index, replacing while operating:
df = df.groupby(level=1).transform(apply_some_fun_name)
# 3. I got the following result while using `np.random.seed(1)`:
0 1 2 3
bar one one zero zero 65535
two i 65534 ii i
baz one zero zero one 65534
two i i ii 65535
foo one zero zero zero zero
two 65535 ii i i
qux one zero zero zero zero
two i i i i