计算 pandas 序列的熵时出错
Error calculating entropy over pandas series
我正在尝试计算 pandas 系列的熵。具体来说,我将 Direction
中的字符串分组为一个序列。具体来说,使用这个函数:
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
将 return Direction
中相同的字符串数,直到发生变化。因此,对于相同 Direction
字符串的每个序列,我想计算 X,Y
.
的熵
使用代码,相同字符串的排序是:
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
此代码曾经有效,但现在 return 出现错误。我不确定这是否是升级后的结果。
import pandas as pd
import numpy as np
def ApEn(U, m = 2, r = 0.2):
'''
Approximate Entropy
Quantify the amount of regularity over time-series data.
Input parameters:
U = Time series
m = Length of compared run of data (subseries length)
r = Filtering level (tolerance). A positive number
'''
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [U.tolist()[i:i + m] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
def Entropy(df):
'''
Calculate entropy for individual direction
'''
df = df[['Time','Direction','X','Y']]
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)
return df
df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
df['Direction'] = direction
# Calculate defensive regularity
entropy = Entropy(df)
错误:
return (N - m + 1.0)**(-1) * sum(np.log(C))
ZeroDivisionError: 0.0 cannot be raised to a negative power
你必须处理你的 ZeroDivisions。也许这样:
def _phi(m):
if N == m - 1:
return 0
...
然后您将在 groupbys、df 和 diff_X[=18 上遇到长度不匹配=] 必须具有相同的长度。
看来,当调用 ApEn._phi()
函数时,N
和 m
的特定值可能最终 returning a 0
.然后需要将其提高到 -1 的负次幂,但是它是未定义的(另请参见 Why does zero raised to the power of negative one equal infinity?)。
为了说明,我尝试专门复制您的场景,并且在 transform
操作的第一次迭代中,发生了以下情况:
U is: 1 0
2 48
(第一个groupby有2个元素)
N is: 2
m is: 3
当您达到 _phi()
的 return 值时,您正在做 (N - m + 1.0)**-1 = (2 - 3 + 1)**-1 = 0**-1
,这是未定义的。也许这里的关键是你说你是按个人方向分组并将 U
数组传递给近似熵函数,但是你按 diff_X
和 diff_Y
分组,由于所应用方法的性质,导致非常小的群体。据我了解,如果你想计算每个方向的近似熵,你只需要按 'Direction':
分组
def Entropy(df):
'''
Calculate entropy for individual direction
'''
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby('Direction')['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby('Direction')['Y'].transform(ApEn)
return df
这会产生如下数据框:
entropy.head()
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 28 47 0.035091 0.035091
1 2 Up 8 47 0.013493 0.046520
2 3 Up 0 32 0.013493 0.046520
3 4 Right 34 8 0.044452 0.044452
4 5 Right 49 27 0.044452 0.044452
问题是因为下面的代码
(N - m + 1.0)**(-1)
考虑 N==1
并且由于 N = len(U)
当从 groupby 产生的组的大小为 1 时发生这种情况。由于 m==2
这最终为
(1-2+1)**-1 == 0
而我们 0**-1
未定义,因此错误。
现在如果我们从理论上看,你如何定义一个只有一个值的时间序列的近似熵;高度不可预测,因此它应该尽可能高。对于这种情况,让我们将其设置为 np.nan
以表示它未定义(熵始终大于等于 0)
代码
import pandas as pd
import numpy as np
def ApEn(U, m = 2, r = 0.2):
'''
Approximate Entropy
Quantify the amount of regularity over time-series data.
Input parameters:
U = Time series
m = Length of compared run of data (subseries length)
r = Filtering level (tolerance). A positive number
'''
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [U.tolist()[i:i + m] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
if (N - m + 1) == 0:
return np.nan
return (N - m + 1)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
def Entropy(df):
'''
Calculate entropy for individual direction
'''
df = df[['Time','Direction','X','Y']]
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)
return df
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
df['Direction'] = direction
# Calculate defensive regularity
print (Entropy(df))
输出:
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 6 16 0.287682 0.287682
1 2 Left 22 6 0.287682 0.287682
2 3 Left 16 5 0.287682 0.287682
3 4 Left 5 48 0.287682 0.287682
4 5 Left 11 21 0.287682 0.287682
5 6 Right 44 25 0.693147 0.693147
6 7 Right 14 12 0.693147 0.693147
7 8 Right 43 40 0.693147 0.693147
8 9 Left 46 44 NaN NaN
9 10 Left 49 2 NaN NaN
更大的样本(导致 0**-1 问题)
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50, size = (100, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Right','Up','Down']
df['Direction'] = np.random.choice((direction), len(df))
print (Entropy(df))
输出:
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 44 47 NaN NaN
1 2 Left 0 3 NaN NaN
2 3 Down 3 39 NaN NaN
3 4 Right 9 19 NaN NaN
4 5 Up 21 36 NaN NaN
.. ... ... .. .. ... ...
95 96 Up 19 33 NaN NaN
96 97 Left 40 32 NaN NaN
97 98 Up 36 6 NaN NaN
98 99 Left 21 31 NaN NaN
99 100 Right 13 7 NaN NaN
我正在尝试计算 pandas 系列的熵。具体来说,我将 Direction
中的字符串分组为一个序列。具体来说,使用这个函数:
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
将 return Direction
中相同的字符串数,直到发生变化。因此,对于相同 Direction
字符串的每个序列,我想计算 X,Y
.
使用代码,相同字符串的排序是:
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
此代码曾经有效,但现在 return 出现错误。我不确定这是否是升级后的结果。
import pandas as pd
import numpy as np
def ApEn(U, m = 2, r = 0.2):
'''
Approximate Entropy
Quantify the amount of regularity over time-series data.
Input parameters:
U = Time series
m = Length of compared run of data (subseries length)
r = Filtering level (tolerance). A positive number
'''
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [U.tolist()[i:i + m] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
return (N - m + 1.0)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
def Entropy(df):
'''
Calculate entropy for individual direction
'''
df = df[['Time','Direction','X','Y']]
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)
return df
df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
df['Direction'] = direction
# Calculate defensive regularity
entropy = Entropy(df)
错误:
return (N - m + 1.0)**(-1) * sum(np.log(C))
ZeroDivisionError: 0.0 cannot be raised to a negative power
你必须处理你的 ZeroDivisions。也许这样:
def _phi(m):
if N == m - 1:
return 0
...
然后您将在 groupbys、df 和 diff_X[=18 上遇到长度不匹配=] 必须具有相同的长度。
看来,当调用 ApEn._phi()
函数时,N
和 m
的特定值可能最终 returning a 0
.然后需要将其提高到 -1 的负次幂,但是它是未定义的(另请参见 Why does zero raised to the power of negative one equal infinity?)。
为了说明,我尝试专门复制您的场景,并且在 transform
操作的第一次迭代中,发生了以下情况:
U is: 1 0
2 48
(第一个groupby有2个元素)
N is: 2
m is: 3
当您达到 _phi()
的 return 值时,您正在做 (N - m + 1.0)**-1 = (2 - 3 + 1)**-1 = 0**-1
,这是未定义的。也许这里的关键是你说你是按个人方向分组并将 U
数组传递给近似熵函数,但是你按 diff_X
和 diff_Y
分组,由于所应用方法的性质,导致非常小的群体。据我了解,如果你想计算每个方向的近似熵,你只需要按 'Direction':
def Entropy(df):
'''
Calculate entropy for individual direction
'''
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby('Direction')['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby('Direction')['Y'].transform(ApEn)
return df
这会产生如下数据框:
entropy.head()
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 28 47 0.035091 0.035091
1 2 Up 8 47 0.013493 0.046520
2 3 Up 0 32 0.013493 0.046520
3 4 Right 34 8 0.044452 0.044452
4 5 Right 49 27 0.044452 0.044452
问题是因为下面的代码
(N - m + 1.0)**(-1)
考虑 N==1
并且由于 N = len(U)
当从 groupby 产生的组的大小为 1 时发生这种情况。由于 m==2
这最终为
(1-2+1)**-1 == 0
而我们 0**-1
未定义,因此错误。
现在如果我们从理论上看,你如何定义一个只有一个值的时间序列的近似熵;高度不可预测,因此它应该尽可能高。对于这种情况,让我们将其设置为 np.nan
以表示它未定义(熵始终大于等于 0)
代码
import pandas as pd
import numpy as np
def ApEn(U, m = 2, r = 0.2):
'''
Approximate Entropy
Quantify the amount of regularity over time-series data.
Input parameters:
U = Time series
m = Length of compared run of data (subseries length)
r = Filtering level (tolerance). A positive number
'''
def _maxdist(x_i, x_j):
return max([abs(ua - va) for ua, va in zip(x_i, x_j)])
def _phi(m):
x = [U.tolist()[i:i + m] for i in range(N - m + 1)]
C = [len([1 for x_j in x if _maxdist(x_i, x_j) <= r]) / (N - m + 1.0) for x_i in x]
if (N - m + 1) == 0:
return np.nan
return (N - m + 1)**(-1) * sum(np.log(C))
N = len(U)
return abs(_phi(m + 1) - _phi(m))
def Entropy(df):
'''
Calculate entropy for individual direction
'''
df = df[['Time','Direction','X','Y']]
diff_dir = df.iloc[0:,1].ne(df.iloc[0:,1].shift()).cumsum()
# Calculate ApEn grouped by direction.
df['ApEn_X'] = df.groupby(diff_dir)['X'].transform(ApEn)
df['ApEn_Y'] = df.groupby(diff_dir)['Y'].transform(ApEn)
return df
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50, size = (10, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Left','Left','Left','Left','Right','Right','Right','Left','Left']
df['Direction'] = direction
# Calculate defensive regularity
print (Entropy(df))
输出:
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 6 16 0.287682 0.287682
1 2 Left 22 6 0.287682 0.287682
2 3 Left 16 5 0.287682 0.287682
3 4 Left 5 48 0.287682 0.287682
4 5 Left 11 21 0.287682 0.287682
5 6 Right 44 25 0.693147 0.693147
6 7 Right 14 12 0.693147 0.693147
7 8 Right 43 40 0.693147 0.693147
8 9 Left 46 44 NaN NaN
9 10 Left 49 2 NaN NaN
更大的样本(导致 0**-1 问题)
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0,50, size = (100, 2)), columns=list('XY'))
df['Time'] = range(1, len(df) + 1)
direction = ['Left','Right','Up','Down']
df['Direction'] = np.random.choice((direction), len(df))
print (Entropy(df))
输出:
Time Direction X Y ApEn_X ApEn_Y
0 1 Left 44 47 NaN NaN
1 2 Left 0 3 NaN NaN
2 3 Down 3 39 NaN NaN
3 4 Right 9 19 NaN NaN
4 5 Up 21 36 NaN NaN
.. ... ... .. .. ... ...
95 96 Up 19 33 NaN NaN
96 97 Left 40 32 NaN NaN
97 98 Up 36 6 NaN NaN
98 99 Left 21 31 NaN NaN
99 100 Right 13 7 NaN NaN