根据逐行操作添加新的 pandas df 列
adding new pandas df column based on operations row-wise
我有这样一个数据框:
Interesting genre_1 probabilities
1 no Empty 0.251306
2 yes Empty 0.042043
3 no Alternative 5.871099
4 yes Alternative 5.723896
5 no Blues 0.027028
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316
10 yes Classical 1.044135
我想根据感兴趣的列对同一类别进行GINI索引。之后,我想在新的 pandas 列中添加这样的值。
这是获取基尼系数的函数:
#Gini Function
#a and b are the quantities of each class
def gini(a,b):
a1 = (a/(a+b))**2
b1 = (b/(a+b))**2
return 1 - (a1 + b1)
编辑* 抱歉,我最终想要的 Dataframe 有错误。在选择 prob(A) 和 prob(B) 时,有趣与否很重要,但基尼分数将相同,因为它将衡量我们将一首歌归类为有趣与否时有多少杂质。因此,如果概率在 50/50% 左右,则意味着基尼系数将达到最大值 (0.5),这是因为同样有可能被误认为选择有趣或不有趣。
因此对于前两行,基尼指数将为:
a=no; b=Empty -> gini(0.251306, 0.042043)= 0.245559831601612
a=yes; b=Empty -> gini(0.042043, 0.251306)= 0.245559831601612
然后我想得到类似的东西:
Interesting genre_1 percentages. GINI INDEX
1 no Empty 0.251306 0.245559831601612
2 yes Empty 0.042043 0.245559831601612
3 no Alternative 5.871099 0.4999194135183881
4 yes Alternative 5.723896. 0.4999194135183881
5 no Blues 0.027028 ..
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316 ..
10 yes Classical 1.044135 ..
我不确定 Interesting
专栏如何影响所有这些,但我强烈建议您使用 numpy.where()
创建新专栏。语法类似于:
import numpy as np
df['GINI INDEX'] = np.where(__condition__,__what to do if true__,__what to do if false__)
好的,我想我明白你的意思了。如果 Interesting 值是 'yes' 或 'no',下面的代码并不关心。但是您想要的是根据该行的 Interesting 值中的值以两种不同的方式为每一行计算 GINI 系数。所以如果 interesting == no,那么结果就是 0.5,因为 a == b。但是如果interesting是'yes',那么就需要用a = probability[i]和b = probability[i+1]。因此,请跳过此部分以获得下面的更新代码。
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
probs = df['probabilities']
def ROLLING_GINI(probabilities):
a1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
b1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
res = 1 - (a1 + b1)
yield res
for i in range(len(probabilities)-1):
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
df['GINI'] = [val for val in ROLLING_GINI(probs)]
print(df)
这才是真正的麻烦开始的地方,因为如果我正确理解你的想法,那么你就无法计算最后的 GINI 值,因为你的数据框不允许这样做。这里重要的一点是数据框中最后一个有趣的值是 'yes'。这意味着我必须使用 a = probability[i] 和 b = probability[i+1]。但是你的数据框没有第 11 行。你有 10 行,在第 i == 10 行,你需要第 11 行的概率来计算 GINI 系数。所以为了让你的想法起作用,最后一个有趣的值必须是 'no',否则你总是会得到一个索引错误。
这里是代码:
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)-1):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
第三次编辑(很抱歉迟到了):
因此,如果我正确应用索引,它确实有效。问题是我想使用 Next 概率,而不是前一个。所以它是 a = probabilities[i-1] 和 b = probabilities[i]
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i-1]/(probabilities[i-1]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i-1]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
我有这样一个数据框:
Interesting genre_1 probabilities
1 no Empty 0.251306
2 yes Empty 0.042043
3 no Alternative 5.871099
4 yes Alternative 5.723896
5 no Blues 0.027028
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316
10 yes Classical 1.044135
我想根据感兴趣的列对同一类别进行GINI索引。之后,我想在新的 pandas 列中添加这样的值。
这是获取基尼系数的函数:
#Gini Function
#a and b are the quantities of each class
def gini(a,b):
a1 = (a/(a+b))**2
b1 = (b/(a+b))**2
return 1 - (a1 + b1)
编辑* 抱歉,我最终想要的 Dataframe 有错误。在选择 prob(A) 和 prob(B) 时,有趣与否很重要,但基尼分数将相同,因为它将衡量我们将一首歌归类为有趣与否时有多少杂质。因此,如果概率在 50/50% 左右,则意味着基尼系数将达到最大值 (0.5),这是因为同样有可能被误认为选择有趣或不有趣。
因此对于前两行,基尼指数将为:
a=no; b=Empty -> gini(0.251306, 0.042043)= 0.245559831601612
a=yes; b=Empty -> gini(0.042043, 0.251306)= 0.245559831601612
然后我想得到类似的东西:
Interesting genre_1 percentages. GINI INDEX
1 no Empty 0.251306 0.245559831601612
2 yes Empty 0.042043 0.245559831601612
3 no Alternative 5.871099 0.4999194135183881
4 yes Alternative 5.723896. 0.4999194135183881
5 no Blues 0.027028 ..
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316 ..
10 yes Classical 1.044135 ..
我不确定 Interesting
专栏如何影响所有这些,但我强烈建议您使用 numpy.where()
创建新专栏。语法类似于:
import numpy as np
df['GINI INDEX'] = np.where(__condition__,__what to do if true__,__what to do if false__)
好的,我想我明白你的意思了。如果 Interesting 值是 'yes' 或 'no',下面的代码并不关心。但是您想要的是根据该行的 Interesting 值中的值以两种不同的方式为每一行计算 GINI 系数。所以如果 interesting == no,那么结果就是 0.5,因为 a == b。但是如果interesting是'yes',那么就需要用a = probability[i]和b = probability[i+1]。因此,请跳过此部分以获得下面的更新代码。
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
probs = df['probabilities']
def ROLLING_GINI(probabilities):
a1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
b1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
res = 1 - (a1 + b1)
yield res
for i in range(len(probabilities)-1):
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
df['GINI'] = [val for val in ROLLING_GINI(probs)]
print(df)
这才是真正的麻烦开始的地方,因为如果我正确理解你的想法,那么你就无法计算最后的 GINI 值,因为你的数据框不允许这样做。这里重要的一点是数据框中最后一个有趣的值是 'yes'。这意味着我必须使用 a = probability[i] 和 b = probability[i+1]。但是你的数据框没有第 11 行。你有 10 行,在第 i == 10 行,你需要第 11 行的概率来计算 GINI 系数。所以为了让你的想法起作用,最后一个有趣的值必须是 'no',否则你总是会得到一个索引错误。
这里是代码:
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)-1):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
第三次编辑(很抱歉迟到了):
因此,如果我正确应用索引,它确实有效。问题是我想使用 Next 概率,而不是前一个。所以它是 a = probabilities[i-1] 和 b = probabilities[i]
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i-1]/(probabilities[i-1]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i-1]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])