Python DataFrame - groupby 和质心计算
Python DataFrame - groupby and centroid calculation
我有一个包含两列的数据框:一列包含一个类别,另一列包含一个 300 维向量。对于类别列中的每个值,我有很多 300 维向量。我需要的是按类别列对数据帧进行分组,同时获取属于每个类别的所有向量的质心值。
Category Vector
Balance [1,2,1,-5,....,9]
Inquiry [-5,3,1,5,...,10]
Card [-3,1,2,3,...1]
Balance [1,3,-2,1,-5,...,7]
Card [3,1,3,4,...,2]
所以在上述情况下,所需的输出将是:
Category Vector
Balance [1,2.5,-0.5,-2,....,8]
Inquiry [-5,3,1,5,...,10]
Card [0,1,2.5,3.5,...,1.5]
我已经编写了以下获取向量数组并计算其质心的函数:
import numpy as np
def get_intent_centroid(array):
centroid = np.zeros(len(array[0]))
for vector in array:
centroid = centroid + vector
return centroid/len(array)
所以我只需要一种快速的方法来应用上面的函数以及对数据帧的 groupby
命令。
请原谅我对数据帧的格式化,但我不知道如何正确格式化它们。
按照 OP 的要求,我有办法通过列表来完成:
vectorsList = list(df["Vector"])
catList = list(df["Category"])
#create a dict for each category and initialise it with a list of 300, zeros
dictOfCats = {}
for each in set(cat):
dictOfCats[each]= [0] * 300
#loop through the vectorsList and catList
for i in range(0, len(catList)):
currentVec = dictOfCats[each]
for j in range(0, len(vectorsList[i])):
currentVec[j] = vectorsList[i][j] + currentVec[j]
dictOfCats[each] = currentVec
#now each element in dict has sum. you can divide it by the count of each category
#you can calculate the frequency by groupby, here since i have used only lists, i am showing execution by lists
catFreq = {}
for eachCat in catList:
if(eachCat in catList):
catList[eachCat] = catList[eachCat] + 1
else:
catList[eachCat] = 1
for eachKey in dictOfCats:
currentVec = dictOfCats[eachKey]
newCurrentVec = [x / catList[eachKey] for x in currentVec]
dictOfCats[eachKey] = newCurrentVec
#now change this dictOfCats to dataframe again
请注意,代码中可能存在错误,因为我尚未使用您的数据进行检查。这在计算上会很昂贵,但如果您无法通过 pandas 找出解决方案,则应该可以完成这项工作。如果您确实在 pandas 中提出了解决方案,请 post 回答
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
{'category': 'Balance', 'vector': [1,2,1,-5,9]},
{'category': 'Inquiry', 'vector': [-5,3,1,5,10]},
{'category': 'Card', 'vector': [-3,1,2,3,1]},
{'category': 'Balance', 'vector': [1,3,-2,1,7]},
{'category': 'Card', 'vector': [3,1,3,4,2]}
]
)
def get_intent_centroid(array):
centroid = np.zeros(len(array[0]))
for vector in array:
centroid = centroid + vector
return centroid/len(array)
df.groupby('category')['vector'].apply(lambda x: get_intent_centroid(x.tolist()))
Output:
category
Balance [1.0, 2.5, -0.5, -2.0, 8.0]
Card [0.0, 1.0, 2.5, 3.5, 1.5]
Inquiry [-5.0, 3.0, 1.0, 5.0, 10.0]
Name: vector, dtype: object
这应该可以在不使用列表的情况下工作
def get_intent_centroid(array):
centroid = np.zeros(len(array.iloc[0]))
for vector in array:
centroid = centroid + vector
return centroid/len(array.iloc[0])
df.groupby('Catagory')['Vector'].apply(get_intent_centroid)
所以向量列表的质心就是向量每个维度的平均值,所以这可以简化很多。
df.groupby('Category')['Vector'].apply(lambda x: np.mean(x.tolist(), axis=0))
它应该比任何 loop/list 转换方法都快。
我有一个包含两列的数据框:一列包含一个类别,另一列包含一个 300 维向量。对于类别列中的每个值,我有很多 300 维向量。我需要的是按类别列对数据帧进行分组,同时获取属于每个类别的所有向量的质心值。
Category Vector
Balance [1,2,1,-5,....,9]
Inquiry [-5,3,1,5,...,10]
Card [-3,1,2,3,...1]
Balance [1,3,-2,1,-5,...,7]
Card [3,1,3,4,...,2]
所以在上述情况下,所需的输出将是:
Category Vector
Balance [1,2.5,-0.5,-2,....,8]
Inquiry [-5,3,1,5,...,10]
Card [0,1,2.5,3.5,...,1.5]
我已经编写了以下获取向量数组并计算其质心的函数:
import numpy as np
def get_intent_centroid(array):
centroid = np.zeros(len(array[0]))
for vector in array:
centroid = centroid + vector
return centroid/len(array)
所以我只需要一种快速的方法来应用上面的函数以及对数据帧的 groupby
命令。
请原谅我对数据帧的格式化,但我不知道如何正确格式化它们。
按照 OP 的要求,我有办法通过列表来完成:
vectorsList = list(df["Vector"])
catList = list(df["Category"])
#create a dict for each category and initialise it with a list of 300, zeros
dictOfCats = {}
for each in set(cat):
dictOfCats[each]= [0] * 300
#loop through the vectorsList and catList
for i in range(0, len(catList)):
currentVec = dictOfCats[each]
for j in range(0, len(vectorsList[i])):
currentVec[j] = vectorsList[i][j] + currentVec[j]
dictOfCats[each] = currentVec
#now each element in dict has sum. you can divide it by the count of each category
#you can calculate the frequency by groupby, here since i have used only lists, i am showing execution by lists
catFreq = {}
for eachCat in catList:
if(eachCat in catList):
catList[eachCat] = catList[eachCat] + 1
else:
catList[eachCat] = 1
for eachKey in dictOfCats:
currentVec = dictOfCats[eachKey]
newCurrentVec = [x / catList[eachKey] for x in currentVec]
dictOfCats[eachKey] = newCurrentVec
#now change this dictOfCats to dataframe again
请注意,代码中可能存在错误,因为我尚未使用您的数据进行检查。这在计算上会很昂贵,但如果您无法通过 pandas 找出解决方案,则应该可以完成这项工作。如果您确实在 pandas 中提出了解决方案,请 post 回答
import pandas as pd
import numpy as np
df = pd.DataFrame(
[
{'category': 'Balance', 'vector': [1,2,1,-5,9]},
{'category': 'Inquiry', 'vector': [-5,3,1,5,10]},
{'category': 'Card', 'vector': [-3,1,2,3,1]},
{'category': 'Balance', 'vector': [1,3,-2,1,7]},
{'category': 'Card', 'vector': [3,1,3,4,2]}
]
)
def get_intent_centroid(array):
centroid = np.zeros(len(array[0]))
for vector in array:
centroid = centroid + vector
return centroid/len(array)
df.groupby('category')['vector'].apply(lambda x: get_intent_centroid(x.tolist()))
Output:
category
Balance [1.0, 2.5, -0.5, -2.0, 8.0]
Card [0.0, 1.0, 2.5, 3.5, 1.5]
Inquiry [-5.0, 3.0, 1.0, 5.0, 10.0]
Name: vector, dtype: object
这应该可以在不使用列表的情况下工作
def get_intent_centroid(array):
centroid = np.zeros(len(array.iloc[0]))
for vector in array:
centroid = centroid + vector
return centroid/len(array.iloc[0])
df.groupby('Catagory')['Vector'].apply(get_intent_centroid)
所以向量列表的质心就是向量每个维度的平均值,所以这可以简化很多。
df.groupby('Category')['Vector'].apply(lambda x: np.mean(x.tolist(), axis=0))
它应该比任何 loop/list 转换方法都快。