对两列进行分组后求平方和

Question

我有一个看起来像这样的数据集：

Value         Type       X_sq    
-1.975767     Weather   
-0.540979     Fruits
-2.359127     Fruits
-2.815604     Corona
-0.929755     Weather

我想遍历每一行并计算上面每一行的平方和值（仅当类型匹配时）。我想把这个值放在 X.sq 列中。

因此，例如，在第一行中，上面没有任何内容。所以只有 (-1.975767 x -1.975767)。在第二行中，上面没有 FRUITS 行，因此它只是 -0.540979 x -0.540979。然而，在第三行，当我们扫描所有前面的行时，我们应该发现 FRUITS 已经在那里了。所以我们应该获取最后一个FRUIT的.....X_sq值并计算新的平方和。

Value         Type       X_sq
-1.975767     Weather   -1.975767 * -1.975767    = x
-0.540979     Fruits    -0.540979 * -0.540979    = y
-2.359127     Fruits    y + ( -2.359127 x -2.359127)  
-2.815604     Corona    -2.815604 * -2.815604
-0.929755     Weather   x + (-0.929755 * -0.929755)

我试过了，效果很好：

df['sumOfSquares'] = df['value'].pow(2).groupby(df['type']).cumsum()

但是，现在我想根据两个列进行分组：这样国家和类型都匹配。

Value         Type       X_sq    Country
-1.975767     Weather            Albania
-0.540979     Fruits             Brazil      --should be grouped
-2.359127     Fruits             Brazil      --should be grouped
-2.815604     Corona             Albania
-0.929755     Weather            Chine

我在这里试过这个（类型=主题）：

df['sumOfSquares'] = df['value'].pow(2).groupby(['themes', 'suppliers_country']).cumsum()

但是，即使 'types' 存在于数据集中

，它也会给我这个错误

----> 1 df['sumOfSquares'] = df['avg_country_tone'].pow(2).groupby(['themes', 'suppliers_country']).cumsum()
     

File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/series.py:1929, in Series.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   1925 axis = self._get_axis_number(axis)
   1927 # error: Argument "squeeze" to "SeriesGroupBy" has incompatible type
   1928 # "Union[bool, NoDefault]"; expected "bool"
-> 1929 return SeriesGroupBy(
   1930     obj=self,
   1931     keys=by,
   1932     axis=axis,
   1933     level=level,
   1934     as_index=as_index,
   1935     sort=sort,
   1936     group_keys=group_keys,
   1937     squeeze=squeeze,  # type: ignore[arg-type]
   1938     observed=observed,
   1939     dropna=dropna,
   1940 )

File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
    879 if grouper is None:
    880     from pandas.core.groupby.grouper import get_grouper
--> 882     grouper, exclusions, obj = get_grouper(
    883         obj,
    884         keys,
    885         axis=axis,
    886         level=level,
    887         sort=sort,
    888         observed=observed,
    889         mutated=self.mutated,
    890         dropna=self.dropna,
    891     )
    893 self.obj = obj
    894 self.axis = obj._get_axis_number(axis)

File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/groupby/grouper.py:882, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
    880         in_axis, level, gpr = False, gpr, None
    881     else:
--> 882         raise KeyError(gpr)
    883 elif isinstance(gpr, Grouper) and gpr.key is not None:
    884     # Add key to exclusions
    885     exclusions.add(gpr.key)

KeyError: 'themes'
even though themes is there. Themes = type

Answer 1

发生错误是因为您正在对 pd 系列进行分组并且它没有名为 'themes', 'suppliers_country' 的键。要对系列进行分组，您必须将另一个系列作为 groupby 参数传递，而不是字符串。尝试将字符串列连接成一个系列，并分组为：

df['sumOfSquares'] = df['Value'].pow(2).groupby(df.Type+"__"+df.Country).cumsum()

或者，您也可以按 2 个不同的系列分组（我认为这是您的第一个想法）：

df['sumOfSquares'] = df['Value'].pow(2).groupby([df.Type,df.Country]).cumsum()

Answer 2

您可以在此处 new 创建新的辅助列，因此可以使用您的解决方案并在 groupby:

中定义列名称

df['sumOfSquares'] = (df.assign(new = df['avg_country_tone'].pow(2))
                        .groupby(['themes', 'suppliers_country'])['new']
                        .cumsum())

Answer 3

如果要合并 Type 和 Country 列以获得总和，请使用：

out = df.assign(X_sq=df['Value'].pow(2)).groupby(['Type', 'Country'])['X_sq'] \
        .sum().reset_index()
print(out)

# Output
      Type  Country      X_sq
0   Corona  Albania  7.927626
1   Fruits   Brazil  5.858138
2  Weather  Albania  3.903655
3  Weather    Chine  0.864444

对两列进行分组后求平方和

find sum of squares after grouping two cols

python

numpy

dataframe

standard-deviation

pandas