对两列进行分组后求平方和
find sum of squares after grouping two cols
我有一个看起来像这样的数据集:
Value Type X_sq
-1.975767 Weather
-0.540979 Fruits
-2.359127 Fruits
-2.815604 Corona
-0.929755 Weather
我想遍历每一行并计算上面每一行的平方和值(仅当类型匹配时)。我想把这个值放在 X.sq 列中。
因此,例如,在第一行中,上面没有任何内容。所以只有 (-1.975767 x -1.975767)。在第二行中,上面没有 FRUITS 行,因此它只是 -0.540979 x -0.540979。然而,在第三行,当我们扫描所有前面的行时,我们应该发现 FRUITS 已经在那里了。所以我们应该获取最后一个FRUIT的.....X_sq值并计算新的平方和。
Value Type X_sq
-1.975767 Weather -1.975767 * -1.975767 = x
-0.540979 Fruits -0.540979 * -0.540979 = y
-2.359127 Fruits y + ( -2.359127 x -2.359127)
-2.815604 Corona -2.815604 * -2.815604
-0.929755 Weather x + (-0.929755 * -0.929755)
我试过了,效果很好:
df['sumOfSquares'] = df['value'].pow(2).groupby(df['type']).cumsum()
但是,现在我想根据两个列进行分组:这样国家和类型都匹配。
Value Type X_sq Country
-1.975767 Weather Albania
-0.540979 Fruits Brazil --should be grouped
-2.359127 Fruits Brazil --should be grouped
-2.815604 Corona Albania
-0.929755 Weather Chine
我在这里试过这个(类型=主题):
df['sumOfSquares'] = df['value'].pow(2).groupby(['themes', 'suppliers_country']).cumsum()
但是,即使 'types' 存在于数据集中
,它也会给我这个错误
----> 1 df['sumOfSquares'] = df['avg_country_tone'].pow(2).groupby(['themes', 'suppliers_country']).cumsum()
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/series.py:1929, in Series.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
1925 axis = self._get_axis_number(axis)
1927 # error: Argument "squeeze" to "SeriesGroupBy" has incompatible type
1928 # "Union[bool, NoDefault]"; expected "bool"
-> 1929 return SeriesGroupBy(
1930 obj=self,
1931 keys=by,
1932 axis=axis,
1933 level=level,
1934 as_index=as_index,
1935 sort=sort,
1936 group_keys=group_keys,
1937 squeeze=squeeze, # type: ignore[arg-type]
1938 observed=observed,
1939 dropna=dropna,
1940 )
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
879 if grouper is None:
880 from pandas.core.groupby.grouper import get_grouper
--> 882 grouper, exclusions, obj = get_grouper(
883 obj,
884 keys,
885 axis=axis,
886 level=level,
887 sort=sort,
888 observed=observed,
889 mutated=self.mutated,
890 dropna=self.dropna,
891 )
893 self.obj = obj
894 self.axis = obj._get_axis_number(axis)
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/groupby/grouper.py:882, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
880 in_axis, level, gpr = False, gpr, None
881 else:
--> 882 raise KeyError(gpr)
883 elif isinstance(gpr, Grouper) and gpr.key is not None:
884 # Add key to exclusions
885 exclusions.add(gpr.key)
KeyError: 'themes'
even though themes is there. Themes = type
发生错误是因为您正在对 pd 系列进行分组并且它没有名为 'themes', 'suppliers_country'
的键。要对系列进行分组,您必须将另一个系列作为 groupby
参数传递,而不是字符串。
尝试将字符串列连接成一个系列,并分组为:
df['sumOfSquares'] = df['Value'].pow(2).groupby(df.Type+"__"+df.Country).cumsum()
或者,您也可以按 2 个不同的系列分组(我认为这是您的第一个想法):
df['sumOfSquares'] = df['Value'].pow(2).groupby([df.Type,df.Country]).cumsum()
您可以在此处 new
创建新的辅助列,因此可以使用您的解决方案并在 groupby
:
中定义列名称
df['sumOfSquares'] = (df.assign(new = df['avg_country_tone'].pow(2))
.groupby(['themes', 'suppliers_country'])['new']
.cumsum())
如果要合并 Type
和 Country
列以获得总和,请使用:
out = df.assign(X_sq=df['Value'].pow(2)).groupby(['Type', 'Country'])['X_sq'] \
.sum().reset_index()
print(out)
# Output
Type Country X_sq
0 Corona Albania 7.927626
1 Fruits Brazil 5.858138
2 Weather Albania 3.903655
3 Weather Chine 0.864444
我有一个看起来像这样的数据集:
Value Type X_sq
-1.975767 Weather
-0.540979 Fruits
-2.359127 Fruits
-2.815604 Corona
-0.929755 Weather
我想遍历每一行并计算上面每一行的平方和值(仅当类型匹配时)。我想把这个值放在 X.sq 列中。
因此,例如,在第一行中,上面没有任何内容。所以只有 (-1.975767 x -1.975767)。在第二行中,上面没有 FRUITS 行,因此它只是 -0.540979 x -0.540979。然而,在第三行,当我们扫描所有前面的行时,我们应该发现 FRUITS 已经在那里了。所以我们应该获取最后一个FRUIT的.....X_sq值并计算新的平方和。
Value Type X_sq
-1.975767 Weather -1.975767 * -1.975767 = x
-0.540979 Fruits -0.540979 * -0.540979 = y
-2.359127 Fruits y + ( -2.359127 x -2.359127)
-2.815604 Corona -2.815604 * -2.815604
-0.929755 Weather x + (-0.929755 * -0.929755)
我试过了,效果很好:
df['sumOfSquares'] = df['value'].pow(2).groupby(df['type']).cumsum()
但是,现在我想根据两个列进行分组:这样国家和类型都匹配。
Value Type X_sq Country
-1.975767 Weather Albania
-0.540979 Fruits Brazil --should be grouped
-2.359127 Fruits Brazil --should be grouped
-2.815604 Corona Albania
-0.929755 Weather Chine
我在这里试过这个(类型=主题):
df['sumOfSquares'] = df['value'].pow(2).groupby(['themes', 'suppliers_country']).cumsum()
但是,即使 'types' 存在于数据集中
,它也会给我这个错误----> 1 df['sumOfSquares'] = df['avg_country_tone'].pow(2).groupby(['themes', 'suppliers_country']).cumsum()
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/series.py:1929, in Series.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
1925 axis = self._get_axis_number(axis)
1927 # error: Argument "squeeze" to "SeriesGroupBy" has incompatible type
1928 # "Union[bool, NoDefault]"; expected "bool"
-> 1929 return SeriesGroupBy(
1930 obj=self,
1931 keys=by,
1932 axis=axis,
1933 level=level,
1934 as_index=as_index,
1935 sort=sort,
1936 group_keys=group_keys,
1937 squeeze=squeeze, # type: ignore[arg-type]
1938 observed=observed,
1939 dropna=dropna,
1940 )
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:882, in GroupBy.__init__(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
879 if grouper is None:
880 from pandas.core.groupby.grouper import get_grouper
--> 882 grouper, exclusions, obj = get_grouper(
883 obj,
884 keys,
885 axis=axis,
886 level=level,
887 sort=sort,
888 observed=observed,
889 mutated=self.mutated,
890 dropna=self.dropna,
891 )
893 self.obj = obj
894 self.axis = obj._get_axis_number(axis)
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/groupby/grouper.py:882, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
880 in_axis, level, gpr = False, gpr, None
881 else:
--> 882 raise KeyError(gpr)
883 elif isinstance(gpr, Grouper) and gpr.key is not None:
884 # Add key to exclusions
885 exclusions.add(gpr.key)
KeyError: 'themes'
even though themes is there. Themes = type
发生错误是因为您正在对 pd 系列进行分组并且它没有名为 'themes', 'suppliers_country'
的键。要对系列进行分组,您必须将另一个系列作为 groupby
参数传递,而不是字符串。
尝试将字符串列连接成一个系列,并分组为:
df['sumOfSquares'] = df['Value'].pow(2).groupby(df.Type+"__"+df.Country).cumsum()
或者,您也可以按 2 个不同的系列分组(我认为这是您的第一个想法):
df['sumOfSquares'] = df['Value'].pow(2).groupby([df.Type,df.Country]).cumsum()
您可以在此处 new
创建新的辅助列,因此可以使用您的解决方案并在 groupby
:
df['sumOfSquares'] = (df.assign(new = df['avg_country_tone'].pow(2))
.groupby(['themes', 'suppliers_country'])['new']
.cumsum())
如果要合并 Type
和 Country
列以获得总和,请使用:
out = df.assign(X_sq=df['Value'].pow(2)).groupby(['Type', 'Country'])['X_sq'] \
.sum().reset_index()
print(out)
# Output
Type Country X_sq
0 Corona Albania 7.927626
1 Fruits Brazil 5.858138
2 Weather Albania 3.903655
3 Weather Chine 0.864444