TypeError: incompatible index of inserted column with frame index when grouping 2 columns
TypeError: incompatible index of inserted column with frame index when grouping 2 columns
我有一个看起来像这样的数据集(+ 一些其他列):
Value Theme Country
-1.975767 Weather China
-0.540979 Fruits China
-2.359127 Fruits China
-2.815604 Corona Brazil
-0.929755 Weather UK
-0.929755 Weather UK
我想找出按主题和国家/地区分组后的值的标准差(如此处解释 calculate standard deviation by grouping two columns
df = pd.read_csv('./Brazil.csv')
df['std'] = df.groupby(['themes', 'country'])['value'].std()
但是,目前我收到此错误:
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:3656, in DataFrame.__setitem__(self, key, value)
3653 self._setitem_array([key], value)
3654 else:
3655 # set column
-> 3656 self._set_item(key, value)
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:3833, in DataFrame._set_item(self, key, value)
3823 def _set_item(self, key, value) -> None:
3824 """
3825 Add series to DataFrame in specified column.
3826
(...)
3831 ensure homogeneity.
3832 """
-> 3833 value = self._sanitize_column(value)
3835 if (
3836 key in self.columns
3837 and value.ndim == 1
3838 and not is_extension_array_dtype(value)
3839 ):
3840 # broadcast across multiple columns if necessary
3841 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:4534, in DataFrame._sanitize_column(self, value)
4532 # We should never get here with DataFrame value
4533 if isinstance(value, Series):
-> 4534 return _reindex_for_setitem(value, self.index)
4536 if is_list_like(value):
4537 com.require_length_match(value, self.index)
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:10985, in _reindex_for_setitem(value, index)
10981 if not value.index.is_unique:
10982 # duplicate axis
10983 raise err
> 10985 raise TypeError(
10986 "incompatible index of inserted column with frame index"
10987 ) from err
10988 return reindexed_value
TypeError: incompatible index of inserted column with frame index
您可以使用 rolling
方法计算每个组的累积标准偏差。
代码
import pandas as pd
# Create a sample dataframe
import io
text_csv = '''Value,Theme,Country
-1.975767,Weather,China
-0.540979,Fruits,China
-2.359127,Fruits,China
-2.815604,Corona,Brazil
-0.929755,Weather,UK
-0.929755,Weather,UK'''
df = pd.read_csv(io.StringIO(text_csv))
# Calculate cumulative standard deviations
df_std = df.groupby(['Theme', 'Country'], as_index=False)['Value'].rolling(len(df), min_periods=1).std()
# Merge the original df with the cumulative std values
df_std = df.join(df_std.drop(['Theme', 'Country'], axis=1).rename(columns={'Value': 'CorrectedStd'}))
输出
Value
Theme
Country
CorrectedStd
0
-1.97577
Weather
China
nan
1
-0.540979
Fruits
China
nan
2
-2.35913
Fruits
China
1.28562
3
-2.8156
Corona
Brazil
nan
4
-0.929755
Weather
UK
nan
5
-0.929755
Weather
UK
0
使用DataFrame.expanding
with remove first level for new column by DataFrame.droplevel
应该是更简单的解决方案:
df['std'] = (df.groupby(['Theme', 'Country'])['Value']
.expanding()
.std()
.droplevel([0,1]))
print (df)
Value Theme Country std
0 -1.975767 Weather China NaN
1 -0.540979 Fruits China NaN
2 -2.359127 Fruits China 1.285625
3 -2.815604 Corona Brazil NaN
4 -0.929755 Weather UK NaN
5 -0.929755 Weather UK 0.000000
我有一个看起来像这样的数据集(+ 一些其他列):
Value Theme Country
-1.975767 Weather China
-0.540979 Fruits China
-2.359127 Fruits China
-2.815604 Corona Brazil
-0.929755 Weather UK
-0.929755 Weather UK
我想找出按主题和国家/地区分组后的值的标准差(如此处解释 calculate standard deviation by grouping two columns
df = pd.read_csv('./Brazil.csv')
df['std'] = df.groupby(['themes', 'country'])['value'].std()
但是,目前我收到此错误:
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:3656, in DataFrame.__setitem__(self, key, value)
3653 self._setitem_array([key], value)
3654 else:
3655 # set column
-> 3656 self._set_item(key, value)
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:3833, in DataFrame._set_item(self, key, value)
3823 def _set_item(self, key, value) -> None:
3824 """
3825 Add series to DataFrame in specified column.
3826
(...)
3831 ensure homogeneity.
3832 """
-> 3833 value = self._sanitize_column(value)
3835 if (
3836 key in self.columns
3837 and value.ndim == 1
3838 and not is_extension_array_dtype(value)
3839 ):
3840 # broadcast across multiple columns if necessary
3841 if not self.columns.is_unique or isinstance(self.columns, MultiIndex):
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:4534, in DataFrame._sanitize_column(self, value)
4532 # We should never get here with DataFrame value
4533 if isinstance(value, Series):
-> 4534 return _reindex_for_setitem(value, self.index)
4536 if is_list_like(value):
4537 com.require_length_match(value, self.index)
File /usr/local/Cellar/ipython/8.0.1/libexec/lib/python3.10/site-packages/pandas/core/frame.py:10985, in _reindex_for_setitem(value, index)
10981 if not value.index.is_unique:
10982 # duplicate axis
10983 raise err
> 10985 raise TypeError(
10986 "incompatible index of inserted column with frame index"
10987 ) from err
10988 return reindexed_value
TypeError: incompatible index of inserted column with frame index
您可以使用 rolling
方法计算每个组的累积标准偏差。
代码
import pandas as pd
# Create a sample dataframe
import io
text_csv = '''Value,Theme,Country
-1.975767,Weather,China
-0.540979,Fruits,China
-2.359127,Fruits,China
-2.815604,Corona,Brazil
-0.929755,Weather,UK
-0.929755,Weather,UK'''
df = pd.read_csv(io.StringIO(text_csv))
# Calculate cumulative standard deviations
df_std = df.groupby(['Theme', 'Country'], as_index=False)['Value'].rolling(len(df), min_periods=1).std()
# Merge the original df with the cumulative std values
df_std = df.join(df_std.drop(['Theme', 'Country'], axis=1).rename(columns={'Value': 'CorrectedStd'}))
输出
Value | Theme | Country | CorrectedStd | |
---|---|---|---|---|
0 | -1.97577 | Weather | China | nan |
1 | -0.540979 | Fruits | China | nan |
2 | -2.35913 | Fruits | China | 1.28562 |
3 | -2.8156 | Corona | Brazil | nan |
4 | -0.929755 | Weather | UK | nan |
5 | -0.929755 | Weather | UK | 0 |
使用DataFrame.expanding
with remove first level for new column by DataFrame.droplevel
应该是更简单的解决方案:
df['std'] = (df.groupby(['Theme', 'Country'])['Value']
.expanding()
.std()
.droplevel([0,1]))
print (df)
Value Theme Country std
0 -1.975767 Weather China NaN
1 -0.540979 Fruits China NaN
2 -2.359127 Fruits China 1.285625
3 -2.815604 Corona Brazil NaN
4 -0.929755 Weather UK NaN
5 -0.929755 Weather UK 0.000000