复杂切片

Complex slicing

我正在尝试执行具有多个条件的切片但没有成功。 这是我的数据框的样子

我有很多国家,这些国家的名字被存储为索引。对于所有这些国家,我有 7 个不同的指标,用于两个不同的年份。

我的目标是 select 所有国家(及其指标),其中 'GDP per capita (constant 2005 US$')' 优于或等于先前定义的阈值 (gdp_min),或者被命名为 'China'、'India' 或 'Brazil'.

为此,我尝试了很多不同的方法,但仍然找不到方法。 这是我最后一次尝试,但出现错误。

gdp_set = final_set[final_set['Indicator Name'] == 'GDP per capita (constant 2005 US$)']['2013'] >= gdp_min | final_set.loc[['China', 'India', 'Brazil']]

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) ~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in na_logical_op(x, y, op) 301 # (xint or xbool) and (yint or bool) --> 302 result = op(x, y) 303 except TypeError:

~\anaconda3\lib\site-packages\pandas\core\roperator.py in ror_(left, right) 55 def ror_(left, right): ---> 56 return operator.or_(right, left) 57

TypeError: ufunc 'bitwise_or' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last) ~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in na_logical_op(x, y, op) 315 try: --> 316 result = libops.scalar_binop(x, y, op) 317 except (

~\anaconda3\lib\site-packages\pandas_libs\ops.pyx in pandas._libs.ops.scalar_binop()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

The above exception was the direct cause of the following exception:

TypeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_16016/3232205269.py in ----> 1 gdp_set = final_set[final_set['Indicator Name'] == 'GDP per capita (constant 2005 US$)']['2013'] >= gdp_min | final_set.loc[['China', 'India', 'Brazil']]

~\anaconda3\lib\site-packages\pandas\core\generic.py in array_ufunc(self, ufunc, method, *inputs, **kwargs) 2030 self, ufunc: np.ufunc, method: str, *inputs: Any, **kwargs: Any
2031 ): -> 2032 return arraylike.array_ufunc(self, ufunc, method, *inputs, **kwargs) 2033 2034 # ideally we would define this to avoid the getattr checks, but

~\anaconda3\lib\site-packages\pandas\core\arraylike.py in array_ufunc(self, ufunc, method, *inputs, **kwargs) 251 252 # for binary ops, use our custom dunder methods --> 253 result = maybe_dispatch_ufunc_to_dunder_op(self, ufunc, method, *inputs, **kwargs) 254 if result is not NotImplemented: 255 return result

~\anaconda3\lib\site-packages\pandas_libs\ops_dispatch.pyx in pandas._libs.ops_dispatch.maybe_dispatch_ufunc_to_dunder_op()

~\anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other) 67 other = item_from_zerodim(other) 68 ---> 69 return method(self, other) 70 71 return new_method

~\anaconda3\lib\site-packages\pandas\core\arraylike.py in ror(self, other) 72 @unpack_zerodim_and_defer("ror") 73 def ror(self, other): ---> 74 return self.logical_method(other, roperator.ror) 75 76 @unpack_zerodim_and_defer("xor")

~\anaconda3\lib\site-packages\pandas\core\frame.py in _arith_method(self, other, op) 6864 self, other = ops.align_method_FRAME(self, other, axis, flex=True, level=None)
6865 -> 6866 new_data = self._dispatch_frame_op(other, op, axis=axis) 6867 return self._construct_result(new_data)
6868

~\anaconda3\lib\site-packages\pandas\core\frame.py in _dispatch_frame_op(self, right, func, axis) 6891 # i.e. scalar, faster than checking np.ndim(right) == 0 6892
with np.errstate(all="ignore"): -> 6893 bm = self._mgr.apply(array_op, right=right) 6894 return type(self)(bm) 6895

~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, align_keys, ignore_failures, **kwargs) 323 try: 324 if callable(f): --> 325 applied = b.apply(f, **kwargs) 326 else: 327 applied = getattr(b, f)(**kwargs)

~\anaconda3\lib\site-packages\pandas\core\internals\blocks.py in apply(self, func, **kwargs) 379 """ 380 with np.errstate(all="ignore"): --> 381 result = func(self.values, **kwargs) 382 383 return self._split_op_result(result)

~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in logical_op(left, right, op) 390 filler = fill_int if is_self_int_dtype and is_other_int_dtype else fill_bool 391 --> 392 res_values = na_logical_op(lvalues, rvalues, op) 393 # error: Cannot call function of unknown type 394 res_values = filler(res_values) # type: ignore[operator]

~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in na_logical_op(x, y, op) 323 ) as err: 324 typ = type(y).name --> 325 raise TypeError( 326 f"Cannot perform '{op.name}' with a dtyped [{x.dtype}] array " 327 f"and scalar of type [{typ}]"

TypeError: Cannot perform 'ror_' with a dtyped [float64] array and scalar of type [bool]

错误很长,但据我了解,问题来自第二个条件,它与 'OR' ( | ) 不兼容。

你们知道我该怎么做吗?我唯一能看到的是用当前索引名称创建一个新列,这样过滤可能与 OR 条件一起工作。

IIUC,使用:

m1 = final_set['Indicator Name'].eq('GDP per capita (constant 2005 US$)')
m2 = fina_set['2013'] >= gdp_min

countries = list(final_set.index[m1 & m2])+['China', 'India', 'Brazil']

gdp_set = final_set[final_set.index.isin(countries)]

更新:

这应该可以满足您的要求:

gdp_set = final_set.loc[list(
    {'China', 'India', 'Brazil'} | 
    set(final_set[((final_set['Indicator Name'] == 'GDP per capita (constant 2005 US$)') & 
        (final_set['2013'] >= gdp_min))].index)
    )]

解释:

  • 创建一个 set,其中包含 'China', 'India', 'Brazil' 与任何索引值(即 Country Name 值)的 set 的联合,其中值为 [=18] 的行=] 匹配目标并且 2013 列的值至少与 gdp_min.
  • 一样大
  • 在此 set 中的国家/地区过滤 final_set 转换为 list 并将生成的数据帧放入 gdp_set.

完整测试代码:

import pandas as pd
final_set = pd.DataFrame({
'Country Name':['Andorra']*6 + ['Argentina']*4 + ['China']*2 + ['India']*2 + ['Brazil']*2,
'Indicator Name':[f'Indicator {i}' for i in range(1, 6)] + ['GDP per capita (constant 2005 US$)'] + [f'Indicator {i}' for i in range(1, 4)] + ['GDP per capita (constant 2005 US$)'] + [f'Indicator {i}'if i % 2 else 'GDP per capita (constant 2005 US$)' for i in range(1,7)],
'2002': [10000.0/2]*6 + [15000.0/2]*4 + [8000.0/2]*6,
'2013': [10000.0]*6 + [15000.0]*4 + [8000.0]*6,
'Currency Unit':['Euro']*6 + ['Argentine peso']*4 + ['RMB']*2 + ['INR']*2 + ['Brazilian real']*2,
'Region':['Europe & Central Asia']*6 + ['Latin America & Caribbean']*4 + ['Asia']*2 + ['South Asia']*2 + ['Latin America & Caribbean']*2,
'GDP per capita (constant 2005 US$)': [10000.0]*6 + [15000.0]*4 + [8000.0]*6
}).set_index('Country Name')
print(final_set)
gdp_min = 14000.0

gdp_set = final_set.loc[list(
    {'China', 'India', 'Brazil'} | 
    set(final_set[((final_set['Indicator Name'] == 'GDP per capita (constant 2005 US$)') & 
        (final_set['2013'] >= gdp_min))].index)
    )]
print(gdp_set)

输入:

                                  Indicator Name    2002     2013   Currency Unit                     Region  GDP per capita (constant 2005 US$)
Country Name
Andorra                              Indicator 1  5000.0  10000.0            Euro      Europe & Central Asia                             10000.0
Andorra                              Indicator 2  5000.0  10000.0            Euro      Europe & Central Asia                             10000.0
Andorra                              Indicator 3  5000.0  10000.0            Euro      Europe & Central Asia                             10000.0
Andorra                              Indicator 4  5000.0  10000.0            Euro      Europe & Central Asia                             10000.0
Andorra                              Indicator 5  5000.0  10000.0            Euro      Europe & Central Asia                             10000.0
Andorra       GDP per capita (constant 2005 US$)  5000.0  10000.0            Euro      Europe & Central Asia                             10000.0
Argentina                            Indicator 1  7500.0  15000.0  Argentine peso  Latin America & Caribbean                             15000.0
Argentina                            Indicator 2  7500.0  15000.0  Argentine peso  Latin America & Caribbean                             15000.0
Argentina                            Indicator 3  7500.0  15000.0  Argentine peso  Latin America & Caribbean                             15000.0
Argentina     GDP per capita (constant 2005 US$)  7500.0  15000.0  Argentine peso  Latin America & Caribbean                             15000.0
China                                Indicator 1  4000.0   8000.0             RMB                       Asia                              8000.0
China         GDP per capita (constant 2005 US$)  4000.0   8000.0             RMB                       Asia                              8000.0
India                                Indicator 3  4000.0   8000.0             INR                 South Asia                              8000.0
India         GDP per capita (constant 2005 US$)  4000.0   8000.0             INR                 South Asia                              8000.0
Brazil                               Indicator 5  4000.0   8000.0  Brazilian real  Latin America & Caribbean                              8000.0
Brazil        GDP per capita (constant 2005 US$)  4000.0   8000.0  Brazilian real  Latin America & Caribbean                              8000.0

输出:

                                  Indicator Name    2002     2013   Currency Unit                     Region  GDP per capita (constant 2005 US$)
Country Name
Brazil                               Indicator 5  4000.0   8000.0  Brazilian real  Latin America & Caribbean                              8000.0
Brazil        GDP per capita (constant 2005 US$)  4000.0   8000.0  Brazilian real  Latin America & Caribbean                              8000.0
China                                Indicator 1  4000.0   8000.0             RMB                       Asia                              8000.0
China         GDP per capita (constant 2005 US$)  4000.0   8000.0             RMB                       Asia                              8000.0
India                                Indicator 3  4000.0   8000.0             INR                 South Asia                              8000.0
India         GDP per capita (constant 2005 US$)  4000.0   8000.0             INR                 South Asia                              8000.0
Argentina                            Indicator 1  7500.0  15000.0  Argentine peso  Latin America & Caribbean                             15000.0
Argentina                            Indicator 2  7500.0  15000.0  Argentine peso  Latin America & Caribbean                             15000.0
Argentina                            Indicator 3  7500.0  15000.0  Argentine peso  Latin America & Caribbean                             15000.0
Argentina     GDP per capita (constant 2005 US$)  7500.0  15000.0  Argentine peso  Latin America & Caribbean                             15000.0

使用查询怎么样?

# min GDP (I used an example number
gdp_min = 3000.0

# Country name set.
countries = {"China", "India", "Brazil"}

# Create string expression to evaluate on DataFrame.
# Note: Backticks should be used for non-standard pandas field names 
# (including names that begin with a numerical value.
expression = f"(`Indicator Name` == 'GDP per capita (constant 2005 US$)' & `2013` >= {gdp_min})"
# Add each country name as 'or' clause for second part of expression.
expression += "or (" + " or ".join([f"`Country Name` == '{n}'" for n in countries]) + ")"

# Collect resulting DataFrame to new variable.
gdp_set = final_set.query(expression)