pandas 使用长嵌套条件查找

Question

我想通过 pandas 在数据框中执行查找。但它将由一系列嵌套的 if else 语句创建，类似于概述 Pandas dataframe add a field based on multiple if statements 但我想使用最多 13 个不同的变量。这似乎很快就会导致混乱。是否有一些符号或其他不错的功能允许我在 pandas 中指定如此长的嵌套条件？到目前为止 np.where() http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.where.html 可能是我最好的选择。

如果我只匹配所有条件下的相等性，是否有捷径？

我是否被迫写出每个条件过滤器？我可以只有一个表达式来选择生成的（单个）查找值吗？

编辑理想情况下我不想匹配

df.loc[df['column_name'] == some_value]

对于每个值，即。 13* 分类级别的数量（假设 7）会有很多不同的值；特别是，if df.loc[df['fist'] == some_value][df['second'] == otherValue1] 条件组合发生，即它们都是链接的。

编辑

一个最小的例子

df = pd.DataFrame({'ageGroup': [1, 2, 2, 1],
                 'first2DigitsOfPostcode': ['12', '23', '12', '12'],
                 'valueOfProduct': ['low', 'medum', 'high', 'low'],
               'lookup_join_value': ['foo', 'bar', 'foo', 'baz']})

定义由 sql 查询生成的查找 table 按所有列分组并聚合值（因此由于笛卡尔积所有。值组合应在查找中表示table.

新记录可能看起来像

new_values = pd.DataFrame({'ageGroup': [1],
                     'first2DigitsOfPostcode': ['12'],
                     'valueOfProduct': ['low']})

假设所有条件都需要相等匹配（如果这样更容易的话），我如何自动查找所有条件。

我找到了

pd.lookup Vectorized look-up of values in Pandas dataframe 这似乎适用于单个列/条件
也许合并是一种解决方案？，但这并没有真正产生所需的查找结果。

编辑 2

第二个回答好像挺有意思的。但是

mask = df.drop('lookup_join_value', axis=1).isin(new_values)
print(mask)
print(df[mask])
print(df[mask]['lookup_join_value'])

不幸的是，查找值只会 return NaN。

Answer 1

df.query 是一个选项，如果您可以使用列名编写查询和表达式：

所以你可以这样做：

query_string = 'some long (but valid) boolean query'

来自 pandas 的示例：

>>> from numpy.random import randn
>>> from pandas import DataFrame
>>> df = DataFrame(randn(10, 2), columns=list('ab'))
>>> df.query('a > b')
# similar to this
>>> df[df.a > df.b]

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html

Answer 2

我认为 df.isin() 符合您的要求。

使用您的示例 df，以及这两个：

exists = pd.DataFrame({'ageGroup': [1],
                 'first2DigitsOfPostcode': ['12'],
                 'valueOfProduct' : 'low'})
new = pd.DataFrame({'ageGroup': [1],
                 'first2DigitsOfPostcode': ['12'],
                 'valueOfProduct' : 'high'})

然后您可以检查哪些值匹配，如果全部匹配，或者只是部分匹配：

df.isin(exists.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct 
0    True                   True           True 
1    False                  False          False
2    False                  True           False
3    True                   True           True

df.isin(new.values[0])
Out[46]: ageGroup first2DigitsOfPostcode valueOfProduct 
0    True                   True           False
1    False                  False          False
2    False                  True           True
3    True                   True           False

如果您的 "query" 不是数据框而是列表，则不需要“.values[0]”位。字典的问题是它也试图匹配索引。

从你的问题中我不清楚你想要返回什么，但你可以根据所有（或部分）行是否相同来进行子集：

# Returns matching rows    
df[df.isin(exists.values[0]).values.all(True)]

# Returns rows where the first two columns match
matches = df.isin(new.values[0]).values
df[[item==[True,True,False] for item in matches.tolist()]]

...最后一个可能有更聪明的写法。

Answer 3

既然我更好地了解了您的需求，数据帧合并是可能是更好的选择：

IN: df.merge(new_values, how='inner')
OUT:   ageGroup first2DigitsOfPostcode lookup_join_value valueOfProduct
0         1                     12               foo            low
1         1                     12               baz            low

肯定比我给的其他答案短！我会留下旧的，以防它启发其他人。

pandas 使用长嵌套条件查找

pandas lookup with long and nested conditions

python

lookup

lookup-tables

pandas

编辑

编辑 2