在 pandas df 中查找 A 列中的 True 值是否是自上次 B 列中的 True 以来他的第一次出现

Question

我正在寻找最有效的方法来查找 column A 中的 True 值是否是自 column B 中最后一个 True 值以来的第一次出现。

在此示例中，预期输出为 column C。

示例 1：

df = pd.DataFrame({
    'A': [False, False, True, False, True, False, True, False, True],
    'B': [True, False, False, False, False, True, False, False, False],
    'C': [False, False, True, False, False, False, True, False, False]
})

	A	B	C
0	False	True	False
1	False	False	False
2	True	False	True
3	False	False	False
4	True	False	False
5	False	True	False
6	True	False	True
7	False	False	False
8	True	False	False

示例 2：

df = pd.DataFrame({
    'A': [True, False, False, True, False, True, False, True, False],
    'B': [False, True, False, False, False, False, True, False, False],
    'C': [False, False, False, True, False, False, False, True, False]
})

	A	B	C
0	True	False	False
1	False	True	False
2	False	False	False
3	True	False	True
4	False	False	False
5	True	False	False
6	False	True	False
7	True	False	True
8	False	False	False

示例 3：

在这里你可以找到 .csv file with a bigger example

Answer 1

这是一种方法，也许不是最好的方法。

is_occurred = False
def is_first_occurrence_since(column_to_check, column_occurence):
    global is_occurred
    if is_occurred and column_to_check == True:
        is_occurred = False
        return True
    elif not is_occurred and column_occurence == True:
        is_occurred = True
    return False
df.apply(lambda row: is_first_occurrence_since(row['A'], row['B']), axis=1)

Answer 2

您可以对“B”列的累计总和使用 groupby 操作，按照您描述的方式对数据框进行分组。然后，您可以使用 idxmax 获取列“A”中每个首次出现的索引。获得这些索引后，您可以创建新列“C”。

使用 idxmax 是一个小技巧，因为我们实际上对最大值不感兴趣，因为“A”列只有 True 和 False 作为其值。 idxmax 将 return 索引的 第一次出现 最大值（在这种情况下，每个组中第一次出现的 True ），它是我们特别感兴趣的。

df = pd.DataFrame({
    'A': [False, False, True, False, True, False, True, False, True],
    'B': [True, False, False, False, False, True, False, False, False],
})

# get a dataframe of the position of the max as well as the max value
indices_df = df["A"].groupby(df["B"].cumsum()).agg(["idxmax", "max"])

# mask to filter out the 0th group
skip_0th = (indices_df.index > 0)

# mask to filter out groups who do not have True as a value
groups_with_true = (indices_df["max"] == True)

# combine masks and retrieve the appropriate index
indices = indices_df.loc[skip_0th & groups_with_true, "idxmax"]

df["C"] = False
df.loc[indices, "C"] = True

print(df)
       A      B      C
0  False   True  False
1  False  False  False
2   True  False   True
3  False  False  False
4   True  False  False
5  False   True  False
6   True  False   True
7  False  False  False
8   True  False  False

已更新示例 2。

我们可以通过对索引系列进行切片以排除索引为 0 的任何条目（例如，标签从 1 到末尾切片）来解决此问题。这是可行的，因为我们的 groupby 操作根据 .cumsum 分配基于整数的标签。在示例 1 中，最小的索引标签将为 1（因为“B”列中的第一个值为 True）。而在示例 2 中，最小的索引标签将为 0。由于我们不希望 0 影响我们的结果，我们可以简单地将它从 indices.

中切掉

当我们在对 indices 系列执行切片后分配“C”时，我们将适当地忽略列“B”中第一次出现 True 之前的所有值。

文字够多了，让我们看一些代码。

示例 1

print(indices)
1    2
2    6

# Slicing here doesn't change anything, since indices does not have
#  a value corresponding to label position 0
indices = indices.loc[1:]
print(indices)
1    2
2    6

示例 2

print(indices)
0    0
1    3
2    7

# we don't want to include the value from label position 0 in `indices`
#  so we can use slicing to remove it

indices = indices.loc[1:]
print(indices)
1    3
2    7

在 pandas df 中查找 A 列中的 True 值是否是自上次 B 列中的 True 以来他的第一次出现

In pandas df find if the True value in column A is his first occurrence since last True in column B

python

vectorization

dataframe

pandas