在 Pandas 系列中查找相邻区域

Find adjacent regions in Pandas Series

我想 select 所有值大于 1 的区域,如果它们连接到值大于 5 的元素。 如果两个值之间用 0 分隔,则它们没有连接。

对于以下数据集,

pd.Series(data = [0,2,0,2,3,6,3,0])

输出应该是

pd.Series(data = [False,False,False,True,True,True,True,False])

我自己用丑陋的方式解决了它,见下文。不过,我还是想知道有没有更好的办法。

test_series = pd.Series(data = [0,2,0,2,3,6,3,0])

bool_df = pd.DataFrame(data= [(test_series>1), (test_series>5)]).T 
bool_df.loc[:,0] = (bool_df.loc[:,0])&(~bool_df.loc[:,1])
# make a boolean DataFrame.
# Column 0 is values between 1 and 5, and column 1 is values above 5.
# the resulting boolean series we are looking for is column 1 after it has been modified in the following way.



k=0 # k is an integer that indexes the bool_df values that are less than 1
while k < len(bool_df.loc[bool_df.loc[:,0],0]):
    i = bool_df.loc[bool_df.loc[:,0],0].index[k] # the bool_df index corresponding to k
    if i > 0: # avoid negative indeces
        if bool_df.loc[i-1,1]: # Check if the previous entry had a value above 5
            bool_df.loc[i,1] = True
            k+=1
        else: 
            j=i
            while bool_df.loc[j,0]: # find the end of the streak of 1<values<5.
                j+=1
            bool_df.loc[i:j,1] = bool_df.loc[j,1] # set the whole streak to the value found at the end, either >5 or <1
            k = sum(bool_df.loc[bool_df.loc[:,0],0].index<j) 
    else:
        k+=1

嗯,看来我找到了一个单线,使用 pandas groupby 函数:

import pandas as pd

ts = pd.Series(data = [0,2,0,2,3,6,3,0])

# The flag column allows me to identify sequences. Here 0s are included 
# in the "sequence", but as you can see in next line doesn't matter 
df = pd.concat([ts, (ts==0).cumsum()], axis = 1, keys = ['val', 'flag'])

#   val  flag
#0    0     1
#1    2     1
#2    0     2
#3    2     2
#4    3     2
#5    6     2
#6    3     2
#7    0     3

# For each group (having the same flag), I do a boolean AND of two conditions:
# any value above 5  AND value above 1  (which excludes zeros) 
df.groupby('flag').transform(lambda x: (x>5).any() * x > 1)

#Out[32]: 
#     val
#0  False
#1  False
#2  False
#3   True
#4   True
#5   True
#6   True
#7  False

如果你想知道,你可以在一行中折叠所有内容:

ts.groupby((ts==0).cumsum()).transform(lambda x: (x>5).any() * x > 1).astype(bool)

我的第一个方法还是留给大家参考:

import itertools
import pandas as pd

def flatten(l):
    # Util function to flatten a list of lists
    # e.g. [[1], [2,3]] -> [1,2,3]
    return list(itertools.chain(*l))

ts = pd.Series(data = [0,2,0,2,3,6,3,0])
#Get data as list
values = ts.values.tolist()

# From what I understand the 0s delimit subsequences (so numbers are not
# connected if separated by a 0

# Get location of zeros
gap_loc = [idx for (idx, el) in enumerate(values) if el==0]  
# Re-create pandas series
gap_series = pd.Series(False, index = gap_loc)

# Get values and locations of the subsequences (i.e. seperated by zeros)
valid_loc = [range(prev_gap+1,gap) for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
list_seq = [values[prev_gap+1:gap] for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
# list_seq = [[2], [2, 3, 6, 3]]

# Verify your condition
check_condition = [[el>1 and any(map(lambda x: x>5, sublist)) for el in sublist] 
                     for sublist in list_seq]
# Put results back into a pandas Series
valid_series = pd.Series(flatten(check_condition), index = flatten(valid_loc))

# Put everything together:
result = pd.concat([gap_series, valid_series], axis = 0).sort_index()

#result
#Out[101]: 
#0    False
#1    False
#2    False
#3     True
#4     True
#5     True
#6     True
#7    False
#dtype: bool