在 Pandas 系列中查找相邻区域
Find adjacent regions in Pandas Series
我想 select 所有值大于 1 的区域,如果它们连接到值大于 5 的元素。
如果两个值之间用 0 分隔,则它们没有连接。
对于以下数据集,
pd.Series(data = [0,2,0,2,3,6,3,0])
输出应该是
pd.Series(data = [False,False,False,True,True,True,True,False])
我自己用丑陋的方式解决了它,见下文。不过,我还是想知道有没有更好的办法。
test_series = pd.Series(data = [0,2,0,2,3,6,3,0])
bool_df = pd.DataFrame(data= [(test_series>1), (test_series>5)]).T
bool_df.loc[:,0] = (bool_df.loc[:,0])&(~bool_df.loc[:,1])
# make a boolean DataFrame.
# Column 0 is values between 1 and 5, and column 1 is values above 5.
# the resulting boolean series we are looking for is column 1 after it has been modified in the following way.
k=0 # k is an integer that indexes the bool_df values that are less than 1
while k < len(bool_df.loc[bool_df.loc[:,0],0]):
i = bool_df.loc[bool_df.loc[:,0],0].index[k] # the bool_df index corresponding to k
if i > 0: # avoid negative indeces
if bool_df.loc[i-1,1]: # Check if the previous entry had a value above 5
bool_df.loc[i,1] = True
k+=1
else:
j=i
while bool_df.loc[j,0]: # find the end of the streak of 1<values<5.
j+=1
bool_df.loc[i:j,1] = bool_df.loc[j,1] # set the whole streak to the value found at the end, either >5 or <1
k = sum(bool_df.loc[bool_df.loc[:,0],0].index<j)
else:
k+=1
嗯,看来我找到了一个单线,使用 pandas groupby 函数:
import pandas as pd
ts = pd.Series(data = [0,2,0,2,3,6,3,0])
# The flag column allows me to identify sequences. Here 0s are included
# in the "sequence", but as you can see in next line doesn't matter
df = pd.concat([ts, (ts==0).cumsum()], axis = 1, keys = ['val', 'flag'])
# val flag
#0 0 1
#1 2 1
#2 0 2
#3 2 2
#4 3 2
#5 6 2
#6 3 2
#7 0 3
# For each group (having the same flag), I do a boolean AND of two conditions:
# any value above 5 AND value above 1 (which excludes zeros)
df.groupby('flag').transform(lambda x: (x>5).any() * x > 1)
#Out[32]:
# val
#0 False
#1 False
#2 False
#3 True
#4 True
#5 True
#6 True
#7 False
如果你想知道,你可以在一行中折叠所有内容:
ts.groupby((ts==0).cumsum()).transform(lambda x: (x>5).any() * x > 1).astype(bool)
我的第一个方法还是留给大家参考:
import itertools
import pandas as pd
def flatten(l):
# Util function to flatten a list of lists
# e.g. [[1], [2,3]] -> [1,2,3]
return list(itertools.chain(*l))
ts = pd.Series(data = [0,2,0,2,3,6,3,0])
#Get data as list
values = ts.values.tolist()
# From what I understand the 0s delimit subsequences (so numbers are not
# connected if separated by a 0
# Get location of zeros
gap_loc = [idx for (idx, el) in enumerate(values) if el==0]
# Re-create pandas series
gap_series = pd.Series(False, index = gap_loc)
# Get values and locations of the subsequences (i.e. seperated by zeros)
valid_loc = [range(prev_gap+1,gap) for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
list_seq = [values[prev_gap+1:gap] for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
# list_seq = [[2], [2, 3, 6, 3]]
# Verify your condition
check_condition = [[el>1 and any(map(lambda x: x>5, sublist)) for el in sublist]
for sublist in list_seq]
# Put results back into a pandas Series
valid_series = pd.Series(flatten(check_condition), index = flatten(valid_loc))
# Put everything together:
result = pd.concat([gap_series, valid_series], axis = 0).sort_index()
#result
#Out[101]:
#0 False
#1 False
#2 False
#3 True
#4 True
#5 True
#6 True
#7 False
#dtype: bool
我想 select 所有值大于 1 的区域,如果它们连接到值大于 5 的元素。 如果两个值之间用 0 分隔,则它们没有连接。
对于以下数据集,
pd.Series(data = [0,2,0,2,3,6,3,0])
输出应该是
pd.Series(data = [False,False,False,True,True,True,True,False])
我自己用丑陋的方式解决了它,见下文。不过,我还是想知道有没有更好的办法。
test_series = pd.Series(data = [0,2,0,2,3,6,3,0])
bool_df = pd.DataFrame(data= [(test_series>1), (test_series>5)]).T
bool_df.loc[:,0] = (bool_df.loc[:,0])&(~bool_df.loc[:,1])
# make a boolean DataFrame.
# Column 0 is values between 1 and 5, and column 1 is values above 5.
# the resulting boolean series we are looking for is column 1 after it has been modified in the following way.
k=0 # k is an integer that indexes the bool_df values that are less than 1
while k < len(bool_df.loc[bool_df.loc[:,0],0]):
i = bool_df.loc[bool_df.loc[:,0],0].index[k] # the bool_df index corresponding to k
if i > 0: # avoid negative indeces
if bool_df.loc[i-1,1]: # Check if the previous entry had a value above 5
bool_df.loc[i,1] = True
k+=1
else:
j=i
while bool_df.loc[j,0]: # find the end of the streak of 1<values<5.
j+=1
bool_df.loc[i:j,1] = bool_df.loc[j,1] # set the whole streak to the value found at the end, either >5 or <1
k = sum(bool_df.loc[bool_df.loc[:,0],0].index<j)
else:
k+=1
嗯,看来我找到了一个单线,使用 pandas groupby 函数:
import pandas as pd
ts = pd.Series(data = [0,2,0,2,3,6,3,0])
# The flag column allows me to identify sequences. Here 0s are included
# in the "sequence", but as you can see in next line doesn't matter
df = pd.concat([ts, (ts==0).cumsum()], axis = 1, keys = ['val', 'flag'])
# val flag
#0 0 1
#1 2 1
#2 0 2
#3 2 2
#4 3 2
#5 6 2
#6 3 2
#7 0 3
# For each group (having the same flag), I do a boolean AND of two conditions:
# any value above 5 AND value above 1 (which excludes zeros)
df.groupby('flag').transform(lambda x: (x>5).any() * x > 1)
#Out[32]:
# val
#0 False
#1 False
#2 False
#3 True
#4 True
#5 True
#6 True
#7 False
如果你想知道,你可以在一行中折叠所有内容:
ts.groupby((ts==0).cumsum()).transform(lambda x: (x>5).any() * x > 1).astype(bool)
我的第一个方法还是留给大家参考:
import itertools
import pandas as pd
def flatten(l):
# Util function to flatten a list of lists
# e.g. [[1], [2,3]] -> [1,2,3]
return list(itertools.chain(*l))
ts = pd.Series(data = [0,2,0,2,3,6,3,0])
#Get data as list
values = ts.values.tolist()
# From what I understand the 0s delimit subsequences (so numbers are not
# connected if separated by a 0
# Get location of zeros
gap_loc = [idx for (idx, el) in enumerate(values) if el==0]
# Re-create pandas series
gap_series = pd.Series(False, index = gap_loc)
# Get values and locations of the subsequences (i.e. seperated by zeros)
valid_loc = [range(prev_gap+1,gap) for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
list_seq = [values[prev_gap+1:gap] for prev_gap, gap in zip(gap_loc[:-1],gap_loc[1:])]
# list_seq = [[2], [2, 3, 6, 3]]
# Verify your condition
check_condition = [[el>1 and any(map(lambda x: x>5, sublist)) for el in sublist]
for sublist in list_seq]
# Put results back into a pandas Series
valid_series = pd.Series(flatten(check_condition), index = flatten(valid_loc))
# Put everything together:
result = pd.concat([gap_series, valid_series], axis = 0).sort_index()
#result
#Out[101]:
#0 False
#1 False
#2 False
#3 True
#4 True
#5 True
#6 True
#7 False
#dtype: bool