计算列表与列表 pandas 列的交集长度
Counting length of intersection of a list with pandas column of lists
我有一个唯一随机整数列表和一个包含一列列表的数据框,如下所示:
>>> panel
[1, 10, 9, 5, 6]
>>> df
col1
0 [1, 5]
1 [2, 3, 4]
2 [9, 10, 6]
我想要的输出是 panel
和数据帧中每个单独列表之间重叠的长度:
>>> result
col1 res
0 [1, 5] 2
1 [2, 3, 4] 0
2 [9, 10, 6] 3
目前,我正在使用 apply
函数,但我想知道是否有更快的方法,因为我需要创建很多面板并为每个面板循环执行此任务。
# My version right now
def cntOverlap(panel, series):
# Typically the lists inside df will be much shorter than panel,
# so I think the fastest way would be converting the panel into a set
# and loop through the lists within the dataframe
return sum(1 if x in panel for x in series)
#return len(np.setxor1d(list(panel), series))
#return len(panel.difference(series))
for i, panel in enumerate(list_of_panels):
panel = set(panel)
df[f"panel_{i}"] = df["col1"].apply(lambda x: cntOverlap(panel, x))
您可以使用 explode
(从 pandas 0.25+ 可用)和 isin
:
df['col1'].explode().isin(panel).sum(level=0)
输出:
0 2.0
1 0.0
2 3.0
Name: col1, dtype: float64
由于每行的可变长度数据,我们需要迭代(显式或隐式,即在幕后)保持在 Python 内。但是,我们可以优化到每次迭代计算最小化的水平。遵循这种理念,这里有一个带有数组分配和一些掩码的 -
# l is input list of unique random integers
s = df.col1
max_num = 10 # max number in df, if not known use : max(max(s))
map_ar = np.zeros(max_num+1, dtype=bool)
map_ar[l] = 1
df['res'] = [map_ar[v].sum() for v in s]
或者使用 2D 数组分配来进一步最小化每次迭代计算 -
map_ar = np.zeros((len(df),max_num+1), dtype=bool)
map_ar[:,l] = 1
for i,v in enumerate(s):
map_ar[i,v] = 0
df['res'] = len(l)-map_ar.sum(1)
我有一个唯一随机整数列表和一个包含一列列表的数据框,如下所示:
>>> panel
[1, 10, 9, 5, 6]
>>> df
col1
0 [1, 5]
1 [2, 3, 4]
2 [9, 10, 6]
我想要的输出是 panel
和数据帧中每个单独列表之间重叠的长度:
>>> result
col1 res
0 [1, 5] 2
1 [2, 3, 4] 0
2 [9, 10, 6] 3
目前,我正在使用 apply
函数,但我想知道是否有更快的方法,因为我需要创建很多面板并为每个面板循环执行此任务。
# My version right now
def cntOverlap(panel, series):
# Typically the lists inside df will be much shorter than panel,
# so I think the fastest way would be converting the panel into a set
# and loop through the lists within the dataframe
return sum(1 if x in panel for x in series)
#return len(np.setxor1d(list(panel), series))
#return len(panel.difference(series))
for i, panel in enumerate(list_of_panels):
panel = set(panel)
df[f"panel_{i}"] = df["col1"].apply(lambda x: cntOverlap(panel, x))
您可以使用 explode
(从 pandas 0.25+ 可用)和 isin
:
df['col1'].explode().isin(panel).sum(level=0)
输出:
0 2.0
1 0.0
2 3.0
Name: col1, dtype: float64
由于每行的可变长度数据,我们需要迭代(显式或隐式,即在幕后)保持在 Python 内。但是,我们可以优化到每次迭代计算最小化的水平。遵循这种理念,这里有一个带有数组分配和一些掩码的 -
# l is input list of unique random integers
s = df.col1
max_num = 10 # max number in df, if not known use : max(max(s))
map_ar = np.zeros(max_num+1, dtype=bool)
map_ar[l] = 1
df['res'] = [map_ar[v].sum() for v in s]
或者使用 2D 数组分配来进一步最小化每次迭代计算 -
map_ar = np.zeros((len(df),max_num+1), dtype=bool)
map_ar[:,l] = 1
for i,v in enumerate(s):
map_ar[i,v] = 0
df['res'] = len(l)-map_ar.sum(1)