For循环在数据框中查找附近的重复行

Question

我有一个这样的值列表：

l = [0,1,1,1,0,0,1,0,1,0]

我正在尝试在如下数据框中查找接近重复的行（有一位或两位数的差异）：

但请记住，它们有更多的行和列，这只是一个示例数据框

df = pd.DataFrame({'a': [0, 1, 0], 'b': [0, 1, 0], 'c': [1, 1, 0], 'd': [1, 0, 1], 'e': [1, 1, 0],
                   'f': [0, 1, 1], 'g': [0, 1, 0], 'h': [1, 1, 0], 'i': [1, 1, 0], 'j': [0, 1, 1]},
                 index=['x', 'y', 'z'])

   a  b  c  d  e  f  g  h  i  j
x  0  0  1  1  1  0  0  1  1  0
y  1  1  1  0  1  1  1  1  1  1
z  0  0  0  1  0  1  0  0  0  1

Answer 1

您可以使用 df.eq(l).sum(axis=1) 来计算列表中（对齐的）公共元素的数量：

l = [0,1,1,1,0,0,1,0,1,0]
df.eq(l).sum(axis=1)

x    6
y    4
z    4
dtype: int64

要使用阈值进行过滤，请使用：

diff = 4
df[df.eq(l).sum(axis=1).ge(len(l)-diff)]

输出：

   a  b  c  d  e  f  g  h  i  j
x  0  0  1  1  1  0  0  1  1  0

For循环在数据框中查找附近的重复行

For Loop in a Data Frame to find near duplicate rows

python

hash

for-loop

duplicates

dataframe