如何使用 numpy 向量化在两个巨大数据帧的同一列中查找不同的数据列?

How to use numpy vectorization to find different data columns in the same column of two huge dataframes?

我有两个数据框,df1df2,它们都有相同的列 datedf1date列大约有几千万行,df1date列不完整,可能重复。 df2 大约有几千行。 df2date栏已完成,不再重复。如何使用numpyvectorization找出df1中不存在但df2中存在的date1数据并生成numpyndarray? 我尝试了 np.wheregroupby.size,但找不到正确的使用方法。

您还可以在评论中推荐的解决方案之上使用它:

df2[~np.in1d(df2['date'],df1['date'])]['date'].to_numpy()

如果我的理解是正确的,我会按照你的描述生成一个演示数据集。 我的发现是功能简单,但是时间成本却不高。

  • 如果可以等,推荐的方案是一个选择, 因为扫描 10,000 / 100,000 大小的 df1 大约需要 0.4 / 4 秒, 并且呈线性增长, 相信会达到你的申请无法忍受的地步;

  • 如果你有足够大的内存,你可以使用我的“设置解决方案”,它将提供相同(正确)的结果,并且在 3,000,000 大小的 df1 上仅花费 0.1885 秒。

--------------- Experiment [10000] starts ------------------------------
Recommended solution [10000] costs 0.49669885635375977 seconds
Set solution [10000] costs 0.001967191696166992 seconds
[ True]
--------------- Experiment [100000] starts ------------------------------
Recommended solution [100000] costs 4.4930009841918945 seconds
Set solution [100000] costs 0.006981611251831055 seconds
[ True]
--------------- Experiment [1000000] starts ------------------------------
Set solution [1000000] costs 0.06779003143310547 seconds
--------------- Experiment [3000000] starts ------------------------------
Set solution [3000000] costs 0.18847417831420898 seconds

以下是我的代码

# %%
import time
import random
import datetime
import numpy as np
import pandas as pd

# %%
n = int(3e3)
today = datetime.date.today()
delta = datetime.timedelta(days=1)

for k in [int(1e4), int(1e5), int(1e6), int(3e6)]:
    print(
        f'--------------- Experiment [{k}] starts ------------------------------')
    days = ['{}'.format(today + i * delta) for i in range(n)]

    # Generate df2
    df2 = pd.DataFrame()
    df2['date'] = days
    df2

    # Generate df1
    df1 = pd.DataFrame()
    df1['date'] = random.choices(days[:-10], k=k)
    df1

    # Recommended solution.
    # It won't run at n of 1e6, since it will cost too much time.
    if k < 1e6:
        t0 = time.time()
        o1 = df2[~np.in1d(df2['date'], df1['date'])]['date']
        print(f'Recommended solution [{k}] costs', time.time() - t0, 'seconds')
        o1

    # Set solution
    t0 = time.time()
    set1 = set(df1['date'])
    o2 = df2['date'][df2['date'].map(lambda x: x not in set1)]
    print(f'Set solution [{k}] costs', time.time() - t0, 'seconds')
    o2

    # Compare the results of the two solutions
    if k < 1e6:
        print((o1 == o2).unique())

# %%


我找到了另一种方法,使用np.setdiff1d,代码如下:

from time import time
start = time()
n2 = np.setdiff1d(df2.date.values, df1.date.drop_duplicates().to_numpy())
print('len df1:%d  df2:%d' % (len(df1), len(df2)))
stop = time()
print(stop - start)

结果是:

len df1: 11351055  df2:7406  
1.6254549026489258