Pandas/Dask：从多索引或第二个数据帧的其他两列中过滤数据帧？

Question

假设一个

grouped_id_date = ddf.groupby(['my_id', 'my_date']).count().compute()

，我们收到一个新的 DataFrame，它计算每对存在的行数：

+------------+------------+----+-------------------+
|   my_id    |  my_date   | || | my_value (random) |
+------------+------------+----+-------------------+
| MultiIndex | MultiIndex | || | Normal Column     |
| A          | 2020-06-03 | || | 5                 |
| A          | 2020-06-04 | || | 3                 |
| B          | 2020-06-03 | || | 3                 |
| C          | 2020-06-04 | || | 4                 |
+------------+------------+----+-------------------+

现在我想返回到 ddf 只有 .loc 这样的行，其中有一个 my_count >3。实现此目标的好方法是什么？

我目前的解决方案是这个，它有效，但它就像..需要更好的方法：

condition = None
for i, my_id_mdate_combi_data in enumerate(grouped_id_date.iterrows()): 
    if i == 1000:
        break # not sure where MaxRecursion Exceptions kicks in..
    my_id = grouped_id_date.index[i][0]
    mdate = grouped_id_date.index[i][1]
    if condition is None:
        condition = ((ddf.my_id == my_id) & (ddf.my_date == my_date))
    else:
        condition = condition | ((ddf.my_id == my_id) & (ddf.my_date == my_date))

result = ddf.loc[condition] # Works, but slow and you reach MaxRecursion Exceptions somewhere.

dataframe 有 500.000.000 行，所以不应该有太多的洗牌等等..

Answer 1

像这样的东西应该可以工作：

  grouped_id_date = grouped_id_date[grouped_id_date['my_value'] > 3]  
  valid_pairs = grouped_id_date.index.tolist()
  all_pairs = list(ddf[['my_id', 'my_date']].values)
  mask = [(my_id, my_date) in valid_pairs for (my_id, my_date) in all_pairs]

  result = ddf[mask]

这个想法是建立你自己的布尔索引。您知道分组数据中的所有对都必须存在于原始数据帧 ddf 中。您将包含所有有效对的 MultiIndex 提取到列表中。然后从 ddf 中提取所有对并检查它们是否存在。

免责声明：我没有测试这段代码。逻辑应该是正确的，但可能存在导致语法错误或类似错误的隐藏拼写错误。

Answer 2

这就是我的想法：

QS_MIN_ROWS_PER_GROUP = 3

# Build groups for each my_date+my_id combination (take a look whats in there)
grouped_myid_mydate = ddf_c.groupby(['my_id', 'my_date'])

# Count amount of occurrences on that day in that id.
quotes_per_myid_mydate_all = grouped_myid_mydate.count().compute()

# Apply filter based on groupby (this now actually compares the rows per group with a pre-defined threshold).
qs_myid_mydate_combinations = quotes_per_myid_mydate_all.loc[quotes_per_myid_mydate_all.my_id>QS_MIN_ROWS_PER_GROUP]

# Get valid pairs from MultiIndex
valid_pairs = qs_myid_mydate_combinations.index.tolist()

# Build list which is searchable by a newly addded search column, which contains both values of the two columns to compare with.. Nasty

valid_pairs_formated = []
for pair in valid_pairs:
    valid_pairs_formated.append('%s;%s' % (pair[0], pair[1]))
print(valid_pairs_formated)

# Add new search-column to central `DataFrame`. This assumes no ';' in the columns!
ddf_c['pair_code'] = (ddf_c.my_id + ';' + ddf_c.my_date.astype(str))

然后我们可以在 valid_pairs_formated:

上过滤 pair_code

is_in_valid_set_of_combinations = ddf_c.pair_code.isin(valid_pairs_formated)

让我们看看结果是否合理：

is_in_valid_set_of_combinations.value_counts().compute() # you can skip this

>> Output:
True     246641219
False        11377
Name: pair_code, dtype: int64

好的，好的。

# Lastly reach target: Filter the original DataFrame
ddf_c = ddf_c.loc[ddf_c.is_in_valid_set_of_combinations == True]

# And finally check the row count
len(ddf_c.index)
> 246641219

# And remove that nasty search column:
ddf_c = ddf_c.drop(columns=['pair_code'])

很多代码，用于 'n' 列比较...但它有效。

Answer 3

如果你真的了解你的数据，你也可以做一些假设来构建一个数值函数，这样计算和过滤速度更快：

我们假设 my_id 小于 100000 并且可以构建一个新列 pair_code_numeric:

PAIR_CODE_OFFSET_FOR_SID = 100000
col_name = 'pair_code_numeric'
ddf_c[col_name] = ((ddf_c.index.dt.year * (10000 * PAIR_CODE_OFFSET_FOR_SID)) + (ddf_c.index.dt.month * (100 * PAIR_CODE_OFFSET_FOR_SID)) + ((ddf_c.index.dt.day) * PAIR_CODE_OFFSET_FOR_SID) + ddf_c.s_id)

所以出来的是：

# view data without e0x formatting
ddf_c[col_name].apply(lambda x: '%0.f' % x, meta='int64').head()

2019-05-22 09:10:00.011433    2019052200210
2019-05-22 09:10:03.690125    2019052200175
2019-05-22 09:10:04.160046    2019052200448

那么剩下的就是straight-forward groupby & locate单列-

.groupby第一：

v = True
grouped_pair_code = ddf_c.groupby([col_name])
# Count amount of rows in that pair code 
# (One a approach chosen here, but you can apply the method for everything).
quotes_per_pair_code_all = grouped_pair_code.count().compute()
if v: print('Got %s %s combintions before Q/S' % (quotes_per_pair_code_all.shape[0], col_name))

# Get valid combinations from pair_code_numeric from the groupby by counting the numbers per group. Minimum is a hundred rows (that is what is in qs_pair_code_combinations).
qs_pair_code_combis = qs_pair_code_combinations(quotes_per_pair_code_all=quotes_per_pair_code_all,                    QS_MIN_L1_ROWS_PER_DAY = 100, v=False)
ddf_c = client.persist(ddf_c)

输出：

Got 3467 pair_code_numeric combintions before Q/S
Got 2646 valid pair_code_numeric_combis

然后我们可以简单地.loc创建一个新列，显示该行是否有效：

valid_pairs_numeric = qs_pair_code_combis.index.tolist()
ddf_c['is_in_valid_set_of_combis'] = ddf_c[col_name].isin(valid_pairs_numeric)

最后，我们可以过滤巨大的 dask.DataFrame:

len(ddf_c.loc[ddf_c.is_in_valid_set_of_combis == True])
# > 246641219 (Correct after filtering)

Pandas/Dask：从多索引或第二个数据帧的其他两列中过滤数据帧？

Pandas/Dask: Filter dataframe from multiindex or two other columns of a second dataframe?

multi-index

dataframe

pandas

dask

dask-dataframe