Pandas/Dask:从多索引或第二个数据帧的其他两列中过滤数据帧?
Pandas/Dask: Filter dataframe from multiindex or two other columns of a second dataframe?
假设一个
grouped_id_date = ddf.groupby(['my_id', 'my_date']).count().compute()
,我们收到一个新的 DataFrame
,它计算每对存在的行数:
+------------+------------+----+-------------------+
| my_id | my_date | || | my_value (random) |
+------------+------------+----+-------------------+
| MultiIndex | MultiIndex | || | Normal Column |
| A | 2020-06-03 | || | 5 |
| A | 2020-06-04 | || | 3 |
| B | 2020-06-03 | || | 3 |
| C | 2020-06-04 | || | 4 |
+------------+------------+----+-------------------+
现在我想返回到 ddf
只有 .loc
这样的行,其中有一个 my_count >3
。实现此目标的好方法是什么?
我目前的解决方案是这个,它有效,但它就像..需要更好的方法:
condition = None
for i, my_id_mdate_combi_data in enumerate(grouped_id_date.iterrows()):
if i == 1000:
break # not sure where MaxRecursion Exceptions kicks in..
my_id = grouped_id_date.index[i][0]
mdate = grouped_id_date.index[i][1]
if condition is None:
condition = ((ddf.my_id == my_id) & (ddf.my_date == my_date))
else:
condition = condition | ((ddf.my_id == my_id) & (ddf.my_date == my_date))
result = ddf.loc[condition] # Works, but slow and you reach MaxRecursion Exceptions somewhere.
dataframe 有 500.000.000 行,所以不应该有太多的洗牌等等..
像这样的东西应该可以工作:
grouped_id_date = grouped_id_date[grouped_id_date['my_value'] > 3]
valid_pairs = grouped_id_date.index.tolist()
all_pairs = list(ddf[['my_id', 'my_date']].values)
mask = [(my_id, my_date) in valid_pairs for (my_id, my_date) in all_pairs]
result = ddf[mask]
这个想法是建立你自己的布尔索引。您知道分组数据中的所有对都必须存在于原始数据帧 ddf
中。您将包含所有有效对的 MultiIndex 提取到列表中。然后从 ddf 中提取所有对并检查它们是否存在。
免责声明:我没有测试这段代码。逻辑应该是正确的,但可能存在导致语法错误或类似错误的隐藏拼写错误。
这就是我的想法:
QS_MIN_ROWS_PER_GROUP = 3
# Build groups for each my_date+my_id combination (take a look whats in there)
grouped_myid_mydate = ddf_c.groupby(['my_id', 'my_date'])
# Count amount of occurrences on that day in that id.
quotes_per_myid_mydate_all = grouped_myid_mydate.count().compute()
# Apply filter based on groupby (this now actually compares the rows per group with a pre-defined threshold).
qs_myid_mydate_combinations = quotes_per_myid_mydate_all.loc[quotes_per_myid_mydate_all.my_id>QS_MIN_ROWS_PER_GROUP]
# Get valid pairs from MultiIndex
valid_pairs = qs_myid_mydate_combinations.index.tolist()
# Build list which is searchable by a newly addded search column, which contains both values of the two columns to compare with.. Nasty
valid_pairs_formated = []
for pair in valid_pairs:
valid_pairs_formated.append('%s;%s' % (pair[0], pair[1]))
print(valid_pairs_formated)
# Add new search-column to central `DataFrame`. This assumes no ';' in the columns!
ddf_c['pair_code'] = (ddf_c.my_id + ';' + ddf_c.my_date.astype(str))
然后我们可以在 valid_pairs_formated
:
上过滤 pair_code
is_in_valid_set_of_combinations = ddf_c.pair_code.isin(valid_pairs_formated)
让我们看看结果是否合理:
is_in_valid_set_of_combinations.value_counts().compute() # you can skip this
>> Output:
True 246641219
False 11377
Name: pair_code, dtype: int64
好的,好的。
# Lastly reach target: Filter the original DataFrame
ddf_c = ddf_c.loc[ddf_c.is_in_valid_set_of_combinations == True]
# And finally check the row count
len(ddf_c.index)
> 246641219
# And remove that nasty search column:
ddf_c = ddf_c.drop(columns=['pair_code'])
很多代码,用于 'n' 列比较...但它有效。
如果你真的了解你的数据,你也可以做一些假设来构建一个数值函数,这样计算和过滤速度更快:
我们假设 my_id
小于 100000 并且可以构建一个新列 pair_code_numeric
:
PAIR_CODE_OFFSET_FOR_SID = 100000
col_name = 'pair_code_numeric'
ddf_c[col_name] = ((ddf_c.index.dt.year * (10000 * PAIR_CODE_OFFSET_FOR_SID)) + (ddf_c.index.dt.month * (100 * PAIR_CODE_OFFSET_FOR_SID)) + ((ddf_c.index.dt.day) * PAIR_CODE_OFFSET_FOR_SID) + ddf_c.s_id)
所以出来的是:
# view data without e0x formatting
ddf_c[col_name].apply(lambda x: '%0.f' % x, meta='int64').head()
2019-05-22 09:10:00.011433 2019052200210
2019-05-22 09:10:03.690125 2019052200175
2019-05-22 09:10:04.160046 2019052200448
那么剩下的就是straight-forward groupby
& locate
单列-
.groupby
第一:
v = True
grouped_pair_code = ddf_c.groupby([col_name])
# Count amount of rows in that pair code
# (One a approach chosen here, but you can apply the method for everything).
quotes_per_pair_code_all = grouped_pair_code.count().compute()
if v: print('Got %s %s combintions before Q/S' % (quotes_per_pair_code_all.shape[0], col_name))
# Get valid combinations from pair_code_numeric from the groupby by counting the numbers per group. Minimum is a hundred rows (that is what is in qs_pair_code_combinations).
qs_pair_code_combis = qs_pair_code_combinations(quotes_per_pair_code_all=quotes_per_pair_code_all, QS_MIN_L1_ROWS_PER_DAY = 100, v=False)
ddf_c = client.persist(ddf_c)
输出:
Got 3467 pair_code_numeric combintions before Q/S
Got 2646 valid pair_code_numeric_combis
然后我们可以简单地.loc
创建一个新列,显示该行是否有效:
valid_pairs_numeric = qs_pair_code_combis.index.tolist()
ddf_c['is_in_valid_set_of_combis'] = ddf_c[col_name].isin(valid_pairs_numeric)
最后,我们可以过滤巨大的 dask.DataFrame
:
len(ddf_c.loc[ddf_c.is_in_valid_set_of_combis == True])
# > 246641219 (Correct after filtering)
假设一个
grouped_id_date = ddf.groupby(['my_id', 'my_date']).count().compute()
,我们收到一个新的 DataFrame
,它计算每对存在的行数:
+------------+------------+----+-------------------+
| my_id | my_date | || | my_value (random) |
+------------+------------+----+-------------------+
| MultiIndex | MultiIndex | || | Normal Column |
| A | 2020-06-03 | || | 5 |
| A | 2020-06-04 | || | 3 |
| B | 2020-06-03 | || | 3 |
| C | 2020-06-04 | || | 4 |
+------------+------------+----+-------------------+
现在我想返回到 ddf
只有 .loc
这样的行,其中有一个 my_count >3
。实现此目标的好方法是什么?
我目前的解决方案是这个,它有效,但它就像..需要更好的方法:
condition = None
for i, my_id_mdate_combi_data in enumerate(grouped_id_date.iterrows()):
if i == 1000:
break # not sure where MaxRecursion Exceptions kicks in..
my_id = grouped_id_date.index[i][0]
mdate = grouped_id_date.index[i][1]
if condition is None:
condition = ((ddf.my_id == my_id) & (ddf.my_date == my_date))
else:
condition = condition | ((ddf.my_id == my_id) & (ddf.my_date == my_date))
result = ddf.loc[condition] # Works, but slow and you reach MaxRecursion Exceptions somewhere.
dataframe 有 500.000.000 行,所以不应该有太多的洗牌等等..
像这样的东西应该可以工作:
grouped_id_date = grouped_id_date[grouped_id_date['my_value'] > 3]
valid_pairs = grouped_id_date.index.tolist()
all_pairs = list(ddf[['my_id', 'my_date']].values)
mask = [(my_id, my_date) in valid_pairs for (my_id, my_date) in all_pairs]
result = ddf[mask]
这个想法是建立你自己的布尔索引。您知道分组数据中的所有对都必须存在于原始数据帧 ddf
中。您将包含所有有效对的 MultiIndex 提取到列表中。然后从 ddf 中提取所有对并检查它们是否存在。
免责声明:我没有测试这段代码。逻辑应该是正确的,但可能存在导致语法错误或类似错误的隐藏拼写错误。
这就是我的想法:
QS_MIN_ROWS_PER_GROUP = 3
# Build groups for each my_date+my_id combination (take a look whats in there)
grouped_myid_mydate = ddf_c.groupby(['my_id', 'my_date'])
# Count amount of occurrences on that day in that id.
quotes_per_myid_mydate_all = grouped_myid_mydate.count().compute()
# Apply filter based on groupby (this now actually compares the rows per group with a pre-defined threshold).
qs_myid_mydate_combinations = quotes_per_myid_mydate_all.loc[quotes_per_myid_mydate_all.my_id>QS_MIN_ROWS_PER_GROUP]
# Get valid pairs from MultiIndex
valid_pairs = qs_myid_mydate_combinations.index.tolist()
# Build list which is searchable by a newly addded search column, which contains both values of the two columns to compare with.. Nasty
valid_pairs_formated = []
for pair in valid_pairs:
valid_pairs_formated.append('%s;%s' % (pair[0], pair[1]))
print(valid_pairs_formated)
# Add new search-column to central `DataFrame`. This assumes no ';' in the columns!
ddf_c['pair_code'] = (ddf_c.my_id + ';' + ddf_c.my_date.astype(str))
然后我们可以在 valid_pairs_formated
:
pair_code
is_in_valid_set_of_combinations = ddf_c.pair_code.isin(valid_pairs_formated)
让我们看看结果是否合理:
is_in_valid_set_of_combinations.value_counts().compute() # you can skip this
>> Output:
True 246641219
False 11377
Name: pair_code, dtype: int64
好的,好的。
# Lastly reach target: Filter the original DataFrame
ddf_c = ddf_c.loc[ddf_c.is_in_valid_set_of_combinations == True]
# And finally check the row count
len(ddf_c.index)
> 246641219
# And remove that nasty search column:
ddf_c = ddf_c.drop(columns=['pair_code'])
很多代码,用于 'n' 列比较...但它有效。
如果你真的了解你的数据,你也可以做一些假设来构建一个数值函数,这样计算和过滤速度更快:
我们假设 my_id
小于 100000 并且可以构建一个新列 pair_code_numeric
:
PAIR_CODE_OFFSET_FOR_SID = 100000
col_name = 'pair_code_numeric'
ddf_c[col_name] = ((ddf_c.index.dt.year * (10000 * PAIR_CODE_OFFSET_FOR_SID)) + (ddf_c.index.dt.month * (100 * PAIR_CODE_OFFSET_FOR_SID)) + ((ddf_c.index.dt.day) * PAIR_CODE_OFFSET_FOR_SID) + ddf_c.s_id)
所以出来的是:
# view data without e0x formatting
ddf_c[col_name].apply(lambda x: '%0.f' % x, meta='int64').head()
2019-05-22 09:10:00.011433 2019052200210
2019-05-22 09:10:03.690125 2019052200175
2019-05-22 09:10:04.160046 2019052200448
那么剩下的就是straight-forward groupby
& locate
单列-
.groupby
第一:
v = True
grouped_pair_code = ddf_c.groupby([col_name])
# Count amount of rows in that pair code
# (One a approach chosen here, but you can apply the method for everything).
quotes_per_pair_code_all = grouped_pair_code.count().compute()
if v: print('Got %s %s combintions before Q/S' % (quotes_per_pair_code_all.shape[0], col_name))
# Get valid combinations from pair_code_numeric from the groupby by counting the numbers per group. Minimum is a hundred rows (that is what is in qs_pair_code_combinations).
qs_pair_code_combis = qs_pair_code_combinations(quotes_per_pair_code_all=quotes_per_pair_code_all, QS_MIN_L1_ROWS_PER_DAY = 100, v=False)
ddf_c = client.persist(ddf_c)
输出:
Got 3467 pair_code_numeric combintions before Q/S
Got 2646 valid pair_code_numeric_combis
然后我们可以简单地.loc
创建一个新列,显示该行是否有效:
valid_pairs_numeric = qs_pair_code_combis.index.tolist()
ddf_c['is_in_valid_set_of_combis'] = ddf_c[col_name].isin(valid_pairs_numeric)
最后,我们可以过滤巨大的 dask.DataFrame
:
len(ddf_c.loc[ddf_c.is_in_valid_set_of_combis == True])
# > 246641219 (Correct after filtering)