具有不同长度的另一个数据帧的子集数据帧

Question

我有一个数据框，其中包含由染色体 (chr) 和位置 (pos) 表示的相互作用的染色体对，如下所示：

>>>import pandas as pd

>>>df1
chr1     pos1     chr2     pos2
chr1    54278    chr13    68798
chr1    32145     chr7  1248798
... 
[162689366 rows x 4 columns]

在真实数据集中，这些按chr1排序，然后是pos1、chr2、pos2。

我有另一个数据集，其中包含我希望按以下格式查看的交互对：

>>>df2
chr     start     stop     comment
chr1    54275    55080   cluster-1
chr1   515523   515634   cluster-2
...
chr13   68760    70760
...
[69 rows x 4 columns]

当且仅当两个交互对（chr1-pos1 和 chr2-pos2）都在 df2 的起始值和终止值范围内时，我希望子集 df1 以包含行。

在这个例子中，最终的数据框看起来像这样：

>>>df3
chr1    pos1      chr2     pos2
chr1    54278    chr13    68798
...

我一直在尝试使用 pandas 中的 .between 函数明智地执行此步骤（对于第一个 chr-pos 对，然后是第二个），但没有成功。 python2.7 和 python3.6.

我都试过了

>>>df3 = df1[(df1['chr1'].isin(df2.chr)) & df1['pos1'].between(df1.pos1(df2.start),df1.pos1(df2.stop))]

这似乎适用于 .isin，但我收到 .between 函数的错误。我想是因为数据帧的长度不一样，但我不能确定。

>>>df1['pos1'].between(df2.start,df2.stop)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/core/series.py", line 2412, in between
    lmask = self >= left
  File "/usr/lib/python2.7/dist-packages/pandas/core/ops.py", line 699, in wrapper
    raise ValueError('Series lengths must match to compare')
ValueError: Series lengths must match to compare

非常感谢任何帮助！

Answer 1

有人可能有更优雅的解决方案，但在我看来，我会加入 df2 到 df1 两次，这样您就可以在一个数据集中获得所有内容，并且比较容易。

df2 基本上是一个查找 table，df2.chr 应该分别匹配到 df1.chr1 和 df1.chr2。

df_all = df1.merge(df2,
                   how='inner',
                   left_on='chr1',
                   right_on='chr') \
            .merge(df2,
                   how='inner',
                   left_on='chr2',
                   right_on='chr',
                   suffixes=('_r1', '_r2'))

注意后缀。所以 pos1 将被测试为在 start_r1-stop_r1 范围内，而 pos2 将被测试为在 start_r2-stop_r2 范围内范围。

df3 = df_all[(df_all['pos1'] \
                  .between(df_all['start_r1'], df_all['stop_r1'])) &
             (df_all['pos2'] \
                  .between(df_all['start_r2'], df_all['stop_r2']))]

# Back to four original columns again
df3 = df3[['chr1', 'pos1', 'chr2', 'pos2']]

具有不同长度的另一个数据帧的子集数据帧

Subset dataframe with another dataframe of a different length

python

numpy

genome

pandas