如果数据集符合 python 中另一个数据集的范围，如何 select 数据集的值？

Question

我有 2 个遗传数据集。一个是基因组范围的行，每个范围都分配了一个测量值，另一个是基因组上的特定位置。我希望将第二个数据集与其行匹配，该行的基因组位置落在第一个数据集的任何范围内。它们必须在范围内并且在同一条染色体上。

例如：

dataset1:
Chromosome  Start  End   Score
    1       100    200     50
    1       200    250     10
    2       10     20000   40
    2       100    200     20
    3       100    200     10

dataset2:
Chromosome  Position
    1          150
    2          157
    2          1067
    3           5

将数据集 2 与数据集 1 的 Start 和 End 范围内的位置进行匹配（并且具有相同的 Chromosome）并为它们分配分数将给出：

Chromosome  Position     Score
    1          150         50
    2          157         20
    2          1067        40
    3           5          NA

2 号染色体上的一个位置匹配 2 个分数，因此需要复制才能分配两个分数。我通常在 R 中编写此代码，并且不经常使用 Python，我不知道从哪里开始在 Python 中执行此操作 - 是否有任何我应该考虑使用的功能可以做到这一点？

Answer 1

以下代码将给出 dataset2 中位置在范围内的所有行，满足每个条件一行：

df = dataset2.merge(dataset1, on='Chromosome')
res = df[(df.Position >= df.Start) & (df.Position <= df.End)][['Chromosome', 'Position', 'Score']

print(res)

输出：

   Chromosome  Position  Score
0           1       150     50
2           2       157     40
3           2       157     20
4           2      1067     40

如果数据集符合 python 中另一个数据集的范围，如何 select 数据集的值？

How to select values of a dataset if it fits in ranges of another dataset in python?

python

bioinformatics