pandas 按范围合并区间
pandas merge intervals by range
我有一个 pandas 数据框,如下所示:
chrom start end probability read
0 chr1 1 10 0.99 read1
1 chr1 5 25 0.99 read2
2 chr1 15 25 0.99 read2
3 chr1 30 40 0.75 read4
我想做的是合并具有相同染色体(chrom 列)且坐标(开始,结束)重叠的区间。在某些情况下,如果多个间隔彼此重叠,则即使它们不重叠,也应该合并一些间隔。请参阅上述示例中的第 0 行和第 2 行以及下面合并的输出
对于那些合并的元素,我想求和它们的概率(概率列)并计算'read'列中的唯一元素。
使用上面的示例将导致以下输出,请注意第 0、1 和 2 行已合并:
chrom start end probability read
0 chr1 1 20 2.97 2
1 chr1 30 40 0.75 1
到目前为止,我一直在使用 pybedtools merge 执行此操作,但事实证明它执行数百万次后速度很慢(我的情况)。因此,我正在寻找其他选项,pandas 是显而易见的选项。我知道使用 pandas groupby 可以对要合并的列应用不同的操作,例如 nunique 和 sum,这是我需要申请的。然而,pandas groupby 仅合并具有精确 'chrom'、'start' 和 'end' 坐标的数据。
我的问题是我不知道如何使用 pandas 根据坐标 (chrom,start,end) 合并我的行,然后应用 sum 和 nunique 操作。
有快速的方法吗?
谢谢!
PS:正如我在问题中所说的那样,我这样做了数百万次,所以速度是一个大问题。因此,我无法使用 pybedtools 或纯 python,它们对我的目标来说太慢了。
谢谢!
IIUC
df.groupby((df.end.shift()-df.start).lt(0).cumsum()).agg({'chrom':'first','start':'first','end':'last','probability':'sum','read':'nunique'})
Out[417]:
chrom start end probability read
0 chr1 1 20 2.97 2
1 chr1 30 40 0.75 1
更多信息创建组密钥
(df.end.shift()-df.start).lt(0).cumsum()
Out[418]:
0 0
1 0
2 0
3 1
dtype: int32
正如@root 所建议的,接受的答案无法推广到类似的情况。例如如果我们向问题中的示例添加范围为 2-3 的额外行:
df = pd.DataFrame({'chrom': ['chr1','chr1','chr1','chr1','chr1'],
'start': [1, 2, 5, 15, 30],
'end': [10, 3, 20, 25, 40],
'probability': [0.99, 0.99, 0.99, 0.99, 0.75],
'read': ['read1','read2','read2','read2','read4']})
...建议的聚合函数输出以下数据帧。请注意,4 在 1-10 范围内,但不再被捕获。范围 1-10、2-3、5-20 和 15-25 全部重叠,因此应归为一组。
一种解决方案是以下方法(使用@W-B 建议的聚合函数和组合区间的方法posted by @CentAu)。
# Union intervals by @CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)
# Get intervals for rows
def f(x,position=None):
"""
Returns an interval for the row. The start and stop position indicate the minimum
and maximum position of all overlapping ranges within the group.
Args:
position (str, optional): Returns an integer indicating start or stop position.
"""
intervals = union(x)
if position and position.lower() == 'start':
group = x.str[0].apply(lambda y: [l.start for g,l in enumerate(intervals) if l.contains(y)][0])
elif position and position.lower() == 'end':
group = x.str[0].apply(lambda y: [l.end for g,l in enumerate(intervals) if l.contains(y)][0])
else:
group = x.str[0].apply(lambda y: [l for g,l in enumerate(intervals) if l.contains(y)][0])
return group
# Combine start and end into a single column
df['start_end'] = df[['start', 'end']].apply(list, axis=1)
# Assign each row to an interval and add start/end columns
df['start_interval'] = df[['chrom',
'start_end']].groupby(['chrom']).transform(f,'start')
df['end_interval'] = df[['chrom',
'start_end']].groupby(['chrom']).transform(f,'end')
# Aggregate rows, using approach by @W-B
df.groupby(['chrom','start_interval','end_interval']).agg({'probability':'sum',
'read':'nunique'}).reset_index()
...输出以下数据帧。第一行的总概率为 3.96,因为我们组合的是四行而不是三行。
虽然这种方法应该更通用,但不一定很快!希望其他人可以提出更快的替代方案。
这是使用 pyranges 和 pandas 的答案。它的改进在于它可以非常快速地进行合并,即使在单核模式下也很容易并行化并且速度超级快。
设置:
import pandas as pd
import pyranges as pr
import numpy as np
rows = int(1e7)
gr = pr.random(rows)
gr.probability = np.random.rand(rows)
gr.read = np.arange(rows)
print(gr)
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# | Chromosome | Start | End | Strand | probability | read |
# | (category) | (int32) | (int32) | (category) | (float64) | (int64) |
# |--------------+-----------+-----------+--------------+----------------------+-----------|
# | chr1 | 149953099 | 149953199 | + | 0.7536048547309669 | 0 |
# | chr1 | 184344435 | 184344535 | + | 0.9358130407479777 | 1 |
# | chr1 | 238639916 | 238640016 | + | 0.024212603310159064 | 2 |
# | chr1 | 95180042 | 95180142 | + | 0.027139751993808026 | 3 |
# | ... | ... | ... | ... | ... | ... |
# | chrY | 34355323 | 34355423 | - | 0.8843190383030953 | 999996 |
# | chrY | 1818049 | 1818149 | - | 0.23138017743097572 | 999997 |
# | chrY | 10101456 | 10101556 | - | 0.3007915302642412 | 999998 |
# | chrY | 355910 | 356010 | - | 0.03694752911338561 | 999999 |
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# Stranded PyRanges object has 1,000,000 rows and 6 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
执行:
def praderas(df):
grpby = df.groupby("Cluster")
prob = grpby.probability.sum()
prob.name = "ProbSum"
n = grpby.read.count()
n.name = "Count"
return df.merge(prob, on="Cluster").merge(n, on="Cluster")
%time result = gr.cluster().apply(praderas)
# 11.4s !
result[result.Count > 2]
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# | Chromosome | Start | End | Strand | probability | read | Cluster | ProbSum | Count |
# | (category) | (int32) | (int32) | (category) | (float64) | (int64) | (int32) | (float64) | (int64) |
# |--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------|
# | chr1 | 52952 | 53052 | + | 0.7411051557901921 | 59695 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 52959 | 53059 | + | 0.9979036360671423 | 356518 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 53029 | 53129 | + | 0.47409221639405397 | 104776 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 64657 | 64757 | + | 0.32465233067499366 | 386140 | 88 | 1.3880589602361695 | 3 |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | chrY | 59356855 | 59356955 | - | 0.3877207561218887 | 9966373 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356865 | 59356965 | - | 0.4007557656399032 | 9907364 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356932 | 59357032 | - | 0.33799123310907786 | 9978653 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356980 | 59357080 | - | 0.055686136451676305 | 9994845 | 8502533 | 1.182153891322546 | 4 |
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# Stranded PyRanges object has 606,212 rows and 9 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
这可以使用 bioframe 来解决。
df = pd.DataFrame({'chrom': ['chr1','chr1','chr1','chr1','chr1'],
'start': [1, 2, 5, 15, 30],
'end': [10, 3, 20, 25, 40],
'probability': [0.99, 0.99, 0.99, 0.99, 0.75],
'read': ['read1','read2','read2','read2','read4']})
import bioframe as bf
bfm = bf.merge(df.iloc[:,:3],min_dist=0)
bf_close = bf.closest(bfm, df, suffixes=('_1','_2'), k=df.shape[0])
bf_close = bf_close[bf_close['distance'] == 0]
bf_close.groupby(['chrom_1','start_1','end_1']).agg({'probability_2':'sum'}).reset_index()
chrom_1 start_1 end_1 probability_2
0 chr1 1 25 3.96
1 chr1 30 40 0.75
我有一个 pandas 数据框,如下所示:
chrom start end probability read
0 chr1 1 10 0.99 read1
1 chr1 5 25 0.99 read2
2 chr1 15 25 0.99 read2
3 chr1 30 40 0.75 read4
我想做的是合并具有相同染色体(chrom 列)且坐标(开始,结束)重叠的区间。在某些情况下,如果多个间隔彼此重叠,则即使它们不重叠,也应该合并一些间隔。请参阅上述示例中的第 0 行和第 2 行以及下面合并的输出
对于那些合并的元素,我想求和它们的概率(概率列)并计算'read'列中的唯一元素。
使用上面的示例将导致以下输出,请注意第 0、1 和 2 行已合并:
chrom start end probability read
0 chr1 1 20 2.97 2
1 chr1 30 40 0.75 1
到目前为止,我一直在使用 pybedtools merge 执行此操作,但事实证明它执行数百万次后速度很慢(我的情况)。因此,我正在寻找其他选项,pandas 是显而易见的选项。我知道使用 pandas groupby 可以对要合并的列应用不同的操作,例如 nunique 和 sum,这是我需要申请的。然而,pandas groupby 仅合并具有精确 'chrom'、'start' 和 'end' 坐标的数据。
我的问题是我不知道如何使用 pandas 根据坐标 (chrom,start,end) 合并我的行,然后应用 sum 和 nunique 操作。
有快速的方法吗?
谢谢!
PS:正如我在问题中所说的那样,我这样做了数百万次,所以速度是一个大问题。因此,我无法使用 pybedtools 或纯 python,它们对我的目标来说太慢了。
谢谢!
IIUC
df.groupby((df.end.shift()-df.start).lt(0).cumsum()).agg({'chrom':'first','start':'first','end':'last','probability':'sum','read':'nunique'})
Out[417]:
chrom start end probability read
0 chr1 1 20 2.97 2
1 chr1 30 40 0.75 1
更多信息创建组密钥
(df.end.shift()-df.start).lt(0).cumsum()
Out[418]:
0 0
1 0
2 0
3 1
dtype: int32
正如@root 所建议的,接受的答案无法推广到类似的情况。例如如果我们向问题中的示例添加范围为 2-3 的额外行:
df = pd.DataFrame({'chrom': ['chr1','chr1','chr1','chr1','chr1'],
'start': [1, 2, 5, 15, 30],
'end': [10, 3, 20, 25, 40],
'probability': [0.99, 0.99, 0.99, 0.99, 0.75],
'read': ['read1','read2','read2','read2','read4']})
...建议的聚合函数输出以下数据帧。请注意,4 在 1-10 范围内,但不再被捕获。范围 1-10、2-3、5-20 和 15-25 全部重叠,因此应归为一组。
一种解决方案是以下方法(使用@W-B 建议的聚合函数和组合区间的方法posted by @CentAu)。
# Union intervals by @CentAu
from sympy import Interval, Union
def union(data):
""" Union of a list of intervals e.g. [(1,2),(3,4)] """
intervals = [Interval(begin, end) for (begin, end) in data]
u = Union(*intervals)
return [u] if isinstance(u, Interval) \
else list(u.args)
# Get intervals for rows
def f(x,position=None):
"""
Returns an interval for the row. The start and stop position indicate the minimum
and maximum position of all overlapping ranges within the group.
Args:
position (str, optional): Returns an integer indicating start or stop position.
"""
intervals = union(x)
if position and position.lower() == 'start':
group = x.str[0].apply(lambda y: [l.start for g,l in enumerate(intervals) if l.contains(y)][0])
elif position and position.lower() == 'end':
group = x.str[0].apply(lambda y: [l.end for g,l in enumerate(intervals) if l.contains(y)][0])
else:
group = x.str[0].apply(lambda y: [l for g,l in enumerate(intervals) if l.contains(y)][0])
return group
# Combine start and end into a single column
df['start_end'] = df[['start', 'end']].apply(list, axis=1)
# Assign each row to an interval and add start/end columns
df['start_interval'] = df[['chrom',
'start_end']].groupby(['chrom']).transform(f,'start')
df['end_interval'] = df[['chrom',
'start_end']].groupby(['chrom']).transform(f,'end')
# Aggregate rows, using approach by @W-B
df.groupby(['chrom','start_interval','end_interval']).agg({'probability':'sum',
'read':'nunique'}).reset_index()
...输出以下数据帧。第一行的总概率为 3.96,因为我们组合的是四行而不是三行。
虽然这种方法应该更通用,但不一定很快!希望其他人可以提出更快的替代方案。
这是使用 pyranges 和 pandas 的答案。它的改进在于它可以非常快速地进行合并,即使在单核模式下也很容易并行化并且速度超级快。
设置:
import pandas as pd
import pyranges as pr
import numpy as np
rows = int(1e7)
gr = pr.random(rows)
gr.probability = np.random.rand(rows)
gr.read = np.arange(rows)
print(gr)
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# | Chromosome | Start | End | Strand | probability | read |
# | (category) | (int32) | (int32) | (category) | (float64) | (int64) |
# |--------------+-----------+-----------+--------------+----------------------+-----------|
# | chr1 | 149953099 | 149953199 | + | 0.7536048547309669 | 0 |
# | chr1 | 184344435 | 184344535 | + | 0.9358130407479777 | 1 |
# | chr1 | 238639916 | 238640016 | + | 0.024212603310159064 | 2 |
# | chr1 | 95180042 | 95180142 | + | 0.027139751993808026 | 3 |
# | ... | ... | ... | ... | ... | ... |
# | chrY | 34355323 | 34355423 | - | 0.8843190383030953 | 999996 |
# | chrY | 1818049 | 1818149 | - | 0.23138017743097572 | 999997 |
# | chrY | 10101456 | 10101556 | - | 0.3007915302642412 | 999998 |
# | chrY | 355910 | 356010 | - | 0.03694752911338561 | 999999 |
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# Stranded PyRanges object has 1,000,000 rows and 6 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
执行:
def praderas(df):
grpby = df.groupby("Cluster")
prob = grpby.probability.sum()
prob.name = "ProbSum"
n = grpby.read.count()
n.name = "Count"
return df.merge(prob, on="Cluster").merge(n, on="Cluster")
%time result = gr.cluster().apply(praderas)
# 11.4s !
result[result.Count > 2]
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# | Chromosome | Start | End | Strand | probability | read | Cluster | ProbSum | Count |
# | (category) | (int32) | (int32) | (category) | (float64) | (int64) | (int32) | (float64) | (int64) |
# |--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------|
# | chr1 | 52952 | 53052 | + | 0.7411051557901921 | 59695 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 52959 | 53059 | + | 0.9979036360671423 | 356518 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 53029 | 53129 | + | 0.47409221639405397 | 104776 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 64657 | 64757 | + | 0.32465233067499366 | 386140 | 88 | 1.3880589602361695 | 3 |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | chrY | 59356855 | 59356955 | - | 0.3877207561218887 | 9966373 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356865 | 59356965 | - | 0.4007557656399032 | 9907364 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356932 | 59357032 | - | 0.33799123310907786 | 9978653 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356980 | 59357080 | - | 0.055686136451676305 | 9994845 | 8502533 | 1.182153891322546 | 4 |
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# Stranded PyRanges object has 606,212 rows and 9 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
这可以使用 bioframe 来解决。
df = pd.DataFrame({'chrom': ['chr1','chr1','chr1','chr1','chr1'],
'start': [1, 2, 5, 15, 30],
'end': [10, 3, 20, 25, 40],
'probability': [0.99, 0.99, 0.99, 0.99, 0.75],
'read': ['read1','read2','read2','read2','read4']})
import bioframe as bf
bfm = bf.merge(df.iloc[:,:3],min_dist=0)
bf_close = bf.closest(bfm, df, suffixes=('_1','_2'), k=df.shape[0])
bf_close = bf_close[bf_close['distance'] == 0]
bf_close.groupby(['chrom_1','start_1','end_1']).agg({'probability_2':'sum'}).reset_index()
chrom_1 start_1 end_1 probability_2
0 chr1 1 25 3.96
1 chr1 30 40 0.75