如何通过比较值范围来合并两个 pandas 数据帧(或传输值)
How to merge two pandas dataframes (or transfer values) by comparing ranges of values
在以下数据中:
data01 =
contig start end haplotype_block
2 5207 5867 1856
2 155667 155670 2816
2 67910 68022 2
2 68464 68483 3
2 525 775 132
2 118938 119559 1157
data02 =
contig start last feature gene_id gene_name transcript_id
2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1
2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1
2 916 1201 exon scaffold_200001.1 NA scaffold_200001.1
2 614 789 exon scaffold_200001.1 NA scaffold_200001.1
2 171 435 exon scaffold_200001.1 NA scaffold_200001.1
2 2677 2806 exon scaffold_200002.1 NA scaffold_200002.1
2 2899 3125 exon scaffold_200002.1 NA scaffold_200002.1
问题:
- 我想比较这两个数据帧的范围(开始 - 结束)。
- 如果范围重叠,我想将
gene_id
和 gene_name
值从 data02 转移到 data01 中的新列。
我试过了(使用pandas):
data01['gene_id'] = ""
data01['gene_name'] = ""
data01['gene_id'] = data01['gene_id'].\
apply(lambda x: data02['gene_id']\
if range(data01['start'], data01['end'])\
<= range(data02['start'], data02['last']) else 'NA')
我该如何改进这段代码?我目前坚持 pandas,但如果使用字典可以更好地解决问题,我愿意接受。但是,请解释一下过程,我愿意学习而不只是得到答案。
谢谢,
期望输出:
contig start end haplotype_block gene_id gene_name
2 5207 5867 1856 scaffold_200003.1,scaffold_200003.1,scaffold_200003.1 CP5,CP5,CP5
# the gene_id and gene_name are repeated 3 times because three intervals (i.e 5262-5496, 5579-5750, 5856-6032) from data02 overlap(or touch) the interval ranges from data01 (5207-5867)
# So, whenever there is overlap of the ranges between two dataframe, copy the gene_id and gene_name.
# and simply NA on gene_id and gene_name for non overlapping ranges
2 155667 155670 2816 NA NA
2 67910 68022 2 NA NA
2 68464 68483 3 NA NA
2 525 775 132 scaffold_200001.1 NA
2 118938 119559 1157 NA NA
我知道您正在使用 python,但使用经典的生物信息学工具可以轻松解决您的问题 bedtools intersect
:http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
您的两个输入文件都遵循标准 BED 格式:http://bedtools.readthedocs.io/en/latest/content/general-usage.html
Bedtools intersect 为您提供了有关如何确定什么构成两个区域之间的交集或重叠的高级逻辑。我相信它也可以直接对 bgzipped 输入进行操作。
你应该在 python 中使用区间树函数,它们非常高效且内存友好,我尝试了类似的东西 运行 它来解决一些后来解决的问题,但这是我写的代码,
Using Interval tree to find overlapping regions
您可以在此代码的基础上进行构建。
s1 = data01.start.values
e1 = data01.end.values
s2 = data02.start.values
e2 = data02['last'].values
overlap = (
(s1[:, None] <= s2) & (e1[:, None] >= s2)
) | (
(s1[:, None] <= e2) & (e1[:, None] >= e2)
)
g = data02.gene_id.values
n = data02.gene_name.values
i, j = np.where(overlap)
idx_map = {i_: data01.index[i_] for i_ in pd.unique(i)}
def make_series(m):
s = pd.Series(m[j]).fillna('').groupby(i).agg(','.join)
return s.rename_axis(idx_map).replace('', np.nan)
data01.assign(
gene_id=make_series(g),
gene_name=make_series(n),
)
如果你想要比 bedtools 快得多的东西 and/or Python 科学堆栈的本地居民,试试 pyranges:
import pyranges as pr
c1 = """Chromosome Start End haplotype_block
2 5207 5867 1856
2 155667 155670 2816
2 67910 68022 2
2 68464 68483 3
2 525 775 132
2 118938 119559 1157"""
c2 = """Chromosome Start End feature gene_id gene_name transcript_id
2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1
2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1
2 916 1201 exon scaffold_200001.1 NA scaffold_200001.1
2 614 789 exon scaffold_200001.1 NA scaffold_200001.1
2 171 435 exon scaffold_200001.1 NA scaffold_200001.1
2 2677 2806 exon scaffold_200002.1 NA scaffold_200002.1
2 2899 3125 exon scaffold_200002.1 NA scaffold_200002.1"""
gr1, gr2 = pr.from_string(c1), pr.from_string(c2)
j = gr1.join(gr2).sort()
print(j)
# +--------------+-----------+-----------+-------------------+-----------+-----------+------------+-------------------+-------------+-------------------+
# | Chromosome | Start | End | haplotype_block | Start_b | End_b | feature | gene_id | gene_name | transcript_id |
# | (category) | (int32) | (int32) | (int64) | (int32) | (int32) | (object) | (object) | (object) | (object) |
# |--------------+-----------+-----------+-------------------+-----------+-----------+------------+-------------------+-------------+-------------------|
# | 2 | 525 | 775 | 132 | 614 | 789 | exon | scaffold_200001.1 | nan | scaffold_200001.1 |
# | 2 | 5207 | 5867 | 1856 | 5262 | 5496 | exon | scaffold_200003.1 | CP5 | scaffold_200003.1 |
# | 2 | 5207 | 5867 | 1856 | 5579 | 5750 | exon | scaffold_200003.1 | CP5 | scaffold_200003.1 |
# | 2 | 5207 | 5867 | 1856 | 5856 | 6032 | exon | scaffold_200003.1 | CP5 | scaffold_200003.1 |
# +--------------+-----------+-----------+-------------------+-----------+-----------+------------+-------------------+-------------+-------------------+
# Unstranded PyRanges object has 4 rows and 10 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
print(j.df)
# Chromosome Start End haplotype_block Start_b End_b feature gene_id gene_name transcript_id
# 0 2 525 775 132 614 789 exon scaffold_200001.1 NaN scaffold_200001.1
# 1 2 5207 5867 1856 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1
# 2 2 5207 5867 1856 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1
# 3 2 5207 5867 1856 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1
在以下数据中:
data01 =
contig start end haplotype_block
2 5207 5867 1856
2 155667 155670 2816
2 67910 68022 2
2 68464 68483 3
2 525 775 132
2 118938 119559 1157
data02 =
contig start last feature gene_id gene_name transcript_id
2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1
2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1
2 916 1201 exon scaffold_200001.1 NA scaffold_200001.1
2 614 789 exon scaffold_200001.1 NA scaffold_200001.1
2 171 435 exon scaffold_200001.1 NA scaffold_200001.1
2 2677 2806 exon scaffold_200002.1 NA scaffold_200002.1
2 2899 3125 exon scaffold_200002.1 NA scaffold_200002.1
问题:
- 我想比较这两个数据帧的范围(开始 - 结束)。
- 如果范围重叠,我想将
gene_id
和gene_name
值从 data02 转移到 data01 中的新列。
我试过了(使用pandas):
data01['gene_id'] = ""
data01['gene_name'] = ""
data01['gene_id'] = data01['gene_id'].\
apply(lambda x: data02['gene_id']\
if range(data01['start'], data01['end'])\
<= range(data02['start'], data02['last']) else 'NA')
我该如何改进这段代码?我目前坚持 pandas,但如果使用字典可以更好地解决问题,我愿意接受。但是,请解释一下过程,我愿意学习而不只是得到答案。
谢谢,
期望输出:
contig start end haplotype_block gene_id gene_name
2 5207 5867 1856 scaffold_200003.1,scaffold_200003.1,scaffold_200003.1 CP5,CP5,CP5
# the gene_id and gene_name are repeated 3 times because three intervals (i.e 5262-5496, 5579-5750, 5856-6032) from data02 overlap(or touch) the interval ranges from data01 (5207-5867)
# So, whenever there is overlap of the ranges between two dataframe, copy the gene_id and gene_name.
# and simply NA on gene_id and gene_name for non overlapping ranges
2 155667 155670 2816 NA NA
2 67910 68022 2 NA NA
2 68464 68483 3 NA NA
2 525 775 132 scaffold_200001.1 NA
2 118938 119559 1157 NA NA
我知道您正在使用 python,但使用经典的生物信息学工具可以轻松解决您的问题 bedtools intersect
:http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
您的两个输入文件都遵循标准 BED 格式:http://bedtools.readthedocs.io/en/latest/content/general-usage.html
Bedtools intersect 为您提供了有关如何确定什么构成两个区域之间的交集或重叠的高级逻辑。我相信它也可以直接对 bgzipped 输入进行操作。
你应该在 python 中使用区间树函数,它们非常高效且内存友好,我尝试了类似的东西 运行 它来解决一些后来解决的问题,但这是我写的代码, Using Interval tree to find overlapping regions
您可以在此代码的基础上进行构建。
s1 = data01.start.values
e1 = data01.end.values
s2 = data02.start.values
e2 = data02['last'].values
overlap = (
(s1[:, None] <= s2) & (e1[:, None] >= s2)
) | (
(s1[:, None] <= e2) & (e1[:, None] >= e2)
)
g = data02.gene_id.values
n = data02.gene_name.values
i, j = np.where(overlap)
idx_map = {i_: data01.index[i_] for i_ in pd.unique(i)}
def make_series(m):
s = pd.Series(m[j]).fillna('').groupby(i).agg(','.join)
return s.rename_axis(idx_map).replace('', np.nan)
data01.assign(
gene_id=make_series(g),
gene_name=make_series(n),
)
如果你想要比 bedtools 快得多的东西 and/or Python 科学堆栈的本地居民,试试 pyranges:
import pyranges as pr
c1 = """Chromosome Start End haplotype_block
2 5207 5867 1856
2 155667 155670 2816
2 67910 68022 2
2 68464 68483 3
2 525 775 132
2 118938 119559 1157"""
c2 = """Chromosome Start End feature gene_id gene_name transcript_id
2 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1
2 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1
2 6115 6198 exon scaffold_200003.1 CP5 scaffold_200003.1
2 916 1201 exon scaffold_200001.1 NA scaffold_200001.1
2 614 789 exon scaffold_200001.1 NA scaffold_200001.1
2 171 435 exon scaffold_200001.1 NA scaffold_200001.1
2 2677 2806 exon scaffold_200002.1 NA scaffold_200002.1
2 2899 3125 exon scaffold_200002.1 NA scaffold_200002.1"""
gr1, gr2 = pr.from_string(c1), pr.from_string(c2)
j = gr1.join(gr2).sort()
print(j)
# +--------------+-----------+-----------+-------------------+-----------+-----------+------------+-------------------+-------------+-------------------+
# | Chromosome | Start | End | haplotype_block | Start_b | End_b | feature | gene_id | gene_name | transcript_id |
# | (category) | (int32) | (int32) | (int64) | (int32) | (int32) | (object) | (object) | (object) | (object) |
# |--------------+-----------+-----------+-------------------+-----------+-----------+------------+-------------------+-------------+-------------------|
# | 2 | 525 | 775 | 132 | 614 | 789 | exon | scaffold_200001.1 | nan | scaffold_200001.1 |
# | 2 | 5207 | 5867 | 1856 | 5262 | 5496 | exon | scaffold_200003.1 | CP5 | scaffold_200003.1 |
# | 2 | 5207 | 5867 | 1856 | 5579 | 5750 | exon | scaffold_200003.1 | CP5 | scaffold_200003.1 |
# | 2 | 5207 | 5867 | 1856 | 5856 | 6032 | exon | scaffold_200003.1 | CP5 | scaffold_200003.1 |
# +--------------+-----------+-----------+-------------------+-----------+-----------+------------+-------------------+-------------+-------------------+
# Unstranded PyRanges object has 4 rows and 10 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
print(j.df)
# Chromosome Start End haplotype_block Start_b End_b feature gene_id gene_name transcript_id
# 0 2 525 775 132 614 789 exon scaffold_200001.1 NaN scaffold_200001.1
# 1 2 5207 5867 1856 5262 5496 exon scaffold_200003.1 CP5 scaffold_200003.1
# 2 2 5207 5867 1856 5579 5750 exon scaffold_200003.1 CP5 scaffold_200003.1
# 3 2 5207 5867 1856 5856 6032 exon scaffold_200003.1 CP5 scaffold_200003.1