如何将重复值合并到同一区间?
How to lump repeated values into the same interval?
我有一个类似于下面的空间数据集。具有来自分析机的“ID”、“Assay”、From interval 和 To interval。我想让它从上到下按 ID 和 Assay 来查看,找到重复的 Assay 值,如果它们重复(紧接着另一个),则将它们合并。我尝试使用 groupby 和聚合,但最终将任何相似的 Assay 值混为一谈,我只希望它背靠背放在一起。希望下面的例子有意义。提前致谢!
结果是我想要的,但代码不会让我得到那个。
import pandas as pd
df = pd.DataFrame({
"ID": [ 1, 1, 1, 1, 2, 2, 3, 3, 5, 5, 5, 5],
"Assay": [ 3, 3, 4, 3, 3, 6, 4, 4, 1, 1, 2, 2],
"From": [ 7, 8, 9,10, 0, 8,12,15, 0, 5,10,15],
"To": [13,14,15,16,17,18,13,100,5,10,15,25]
})
result = df.groupby(["ID", "Assay"]).agg({"From":['first'], "To":['last']})
预期输出:
From To
first last
ID Assay
1 3 7 14
4 9 15
3 10 16
2 3 0 17
6 8 18
3 4 12 100
5 1 0 5
1 5 10
2 10 15
2 15 25
我们可以使用 diff
+ne
+cumsum
从连续的分析中创建组;然后过滤大于 3 的 Assays 并使用 groupby.agg
进行聚合。
然后将此结果与为最终输出过滤的行连接:
df['groups'] = df['Assay'].diff().ne(0).cumsum()
msk = df['Assay'].ge(3)
tmp = (df[msk].groupby(['ID','Assay', 'groups'], sort=False)
.agg({'From':'first', 'To':'last'}).reset_index())
out = pd.concat((tmp, df[~msk])).sort_values('groups').drop(columns='groups').reset_index(drop=True)
输出:
ID Assay From To
0 1 3 7 14
1 1 4 9 15
2 1 3 10 16
3 2 3 0 17
4 2 6 8 18
5 3 4 12 100
6 5 1 0 5
7 5 1 5 10
8 5 2 10 15
9 5 2 15 25
基于itertools.groupby can help you with it or you can take a look at convtools的解决方案:
from convtools.contrib.tables import Table
from convtools import conversion as c
iter_rows = Table.from_csv("input.csv", header=True).into_iter_rows(dict)
# store the converter in a variable for further reuse;
# this is a normal ad hoc function
converter = (
c.chunk_by(c.item("ID"), c.item("Assay"))
.aggregate(
{
"ID": c.ReduceFuncs.First(c.item("ID")),
"Assay": c.ReduceFuncs.First(c.item("Assay")),
"From first": c.ReduceFuncs.First(c.item("From")),
"To last": c.ReduceFuncs.Last(c.item("To")),
}
)
.gen_converter()
)
iter_new_rows = converter(iter_rows)
assert list(iter_new_rows) == [
{'ID': '1', 'Assay': '3', 'From first': '7', 'To last': '14'},
{'ID': '1', 'Assay': '4', 'From first': '9', 'To last': '15'},
{'ID': '1', 'Assay': '3', 'From first': '10', 'To last': '16'},
{'ID': '2', 'Assay': '3', 'From first': '0', 'To last': '17'},
{'ID': '2', 'Assay': '6', 'From first': '8', 'To last': '18'},
{'ID': '3', 'Assay': '4', 'From first': '12', 'To last': '100'},
{'ID': '5', 'Assay': '1', 'From first': '0', 'To last': '10'},
{'ID': '5', 'Assay': '2', 'From first': '10', 'To last': '25'}]
# # or if a csv file is needed
# Table.from_rows(iter_new_rows).into_csv("output.csv")
我有一个类似于下面的空间数据集。具有来自分析机的“ID”、“Assay”、From interval 和 To interval。我想让它从上到下按 ID 和 Assay 来查看,找到重复的 Assay 值,如果它们重复(紧接着另一个),则将它们合并。我尝试使用 groupby 和聚合,但最终将任何相似的 Assay 值混为一谈,我只希望它背靠背放在一起。希望下面的例子有意义。提前致谢! 结果是我想要的,但代码不会让我得到那个。
import pandas as pd
df = pd.DataFrame({
"ID": [ 1, 1, 1, 1, 2, 2, 3, 3, 5, 5, 5, 5],
"Assay": [ 3, 3, 4, 3, 3, 6, 4, 4, 1, 1, 2, 2],
"From": [ 7, 8, 9,10, 0, 8,12,15, 0, 5,10,15],
"To": [13,14,15,16,17,18,13,100,5,10,15,25]
})
result = df.groupby(["ID", "Assay"]).agg({"From":['first'], "To":['last']})
预期输出:
From To
first last
ID Assay
1 3 7 14
4 9 15
3 10 16
2 3 0 17
6 8 18
3 4 12 100
5 1 0 5
1 5 10
2 10 15
2 15 25
我们可以使用 diff
+ne
+cumsum
从连续的分析中创建组;然后过滤大于 3 的 Assays 并使用 groupby.agg
进行聚合。
然后将此结果与为最终输出过滤的行连接:
df['groups'] = df['Assay'].diff().ne(0).cumsum()
msk = df['Assay'].ge(3)
tmp = (df[msk].groupby(['ID','Assay', 'groups'], sort=False)
.agg({'From':'first', 'To':'last'}).reset_index())
out = pd.concat((tmp, df[~msk])).sort_values('groups').drop(columns='groups').reset_index(drop=True)
输出:
ID Assay From To
0 1 3 7 14
1 1 4 9 15
2 1 3 10 16
3 2 3 0 17
4 2 6 8 18
5 3 4 12 100
6 5 1 0 5
7 5 1 5 10
8 5 2 10 15
9 5 2 15 25
itertools.groupby can help you with it or you can take a look at convtools的解决方案:
from convtools.contrib.tables import Table
from convtools import conversion as c
iter_rows = Table.from_csv("input.csv", header=True).into_iter_rows(dict)
# store the converter in a variable for further reuse;
# this is a normal ad hoc function
converter = (
c.chunk_by(c.item("ID"), c.item("Assay"))
.aggregate(
{
"ID": c.ReduceFuncs.First(c.item("ID")),
"Assay": c.ReduceFuncs.First(c.item("Assay")),
"From first": c.ReduceFuncs.First(c.item("From")),
"To last": c.ReduceFuncs.Last(c.item("To")),
}
)
.gen_converter()
)
iter_new_rows = converter(iter_rows)
assert list(iter_new_rows) == [
{'ID': '1', 'Assay': '3', 'From first': '7', 'To last': '14'},
{'ID': '1', 'Assay': '4', 'From first': '9', 'To last': '15'},
{'ID': '1', 'Assay': '3', 'From first': '10', 'To last': '16'},
{'ID': '2', 'Assay': '3', 'From first': '0', 'To last': '17'},
{'ID': '2', 'Assay': '6', 'From first': '8', 'To last': '18'},
{'ID': '3', 'Assay': '4', 'From first': '12', 'To last': '100'},
{'ID': '5', 'Assay': '1', 'From first': '0', 'To last': '10'},
{'ID': '5', 'Assay': '2', 'From first': '10', 'To last': '25'}]
# # or if a csv file is needed
# Table.from_rows(iter_new_rows).into_csv("output.csv")