如果位置存在于范围内,则按行索引每列填充数据框值
Fill dataframe values per column, by row index, if position is present in range
我有一个范围的开始和结束坐标列表,我想根据它们在一个范围内的存在来填充一个 pandas df。
行数是预先确定的,用'0'填充。例如,如果列的范围为 1,3,则行(索引)1-3 将填充“1”。
d={
'a': [[0,2], [3,7], [13,23], [24,25]],
'b': [[1,5], [8,12], [15,18], [20,24]],
}
presabsdict = {}
for G in d.keys():
refpositions = list('0'*50)
positions = d.get(G)
for pos in positions:
pos2 = pos[1]
pos1 = pos[0]
poslength = (pos2-pos1)
refpositions[pos1:(pos2+1)] = (list('1'*(poslength+1)))
presabsdict[G] = refpositions
df = pd.DataFrame.from_dict(presabsdict,orient='index').transpose()
df["Sitespresent"] = df.astype(int).sum(axis=1).astype(int)
print(df)
这对于大型数据集来说效率极低。最终目标是 'Sitespresent'
列,因此放弃数据框的解决方案也适用
你可以做到 something like this:
import pandas as pd
refpositions = pd.DataFrame({'pos':range(50)})
intervals = pd.arrays.IntervalArray([pd.Interval(start,end) for _, v in d.items() for start, end in v], closed='both')
pos_as_intv = [pd.Interval(i,i, closed='both') for i in refpositions.pos]
# Walk through overlaps and count
refpositions['total'] = [intervals.overlaps(x).sum() for x in pos_as_intv]
另一种方法:
import pandas as pd
import numpy as np
def range_array(ranges, lenth):
grid = np.zeros( length, dtype=np.uint8)
for rng in ranges:
grid[ rng[0]:rng[1]] = 1
return(grid)
def make_df(ranges_list, length):
df_dict = {}
for i,ranges in enumerate(ranges_list):
df_dict[i] = range_array(ranges, length)
return(pd.DataFrame.from_dict(df_dict))
a = [[0,2], [0,7], [0,23], [0,25]]
b = [[1,5], [8,12], [15,18], [20,34]]
c = [[1,2], [9,12], [5,11], [20,14]]
d = [[4,6], [5,12], [15,21], [20,44]]
e = [[2,5], [3,12], [15,19], [20,54]]
ranges_list = [a,b,c,d,e]
length = 50
df = make_df(ranges_list, length)
df["sum"] = df.sum(axis=1)
print(df)
其中长度只需要超过范围内的最高单个坐标。
我有一个范围的开始和结束坐标列表,我想根据它们在一个范围内的存在来填充一个 pandas df。
行数是预先确定的,用'0'填充。例如,如果列的范围为 1,3,则行(索引)1-3 将填充“1”。
d={
'a': [[0,2], [3,7], [13,23], [24,25]],
'b': [[1,5], [8,12], [15,18], [20,24]],
}
presabsdict = {}
for G in d.keys():
refpositions = list('0'*50)
positions = d.get(G)
for pos in positions:
pos2 = pos[1]
pos1 = pos[0]
poslength = (pos2-pos1)
refpositions[pos1:(pos2+1)] = (list('1'*(poslength+1)))
presabsdict[G] = refpositions
df = pd.DataFrame.from_dict(presabsdict,orient='index').transpose()
df["Sitespresent"] = df.astype(int).sum(axis=1).astype(int)
print(df)
这对于大型数据集来说效率极低。最终目标是 'Sitespresent'
列,因此放弃数据框的解决方案也适用
你可以做到 something like this:
import pandas as pd
refpositions = pd.DataFrame({'pos':range(50)})
intervals = pd.arrays.IntervalArray([pd.Interval(start,end) for _, v in d.items() for start, end in v], closed='both')
pos_as_intv = [pd.Interval(i,i, closed='both') for i in refpositions.pos]
# Walk through overlaps and count
refpositions['total'] = [intervals.overlaps(x).sum() for x in pos_as_intv]
另一种方法:
import pandas as pd
import numpy as np
def range_array(ranges, lenth):
grid = np.zeros( length, dtype=np.uint8)
for rng in ranges:
grid[ rng[0]:rng[1]] = 1
return(grid)
def make_df(ranges_list, length):
df_dict = {}
for i,ranges in enumerate(ranges_list):
df_dict[i] = range_array(ranges, length)
return(pd.DataFrame.from_dict(df_dict))
a = [[0,2], [0,7], [0,23], [0,25]]
b = [[1,5], [8,12], [15,18], [20,34]]
c = [[1,2], [9,12], [5,11], [20,14]]
d = [[4,6], [5,12], [15,21], [20,44]]
e = [[2,5], [3,12], [15,19], [20,54]]
ranges_list = [a,b,c,d,e]
length = 50
df = make_df(ranges_list, length)
df["sum"] = df.sum(axis=1)
print(df)
其中长度只需要超过范围内的最高单个坐标。