如果位置存在于范围内,则按行索引每列填充数据框值

Fill dataframe values per column, by row index, if position is present in range

我有一个范围的开始和结束坐标列表,我想根据它们在一个范围内的存在来填充一个 pandas df。

行数是预先确定的,用'0'填充。例如,如果列的范围为 1,3,则行(索引)1-3 将填充“1”。

d={
    'a': [[0,2], [3,7], [13,23], [24,25]],
    'b': [[1,5], [8,12], [15,18], [20,24]],
}
presabsdict = {}

for G in d.keys():
    refpositions = list('0'*50)
    positions = d.get(G)
    for pos in positions:
        pos2 = pos[1]
        pos1 = pos[0]
        poslength = (pos2-pos1)
        refpositions[pos1:(pos2+1)] = (list('1'*(poslength+1)))
    presabsdict[G] = refpositions

df = pd.DataFrame.from_dict(presabsdict,orient='index').transpose()
df["Sitespresent"] = df.astype(int).sum(axis=1).astype(int)
print(df)

这对于大型数据集来说效率极低。最终目标是 'Sitespresent' 列,因此放弃数据框的解决方案也适用

你可以做到 something like this:

import pandas as pd

refpositions = pd.DataFrame({'pos':range(50)})
intervals = pd.arrays.IntervalArray([pd.Interval(start,end) for _, v in d.items() for start, end in v], closed='both')
pos_as_intv = [pd.Interval(i,i, closed='both') for i in refpositions.pos]

# Walk through overlaps and count
refpositions['total'] = [intervals.overlaps(x).sum() for x in pos_as_intv]

另一种方法:

import pandas as pd
import numpy as np

def range_array(ranges, lenth):
    grid = np.zeros( length, dtype=np.uint8)
    for rng in ranges:
        grid[ rng[0]:rng[1]] = 1
    return(grid)

def make_df(ranges_list, length):
    df_dict = {}
    for i,ranges in enumerate(ranges_list):
        df_dict[i] = range_array(ranges, length)
    return(pd.DataFrame.from_dict(df_dict))
a = [[0,2], [0,7], [0,23], [0,25]]
b = [[1,5], [8,12], [15,18], [20,34]]
c = [[1,2], [9,12], [5,11], [20,14]]
d = [[4,6], [5,12], [15,21], [20,44]]
e = [[2,5], [3,12], [15,19], [20,54]]

ranges_list = [a,b,c,d,e]
length = 50
df = make_df(ranges_list, length)
df["sum"] = df.sum(axis=1)

print(df)

其中长度只需要超过范围内的最高单个坐标。