优化将列中的范围值重写为单独的行

Question

我有一个数据框 clothes_acc，列 shoe_size 包含如下值：

index   shoe_size
134     37-38
963     43-45
968     39-42
969     43-45
970     37-39

我想做的是在每一行中分别写下整个范围的值。所以我会得到：

index   shoe_size
134     37
134     38
963     43
963     44
963     45
968     39
968     40
968     41
968     42
...

目前，我有以下代码可以正常工作，但对于具有 500k 行的数据帧来说它非常慢。（clothes_acc 实际上包含列中的其他值，这些值在这里并不重要，这就是为什么我使用提到的值获取数据帧的一个子集并将其保存在 tmp 变量中）。

for i, row in tqdm(tmp.iterrows(), total=tmp.shape[0]):
    clothes_acc = clothes_acc.drop([i])
    spl = [int(s) for s in row['shoe_size']]
    for j in range(spl[0],spl[1]+1):
        replicate = row.copy()
        replicate['shoe_size'] = str(j)
        clothes_acc = clothes_acc.append(replicate)
        
clothes_acc.reset_index(drop=True,inplace=True)

有人可以提出改进建议吗？

Answer 1

将字符串范围转换为整数大小列表并调用 explode():

df['shoe_size'] = df.apply(lambda x: 
    [i for i in range(int(x['shoe_size'].split('-')[0]), int(x['shoe_size'].split('-')[1]) + 1)], 
    axis=1)
df = df.explode(column='shoe_size')

例如，如果 df 是：

df = pd.DataFrame({
    'shoe_size': ['37-38', '43-45', '39-42', '43-45', '37-39']
})

...这将给出以下结果：

  shoe_size
0        37
0        38
1        43
1        44
1        45
2        39
2        40
2        41
2        42
3        43
3        44
3        45
4        37
4        38
4        39

Answer 2

一个选项（更多的内存密集型）是提取范围的边界，合并所有可能的值，然后过滤到合并值在范围之间的位置。当许多产品的 shoe_sizes 重叠时，这会正常工作，因此交叉连接不会非常大。

import pandas as pd

# Bring ranges over to df
ranges = (clothes_acc['shoe_size'].str.split('-', expand=True)
            .apply(pd.to_numeric)
            .rename(columns={0: 'lower', 1: 'upper'}))
clothes_acc = pd.concat([clothes_acc, ranges], axis=1)
#   index shoe_size  lower  upper
#0    134     37-38     37     38
#1    963     43-45     43     45
#2    968     39-42     39     42
#3    969     43-45     43     45
#4    970     37-39     37     39  

vals = pd.DataFrame({'shoe_size': np.arange(clothes_acc.lower.min(), 
                                            clothes_acc.upper.max()+1)})

res = (clothes_acc.drop(columns='shoe_size')
            .merge(vals, how='cross')
            .query('lower <= shoe_size <= upper')
            .drop(columns=['lower', 'upper']))

print(res)
    index  shoe_size
0     134         37
1     134         38
15    963         43
16    963         44
17    963         45
20    968         39
21    968         40
22    968         41
23    968         42
33    969         43
34    969         44
35    969         45
36    970         37
37    970         38
38    970         39

优化将列中的范围值重写为单独的行

Optimizing the rewriting of the range values in a column into separate rows

python

pandas

dataframe

optimization