比使用 for 循环更快地构建字符串组合（带分隔符）的方法？

Question

我正在处理一个相对较大的数据集（在 Python 和 Pandas 中），我正在尝试将多列组合构建为一个字符串。

假设我有两个列表； x 和 y，其中：x = ["sector_1", "sector_2", "sector_3", ...] 和 y = [7, 19, 21, ...]。

我一直在使用 for 循环来构建组合 combined = ["sector_1--7", "sector_1--19", "sector_1--21", "sector_2--7", "sector_2--19", ...]，这里的分隔符定义为 --.

我当前的代码如下所示：

sep = '--'
combined = np.empty(0, dtype='object')
for x_value in x:
    for y_value in y:
        combined = np.append(combined,  str(x_value) + sep + str(y_value))
combined = pd.DataFrame(combined)
combined = combined.iloc[:, 0].str.split(sep, expand=True)

上面的代码有效，但我只是想知道是否有更好的方法（可能在运行时更有效）。

Answer 1

试试这个：

import itertools as it
combined = [f'{a}--{b}' for a, b in it.product(x, y)]

输出：

>>> combined
['sector_1--7',
 'sector_1--19',
 'sector_1--21',
 'sector_1--Ellipsis',
 'sector_2--7',
 'sector_2--19',
 'sector_2--21',
 'sector_2--Ellipsis',
 'sector_3--7',
 'sector_3--19',
 'sector_3--21',
 'sector_3--Ellipsis',
 'Ellipsis--7',
 'Ellipsis--19',
 'Ellipsis--21',
 'Ellipsis--Ellipsis']

你应该使用 np.tile 和 np.repeat:

的组合，而不是所有这些

combined_df = pd.DataFrame({0: np.repeat(x, len(x)), 1: np.tile(y, len(x))})

输出：

>>> combined_df
           0         1
0   sector_1         7
1   sector_1        19
2   sector_1        21
3   sector_1  Ellipsis
4   sector_2         7
5   sector_2        19
6   sector_2        21
7   sector_2  Ellipsis
8   sector_3         7
9   sector_3        19
10  sector_3        21
11  sector_3  Ellipsis
12  Ellipsis         7
13  Ellipsis        19
14  Ellipsis        21
15  Ellipsis  Ellipsis

比使用 for 循环更快地构建字符串组合（带分隔符）的方法？

Faster way of building string combinations (with separator) than using a for loop?

python

numpy

runtime

dataframe

pandas