使用 python 聚合字符串值中的子字符串的最佳做法是什么？

Question

我有一个非常具体的问题要解决。我有一个带有字符串（路径）和相关指标的 DataFrame。为了使结果更具可读性，我想在条件下聚合字符串 (Count) 中的路径。一旦一个子字符串连续出现不止一次，此时应该在子字符串中添加一个乘数或类似的东西。

示例输入：
'SEO > direct_&c_(notset) > direct_&c_(notset) > direct_&c_(notset) > SEO'

期望的输出：
'SEO > 3 x (direct_&c_(notset)) > SEO'

如您所见，不应聚合子字符串“SEO”，因为顺序很重要。输入显示用户路径，因此如果简单地计算不同的子字符串，重要信息将会丢失。

Answer 1

您可以使用 itertools.groupby to find matching adjacent components; this returns them grouped, so you can then use more_itertools.ilen ("iterator length") 获取每个组中的计数（如果没有匹配项，则为 1）。

from itertools import groupby

from more_itertools import ilen

in_str =  'SEO > direct_&c_(notset) > direct_&c_(notset) > direct_&c_(notset) > SEO'

out_list = []
for component, group in groupby(in_str.split(' > ')):
    count = ilen(group)
    if count == 1:
        out_list.append(component)
    else:
        out_list.append('%s x (%s)' % (count, component))

out_str = ' > '.join(out_list)
print(out_str)

如果你不想使用 more_itertools 库，你可以这样写：

count = sum(1 for _ in group)

这与 ilen 的作用相同，但读起来更混乱。

使用 python 聚合字符串值中的子字符串的最佳做法是什么？

What is the best practice to aggregate substrings in string values with python?

python

string

count

sequence

aggregation