通过替换列表中的子字符串来创建字符串的笛卡尔积

Create cartesian product of strings by replacing substrings from a list

我有一个字典,其中包含占位符及其可能的值列表,如下所示:

{
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
    # and so on ...
}

我想通过替换模板中的占位符(即 ~GPE~~PERSON~)来创建所有可能的字符串组合:

"My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year".

预期输出为:

"My name is John Davies. I travel to UK with Tom Banton every year."
"My name is John Davies. I travel to UK with Joe Morgan every year."
"My name is John Davies. I travel to USA with Tom Banton every year."
"My name is John Davies. I travel to USA with Joe Morgan every year."
"My name is Tom Banton. I travel to UK with John Davies every year."
"My name is Tom Banton. I travel to UK with Joe Morgan every year."
"My name is Tom Banton. I travel to USA with John Davies every year."
"My name is Tom Banton. I travel to USA with Joe Morgan every year."
"My name is Joe Morgan. I travel to UK with Tom Banton every year."
"My name is Joe Morgan. I travel to UK with John Davies every year."
"My name is Joe Morgan. I travel to USA with Tom Banton every year."
"My name is Joe Morgan. I travel to USA with John Davies every year."

还要注意字典中某个键对应的值是如何在同一个句子中不重复的。例如我不想:“我的名字是 Joe Morgan。我每年都和 Joe Morgan 一起去美国旅行。” (所以不完全是笛卡尔积,但足够接近)

我是 python 的新手,正在尝试使用 re 模块,但找不到解决此问题的方法。

编辑

我面临的主要问题是替换字符串导致长度改变,这使得后续修改字符串变得困难。这尤其是由于字符串中同一占位符可能存在多个实例。以下是详细说明的片段:

label_dict = {
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}


template = "My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year."

for label in label_dict.keys():
    modified_string = template
    offset = 0
    for match in re.finditer(r'{}'.format(label), template):
        for label_text in label_dict.get(label, []):
            start, end = match.start() + offset, match.end() + offset
            offset += (len(label_text) - (end - start))
#             print ("Match was found at {start}-{end}: {match}".format(start = start, end = end, match = match.group()))
            modified_string = modified_string[: start] + label_text + modified_string[end: ]
            print(modified_string)

给出错误的输出为:

My name is ~PERSON~. I travel to UK with ~PERSON~ every year.
My name is ~PERSON~. I travel USA with ~PERSON~ every year.
My name is John Davies. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohTom Banton. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with John Davies every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohTom Banton every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohToJoe Morgan every year.

这里有两种方法,如果你包含我刚才添加的新代码,那么三种方法都可以,它们都会产生所需的输出。

嵌套循环

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = []
for gpe in data_in['~GPE~']:
    for person1 in data_in['~PERSON~']:
        for person2 in data_in['~PERSON~']:
            if person1 != person2: 
                data_out.append(f'My name is {person1}. I travel to {gpe} with {person2} every year.')

print('\n'.join(data_out))

列表理解

data_in ={
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}

data_out = [f'My name is {person1}. I travel to {gpe} with {person2} every year.' for gpe in data_in['~GPE~'] for person1 in data_in['~PERSON~'] for person2 in data_in['~PERSON~'] if person1!=person2]

print('\n'.join(data_out))

使用来自 Pandas

的合并

注意,此代码需要 Pandas 1.2 或更高版本。

import pandas as pd

data = {
    "~GPE~": ['UK', 'USA'],
    "~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
    # and so on ...
}

country = pd.DataFrame({'country':data['~GPE~']})
person = pd.DataFrame({'person':data['~PERSON~']})

cart = country.merge(person, how='cross').merge(person, how='cross')

cart.columns = ['country', 'person1', 'person2']

cart = cart.query('person1 != person2').reset_index()

cart['sentence'] = cart.apply(lambda row: f"My name is {row['person1']}. I travel to {row['country']} with {row['person2']} every year." , axis=1)

sentences = cart['sentence'].to_list()

print('\n'.join(sentences))