通过替换列表中的子字符串来创建字符串的笛卡尔积
Create cartesian product of strings by replacing substrings from a list
我有一个字典,其中包含占位符及其可能的值列表,如下所示:
{
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
# and so on ...
}
我想通过替换模板中的占位符(即 ~GPE~
和 ~PERSON~
)来创建所有可能的字符串组合:
"My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year".
预期输出为:
"My name is John Davies. I travel to UK with Tom Banton every year."
"My name is John Davies. I travel to UK with Joe Morgan every year."
"My name is John Davies. I travel to USA with Tom Banton every year."
"My name is John Davies. I travel to USA with Joe Morgan every year."
"My name is Tom Banton. I travel to UK with John Davies every year."
"My name is Tom Banton. I travel to UK with Joe Morgan every year."
"My name is Tom Banton. I travel to USA with John Davies every year."
"My name is Tom Banton. I travel to USA with Joe Morgan every year."
"My name is Joe Morgan. I travel to UK with Tom Banton every year."
"My name is Joe Morgan. I travel to UK with John Davies every year."
"My name is Joe Morgan. I travel to USA with Tom Banton every year."
"My name is Joe Morgan. I travel to USA with John Davies every year."
还要注意字典中某个键对应的值是如何在同一个句子中不重复的。例如我不想:“我的名字是 Joe Morgan。我每年都和 Joe Morgan 一起去美国旅行。” (所以不完全是笛卡尔积,但足够接近)
我是 python 的新手,正在尝试使用 re 模块,但找不到解决此问题的方法。
编辑
我面临的主要问题是替换字符串导致长度改变,这使得后续修改字符串变得困难。这尤其是由于字符串中同一占位符可能存在多个实例。以下是详细说明的片段:
label_dict = {
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}
template = "My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year."
for label in label_dict.keys():
modified_string = template
offset = 0
for match in re.finditer(r'{}'.format(label), template):
for label_text in label_dict.get(label, []):
start, end = match.start() + offset, match.end() + offset
offset += (len(label_text) - (end - start))
# print ("Match was found at {start}-{end}: {match}".format(start = start, end = end, match = match.group()))
modified_string = modified_string[: start] + label_text + modified_string[end: ]
print(modified_string)
给出错误的输出为:
My name is ~PERSON~. I travel to UK with ~PERSON~ every year.
My name is ~PERSON~. I travel USA with ~PERSON~ every year.
My name is John Davies. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohTom Banton. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with John Davies every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohTom Banton every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohToJoe Morgan every year.
这里有两种方法,如果你包含我刚才添加的新代码,那么三种方法都可以,它们都会产生所需的输出。
嵌套循环
data_in ={
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}
data_out = []
for gpe in data_in['~GPE~']:
for person1 in data_in['~PERSON~']:
for person2 in data_in['~PERSON~']:
if person1 != person2:
data_out.append(f'My name is {person1}. I travel to {gpe} with {person2} every year.')
print('\n'.join(data_out))
列表理解
data_in ={
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}
data_out = [f'My name is {person1}. I travel to {gpe} with {person2} every year.' for gpe in data_in['~GPE~'] for person1 in data_in['~PERSON~'] for person2 in data_in['~PERSON~'] if person1!=person2]
print('\n'.join(data_out))
使用来自 Pandas
的合并
注意,此代码需要 Pandas 1.2 或更高版本。
import pandas as pd
data = {
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
# and so on ...
}
country = pd.DataFrame({'country':data['~GPE~']})
person = pd.DataFrame({'person':data['~PERSON~']})
cart = country.merge(person, how='cross').merge(person, how='cross')
cart.columns = ['country', 'person1', 'person2']
cart = cart.query('person1 != person2').reset_index()
cart['sentence'] = cart.apply(lambda row: f"My name is {row['person1']}. I travel to {row['country']} with {row['person2']} every year." , axis=1)
sentences = cart['sentence'].to_list()
print('\n'.join(sentences))
我有一个字典,其中包含占位符及其可能的值列表,如下所示:
{
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
# and so on ...
}
我想通过替换模板中的占位符(即 ~GPE~
和 ~PERSON~
)来创建所有可能的字符串组合:
"My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year".
预期输出为:
"My name is John Davies. I travel to UK with Tom Banton every year."
"My name is John Davies. I travel to UK with Joe Morgan every year."
"My name is John Davies. I travel to USA with Tom Banton every year."
"My name is John Davies. I travel to USA with Joe Morgan every year."
"My name is Tom Banton. I travel to UK with John Davies every year."
"My name is Tom Banton. I travel to UK with Joe Morgan every year."
"My name is Tom Banton. I travel to USA with John Davies every year."
"My name is Tom Banton. I travel to USA with Joe Morgan every year."
"My name is Joe Morgan. I travel to UK with Tom Banton every year."
"My name is Joe Morgan. I travel to UK with John Davies every year."
"My name is Joe Morgan. I travel to USA with Tom Banton every year."
"My name is Joe Morgan. I travel to USA with John Davies every year."
还要注意字典中某个键对应的值是如何在同一个句子中不重复的。例如我不想:“我的名字是 Joe Morgan。我每年都和 Joe Morgan 一起去美国旅行。” (所以不完全是笛卡尔积,但足够接近)
我是 python 的新手,正在尝试使用 re 模块,但找不到解决此问题的方法。
编辑
我面临的主要问题是替换字符串导致长度改变,这使得后续修改字符串变得困难。这尤其是由于字符串中同一占位符可能存在多个实例。以下是详细说明的片段:
label_dict = {
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}
template = "My name is ~PERSON~. I travel to ~GPE~ with ~PERSON~ every year."
for label in label_dict.keys():
modified_string = template
offset = 0
for match in re.finditer(r'{}'.format(label), template):
for label_text in label_dict.get(label, []):
start, end = match.start() + offset, match.end() + offset
offset += (len(label_text) - (end - start))
# print ("Match was found at {start}-{end}: {match}".format(start = start, end = end, match = match.group()))
modified_string = modified_string[: start] + label_text + modified_string[end: ]
print(modified_string)
给出错误的输出为:
My name is ~PERSON~. I travel to UK with ~PERSON~ every year.
My name is ~PERSON~. I travel USA with ~PERSON~ every year.
My name is John Davies. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohTom Banton. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with ~PERSON~ every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with John Davies every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohTom Banton every year.
My name is JohToJoe Morgan. I travel to ~GPE~ with JohToJoe Morgan every year.
这里有两种方法,如果你包含我刚才添加的新代码,那么三种方法都可以,它们都会产生所需的输出。
嵌套循环
data_in ={
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}
data_out = []
for gpe in data_in['~GPE~']:
for person1 in data_in['~PERSON~']:
for person2 in data_in['~PERSON~']:
if person1 != person2:
data_out.append(f'My name is {person1}. I travel to {gpe} with {person2} every year.')
print('\n'.join(data_out))
列表理解
data_in ={
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan']
}
data_out = [f'My name is {person1}. I travel to {gpe} with {person2} every year.' for gpe in data_in['~GPE~'] for person1 in data_in['~PERSON~'] for person2 in data_in['~PERSON~'] if person1!=person2]
print('\n'.join(data_out))
使用来自 Pandas
的合并注意,此代码需要 Pandas 1.2 或更高版本。
import pandas as pd
data = {
"~GPE~": ['UK', 'USA'],
"~PERSON~": ['John Davies', 'Tom Banton', 'Joe Morgan'],
# and so on ...
}
country = pd.DataFrame({'country':data['~GPE~']})
person = pd.DataFrame({'person':data['~PERSON~']})
cart = country.merge(person, how='cross').merge(person, how='cross')
cart.columns = ['country', 'person1', 'person2']
cart = cart.query('person1 != person2').reset_index()
cart['sentence'] = cart.apply(lambda row: f"My name is {row['person1']}. I travel to {row['country']} with {row['person2']} every year." , axis=1)
sentences = cart['sentence'].to_list()
print('\n'.join(sentences))