如果不匹配子字符串,则从列表中删除项目,无论格式如何
remove item from list if it does not match substring, regardless of the formatting
我有以下数据框:
df = pd.DataFrame()
df['full_string'] = [['apples and bananas', 'applesandbananasamongstothers', 'something else'],
['ApplesandBananas', 'apples and Bananas', 'bananas']]
df['substring'] = ['apples and bananas', 'apples and bananas']
期望的结果是保留 df['full_string'] 中的项目,其中包含在 df['substring'] 中找到的文本,同时考虑到:
- 大小写,大小写无关紧要
- 字间距
- 这些词可能包含与 df['substring']
中的文本无关的其他文本
期望的结果:
df['outcome'] = [['apples and bananas', 'applesandbananasamongstothers'],
['ApplesandBananas', 'apples and Bananas', 'bananas']]
我尝试的是获取 df['substring'] 的第一个关键字,将其用作 df['full_string'] 的匹配器,但是,这不允许我保留'bananas' 数据框第二行中的元素。
(这在虚拟数据上效果不佳):
first_keyword = []
for i in df['substring']:
first_keyword.append(i.split(' ', 1)[0])
df['first_keyword'] = first_keyword
df['C'] = [x[0].lower() in (x[1].lower()) for x in zip(df['first_keyword'], df['full_string'])]
为了简化示例,我选择使用包含您的虚拟数据的列表。你需要让它适应你的问题。
此外,我将你的句子 "The desired outcome is to keep the items in df['full_string'] which contain text that is found in df['substring']" 解释为 text = word.
full_str = ['apples and bananas', 'applesandbananasamongstothers', 'something else',
'ApplesandBananas', 'apples and Bananas', 'bananas']
sub_str = ['apples and bananas', 'red and blue']
# Extract words from sub strings
words_in_sub = [elt.split() for elt in sub_str]
# Flatten and remove duplicates
words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))
# Init output
output = list()
# Loop on the strings in full string
for full_s in full_str:
# Loop on the words to look for
for word in words_in_sub:
if word.lower() in full_s.lower():
output.append(full_s)
break
输出:
In: output
Out:
['apples and bananas',
'applesandbananasamongstothers',
'ApplesandBananas',
'apples and Bananas',
'bananas']
lower/upper 情况在 if 条件中得到处理。间距由 in
语句处理。 full_s
中其他文本的存在由 in
语句处理。 in
语句 return 如果单词出现在字符串中的某处则为真。唯一会 return False 而单词可能被认为存在于字符串中的情况是如果单词被 space 分成两个,例如 'bana naan dapp les'
。此示例不会保留在输出列表中。
编辑: 多行。您也可以将列表展平并使用第一个代码。
full_str = [['apples and bananas', 'applesandbananasamongstothers', 'something else'],
['ApplesandBananas', 'apples and Bananas', 'bananas']]
sub_str = [['apples and bananas'], ['apples and bananas']]
# Assuming same number of rows between full_str and sub_str
# And you want to keep element of full_str[k] according to sub strings in sub_str[k]
number_of_rows = len(full_str)
for k in range(number_of_rows):
# Extract words from sub strings
words_in_sub = [elt.split() for elt in sub_str[k]]
# Flatten and remove duplicates
words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))
# Init output
output = list()
# Loop on the strings in full string
for full_s in full_str[k]:
# Loop on the words to look for
for word in words_in_sub:
if word.lower() in full_s.lower():
output.append(full_s)
break
我有以下数据框:
df = pd.DataFrame()
df['full_string'] = [['apples and bananas', 'applesandbananasamongstothers', 'something else'],
['ApplesandBananas', 'apples and Bananas', 'bananas']]
df['substring'] = ['apples and bananas', 'apples and bananas']
期望的结果是保留 df['full_string'] 中的项目,其中包含在 df['substring'] 中找到的文本,同时考虑到:
- 大小写,大小写无关紧要
- 字间距
- 这些词可能包含与 df['substring'] 中的文本无关的其他文本
期望的结果:
df['outcome'] = [['apples and bananas', 'applesandbananasamongstothers'],
['ApplesandBananas', 'apples and Bananas', 'bananas']]
我尝试的是获取 df['substring'] 的第一个关键字,将其用作 df['full_string'] 的匹配器,但是,这不允许我保留'bananas' 数据框第二行中的元素。
(这在虚拟数据上效果不佳):
first_keyword = []
for i in df['substring']:
first_keyword.append(i.split(' ', 1)[0])
df['first_keyword'] = first_keyword
df['C'] = [x[0].lower() in (x[1].lower()) for x in zip(df['first_keyword'], df['full_string'])]
为了简化示例,我选择使用包含您的虚拟数据的列表。你需要让它适应你的问题。 此外,我将你的句子 "The desired outcome is to keep the items in df['full_string'] which contain text that is found in df['substring']" 解释为 text = word.
full_str = ['apples and bananas', 'applesandbananasamongstothers', 'something else',
'ApplesandBananas', 'apples and Bananas', 'bananas']
sub_str = ['apples and bananas', 'red and blue']
# Extract words from sub strings
words_in_sub = [elt.split() for elt in sub_str]
# Flatten and remove duplicates
words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))
# Init output
output = list()
# Loop on the strings in full string
for full_s in full_str:
# Loop on the words to look for
for word in words_in_sub:
if word.lower() in full_s.lower():
output.append(full_s)
break
输出:
In: output
Out:
['apples and bananas',
'applesandbananasamongstothers',
'ApplesandBananas',
'apples and Bananas',
'bananas']
lower/upper 情况在 if 条件中得到处理。间距由 in
语句处理。 full_s
中其他文本的存在由 in
语句处理。 in
语句 return 如果单词出现在字符串中的某处则为真。唯一会 return False 而单词可能被认为存在于字符串中的情况是如果单词被 space 分成两个,例如 'bana naan dapp les'
。此示例不会保留在输出列表中。
编辑: 多行。您也可以将列表展平并使用第一个代码。
full_str = [['apples and bananas', 'applesandbananasamongstothers', 'something else'],
['ApplesandBananas', 'apples and Bananas', 'bananas']]
sub_str = [['apples and bananas'], ['apples and bananas']]
# Assuming same number of rows between full_str and sub_str
# And you want to keep element of full_str[k] according to sub strings in sub_str[k]
number_of_rows = len(full_str)
for k in range(number_of_rows):
# Extract words from sub strings
words_in_sub = [elt.split() for elt in sub_str[k]]
# Flatten and remove duplicates
words_in_sub = list(set([item for sublist in words_in_sub for item in sublist]))
# Init output
output = list()
# Loop on the strings in full string
for full_s in full_str[k]:
# Loop on the words to look for
for word in words_in_sub:
if word.lower() in full_s.lower():
output.append(full_s)
break