如何删除列表中的起始词?
How to remove inception words in list?
给定一个包含 "inception" 个单词的列表,如何删除初始单词?如何找到更大的起始词?
让我们将起始词定义为出现在同一列表中的较大词的一部分的词。
任务:
To make it very clear, if a list contains ['a', 'b', 'a b c'], removes
'a' and 'b' because there is an element that contains 'a' and 'b' that
is bigger itself.
示例 1, [in]:
[u'dose rate', u'object', u'dose', u'rate', u'computation']
[输出]:
[u'dose rate', u'object',u'computation']
示例 2, [in]:
[u'shift', u'magnetic', u'system', u'magnetic sensor', u'phase shift', u'phase', u'output', u'sensor', u'sensing', u'sensor system']
由于 'magnetic'、'sensor'、'system'、'magnetic sensor' 和 'sensor system' 存在,我们可以:
期望的输出,[out]:
[u'system', u'magnetic sensor', u'phase shift', u'output', u'sensing']
或[出]:
[u'magnetic'u'phase shift', u'output', u'sensing', u'sensor system']
我已尝试以下操作,但未获得所需的输出:
ls = [u'dose rate', u'object', u'dose', u'rate', u'computation']
>>> set([i for i in ls for j in ls if i!=j or i not in j])
set([u'dose rate', u'object', u'rate', u'computation', u'dose'])
>>> set([j for i in ls for j in ls if i!=j or i not in j])
set([u'rate', u'object', u'dose rate', u'computation', u'dose'])
>>> set([j for j in ls for i in ls if i!=j or i not in j])
set([u'dose rate', u'object', u'rate', u'computation', u'dose'])
给定一个单词列表:
>>> words = [u'dose rate', u'object', u'dose', u'rate', u'computation']
以及初始词的定义:
>>> inception = lambda x: any(x in w for w in words if len(x) < len(w))
我们可以像这样构造一个 'non inception words' 的列表:
>>> [w for w in words if not inception(w)]
[u'dose rate', u'object', u'computation']
所以为了满足第一个例子,你可以做类似的事情,
>>> words = [u'dose rate', u'object', u'dose', u'rate', u'computation']
>>> [w1 for w1 in words if not any(w1 in w2 for w2 in words if w2 != w1)]
[u'dose rate', u'object', u'computation']
但是你的第二个例子表明你的要求有点复杂。您不能多次使用同一个小词来组成一个字符串。
不幸的是,单线是不可能的。尝试类似的东西,
def remove_comprising(words):
seen = set()
result_words = []
for word in words:
for small_word in words:
if small_word in word and small_word != word:
if small_word in seen:
word = word.replace(small_word, '')
else:
seen.add(small_word)
result_words.append(word)
return [word.strip() for word in result_words if word not in seen]
然后我们得到了示例 1 的正确结果,
>>> words = [u'dose rate', u'object', u'dose', u'rate', u'computation']
>>> remove_comprising(words)
[u'dose rate', u'object', u'computation']
和示例 2,
>>> words = [u'shift', u'magnetic', u'system', u'magnetic sensor', u'phase shift', u'phase', u'output', u'sensor', u'sensing', u'sensor system']
>>> remove_comprising(words)
[u'magnetic sensor', u'phase shift', u'output', u'sensing', u'system']
阅读起来有点复杂:在实现上不是 pythonic,但应该可以解决问题。
基本思路是:评估并标记列表中的每个单词是否应包含在内。
然后使用那个标志,实际打印出这些词。
问题是你想找到可以成为其他 2 个更大单词的一部分的单词,这使得标记更加细化(不仅仅是保留或拒绝,而是保留,继续保留和拒绝)
import copy
def inception(wordlist):
# dont want to mutilate original list
new_wordlist = copy.deepcopy(wordlist)
# find length of wordlist to know when original length is traversed
word_count = len(new_wordlist)
output_set = set()
output_list = [] # flags existence, -1 = evaluation postponed, 0 = exclude, 1= include
eval_list = []
# iterate through list
for idx, word in enumerate(new_wordlist):
inner_words = word.split()
# if its only 1 word, evaluate at the end
# Can be made smarter to reject earlier
if len(inner_words) == 1 and idx < word_count:
output_list.append(-1)
eval_list.append(word)
new_wordlist.append(word)
continue
# Flag existence of inner words if they haven't been found
existence = 0
for in_wrd in inner_words:
if in_wrd in output_set:
output_list.append(0)
else:
# keep continued
existence += 1
output_set.add(in_wrd)
output_list.append(existence)
eval_list.append(in_wrd)
# now evaluate by position of flags
final_set = set()
for idx, word in enumerate(eval_list):
if output_list[idx] > 0:
# combine if words are in order
if output_list[idx] > 1:
final_set.remove(eval_list[idx-1])
word = ' '.join([eval_list[idx-1], eval_list[idx]])
final_set.add(word)
return list(final_set)
我只用您提供的 2 套进行了测试。如果您有失败的设置,请将它们添加到评论中,我当然会更正。
给定一个包含 "inception" 个单词的列表,如何删除初始单词?如何找到更大的起始词?
让我们将起始词定义为出现在同一列表中的较大词的一部分的词。
任务:
To make it very clear, if a list contains ['a', 'b', 'a b c'], removes 'a' and 'b' because there is an element that contains 'a' and 'b' that is bigger itself.
示例 1, [in]:
[u'dose rate', u'object', u'dose', u'rate', u'computation']
[输出]:
[u'dose rate', u'object',u'computation']
示例 2, [in]:
[u'shift', u'magnetic', u'system', u'magnetic sensor', u'phase shift', u'phase', u'output', u'sensor', u'sensing', u'sensor system']
由于 'magnetic'、'sensor'、'system'、'magnetic sensor' 和 'sensor system' 存在,我们可以:
期望的输出,[out]:
[u'system', u'magnetic sensor', u'phase shift', u'output', u'sensing']
或[出]:
[u'magnetic'u'phase shift', u'output', u'sensing', u'sensor system']
我已尝试以下操作,但未获得所需的输出:
ls = [u'dose rate', u'object', u'dose', u'rate', u'computation']
>>> set([i for i in ls for j in ls if i!=j or i not in j])
set([u'dose rate', u'object', u'rate', u'computation', u'dose'])
>>> set([j for i in ls for j in ls if i!=j or i not in j])
set([u'rate', u'object', u'dose rate', u'computation', u'dose'])
>>> set([j for j in ls for i in ls if i!=j or i not in j])
set([u'dose rate', u'object', u'rate', u'computation', u'dose'])
给定一个单词列表:
>>> words = [u'dose rate', u'object', u'dose', u'rate', u'computation']
以及初始词的定义:
>>> inception = lambda x: any(x in w for w in words if len(x) < len(w))
我们可以像这样构造一个 'non inception words' 的列表:
>>> [w for w in words if not inception(w)]
[u'dose rate', u'object', u'computation']
所以为了满足第一个例子,你可以做类似的事情,
>>> words = [u'dose rate', u'object', u'dose', u'rate', u'computation']
>>> [w1 for w1 in words if not any(w1 in w2 for w2 in words if w2 != w1)]
[u'dose rate', u'object', u'computation']
但是你的第二个例子表明你的要求有点复杂。您不能多次使用同一个小词来组成一个字符串。
不幸的是,单线是不可能的。尝试类似的东西,
def remove_comprising(words):
seen = set()
result_words = []
for word in words:
for small_word in words:
if small_word in word and small_word != word:
if small_word in seen:
word = word.replace(small_word, '')
else:
seen.add(small_word)
result_words.append(word)
return [word.strip() for word in result_words if word not in seen]
然后我们得到了示例 1 的正确结果,
>>> words = [u'dose rate', u'object', u'dose', u'rate', u'computation']
>>> remove_comprising(words)
[u'dose rate', u'object', u'computation']
和示例 2,
>>> words = [u'shift', u'magnetic', u'system', u'magnetic sensor', u'phase shift', u'phase', u'output', u'sensor', u'sensing', u'sensor system']
>>> remove_comprising(words)
[u'magnetic sensor', u'phase shift', u'output', u'sensing', u'system']
阅读起来有点复杂:在实现上不是 pythonic,但应该可以解决问题。
基本思路是:评估并标记列表中的每个单词是否应包含在内。 然后使用那个标志,实际打印出这些词。
问题是你想找到可以成为其他 2 个更大单词的一部分的单词,这使得标记更加细化(不仅仅是保留或拒绝,而是保留,继续保留和拒绝)
import copy
def inception(wordlist):
# dont want to mutilate original list
new_wordlist = copy.deepcopy(wordlist)
# find length of wordlist to know when original length is traversed
word_count = len(new_wordlist)
output_set = set()
output_list = [] # flags existence, -1 = evaluation postponed, 0 = exclude, 1= include
eval_list = []
# iterate through list
for idx, word in enumerate(new_wordlist):
inner_words = word.split()
# if its only 1 word, evaluate at the end
# Can be made smarter to reject earlier
if len(inner_words) == 1 and idx < word_count:
output_list.append(-1)
eval_list.append(word)
new_wordlist.append(word)
continue
# Flag existence of inner words if they haven't been found
existence = 0
for in_wrd in inner_words:
if in_wrd in output_set:
output_list.append(0)
else:
# keep continued
existence += 1
output_set.add(in_wrd)
output_list.append(existence)
eval_list.append(in_wrd)
# now evaluate by position of flags
final_set = set()
for idx, word in enumerate(eval_list):
if output_list[idx] > 0:
# combine if words are in order
if output_list[idx] > 1:
final_set.remove(eval_list[idx-1])
word = ' '.join([eval_list[idx-1], eval_list[idx]])
final_set.add(word)
return list(final_set)
我只用您提供的 2 套进行了测试。如果您有失败的设置,请将它们添加到评论中,我当然会更正。